NLP Overview

Overview

NLP-Overview

  1. Introduction
  • The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human languages. Since the 1980s, the field has increasingly relied on data-driven computation involving statistics, probability, and machine learning.

  • Recent increases in computational power and parallelization, harnessed by Graphical Processing Units (GPUs), now allow for “deep learning”, which utilizes artificial neural networks (ANNs), sometimes with billions of trainable parameters. Additionally, the contemporary availability of large datasets, facilitated by sophisticated data collection processes, enables the training of such deep architectures.

  1. Natural Language Processing
  • The field of natural language processing, also known as computational linguistics, involves the engineering of computational models and processes to solve practical problems in understanding human languages. Although it is sometimes difficult to distinguish clearly to which areas issues belong.

  • Work in NLP can be divided into two broad sub-areas: core areas and applications.

  • Often one needs to handle one or more of the core issues successfully and apply those ideas and procedures to solve practical problems.

  • Currently, NLP is primarily a data-driven field using statistical and probabilistic computations along with machine learning.

2.1. Core area of NLP

  • The core issues are those that are inherently present in any computational linguistic system. To perform translation, text summarization, image captioning, or any other linguistic task, there must be some understanding of the underlying language. This understanding can be broken down into at least four main areas: language modeling, morphology, parsing, and semantics. The number of scholarly works in each area over the last decade is shown in Figure 3.

image

Fig. 3: Publication Volume for Core Areas of NLP. The number of publications, indexed by Google Scholar, relating to each topic over the last decade is shown. While all areas have experienced growth, language modeling has grown the most.

2.1.1 Language Modeling and Word Embeddings

  • Language modeling can be viewed in two ways. First, it determines which words follow which. By extension, however, this can be viewed as determining what words mean, as individual words are only weakly meaningful, deriving their full value only from their interactions with other words.

  • The core areas address fundamental problems such as language modeling, which underscores quantifying associations among naturally occurring words.

  • Arguably, the most important task in NLP is that of language modeling. Language modeling (LM) is an essential piece of almost any application of NLP. Language modeling is the process of creating a model to predict words or simple linguistic components given previous words or components. This is useful for applications in which a user types input, to provide predictive ability for fast text entry. However, its power and versatility emanate from the fact that it can implicitly capture syntactic and semantic relationships among words or components in a linear neighborhood, making it useful for tasks such as machine translation or text summarization. Using prediction, such programs are able to generate more relevant, human-sounding sentences

2.1.2. Morphology

  • Morphology is the study of how words themselves are formed. It considers the roots of words and the use of prefixes and suffixes, compounds, and other intraword devices, to display tense, gender, plurality, and a other linguistic constructs. (Understanding later)

  • Morphological processing, dealing with segmentation of meaningful components of words and identifying the true parts of speech of words as used.

  • Morphology is concerned with finding segments within single words, including roots and stems, prefixes, suffixes,and—in some languages—infixes. Affixes (prefixes, suffixes, or infixes) are used to overtly modify stems for gender, number, person, et cetera.

2.1.3. Parsing

  • Parsing considers which words modify others, forming constituents, leading to a sentential structure. The area of semantics is the study of what words mean. It takes into account the meanings of the individual words and how they relate to and modify others, as well as the context these words appear in and some degree of world knowledge, i.e., “common sense”. There is a significant amount of overlap between each of these areas.

  • Syntactic processing, or parsing, which builds sentence diagrams as possible precursors to semantic processing.

  • Parsing examines how different words and phrases relate to each other within a sentence. There are at least two distinct forms of parsing: constituency parsing and dependency parsing. In constituency parsing, phrasal constituents are extracted from a sentence in a hierarchical fashion. Dependency parsing looks at the relationships between pairs of individual words.

  • Remaining Challenges: Outside of universal parsing, a parsing challenge that needs to be further investigated is the building of syntactic structures without the use of treebanks for training. Attempts have been made using attention scores and Tree-LSTMs, as well as outside-inside auto-encoders. If such approaches are successful, they have potential use in many environments, including in the context of low-resource languages and out-of-domain scenarios. While a number of other challenges remain, these are the largest and are expected to receive the most focus.

D. Semantics

  • Semantic processing, which attempts to distill meaning of words, phrases, and higher level components in text.

  • Semantic processing involves understanding the meaning of words, phrases, sentences, or documents at some level. Word embeddings, such as Word2Vec and GloVe, claim to capture meanings of words, following the Distributional Hypothesis of Meaning. As a corollary, when vectors corresponding to phrases, sentences, or other components of text are processed using a neural network, a representation that can be loosely thought to be semantically representative is computed compositionally.

  • In this section, neural semantic processing research is separated into two distinct areas: Work on comparing the semantic similarity of two portions of text, and work on capturing and transferring meaning in high level constituents, particularly sentences.

  • Semantic Challenges: In addition to the challenges already mentioned, researchers believe that being able to solve tasks well does not indicate actual understanding. Integrating deep networks with general word-graphs (e.g. WordNet) or knowledge-graphs (e.g. DBPedia) may be able to endow a sense of understanding. Graph-embedding is an active area of research, and work on integrating language-based models and graph models has only recently begun to take off, giving hope for better machine understanding.

  1. Summary of Core Issues
  • Deep learning has generally performed very well, surpassing existing states of the art in many individual core NLP tasks, and has thus created the foundation on which useful natural language applications can and are being built. However, it is clear from examining the research reviewed here that natural language is an enigmatically complex topic, with myriad core or basic tasks, of which deep learning has only grazed the surface. It is also not clear how architectures for ably executing individual core tasks can be synthesized to build a common edifice, possibly a much more complex distributed neural architecture, to show competence in multiple or “all” core tasks. More fundamentally, it is also not clear, how mastering of basic tasks, may lead to superior performance in applied tasks, which are the ultimate engineering goals, especially in the context of building effective and efficient deep learning models. Many, if not most, successful deep learning architectures for applied tasks, discussed in the next section, seem to forgo explicit architectural components for core tasks, and learn such tasks implicitly. Thus, some researchers argue that the relevance of the large amount of work on core issues is not fully justified, while others argue that further extensive research in such areas is necessary to better understand and develop systems which more perfectly perform these tasks, whether explicitly or implicitly.
  1. Application of NLP Using Deep Learning
  • While the study of core areas of NLP is important to understanding how neural models work, it is meaningless in and of itself from an engineering perspective, which values applications that benefit humanity, not pure philosophical and scientific inquiry. Current approaches to solving several immediately useful NLP tasks are summarized here. Note that the issues included here are only those involving the processing of text, not the processing of verbal speech. Because speech processing [162], [163] requires expertise on several other topics including acoustic processing, it is generally considered another field of its own, sharing many commonalities with the field of NLP. The number of studies in each discussed area over the last decade is shown in Figure 4.

image

Fig. 4: Publication Volume for Applied Areas of NLP. All areas of applied natural language processing discussed have witnessed growth in recent years, with the largest growth occurring in the last two to three years.

  • The application areas involve topics

    • Extraction of useful information (e.g. named entities and relations), translation of text between and among languages, summarization of written works, automatic answering of questions by inferring answers, and classification and clustering of documents.

4.1. Information Retrieval

  • The purpose of Information Retrieval (IR) systems is to help people find the right (most useful) information in the right (most convenient) format at the right time (when they need it) [164]. Among many issues in IR, a primary problem that needs addressing pertains to ranking documents with respect to a query string in terms of relevance scores for ad-hoc retrieval tasks, similar to what happens in a search engine.

4.2. Information Extraction

  • Information extraction extracts explicit or implicit information from text. The outputs of systems vary, but often the extracted data and the relationships within it are saved in relational databases [172]. Commonly extracted information includes named entities and relations, events and their participants, temporal information, and tuples of facts.

4.3. Text Generation

  • Many NLP tasks require the generation of human-like language. Summarization and machine translation convert one text to another in a sequence-to-sequence (seq2seq) fashion. Other tasks, such as image and video captioning and automatic weather and sports reporting, convert non-textual data to text. Some tasks, however, produce text without any input data to convert (or with only small amounts used as a topic or guide). These tasks include poetry generation, joke generation, and story generation.

4.4. Summarization

  • Summarization finds elements of interest in documents in order to produce an encapsulation of the most important content. There are two primary types of summarization: extractive and abstractive. The first focuses on sentence extraction, simplification, reordering, and concatenation to relay the important information in documents using text taken directly from the documents. Abstractive summaries rely on expressing documents’ contents through generation-style abstraction, possibly using words never seen in the documents [48].

4.5. Question Answering

  • Similar to summarization and information extraction, question answering (QA) gathers relevant words, phrases, or sentences from a document. QA returns this information in a coherent fashion in response to a request. Current methods resemble those of summarization.

4.6. Machine Translation

  • Machine translation (MT) is the quintessential application of NLP. It involves the use of mathematical and algorithmic techniques to translate documents in one language to another. Performing effective translation is intrinsically onerous even for humans, requiring proficiency in areas such as morphology, syntax, and semantics, as well as an adept understanding and discernment of cultural sensitivities, for both of the languages (and associated societies) under consideration [48].
  1. Summary of Deep Learning NLP Applications
  • Numerous other applications of natural language processing exist including grammar correction, as seen in word processors, and author mimicking, which, given sufficient data, generates text replicating the style of a particular writer. Many of these applications are infrequently used, understudied, or not yet exposed to deep learning. However, the area of sentiment analysis should be noted, as it is becoming increasingly popular and utilizing deep learning. In large part a semantic task, it is the extraction of a writer’s sentiment—their positive, negative, or neutral inclination towards some subject or idea [268]. Applications are varied, including product research, futures prediction, social media analysis, and classification of spam [269], [270]. The current state of the art uses an ensemble including both LSTMs and CNNs [271].

  • This section has provided a number of select examples of the applied usages of deep learning in natural language processing. Countless studies have been conducted in these and similar areas, chronicling the ways in which deep learning has facilitated the successful use of natural language in a wide variety of applications. Only a minuscule fraction of such work has been referred to in this survey.

  • While more specific recommendations for practitioners have been discussed in some individual subsections, the current trend in state-of-the-art models in all application areas is to use pre-trained stacks of Transformer units in some configuration, whether in encoder-decoder configurations or just as encoders. Thus, self-attention which is the mainstay of Transformer has become the norm, along with cross-attention between encoder and decoder units, if decoders are present. In fact, in many recent papers, if not most, Transformers have begun to replace LSTM units that were preponderant just a few months ago. Pre-training of these large Transformer models has also become the accepted way to endow a model with generalized knowledge of language. Models such as BERT, which have been trained on corpora of billions of words, are available for download, thus providing a practitioner with a model that possesses a great amount of general knowledge of language already. A practitioner can further train it with one’s own general corpora, if desired, but such training is not always necessary, considering the enormous sizes of the pre-training that downloaded models have received. To train a model to perform a certain task well, the last step a practitioner must go through is to use available downloadable task-specific corpora, or build one’s own task-specific corpus. This last training step is usually supervised. It is also recommended that if several tasks are to be performed, multi-task training be used wherever possible.

Reference:

[1] https://arxiv.org/pdf/1807.10854.pdf

[2] https://arxiv.org/pdf/1708.02709.pdf

[3] https://arxiv.org/pdf/2108.05542.pdf

[4] Recent Advances in Natural Language Processing via Large Pre-TrainedLanguage Models: A Survey https://arxiv.org/pdf/2111.01243.pdf

[5] AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing https://arxiv.org/pdf/2108.05542.pdf Appendix:

Encompasses (v) to include different types of things

Underscores (v) to emphasize something, or to show that it is important

Data-driven science is an interdisciplinary field of scientific methods to extract knowledge from data

Data-driven learning a learning approach driven by research-like access to data

Inherently (v) is a basic or essential feature that gives something its character

Arguably (adv) used for stating your opinion or belief, especially when you think other people may disagree (Được cho là)

Versatility (adj) able to change easily or to be used for different purposes (Tính linh hoạt)

Emanate (v) to come from a particular place (bắt nguồn).

Implicitly (adj) not stated directly, but expressed in the way that someone behaves, or understood from what they are saying (ngầm hiểu).

Corollary (n) something that will also be true if a particular idea or statement is true, or something that will also exist if a particular situation exists (kết quả tất yếu)1

Owner
PeterPham
PhD Student at National Chung Cheng University
PeterPham
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023
The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Subformer This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us to retain performance while

Machel Reid 10 Dec 27, 2022
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

74 Dec 05, 2022
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the

Galois Autocompleter 91 Sep 23, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural languag

Benjamin Heinzerling 1.1k Jan 03, 2023
Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

Sandeep Tammu 19 Dec 17, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 07, 2022
My implementation of Safaricom Machine Learning Codility test. The code has bugs, logical I guess I made errors and any correction will be appreciated.

Safaricom_Codility Machine Learning 2022 The test entails two questions. Question 1 was on Machine Learning. Question 2 was on SQL I ran out of time.

Lawrence M. 1 Mar 03, 2022
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

Welcome to Spokestack Python! This library is intended for developing voice interfaces in Python. This can include anything from Raspberry Pi applicat

Spokestack 133 Sep 20, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
Natural language computational chemistry command line interface.

nlcc Install pip install nlcc Must have Open-AI Codex key: export OPENAI_API_KEY=your key here then nlcc key bindings ctrl-w copy to clipboard (Note

Andrew White 37 Dec 14, 2022
Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

Vasileios Lioutas 28 Dec 07, 2022
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022