Natural Language Processing Best Practices & Examples

Last update: Dec 31, 2022

Overview

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

"Natural language is not a synonym for English"
"English isn't generic for language, despite what NLP papers might lead you to believe"
"Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario	Models	Description	Languages
Text Classification	BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM	Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content.	English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition	BERT	Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest.	English
Text Summarization	BERTSumExt BERTSumAbs UniLM (s2s-ft) MiniLM	Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text.	English
Entailment	BERT, XLNet, RoBERTa	Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not.	English
Question Answering	BiDAF, BERT, XLNet	Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query.	English
Sentence Similarity	BERT, GenSen	Sentence similarity is the process of computing a similarity score given a pair of text documents.	English
Embeddings	Word2Vec fastText GloVe	Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.	English
Sentiment Analysis	Dependency Parser GloVe	Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect .	English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

Accessing Datastores to easily read and write your data in Azure storage services such as blob storage or file share.
Scaling up and out on Azure Machine Learning Compute.
Automated Machine Learning which builds high quality machine learning models by automating model and hyperparameter selection. AutoML explores BERT, BiLSTM, bag-of-words, and word embeddings on the user's dataset to handle text columns.
Tracking experiments and monitoring metrics to enhance the model creation process.
Distributed Training
Hyperparameter tuning
Deploying the trained machine learning model as a web service to Azure Container Instance for deveopment and test, or for low scale, CPU-based workloads.
Deploying the trained machine learning model as a web service to Azure Kubernetes Service for high-scale production deployments and provides autoscaling, and fast response times.

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository	Description
Transformers	A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks	ML and deep learning examples with Azure Machine Learning.
AzureML-BERT	End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS	MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN	Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM	Unified Language Model Pre-training.
DialoGPT	DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build	Branch	Status
Linux CPU	master
Linux CPU	staging
Linux GPU	master
Linux GPU	staging

Comments

[ASK] Remove 'repo_metrics' folder

Description

During the team discussion, we believe that the files in this 'repo_metrics' folder should be removed from the NLP repo and create a centralized way to maintain it.

Other Comments
enhancement

opened by yijingchen 31
Fix broken data path and add git clone cell
Description

This notebook 'embedding_trainer.ipynb' cannot be ran end to end. Fixed related issues:

Added the !git clone http://github.com/stanfordnlp/glove command inside the Jupyter Notebook. I found the experience smoother this way. However, I don't know why the author decided to leave it out. Please ask the author to verify this in case there are other risk that I'm not aware of.

This command cd glove && make gives error and I didn't add a cell for !cd glove && make. The notebook runs fine without this command. Please check with the author to see if this is necessary to include it.

This function from utils_nlp.dataset import stsbenchmark updated the data path in the util file. However this notebook haven't updated it. I modified the path in this notebook.

With these fixes, test pipeline should be able to run this notebook end to end.

Related Issues

https://github.com/microsoft/nlp/issues/230

Checklist:

[X] My code follows the code style of this project, as detailed in our contribution guidelines.

[x] I have added tests.

[ ] I have updated the documentation accordingly.
opened by yijingchen 19
GenSen on AML deep dive notebook (sentence similarity)
This notebook serves as an introduction to an end-to-end NLP solution for sentence similarity building one of the advanced models - GenSen on AzureML platform. We show the advantages of AzureML when training large NLP models with GPU.

The notebook includes: Data loading and preprocessing, Train GenSen model with distributed PyTorch with Horovod on AzureML and Tuning on HypterDrive. Evaluation and deployment will be added later. In addition, the comparison results with training and tuning on AML v.s. VM will be added once this initial PR is merged with staging.

Provide a refactored GenSen code into utils_nlp to make the model reusable.

We provide a distributed PyTorch with Horovod implementation of the paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks. This code is based on the gibhub codebase from Maluuba, but we have refactored the code in the following aspects:

Support a distributed PyTorch with Horovod

Clean and refactor the original code in a more structured form

Change the training file (train.py) from non-stopping to stop when the validation loss reaches to the local minimum

Update the code from Python 2.7 to 3+ and PyTorch from 0.2 or 0.3 to 1.0.1

Add some necessary comments

Add some code for training on AzureML platform

Fix the bug on when setting the batch size to 1, the training raises an error
opened by catherine667 16
[BUG] Cannot set up nlp_gpu environment
Description

The following errors showing up while setting up the GPU environment:

Collecting package metadata: done Solving environment: failed

ResolvePackageNotFound: cudatoolkit==9.2

How do we replicate the bug?

Machine: Microsoft Azure Deep Learning Virtual Machine Standard NC6 Operation System: Windows Code: cd nlp python tools/generate_conda_file.py --gpu conda env create -n nlp_gpu -f nlp_gpu.yaml

Expected behavior (i.e. solution)

The installation should complete without errors.

Other Comments

I changed some package version in yaml file to this:

cudatoolkit>=9.2

tensorflow-gpu>=1.12.0

The installation proceed with the above configuration, however another error occur which shows below: ---------------------------------------- ERROR: Command "'C:\Anaconda\envs\nlp_gpu\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\adminyijing\AppData\Local\Temp\2\pip-record-su74tqwf\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\
bug
opened by yijingchen 13
[FEATURE] Check that all AzureML notebooks are tested

Description

Need to add .azureml folder to provide the common AzureML subscription for AzureML notebooks.

How do we replicate the bug?

Under .azureml folder, it should contain: config.json file should look like this: "{"Id": null, "Scope": "/subscriptions/[ID]/resourceGroups/nlprg/providers/Microsoft.MachineLearningServices/workspaces/[Workspace Name]"}", which can be downloaded from workspace.

Related notebooks: GenSen https://github.com/microsoft/nlp/pull/199 and BERT https://github.com/microsoft/nlp/pull/191 notebook testing

related to https://github.com/microsoft/nlp/issues/143

Expected behavior (i.e. solution)

Other Comments

#262
bug release-blocker

opened by catherine667 10
Staging to master to add github metrics
Description

In order to start recording the metrics we need to merge the metrics to master.

@irshaffe when this is in master, I need to activate it from devops to be executed every day, it will store the metrics so you can start populating the powerbi dashboard. The original powerbi dashboard for Recommenders was done with Scott (@gramhagen)

Related Issues

#24

Checklist:

[ ] My code follows the code style of this project, as detailed in our contribution guidelines.

[ ] I have added tests.

[ ] I have updated the documentation accordingly.
opened by miguelgfierro 10
V chguan/add icml ex nlp code
Description

We have added the code of our ICML paper. The related files are:

interpreter.py and README.md files under utils_nlp\interpreter. The interpreter.py file is the main functional file we utilize. README.md is an instruction file on it.

explain_simple_model.ipynb and explain_BERT_model.ipynb files under scenarios\interpret_NLP_models for two scenarios on how to interpreter.py.

test_interpreter.py under tests\unit. This file contains 6 unit tests for interpreter.py (which, in my machine, cost about 2.25s to run).

example.png under utils_nlp\interpreter folder used by README.md, and regular.json under scenarios\interpret_NLP_models folder used by explain_BERT_model.ipynb. I know from other pull requests that files like these are not allowed to merge. So, can anyone help me upload these two files to somewhere? Thanks for your help in advance : )

Related Issues

Our issue is #62.

Checklist:

My code follows the code style of this project, as detailed in our contribution guidelines.

I have added tests.

[ ] I have updated the documentation accordingly (I now add README.md to utils_nlp only. What other .md files should I modify or add?).
opened by Frozenmad 9
Transformers
Description

Related Issues

Checklist:

[ ] My code follows the code style of this project, as detailed in our contribution guidelines.

[ ] I have added tests.

[ ] I have updated the documentation accordingly.
opened by saidbleik 8
Integration tests
Description

Integration and smoke tests

@bethz, @jainr In the code I would like to have the sequence:

create conda env

run smoke

run integration

remove conda

Beth told me that there might be a more elegant way of doing this. Can you please offer some guideline?

@saidbleik @sharatsc the scheduler is not working at the moment (you might have seen the emails to devops). As a temporal solution I thought of running this pipeline every time there is a PR to master. Feel free to propose another idea

Related Issues

#25

Checklist:

[ ] My code follows the code style of this project, as detailed in our contribution guidelines.

[ ] I have added tests.

[ ] I have updated the documentation accordingly.
opened by miguelgfierro 8
Hlu/bert ner utils
5/31: Notebook is updated with new dataset. Everything is ready to be reviewed! @saidbleik @miguelgfierro @yexing99

5/29 updates: @saidbleik @miguelgfierro I made several updates based on recent discussions with Said. I still need to update the notebook with a new dataset, but the utility classes and functions are ready to be reviewed (I don't plan to make other significant changes besides addressing review comments.) Please ignore the bert_data_utils.py file for now, I need to update it for the new dataset. Some functions in the common_ner.py are from Said's sequence classification PR. I will merge this with the common.py once Said completes his PR.

5/20 udpates: @saidbleik @miguelgfierro I made another update based on our discussion last week.

I got rid of the InputFeature class. I also tried to get rid of the InputExample class, but found it hard. If we pass the data around as tuples, there a few possible scenarios a. Single sentence data with label: (sentence_text, label) b. Single sentence data without label: (sentence_text,) c. Two sentence data with label: (sentence_1_text, sentence_2_text, label) d. Two sentence data without label: (sentence_1_text, sentence_2_text) As you can see, a and d can be confusing, unless we have different sets of code for single-sentence tasks and two-sentence tasks. I renamed InputExample to BertInputData and created a namedtuple version of it. Please take a look at bert_data_utils.py.

I'm still keeping the tokenization step outside of the classifier, but changed the tokenization utility function to output TensorDataset instead of InputFeature. TensorDataset helps wrapping multiple tensors without using InputFeature.

I'm flexible with using or not using the configuration class.

Let's seek more evidence to finalize these decisions as Miguel suggested.

5/16 updates: @saidbleik @miguelgfierro I made another pass through the code. Three major changes:

Consolidated some utility functions into the BertTokenClassifier class.

Removed some unnecessary configurations.

Added docstring.

In general, I followed the BertSequenceClassifer Said wrote, but made a few different design decisions.

Use a single BertFineTuneConfig class to set all parameters. BertFineTuneConfig is initialized using a dictionary. User can use a yaml file to set parameters, then load the yaml file into a dictionary. I think this would make code less verbose when we want to give the user more control and also make it easier for user to document how they run their experiments.
I also store all configurations in the BertTokenClassifier object, in case one needs to pickle the model and use it somewhere else.

Keep the tokenization step out side of the classifier class. I think this is a preprocessing step and shouldn't be included in the classifier. It also helps the users understand better what they are doing. I think we want to abstract things to improve resusabiliy, but a sequence of smaller black boxes may help user understand the process better than one big black box.

Keep the InputExample and InputFeatures classes, and use PyTorch Dataloader instead of custom function to create batches. I think using some standard data structures will make the code written by different people look more consistent. There may be some initial learning curve, but could be helpful in the long run. The fields in InputExample and InputFeatures also help the user understanding how BERT works.

I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

I still need to refine some functions and improve the formatting, but want to create this PR for people to review and comment. @miguelgfierro
opened by hlums 8
[ASK] Improve user experience for long running notebooks
Description

Some notebooks take long time to run. For external data scientist who wants to try it out fast and see how things work, this is not a quite pleasant experience. Here are some ideas for improvements:

Each notebook add a note section describing the the machine configuration(i.e. # of GPU, etc) and estimated time to finish running the notebooks so that user won't be surprised.

Another idea is set the default of the notebooks to run on smaller data and smaller parameters. And then add another section to guide user change it to larger experiment so they know they'll face a long running time.

Notebook running time (Last update: 8/1/2019)

Machine: Azure DLVM Standard_NC12 with 2 GPU

| Scenario |Notebook Name |CPU |GPU | |------|--------------------------------|------|---| |entailment | entailment_xnli_multilingual | NA | ~20hrs | |name_entity_recognition| ner_wikigold_bert | ~37mins | ~6mins | |embeddings| embedding_trainer| ~5mins | ~5mins | | interpret_NLP_models | understand_models| ~4mins | ~2mins| | text_classification | tc_mnil_bert | ~8.2hrs| ~1.2hrs |
enhancement
opened by yijingchen 7
Add `$schema` to `cgmanifest.json`

This pull request adds the JSON schema for cgmanifest.json.

FAQ

Why?

A JSON schema helps you to ensure that your cgmanifest.json file is valid. JSON schema validation is a build-in feature in most modern IDEs like Visual Studio and Visual Studio Code. Most modern IDEs also provide code-completion for JSON schemas.

How can I validate my cgmanifest.json file?

Most modern IDEs like Visual Studio and Visual Studio Code have a built-in feature to validate JSON files. You can also use this small script to validate your cgmanifest.json file.

Why does it suggest camel case for the properties?

Component Detection is able to read camel case and pascal case properties. However, the JSON schema doesn't have a case-insensitive mode. We therefore suggest camel case as it's the most common format for JSON.

Why is the diff so large?

To deserialize the cgmanifest.json file, we use JSON.parse(). However, to serialize the JSON again we use prettier. We found that, in general, it gave smaller diffs than the default JSON.stringify() function.

opened by JamieMagee 0
This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

opened by microsoft-github-policy-service[bot] 1
Adding Microsoft SECURITY.MD

Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

opened by microsoft-github-policy-service[bot] 0
[ASK] How to run on GPU

Description

I was running the codes for text classification (tc_mnli_transformers.ipynb) and it keeps on running on my cpu instead of gpu. How can I change that? It's taking way too long to train as a result. Please help.

Other Comments

opened by poko1 0

[ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

Description

I run in Google Colab the following code

!pip install --upgrade 
!pip install -q git+https://github.com/microsoft/nlp-recipes.git
!pip install jsonlines
!pip install pyrouge
!pip install scrapbook

import os
import shutil
import sys
from tempfile import TemporaryDirectory
import torch
import nltk
from nltk import tokenize
import pandas as pd
import pprint
import scrapbook as sb

nlp_path = os.path.abspath("../../")
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)

from utils_nlp import models
from utils_nlp.models import transformers 
from utils_nlp.models.transformers.abstractive_summarization_bertsum \
     import BertSumAbs, BertSumAbsProcessor

It breaks on the last line and I get the following error

/usr/local/lib/python3.7/dist-packages/utils_nlp/models/transformers/abstractive_summarization_bertsum.py in <module>()
     15 from torch.utils.data.distributed import DistributedSampler
     16 from tqdm import tqdm
---> 17 from transformers import AutoTokenizer, BertModel
     18 
     19 from utils_nlp.common.pytorch_utils import (

ModuleNotFoundError: No module named 'transformers'

In summary, the code in abstractive_summarization_bertsum.py doesn't resolve transformers where it is located in the transformer folder. Is it something to be fixed on your side?

opened by neqkir 1

[ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

When I run below code. summarizer.fit( ext_sum_train, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE, gradient_accumulation_steps=2, max_steps=MAX_STEPS, learning_rate=LEARNING_RATE, warmup_steps=WARMUP_STEPS, verbose=True, report_every=REPORT_EVERY, clip_grad_norm=False, use_preprocessed_data=USE_PREPROCSSED_DATA )

It gives me error like this.

Iteration:   0%|          | 0/199 [00:00<?, ?it/s]

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-40-343cf59f0aa4> in <module>()
     12             report_every=REPORT_EVERY,
     13             clip_grad_norm=False,
---> 14             use_preprocessed_data=USE_PREPROCSSED_DATA
     15         )
     16 

11 frames

/content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in fit(self, train_dataset, num_gpus, gpu_ids, batch_size, local_rank, max_steps, warmup_steps, learning_rate, optimization_method, max_grad_norm, beta1, beta2, decay_method, gradient_accumulation_steps, report_every, verbose, seed, save_every, world_size, rank, use_preprocessed_data, **kwargs)
    775             report_every=report_every,
    776             clip_grad_norm=False,
--> 777             save_every=save_every,
    778         )
    779 

/content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/common.py in fine_tune(self, train_dataloader, get_inputs, device, num_gpus, max_steps, global_step, max_grad_norm, gradient_accumulation_steps, optimizer, scheduler, fp16, amp, local_rank, verbose, seed, report_every, save_every, clip_grad_norm, validation_function)
    191                 disable=local_rank not in [-1, 0] or not verbose,
    192             )
--> 193             for step, batch in enumerate(epoch_iterator):
    194                 inputs = get_inputs(batch, device, self.model_name)
    195                 outputs = self.model(**inputs)

/usr/local/lib/python3.7/dist-packages/tqdm/std.py in __iter__(self)
   1102                 fp_write=getattr(self.fp, 'write', sys.stderr.write))
   1103 
-> 1104         for obj in iterable:
   1105             yield obj
   1106             # Update and possibly print the progressbar.

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    559     def _next_data(self):
    560         index = self._next_index()  # may raise StopIteration
--> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    562         if self._pin_memory:
    563             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

/content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate_fn(data)
    744             def collate_fn(data):
    745                 return self.processor.collate(
--> 746                     data, block_size=self.max_pos_length, device=device
    747                 )
    748 

/content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate(self, data, block_size, device, train_mode)
    470         else:
    471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
--> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
    473                 batch = Batch(list(filter(None, encoded_text)), True)
    474             else:

/content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in <listcomp>(.0)
    470         else:
    471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
--> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
    473                 batch = Batch(list(filter(None, encoded_text)), True)
    474             else:

/content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in encode_single(self, d, block_size, train_mode)
    539             + ["[SEP]"]
    540         )
--> 541         src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
    542         _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
    543         segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in convert_tokens_to_ids(self, tokens)

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _convert_token_to_id_with_added_voc(self, token)

TypeError: Can't convert 0 to PyString

P.S. I try to run this code using google colab free GPU.

Any help is welcome :)

opened by ToonicTie 2

Releases(2.2.0)

2.2.0(Mar 30, 2020)
Text Summarization

In this release, we support both abstractive and extractive text summarization.

New Model: UniLM

UniLM is a state of the art model developed by Microsoft Research Asia (MSRA). The model is pre-trained on a large unlabeled natural language corpus (English Wikipedia and BookCorpus) and can be fine-tuned on different types of labeled data for various NLP tasks like text classification and abstractive summarization.

Supported Models

unilm-large-cased

unilm-base-cased

For more info about UniLM, please refer to the following:

Paper: Unified Language Model Pre-training for Natural Language Understanding and Generation

Github: https://github.com/microsoft/unilm

Thanks to the UniLM team, Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon, for their great work and support for the integration.

New Model: BERTSum

BERTSum is an encoder architecture designed for text summarization. It can be used together with different decoders to support both extractive and abstractive summarization.

Supported Models

bert-base-uncased (extractive and abstractive)

distilbert-base-uncased (extractive)

Papers:

Text Summarization with Pretrained Encoders

Fine-tune BERT for Extractive Summarization

GitHub:

https://github.com/nlpyang/PreSumm/

https://github.com/nlpyang/Bertsum

Thanks to the original authors Yang Liu and Mirella Lapata for their great contribution.

All model implementations support distributed training and multi-GPU inferencing. For abstractive summarization, we also support mixed-precision training and inference.
Source code(tar.gz)
Source code(zip)
v2.1.0(Jan 25, 2020)

Source code(tar.gz)
Source code(zip)
v2.0.0(Dec 4, 2019)

This release integrates Hugging face transformers library.
Source code(tar.gz)
Source code(zip)
v1.0.0(Oct 4, 2019)

Source code(tar.gz)
Source code(zip)
v0.1.0(Sep 19, 2019)

Source code(tar.gz)
Source code(zip)

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub Repository

Natural Language Processing Best Practices & Examples

Related tags

Overview

NLP Best Practices

Overview

Target Audience

Focus Areas

Scenarios

Algorithms

Languages

Content

Getting Started

Azure Machine Learning Service

Contributing

Blog Posts

References

Build Status

Comments

Description

Other Comments

Description

Related Issues

Checklist:

Description

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

Description

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

Description

Related Issues

Checklist:

Description

Related Issues

Checklist:

Description

Related Issues

Checklist:

Description

Related Issues

Checklist:

Let's seek more evidence to finalize these decisions as Miguel suggested.

I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

Description

Notebook running time (Last update: 8/1/2019)

FAQ

Why?

How can I validate my cgmanifest.json file?

Why does it suggest camel case for the properties?

Why is the diff so large?

Description

Other Comments

Description

Releases(2.2.0)

2.2.0(Mar 30, 2020)

Text Summarization

New Model: UniLM

Supported Models

New Model: BERTSum

Supported Models

v2.1.0(Jan 25, 2020)

v2.0.0(Dec 4, 2019)

v1.0.0(Oct 4, 2019)

v0.1.0(Sep 19, 2019)

Owner

Microsoft

Tools to download and cleanup Common Crawl data

Fixes mojibake and other glitches in Unicode text, after the fact.

COVID-19 Related NLP Papers

A Telegram bot to add notes to Flomo.

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

A Flask Sentiment Analysis API, with visual implementation

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Reproduction process of BERT on SST2 dataset

It analyze the sentiment of the user, whether it is postive or negative.

CDLA: A Chinese document layout analysis (CDLA) dataset

A look-ahead multi-entity Transformer for modeling coordinated agents.

How can I validate my `cgmanifest.json` file?