Natural Language Processing Best Practices & Examples

Overview

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

  • "Natural language is not a synonym for English"
  • "English isn't generic for language, despite what NLP papers might lead you to believe"
  • "Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario Models Description Languages
Text Classification BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition BERT Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. English
Text Summarization BERTSumExt
BERTSumAbs
UniLM (s2s-ft)
MiniLM
Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text. English
Entailment BERT, XLNet, RoBERTa Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not. English
Question Answering BiDAF, BERT, XLNet Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. English
Sentence Similarity BERT, GenSen Sentence similarity is the process of computing a similarity score given a pair of text documents. English
Embeddings Word2Vec
fastText
GloVe
Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension. English
Sentiment Analysis Dependency Parser
GloVe
Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect . English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository Description
Transformers A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks ML and deep learning examples with Azure Machine Learning.
AzureML-BERT End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM Unified Language Model Pre-training.
DialoGPT DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build Branch Status
Linux CPU master Build Status
Linux CPU staging Build Status
Linux GPU master Build Status
Linux GPU staging Build Status
Comments
  • [ASK] Remove 'repo_metrics' folder

    [ASK] Remove 'repo_metrics' folder

    Description

    During the team discussion, we believe that the files in this 'repo_metrics' folder should be removed from the NLP repo and create a centralized way to maintain it.

    Other Comments

    enhancement 
    opened by yijingchen 31
  • Fix broken data path and add git clone cell

    Fix broken data path and add git clone cell

    Description

    This notebook 'embedding_trainer.ipynb' cannot be ran end to end. Fixed related issues:

    • Added the !git clone http://github.com/stanfordnlp/glove command inside the Jupyter Notebook. I found the experience smoother this way. However, I don't know why the author decided to leave it out. Please ask the author to verify this in case there are other risk that I'm not aware of.

    • This command cd glove && make gives error and I didn't add a cell for !cd glove && make. The notebook runs fine without this command. Please check with the author to see if this is necessary to include it.

    • This function from utils_nlp.dataset import stsbenchmark updated the data path in the util file. However this notebook haven't updated it. I modified the path in this notebook.

    • With these fixes, test pipeline should be able to run this notebook end to end.

    Related Issues

    https://github.com/microsoft/nlp/issues/230

    Checklist:

    • [X] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [x] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by yijingchen 19
  • GenSen on AML deep dive notebook (sentence similarity)

    GenSen on AML deep dive notebook (sentence similarity)

    1. This notebook serves as an introduction to an end-to-end NLP solution for sentence similarity building one of the advanced models - GenSen on AzureML platform. We show the advantages of AzureML when training large NLP models with GPU.

    The notebook includes: Data loading and preprocessing, Train GenSen model with distributed PyTorch with Horovod on AzureML and Tuning on HypterDrive. Evaluation and deployment will be added later. In addition, the comparison results with training and tuning on AML v.s. VM will be added once this initial PR is merged with staging.

    1. Provide a refactored GenSen code into utils_nlp to make the model reusable.

    We provide a distributed PyTorch with Horovod implementation of the paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks. This code is based on the gibhub codebase from Maluuba, but we have refactored the code in the following aspects:

    1. Support a distributed PyTorch with Horovod
    2. Clean and refactor the original code in a more structured form
    3. Change the training file (train.py) from non-stopping to stop when the validation loss reaches to the local minimum
    4. Update the code from Python 2.7 to 3+ and PyTorch from 0.2 or 0.3 to 1.0.1
    5. Add some necessary comments
    6. Add some code for training on AzureML platform
    7. Fix the bug on when setting the batch size to 1, the training raises an error
    opened by catherine667 16
  • [BUG] Cannot set up nlp_gpu environment

    [BUG] Cannot set up nlp_gpu environment

    Description

    The following errors showing up while setting up the GPU environment:

    Collecting package metadata: done Solving environment: failed

    ResolvePackageNotFound: cudatoolkit==9.2

    image

    How do we replicate the bug?

    Machine: Microsoft Azure Deep Learning Virtual Machine Standard NC6 Operation System: Windows Code: cd nlp python tools/generate_conda_file.py --gpu conda env create -n nlp_gpu -f nlp_gpu.yaml

    Expected behavior (i.e. solution)

    The installation should complete without errors.

    Other Comments

    I changed some package version in yaml file to this:

    • cudatoolkit>=9.2
    • tensorflow-gpu>=1.12.0

    The installation proceed with the above configuration, however another error occur which shows below: ---------------------------------------- ERROR: Command "'C:\Anaconda\envs\nlp_gpu\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\adminyijing\AppData\Local\Temp\2\pip-record-su74tqwf\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\

    bug 
    opened by yijingchen 13
  • [FEATURE] Check that all AzureML notebooks are tested

    [FEATURE] Check that all AzureML notebooks are tested

    Description

    Need to add .azureml folder to provide the common AzureML subscription for AzureML notebooks.

    How do we replicate the bug?

    Under .azureml folder, it should contain: config.json file should look like this: "{"Id": null, "Scope": "/subscriptions/[ID]/resourceGroups/nlprg/providers/Microsoft.MachineLearningServices/workspaces/[Workspace Name]"}", which can be downloaded from workspace.

    Related notebooks: GenSen https://github.com/microsoft/nlp/pull/199 and BERT https://github.com/microsoft/nlp/pull/191 notebook testing

    related to https://github.com/microsoft/nlp/issues/143

    Expected behavior (i.e. solution)

    Other Comments

    #262

    bug release-blocker 
    opened by catherine667 10
  • Staging to master to add github metrics

    Staging to master to add github metrics

    Description

    In order to start recording the metrics we need to merge the metrics to master.

    @irshaffe when this is in master, I need to activate it from devops to be executed every day, it will store the metrics so you can start populating the powerbi dashboard. The original powerbi dashboard for Recommenders was done with Scott (@gramhagen)

    Related Issues

    #24

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by miguelgfierro 10
  • V chguan/add icml ex nlp code

    V chguan/add icml ex nlp code

    Description

    We have added the code of our ICML paper. The related files are:

    • interpreter.py and README.md files under utils_nlp\interpreter. The interpreter.py file is the main functional file we utilize. README.md is an instruction file on it.
    • explain_simple_model.ipynb and explain_BERT_model.ipynb files under scenarios\interpret_NLP_models for two scenarios on how to interpreter.py.
    • test_interpreter.py under tests\unit. This file contains 6 unit tests for interpreter.py (which, in my machine, cost about 2.25s to run).
    • example.png under utils_nlp\interpreter folder used by README.md, and regular.json under scenarios\interpret_NLP_models folder used by explain_BERT_model.ipynb. I know from other pull requests that files like these are not allowed to merge. So, can anyone help me upload these two files to somewhere? Thanks for your help in advance : )

    Related Issues

    Our issue is #62.

    Checklist:

    • My code follows the code style of this project, as detailed in our contribution guidelines.
    • I have added tests.
    • [ ] I have updated the documentation accordingly (I now add README.md to utils_nlp only. What other .md files should I modify or add?).
    opened by Frozenmad 9
  • Transformers

    Transformers

    Description

    Related Issues

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by saidbleik 8
  • Integration tests

    Integration tests

    Description

    Integration and smoke tests

    @bethz, @jainr In the code I would like to have the sequence:

    1. create conda env
    2. run smoke
    3. run integration
    4. remove conda

    Beth told me that there might be a more elegant way of doing this. Can you please offer some guideline?

    @saidbleik @sharatsc the scheduler is not working at the moment (you might have seen the emails to devops). As a temporal solution I thought of running this pipeline every time there is a PR to master. Feel free to propose another idea

    Related Issues

    #25

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by miguelgfierro 8
  • Hlu/bert ner utils

    Hlu/bert ner utils

    5/31: Notebook is updated with new dataset. Everything is ready to be reviewed! @saidbleik @miguelgfierro @yexing99


    5/29 updates: @saidbleik @miguelgfierro I made several updates based on recent discussions with Said. I still need to update the notebook with a new dataset, but the utility classes and functions are ready to be reviewed (I don't plan to make other significant changes besides addressing review comments.) Please ignore the bert_data_utils.py file for now, I need to update it for the new dataset. Some functions in the common_ner.py are from Said's sequence classification PR. I will merge this with the common.py once Said completes his PR.


    5/20 udpates: @saidbleik @miguelgfierro I made another update based on our discussion last week.

    I got rid of the InputFeature class. I also tried to get rid of the InputExample class, but found it hard. If we pass the data around as tuples, there a few possible scenarios a. Single sentence data with label: (sentence_text, label) b. Single sentence data without label: (sentence_text,) c. Two sentence data with label: (sentence_1_text, sentence_2_text, label) d. Two sentence data without label: (sentence_1_text, sentence_2_text) As you can see, a and d can be confusing, unless we have different sets of code for single-sentence tasks and two-sentence tasks. I renamed InputExample to BertInputData and created a namedtuple version of it. Please take a look at bert_data_utils.py.

    I'm still keeping the tokenization step outside of the classifier, but changed the tokenization utility function to output TensorDataset instead of InputFeature. TensorDataset helps wrapping multiple tensors without using InputFeature.

    I'm flexible with using or not using the configuration class.

    Let's seek more evidence to finalize these decisions as Miguel suggested.

    5/16 updates: @saidbleik @miguelgfierro I made another pass through the code. Three major changes:

    1. Consolidated some utility functions into the BertTokenClassifier class.
    2. Removed some unnecessary configurations.
    3. Added docstring.

    In general, I followed the BertSequenceClassifer Said wrote, but made a few different design decisions.

    • Use a single BertFineTuneConfig class to set all parameters. BertFineTuneConfig is initialized using a dictionary. User can use a yaml file to set parameters, then load the yaml file into a dictionary. I think this would make code less verbose when we want to give the user more control and also make it easier for user to document how they run their experiments.
      I also store all configurations in the BertTokenClassifier object, in case one needs to pickle the model and use it somewhere else.
    • Keep the tokenization step out side of the classifier class. I think this is a preprocessing step and shouldn't be included in the classifier. It also helps the users understand better what they are doing. I think we want to abstract things to improve resusabiliy, but a sequence of smaller black boxes may help user understand the process better than one big black box.
    • Keep the InputExample and InputFeatures classes, and use PyTorch Dataloader instead of custom function to create batches. I think using some standard data structures will make the code written by different people look more consistent. There may be some initial learning curve, but could be helpful in the long run. The fields in InputExample and InputFeatures also help the user understanding how BERT works.

    I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

    I still need to refine some functions and improve the formatting, but want to create this PR for people to review and comment. @miguelgfierro

    opened by hlums 8
  • [ASK] Improve user experience for long running notebooks

    [ASK] Improve user experience for long running notebooks

    Description

    Some notebooks take long time to run. For external data scientist who wants to try it out fast and see how things work, this is not a quite pleasant experience. Here are some ideas for improvements:

    • Each notebook add a note section describing the the machine configuration(i.e. # of GPU, etc) and estimated time to finish running the notebooks so that user won't be surprised.
    • Another idea is set the default of the notebooks to run on smaller data and smaller parameters. And then add another section to guide user change it to larger experiment so they know they'll face a long running time.

    Notebook running time (Last update: 8/1/2019)

    Machine: Azure DLVM Standard_NC12 with 2 GPU

    | Scenario |Notebook Name |CPU |GPU | |------|--------------------------------|------|---| |entailment | entailment_xnli_multilingual | NA | ~20hrs | |name_entity_recognition| ner_wikigold_bert | ~37mins | ~6mins | |embeddings| embedding_trainer| ~5mins | ~5mins | | interpret_NLP_models | understand_models| ~4mins | ~2mins| | text_classification | tc_mnil_bert | ~8.2hrs| ~1.2hrs |

    enhancement 
    opened by yijingchen 7
  • Add `$schema` to `cgmanifest.json`

    Add `$schema` to `cgmanifest.json`

    This pull request adds the JSON schema for cgmanifest.json.

    FAQ

    Why?

    A JSON schema helps you to ensure that your cgmanifest.json file is valid. JSON schema validation is a build-in feature in most modern IDEs like Visual Studio and Visual Studio Code. Most modern IDEs also provide code-completion for JSON schemas.

    How can I validate my cgmanifest.json file?

    Most modern IDEs like Visual Studio and Visual Studio Code have a built-in feature to validate JSON files. You can also use this small script to validate your cgmanifest.json file.

    Why does it suggest camel case for the properties?

    Component Detection is able to read camel case and pascal case properties. However, the JSON schema doesn't have a case-insensitive mode. We therefore suggest camel case as it's the most common format for JSON.

    Why is the diff so large?

    To deserialize the cgmanifest.json file, we use JSON.parse(). However, to serialize the JSON again we use prettier. We found that, in general, it gave smaller diffs than the default JSON.stringify() function.

    opened by JamieMagee 0
  • This repo is missing important files

    This repo is missing important files

    There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    Merge this pull request

    opened by microsoft-github-policy-service[bot] 1
  • Adding Microsoft SECURITY.MD

    Adding Microsoft SECURITY.MD

    Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    opened by microsoft-github-policy-service[bot] 0
  • [ASK] How to run on GPU

    [ASK] How to run on GPU

    Description

    I was running the codes for text classification (tc_mnli_transformers.ipynb) and it keeps on running on my cpu instead of gpu. How can I change that? It's taking way too long to train as a result. Please help.

    Other Comments

    opened by poko1 0
  • [ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

    [ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

    Description

    I run in Google Colab the following code

    !pip install --upgrade 
    !pip install -q git+https://github.com/microsoft/nlp-recipes.git
    !pip install jsonlines
    !pip install pyrouge
    !pip install scrapbook
    
    import os
    import shutil
    import sys
    from tempfile import TemporaryDirectory
    import torch
    import nltk
    from nltk import tokenize
    import pandas as pd
    import pprint
    import scrapbook as sb
    
    nlp_path = os.path.abspath("../../")
    if nlp_path not in sys.path:
        sys.path.insert(0, nlp_path)
    
    from utils_nlp import models
    from utils_nlp.models import transformers 
    from utils_nlp.models.transformers.abstractive_summarization_bertsum \
         import BertSumAbs, BertSumAbsProcessor
    

    It breaks on the last line and I get the following error

    /usr/local/lib/python3.7/dist-packages/utils_nlp/models/transformers/abstractive_summarization_bertsum.py in <module>()
         15 from torch.utils.data.distributed import DistributedSampler
         16 from tqdm import tqdm
    ---> 17 from transformers import AutoTokenizer, BertModel
         18 
         19 from utils_nlp.common.pytorch_utils import (
    
    ModuleNotFoundError: No module named 'transformers'
    

    In summary, the code in abstractive_summarization_bertsum.py doesn't resolve transformers where it is located in the transformer folder. Is it something to be fixed on your side?

    opened by neqkir 1
  • [ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

    [ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

    When I run below code. summarizer.fit( ext_sum_train, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE, gradient_accumulation_steps=2, max_steps=MAX_STEPS, learning_rate=LEARNING_RATE, warmup_steps=WARMUP_STEPS, verbose=True, report_every=REPORT_EVERY, clip_grad_norm=False, use_preprocessed_data=USE_PREPROCSSED_DATA )

    It gives me error like this.

    Iteration:   0%|          | 0/199 [00:00<?, ?it/s]
    
    ---------------------------------------------------------------------------
    
    TypeError                                 Traceback (most recent call last)
    
    <ipython-input-40-343cf59f0aa4> in <module>()
         12             report_every=REPORT_EVERY,
         13             clip_grad_norm=False,
    ---> 14             use_preprocessed_data=USE_PREPROCSSED_DATA
         15         )
         16 
    
    11 frames
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in fit(self, train_dataset, num_gpus, gpu_ids, batch_size, local_rank, max_steps, warmup_steps, learning_rate, optimization_method, max_grad_norm, beta1, beta2, decay_method, gradient_accumulation_steps, report_every, verbose, seed, save_every, world_size, rank, use_preprocessed_data, **kwargs)
        775             report_every=report_every,
        776             clip_grad_norm=False,
    --> 777             save_every=save_every,
        778         )
        779 
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/common.py in fine_tune(self, train_dataloader, get_inputs, device, num_gpus, max_steps, global_step, max_grad_norm, gradient_accumulation_steps, optimizer, scheduler, fp16, amp, local_rank, verbose, seed, report_every, save_every, clip_grad_norm, validation_function)
        191                 disable=local_rank not in [-1, 0] or not verbose,
        192             )
    --> 193             for step, batch in enumerate(epoch_iterator):
        194                 inputs = get_inputs(batch, device, self.model_name)
        195                 outputs = self.model(**inputs)
    
    /usr/local/lib/python3.7/dist-packages/tqdm/std.py in __iter__(self)
       1102                 fp_write=getattr(self.fp, 'write', sys.stderr.write))
       1103 
    -> 1104         for obj in iterable:
       1105             yield obj
       1106             # Update and possibly print the progressbar.
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
        519             if self._sampler_iter is None:
        520                 self._reset()
    --> 521             data = self._next_data()
        522             self._num_yielded += 1
        523             if self._dataset_kind == _DatasetKind.Iterable and \
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
        559     def _next_data(self):
        560         index = self._next_index()  # may raise StopIteration
    --> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
        562         if self._pin_memory:
        563             data = _utils.pin_memory.pin_memory(data)
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
         45         else:
         46             data = self.dataset[possibly_batched_index]
    ---> 47         return self.collate_fn(data)
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate_fn(data)
        744             def collate_fn(data):
        745                 return self.processor.collate(
    --> 746                     data, block_size=self.max_pos_length, device=device
        747                 )
        748 
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate(self, data, block_size, device, train_mode)
        470         else:
        471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
    --> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
        473                 batch = Batch(list(filter(None, encoded_text)), True)
        474             else:
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in <listcomp>(.0)
        470         else:
        471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
    --> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
        473                 batch = Batch(list(filter(None, encoded_text)), True)
        474             else:
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in encode_single(self, d, block_size, train_mode)
        539             + ["[SEP]"]
        540         )
    --> 541         src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        542         _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
        543         segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
    
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in convert_tokens_to_ids(self, tokens)
    
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _convert_token_to_id_with_added_voc(self, token)
    
    TypeError: Can't convert 0 to PyString
    

    P.S. I try to run this code using google colab free GPU.

    Any help is welcome :)

    opened by ToonicTie 2
Releases(2.2.0)
  • 2.2.0(Mar 30, 2020)

    Text Summarization

    In this release, we support both abstractive and extractive text summarization.

    New Model: UniLM

    UniLM is a state of the art model developed by Microsoft Research Asia (MSRA). The model is pre-trained on a large unlabeled natural language corpus (English Wikipedia and BookCorpus) and can be fine-tuned on different types of labeled data for various NLP tasks like text classification and abstractive summarization.

    Supported Models

    • unilm-large-cased
    • unilm-base-cased

    For more info about UniLM, please refer to the following:

    Thanks to the UniLM team, Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon, for their great work and support for the integration.

    New Model: BERTSum

    BERTSum is an encoder architecture designed for text summarization. It can be used together with different decoders to support both extractive and abstractive summarization.

    Supported Models

    • bert-base-uncased (extractive and abstractive)
    • distilbert-base-uncased (extractive)

    Thanks to the original authors Yang Liu and Mirella Lapata for their great contribution.

    All model implementations support distributed training and multi-GPU inferencing. For abstractive summarization, we also support mixed-precision training and inference.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Dec 4, 2019)

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

Kevin 2 Jan 14, 2022
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 06, 2023
Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
Generate text line images for training deep learning OCR model (e.g. CRNN)

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 06, 2023
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
RecipeReduce: Simplified Recipe Processing for Lazy Programmers

RecipeReduce This repo will help you figure out the amount of ingredients to buy for a certain number of meals with selected recipes. RecipeReduce Get

Qibin Chen 9 Apr 22, 2022
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 02, 2023
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 01, 2023
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

MINDs Lab 881 Jan 03, 2023
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023