Natural Language Processing Best Practices & Examples

Overview

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

  • "Natural language is not a synonym for English"
  • "English isn't generic for language, despite what NLP papers might lead you to believe"
  • "Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario Models Description Languages
Text Classification BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition BERT Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. English
Text Summarization BERTSumExt
BERTSumAbs
UniLM (s2s-ft)
MiniLM
Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text. English
Entailment BERT, XLNet, RoBERTa Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not. English
Question Answering BiDAF, BERT, XLNet Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. English
Sentence Similarity BERT, GenSen Sentence similarity is the process of computing a similarity score given a pair of text documents. English
Embeddings Word2Vec
fastText
GloVe
Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension. English
Sentiment Analysis Dependency Parser
GloVe
Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect . English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository Description
Transformers A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks ML and deep learning examples with Azure Machine Learning.
AzureML-BERT End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM Unified Language Model Pre-training.
DialoGPT DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build Branch Status
Linux CPU master Build Status
Linux CPU staging Build Status
Linux GPU master Build Status
Linux GPU staging Build Status
Comments
  • [ASK] Remove 'repo_metrics' folder

    [ASK] Remove 'repo_metrics' folder

    Description

    During the team discussion, we believe that the files in this 'repo_metrics' folder should be removed from the NLP repo and create a centralized way to maintain it.

    Other Comments

    enhancement 
    opened by yijingchen 31
  • Fix broken data path and add git clone cell

    Fix broken data path and add git clone cell

    Description

    This notebook 'embedding_trainer.ipynb' cannot be ran end to end. Fixed related issues:

    • Added the !git clone http://github.com/stanfordnlp/glove command inside the Jupyter Notebook. I found the experience smoother this way. However, I don't know why the author decided to leave it out. Please ask the author to verify this in case there are other risk that I'm not aware of.

    • This command cd glove && make gives error and I didn't add a cell for !cd glove && make. The notebook runs fine without this command. Please check with the author to see if this is necessary to include it.

    • This function from utils_nlp.dataset import stsbenchmark updated the data path in the util file. However this notebook haven't updated it. I modified the path in this notebook.

    • With these fixes, test pipeline should be able to run this notebook end to end.

    Related Issues

    https://github.com/microsoft/nlp/issues/230

    Checklist:

    • [X] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [x] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by yijingchen 19
  • GenSen on AML deep dive notebook (sentence similarity)

    GenSen on AML deep dive notebook (sentence similarity)

    1. This notebook serves as an introduction to an end-to-end NLP solution for sentence similarity building one of the advanced models - GenSen on AzureML platform. We show the advantages of AzureML when training large NLP models with GPU.

    The notebook includes: Data loading and preprocessing, Train GenSen model with distributed PyTorch with Horovod on AzureML and Tuning on HypterDrive. Evaluation and deployment will be added later. In addition, the comparison results with training and tuning on AML v.s. VM will be added once this initial PR is merged with staging.

    1. Provide a refactored GenSen code into utils_nlp to make the model reusable.

    We provide a distributed PyTorch with Horovod implementation of the paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks. This code is based on the gibhub codebase from Maluuba, but we have refactored the code in the following aspects:

    1. Support a distributed PyTorch with Horovod
    2. Clean and refactor the original code in a more structured form
    3. Change the training file (train.py) from non-stopping to stop when the validation loss reaches to the local minimum
    4. Update the code from Python 2.7 to 3+ and PyTorch from 0.2 or 0.3 to 1.0.1
    5. Add some necessary comments
    6. Add some code for training on AzureML platform
    7. Fix the bug on when setting the batch size to 1, the training raises an error
    opened by catherine667 16
  • [BUG] Cannot set up nlp_gpu environment

    [BUG] Cannot set up nlp_gpu environment

    Description

    The following errors showing up while setting up the GPU environment:

    Collecting package metadata: done Solving environment: failed

    ResolvePackageNotFound: cudatoolkit==9.2

    image

    How do we replicate the bug?

    Machine: Microsoft Azure Deep Learning Virtual Machine Standard NC6 Operation System: Windows Code: cd nlp python tools/generate_conda_file.py --gpu conda env create -n nlp_gpu -f nlp_gpu.yaml

    Expected behavior (i.e. solution)

    The installation should complete without errors.

    Other Comments

    I changed some package version in yaml file to this:

    • cudatoolkit>=9.2
    • tensorflow-gpu>=1.12.0

    The installation proceed with the above configuration, however another error occur which shows below: ---------------------------------------- ERROR: Command "'C:\Anaconda\envs\nlp_gpu\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\adminyijing\AppData\Local\Temp\2\pip-record-su74tqwf\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\

    bug 
    opened by yijingchen 13
  • [FEATURE] Check that all AzureML notebooks are tested

    [FEATURE] Check that all AzureML notebooks are tested

    Description

    Need to add .azureml folder to provide the common AzureML subscription for AzureML notebooks.

    How do we replicate the bug?

    Under .azureml folder, it should contain: config.json file should look like this: "{"Id": null, "Scope": "/subscriptions/[ID]/resourceGroups/nlprg/providers/Microsoft.MachineLearningServices/workspaces/[Workspace Name]"}", which can be downloaded from workspace.

    Related notebooks: GenSen https://github.com/microsoft/nlp/pull/199 and BERT https://github.com/microsoft/nlp/pull/191 notebook testing

    related to https://github.com/microsoft/nlp/issues/143

    Expected behavior (i.e. solution)

    Other Comments

    #262

    bug release-blocker 
    opened by catherine667 10
  • Staging to master to add github metrics

    Staging to master to add github metrics

    Description

    In order to start recording the metrics we need to merge the metrics to master.

    @irshaffe when this is in master, I need to activate it from devops to be executed every day, it will store the metrics so you can start populating the powerbi dashboard. The original powerbi dashboard for Recommenders was done with Scott (@gramhagen)

    Related Issues

    #24

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by miguelgfierro 10
  • V chguan/add icml ex nlp code

    V chguan/add icml ex nlp code

    Description

    We have added the code of our ICML paper. The related files are:

    • interpreter.py and README.md files under utils_nlp\interpreter. The interpreter.py file is the main functional file we utilize. README.md is an instruction file on it.
    • explain_simple_model.ipynb and explain_BERT_model.ipynb files under scenarios\interpret_NLP_models for two scenarios on how to interpreter.py.
    • test_interpreter.py under tests\unit. This file contains 6 unit tests for interpreter.py (which, in my machine, cost about 2.25s to run).
    • example.png under utils_nlp\interpreter folder used by README.md, and regular.json under scenarios\interpret_NLP_models folder used by explain_BERT_model.ipynb. I know from other pull requests that files like these are not allowed to merge. So, can anyone help me upload these two files to somewhere? Thanks for your help in advance : )

    Related Issues

    Our issue is #62.

    Checklist:

    • My code follows the code style of this project, as detailed in our contribution guidelines.
    • I have added tests.
    • [ ] I have updated the documentation accordingly (I now add README.md to utils_nlp only. What other .md files should I modify or add?).
    opened by Frozenmad 9
  • Transformers

    Transformers

    Description

    Related Issues

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by saidbleik 8
  • Integration tests

    Integration tests

    Description

    Integration and smoke tests

    @bethz, @jainr In the code I would like to have the sequence:

    1. create conda env
    2. run smoke
    3. run integration
    4. remove conda

    Beth told me that there might be a more elegant way of doing this. Can you please offer some guideline?

    @saidbleik @sharatsc the scheduler is not working at the moment (you might have seen the emails to devops). As a temporal solution I thought of running this pipeline every time there is a PR to master. Feel free to propose another idea

    Related Issues

    #25

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by miguelgfierro 8
  • Hlu/bert ner utils

    Hlu/bert ner utils

    5/31: Notebook is updated with new dataset. Everything is ready to be reviewed! @saidbleik @miguelgfierro @yexing99


    5/29 updates: @saidbleik @miguelgfierro I made several updates based on recent discussions with Said. I still need to update the notebook with a new dataset, but the utility classes and functions are ready to be reviewed (I don't plan to make other significant changes besides addressing review comments.) Please ignore the bert_data_utils.py file for now, I need to update it for the new dataset. Some functions in the common_ner.py are from Said's sequence classification PR. I will merge this with the common.py once Said completes his PR.


    5/20 udpates: @saidbleik @miguelgfierro I made another update based on our discussion last week.

    I got rid of the InputFeature class. I also tried to get rid of the InputExample class, but found it hard. If we pass the data around as tuples, there a few possible scenarios a. Single sentence data with label: (sentence_text, label) b. Single sentence data without label: (sentence_text,) c. Two sentence data with label: (sentence_1_text, sentence_2_text, label) d. Two sentence data without label: (sentence_1_text, sentence_2_text) As you can see, a and d can be confusing, unless we have different sets of code for single-sentence tasks and two-sentence tasks. I renamed InputExample to BertInputData and created a namedtuple version of it. Please take a look at bert_data_utils.py.

    I'm still keeping the tokenization step outside of the classifier, but changed the tokenization utility function to output TensorDataset instead of InputFeature. TensorDataset helps wrapping multiple tensors without using InputFeature.

    I'm flexible with using or not using the configuration class.

    Let's seek more evidence to finalize these decisions as Miguel suggested.

    5/16 updates: @saidbleik @miguelgfierro I made another pass through the code. Three major changes:

    1. Consolidated some utility functions into the BertTokenClassifier class.
    2. Removed some unnecessary configurations.
    3. Added docstring.

    In general, I followed the BertSequenceClassifer Said wrote, but made a few different design decisions.

    • Use a single BertFineTuneConfig class to set all parameters. BertFineTuneConfig is initialized using a dictionary. User can use a yaml file to set parameters, then load the yaml file into a dictionary. I think this would make code less verbose when we want to give the user more control and also make it easier for user to document how they run their experiments.
      I also store all configurations in the BertTokenClassifier object, in case one needs to pickle the model and use it somewhere else.
    • Keep the tokenization step out side of the classifier class. I think this is a preprocessing step and shouldn't be included in the classifier. It also helps the users understand better what they are doing. I think we want to abstract things to improve resusabiliy, but a sequence of smaller black boxes may help user understand the process better than one big black box.
    • Keep the InputExample and InputFeatures classes, and use PyTorch Dataloader instead of custom function to create batches. I think using some standard data structures will make the code written by different people look more consistent. There may be some initial learning curve, but could be helpful in the long run. The fields in InputExample and InputFeatures also help the user understanding how BERT works.

    I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

    I still need to refine some functions and improve the formatting, but want to create this PR for people to review and comment. @miguelgfierro

    opened by hlums 8
  • [ASK] Improve user experience for long running notebooks

    [ASK] Improve user experience for long running notebooks

    Description

    Some notebooks take long time to run. For external data scientist who wants to try it out fast and see how things work, this is not a quite pleasant experience. Here are some ideas for improvements:

    • Each notebook add a note section describing the the machine configuration(i.e. # of GPU, etc) and estimated time to finish running the notebooks so that user won't be surprised.
    • Another idea is set the default of the notebooks to run on smaller data and smaller parameters. And then add another section to guide user change it to larger experiment so they know they'll face a long running time.

    Notebook running time (Last update: 8/1/2019)

    Machine: Azure DLVM Standard_NC12 with 2 GPU

    | Scenario |Notebook Name |CPU |GPU | |------|--------------------------------|------|---| |entailment | entailment_xnli_multilingual | NA | ~20hrs | |name_entity_recognition| ner_wikigold_bert | ~37mins | ~6mins | |embeddings| embedding_trainer| ~5mins | ~5mins | | interpret_NLP_models | understand_models| ~4mins | ~2mins| | text_classification | tc_mnil_bert | ~8.2hrs| ~1.2hrs |

    enhancement 
    opened by yijingchen 7
  • Add `$schema` to `cgmanifest.json`

    Add `$schema` to `cgmanifest.json`

    This pull request adds the JSON schema for cgmanifest.json.

    FAQ

    Why?

    A JSON schema helps you to ensure that your cgmanifest.json file is valid. JSON schema validation is a build-in feature in most modern IDEs like Visual Studio and Visual Studio Code. Most modern IDEs also provide code-completion for JSON schemas.

    How can I validate my cgmanifest.json file?

    Most modern IDEs like Visual Studio and Visual Studio Code have a built-in feature to validate JSON files. You can also use this small script to validate your cgmanifest.json file.

    Why does it suggest camel case for the properties?

    Component Detection is able to read camel case and pascal case properties. However, the JSON schema doesn't have a case-insensitive mode. We therefore suggest camel case as it's the most common format for JSON.

    Why is the diff so large?

    To deserialize the cgmanifest.json file, we use JSON.parse(). However, to serialize the JSON again we use prettier. We found that, in general, it gave smaller diffs than the default JSON.stringify() function.

    opened by JamieMagee 0
  • This repo is missing important files

    This repo is missing important files

    There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    Merge this pull request

    opened by microsoft-github-policy-service[bot] 1
  • Adding Microsoft SECURITY.MD

    Adding Microsoft SECURITY.MD

    Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    opened by microsoft-github-policy-service[bot] 0
  • [ASK] How to run on GPU

    [ASK] How to run on GPU

    Description

    I was running the codes for text classification (tc_mnli_transformers.ipynb) and it keeps on running on my cpu instead of gpu. How can I change that? It's taking way too long to train as a result. Please help.

    Other Comments

    opened by poko1 0
  • [ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

    [ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

    Description

    I run in Google Colab the following code

    !pip install --upgrade 
    !pip install -q git+https://github.com/microsoft/nlp-recipes.git
    !pip install jsonlines
    !pip install pyrouge
    !pip install scrapbook
    
    import os
    import shutil
    import sys
    from tempfile import TemporaryDirectory
    import torch
    import nltk
    from nltk import tokenize
    import pandas as pd
    import pprint
    import scrapbook as sb
    
    nlp_path = os.path.abspath("../../")
    if nlp_path not in sys.path:
        sys.path.insert(0, nlp_path)
    
    from utils_nlp import models
    from utils_nlp.models import transformers 
    from utils_nlp.models.transformers.abstractive_summarization_bertsum \
         import BertSumAbs, BertSumAbsProcessor
    

    It breaks on the last line and I get the following error

    /usr/local/lib/python3.7/dist-packages/utils_nlp/models/transformers/abstractive_summarization_bertsum.py in <module>()
         15 from torch.utils.data.distributed import DistributedSampler
         16 from tqdm import tqdm
    ---> 17 from transformers import AutoTokenizer, BertModel
         18 
         19 from utils_nlp.common.pytorch_utils import (
    
    ModuleNotFoundError: No module named 'transformers'
    

    In summary, the code in abstractive_summarization_bertsum.py doesn't resolve transformers where it is located in the transformer folder. Is it something to be fixed on your side?

    opened by neqkir 1
  • [ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

    [ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

    When I run below code. summarizer.fit( ext_sum_train, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE, gradient_accumulation_steps=2, max_steps=MAX_STEPS, learning_rate=LEARNING_RATE, warmup_steps=WARMUP_STEPS, verbose=True, report_every=REPORT_EVERY, clip_grad_norm=False, use_preprocessed_data=USE_PREPROCSSED_DATA )

    It gives me error like this.

    Iteration:   0%|          | 0/199 [00:00<?, ?it/s]
    
    ---------------------------------------------------------------------------
    
    TypeError                                 Traceback (most recent call last)
    
    <ipython-input-40-343cf59f0aa4> in <module>()
         12             report_every=REPORT_EVERY,
         13             clip_grad_norm=False,
    ---> 14             use_preprocessed_data=USE_PREPROCSSED_DATA
         15         )
         16 
    
    11 frames
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in fit(self, train_dataset, num_gpus, gpu_ids, batch_size, local_rank, max_steps, warmup_steps, learning_rate, optimization_method, max_grad_norm, beta1, beta2, decay_method, gradient_accumulation_steps, report_every, verbose, seed, save_every, world_size, rank, use_preprocessed_data, **kwargs)
        775             report_every=report_every,
        776             clip_grad_norm=False,
    --> 777             save_every=save_every,
        778         )
        779 
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/common.py in fine_tune(self, train_dataloader, get_inputs, device, num_gpus, max_steps, global_step, max_grad_norm, gradient_accumulation_steps, optimizer, scheduler, fp16, amp, local_rank, verbose, seed, report_every, save_every, clip_grad_norm, validation_function)
        191                 disable=local_rank not in [-1, 0] or not verbose,
        192             )
    --> 193             for step, batch in enumerate(epoch_iterator):
        194                 inputs = get_inputs(batch, device, self.model_name)
        195                 outputs = self.model(**inputs)
    
    /usr/local/lib/python3.7/dist-packages/tqdm/std.py in __iter__(self)
       1102                 fp_write=getattr(self.fp, 'write', sys.stderr.write))
       1103 
    -> 1104         for obj in iterable:
       1105             yield obj
       1106             # Update and possibly print the progressbar.
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
        519             if self._sampler_iter is None:
        520                 self._reset()
    --> 521             data = self._next_data()
        522             self._num_yielded += 1
        523             if self._dataset_kind == _DatasetKind.Iterable and \
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
        559     def _next_data(self):
        560         index = self._next_index()  # may raise StopIteration
    --> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
        562         if self._pin_memory:
        563             data = _utils.pin_memory.pin_memory(data)
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
         45         else:
         46             data = self.dataset[possibly_batched_index]
    ---> 47         return self.collate_fn(data)
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate_fn(data)
        744             def collate_fn(data):
        745                 return self.processor.collate(
    --> 746                     data, block_size=self.max_pos_length, device=device
        747                 )
        748 
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate(self, data, block_size, device, train_mode)
        470         else:
        471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
    --> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
        473                 batch = Batch(list(filter(None, encoded_text)), True)
        474             else:
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in <listcomp>(.0)
        470         else:
        471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
    --> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
        473                 batch = Batch(list(filter(None, encoded_text)), True)
        474             else:
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in encode_single(self, d, block_size, train_mode)
        539             + ["[SEP]"]
        540         )
    --> 541         src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        542         _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
        543         segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
    
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in convert_tokens_to_ids(self, tokens)
    
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _convert_token_to_id_with_added_voc(self, token)
    
    TypeError: Can't convert 0 to PyString
    

    P.S. I try to run this code using google colab free GPU.

    Any help is welcome :)

    opened by ToonicTie 2
Releases(2.2.0)
  • 2.2.0(Mar 30, 2020)

    Text Summarization

    In this release, we support both abstractive and extractive text summarization.

    New Model: UniLM

    UniLM is a state of the art model developed by Microsoft Research Asia (MSRA). The model is pre-trained on a large unlabeled natural language corpus (English Wikipedia and BookCorpus) and can be fine-tuned on different types of labeled data for various NLP tasks like text classification and abstractive summarization.

    Supported Models

    • unilm-large-cased
    • unilm-base-cased

    For more info about UniLM, please refer to the following:

    Thanks to the UniLM team, Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon, for their great work and support for the integration.

    New Model: BERTSum

    BERTSum is an encoder architecture designed for text summarization. It can be used together with different decoders to support both extractive and abstractive summarization.

    Supported Models

    • bert-base-uncased (extractive and abstractive)
    • distilbert-base-uncased (extractive)

    Thanks to the original authors Yang Liu and Mirella Lapata for their great contribution.

    All model implementations support distributed training and multi-GPU inferencing. For abstractive summarization, we also support mixed-precision training and inference.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Dec 4, 2019)

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

Facebook Research 4.3k Jan 01, 2023
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022
A simple version of DeTR

DeTR-Lite A simple version of DeTR Before you enjoy this DeTR-Lite The purpose of this project is to allow you to learn the basic knowledge of DeTR. P

Jianhua Yang 11 Jun 13, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
The ibet-Prime security token management system for ibet network.

ibet-Prime The ibet-Prime security token management system for ibet network. Features ibet-Prime is an API service that enables the issuance and manag

BOOSTRY 8 Dec 22, 2022
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 06, 2021
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Blackstone Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project f

ICLR&D 579 Jan 08, 2023