Natural Language Processing Best Practices & Examples

Overview

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

  • "Natural language is not a synonym for English"
  • "English isn't generic for language, despite what NLP papers might lead you to believe"
  • "Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario Models Description Languages
Text Classification BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition BERT Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. English
Text Summarization BERTSumExt
BERTSumAbs
UniLM (s2s-ft)
MiniLM
Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text. English
Entailment BERT, XLNet, RoBERTa Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not. English
Question Answering BiDAF, BERT, XLNet Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. English
Sentence Similarity BERT, GenSen Sentence similarity is the process of computing a similarity score given a pair of text documents. English
Embeddings Word2Vec
fastText
GloVe
Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension. English
Sentiment Analysis Dependency Parser
GloVe
Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect . English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository Description
Transformers A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks ML and deep learning examples with Azure Machine Learning.
AzureML-BERT End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM Unified Language Model Pre-training.
DialoGPT DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build Branch Status
Linux CPU master Build Status
Linux CPU staging Build Status
Linux GPU master Build Status
Linux GPU staging Build Status
Comments
  • [ASK] Remove 'repo_metrics' folder

    [ASK] Remove 'repo_metrics' folder

    Description

    During the team discussion, we believe that the files in this 'repo_metrics' folder should be removed from the NLP repo and create a centralized way to maintain it.

    Other Comments

    enhancement 
    opened by yijingchen 31
  • Fix broken data path and add git clone cell

    Fix broken data path and add git clone cell

    Description

    This notebook 'embedding_trainer.ipynb' cannot be ran end to end. Fixed related issues:

    • Added the !git clone http://github.com/stanfordnlp/glove command inside the Jupyter Notebook. I found the experience smoother this way. However, I don't know why the author decided to leave it out. Please ask the author to verify this in case there are other risk that I'm not aware of.

    • This command cd glove && make gives error and I didn't add a cell for !cd glove && make. The notebook runs fine without this command. Please check with the author to see if this is necessary to include it.

    • This function from utils_nlp.dataset import stsbenchmark updated the data path in the util file. However this notebook haven't updated it. I modified the path in this notebook.

    • With these fixes, test pipeline should be able to run this notebook end to end.

    Related Issues

    https://github.com/microsoft/nlp/issues/230

    Checklist:

    • [X] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [x] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by yijingchen 19
  • GenSen on AML deep dive notebook (sentence similarity)

    GenSen on AML deep dive notebook (sentence similarity)

    1. This notebook serves as an introduction to an end-to-end NLP solution for sentence similarity building one of the advanced models - GenSen on AzureML platform. We show the advantages of AzureML when training large NLP models with GPU.

    The notebook includes: Data loading and preprocessing, Train GenSen model with distributed PyTorch with Horovod on AzureML and Tuning on HypterDrive. Evaluation and deployment will be added later. In addition, the comparison results with training and tuning on AML v.s. VM will be added once this initial PR is merged with staging.

    1. Provide a refactored GenSen code into utils_nlp to make the model reusable.

    We provide a distributed PyTorch with Horovod implementation of the paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks. This code is based on the gibhub codebase from Maluuba, but we have refactored the code in the following aspects:

    1. Support a distributed PyTorch with Horovod
    2. Clean and refactor the original code in a more structured form
    3. Change the training file (train.py) from non-stopping to stop when the validation loss reaches to the local minimum
    4. Update the code from Python 2.7 to 3+ and PyTorch from 0.2 or 0.3 to 1.0.1
    5. Add some necessary comments
    6. Add some code for training on AzureML platform
    7. Fix the bug on when setting the batch size to 1, the training raises an error
    opened by catherine667 16
  • [BUG] Cannot set up nlp_gpu environment

    [BUG] Cannot set up nlp_gpu environment

    Description

    The following errors showing up while setting up the GPU environment:

    Collecting package metadata: done Solving environment: failed

    ResolvePackageNotFound: cudatoolkit==9.2

    image

    How do we replicate the bug?

    Machine: Microsoft Azure Deep Learning Virtual Machine Standard NC6 Operation System: Windows Code: cd nlp python tools/generate_conda_file.py --gpu conda env create -n nlp_gpu -f nlp_gpu.yaml

    Expected behavior (i.e. solution)

    The installation should complete without errors.

    Other Comments

    I changed some package version in yaml file to this:

    • cudatoolkit>=9.2
    • tensorflow-gpu>=1.12.0

    The installation proceed with the above configuration, however another error occur which shows below: ---------------------------------------- ERROR: Command "'C:\Anaconda\envs\nlp_gpu\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\adminyijing\AppData\Local\Temp\2\pip-record-su74tqwf\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\

    bug 
    opened by yijingchen 13
  • [FEATURE] Check that all AzureML notebooks are tested

    [FEATURE] Check that all AzureML notebooks are tested

    Description

    Need to add .azureml folder to provide the common AzureML subscription for AzureML notebooks.

    How do we replicate the bug?

    Under .azureml folder, it should contain: config.json file should look like this: "{"Id": null, "Scope": "/subscriptions/[ID]/resourceGroups/nlprg/providers/Microsoft.MachineLearningServices/workspaces/[Workspace Name]"}", which can be downloaded from workspace.

    Related notebooks: GenSen https://github.com/microsoft/nlp/pull/199 and BERT https://github.com/microsoft/nlp/pull/191 notebook testing

    related to https://github.com/microsoft/nlp/issues/143

    Expected behavior (i.e. solution)

    Other Comments

    #262

    bug release-blocker 
    opened by catherine667 10
  • Staging to master to add github metrics

    Staging to master to add github metrics

    Description

    In order to start recording the metrics we need to merge the metrics to master.

    @irshaffe when this is in master, I need to activate it from devops to be executed every day, it will store the metrics so you can start populating the powerbi dashboard. The original powerbi dashboard for Recommenders was done with Scott (@gramhagen)

    Related Issues

    #24

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by miguelgfierro 10
  • V chguan/add icml ex nlp code

    V chguan/add icml ex nlp code

    Description

    We have added the code of our ICML paper. The related files are:

    • interpreter.py and README.md files under utils_nlp\interpreter. The interpreter.py file is the main functional file we utilize. README.md is an instruction file on it.
    • explain_simple_model.ipynb and explain_BERT_model.ipynb files under scenarios\interpret_NLP_models for two scenarios on how to interpreter.py.
    • test_interpreter.py under tests\unit. This file contains 6 unit tests for interpreter.py (which, in my machine, cost about 2.25s to run).
    • example.png under utils_nlp\interpreter folder used by README.md, and regular.json under scenarios\interpret_NLP_models folder used by explain_BERT_model.ipynb. I know from other pull requests that files like these are not allowed to merge. So, can anyone help me upload these two files to somewhere? Thanks for your help in advance : )

    Related Issues

    Our issue is #62.

    Checklist:

    • My code follows the code style of this project, as detailed in our contribution guidelines.
    • I have added tests.
    • [ ] I have updated the documentation accordingly (I now add README.md to utils_nlp only. What other .md files should I modify or add?).
    opened by Frozenmad 9
  • Transformers

    Transformers

    Description

    Related Issues

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by saidbleik 8
  • Integration tests

    Integration tests

    Description

    Integration and smoke tests

    @bethz, @jainr In the code I would like to have the sequence:

    1. create conda env
    2. run smoke
    3. run integration
    4. remove conda

    Beth told me that there might be a more elegant way of doing this. Can you please offer some guideline?

    @saidbleik @sharatsc the scheduler is not working at the moment (you might have seen the emails to devops). As a temporal solution I thought of running this pipeline every time there is a PR to master. Feel free to propose another idea

    Related Issues

    #25

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by miguelgfierro 8
  • Hlu/bert ner utils

    Hlu/bert ner utils

    5/31: Notebook is updated with new dataset. Everything is ready to be reviewed! @saidbleik @miguelgfierro @yexing99


    5/29 updates: @saidbleik @miguelgfierro I made several updates based on recent discussions with Said. I still need to update the notebook with a new dataset, but the utility classes and functions are ready to be reviewed (I don't plan to make other significant changes besides addressing review comments.) Please ignore the bert_data_utils.py file for now, I need to update it for the new dataset. Some functions in the common_ner.py are from Said's sequence classification PR. I will merge this with the common.py once Said completes his PR.


    5/20 udpates: @saidbleik @miguelgfierro I made another update based on our discussion last week.

    I got rid of the InputFeature class. I also tried to get rid of the InputExample class, but found it hard. If we pass the data around as tuples, there a few possible scenarios a. Single sentence data with label: (sentence_text, label) b. Single sentence data without label: (sentence_text,) c. Two sentence data with label: (sentence_1_text, sentence_2_text, label) d. Two sentence data without label: (sentence_1_text, sentence_2_text) As you can see, a and d can be confusing, unless we have different sets of code for single-sentence tasks and two-sentence tasks. I renamed InputExample to BertInputData and created a namedtuple version of it. Please take a look at bert_data_utils.py.

    I'm still keeping the tokenization step outside of the classifier, but changed the tokenization utility function to output TensorDataset instead of InputFeature. TensorDataset helps wrapping multiple tensors without using InputFeature.

    I'm flexible with using or not using the configuration class.

    Let's seek more evidence to finalize these decisions as Miguel suggested.

    5/16 updates: @saidbleik @miguelgfierro I made another pass through the code. Three major changes:

    1. Consolidated some utility functions into the BertTokenClassifier class.
    2. Removed some unnecessary configurations.
    3. Added docstring.

    In general, I followed the BertSequenceClassifer Said wrote, but made a few different design decisions.

    • Use a single BertFineTuneConfig class to set all parameters. BertFineTuneConfig is initialized using a dictionary. User can use a yaml file to set parameters, then load the yaml file into a dictionary. I think this would make code less verbose when we want to give the user more control and also make it easier for user to document how they run their experiments.
      I also store all configurations in the BertTokenClassifier object, in case one needs to pickle the model and use it somewhere else.
    • Keep the tokenization step out side of the classifier class. I think this is a preprocessing step and shouldn't be included in the classifier. It also helps the users understand better what they are doing. I think we want to abstract things to improve resusabiliy, but a sequence of smaller black boxes may help user understand the process better than one big black box.
    • Keep the InputExample and InputFeatures classes, and use PyTorch Dataloader instead of custom function to create batches. I think using some standard data structures will make the code written by different people look more consistent. There may be some initial learning curve, but could be helpful in the long run. The fields in InputExample and InputFeatures also help the user understanding how BERT works.

    I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

    I still need to refine some functions and improve the formatting, but want to create this PR for people to review and comment. @miguelgfierro

    opened by hlums 8
  • [ASK] Improve user experience for long running notebooks

    [ASK] Improve user experience for long running notebooks

    Description

    Some notebooks take long time to run. For external data scientist who wants to try it out fast and see how things work, this is not a quite pleasant experience. Here are some ideas for improvements:

    • Each notebook add a note section describing the the machine configuration(i.e. # of GPU, etc) and estimated time to finish running the notebooks so that user won't be surprised.
    • Another idea is set the default of the notebooks to run on smaller data and smaller parameters. And then add another section to guide user change it to larger experiment so they know they'll face a long running time.

    Notebook running time (Last update: 8/1/2019)

    Machine: Azure DLVM Standard_NC12 with 2 GPU

    | Scenario |Notebook Name |CPU |GPU | |------|--------------------------------|------|---| |entailment | entailment_xnli_multilingual | NA | ~20hrs | |name_entity_recognition| ner_wikigold_bert | ~37mins | ~6mins | |embeddings| embedding_trainer| ~5mins | ~5mins | | interpret_NLP_models | understand_models| ~4mins | ~2mins| | text_classification | tc_mnil_bert | ~8.2hrs| ~1.2hrs |

    enhancement 
    opened by yijingchen 7
  • Add `$schema` to `cgmanifest.json`

    Add `$schema` to `cgmanifest.json`

    This pull request adds the JSON schema for cgmanifest.json.

    FAQ

    Why?

    A JSON schema helps you to ensure that your cgmanifest.json file is valid. JSON schema validation is a build-in feature in most modern IDEs like Visual Studio and Visual Studio Code. Most modern IDEs also provide code-completion for JSON schemas.

    How can I validate my cgmanifest.json file?

    Most modern IDEs like Visual Studio and Visual Studio Code have a built-in feature to validate JSON files. You can also use this small script to validate your cgmanifest.json file.

    Why does it suggest camel case for the properties?

    Component Detection is able to read camel case and pascal case properties. However, the JSON schema doesn't have a case-insensitive mode. We therefore suggest camel case as it's the most common format for JSON.

    Why is the diff so large?

    To deserialize the cgmanifest.json file, we use JSON.parse(). However, to serialize the JSON again we use prettier. We found that, in general, it gave smaller diffs than the default JSON.stringify() function.

    opened by JamieMagee 0
  • This repo is missing important files

    This repo is missing important files

    There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    Merge this pull request

    opened by microsoft-github-policy-service[bot] 1
  • Adding Microsoft SECURITY.MD

    Adding Microsoft SECURITY.MD

    Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    opened by microsoft-github-policy-service[bot] 0
  • [ASK] How to run on GPU

    [ASK] How to run on GPU

    Description

    I was running the codes for text classification (tc_mnli_transformers.ipynb) and it keeps on running on my cpu instead of gpu. How can I change that? It's taking way too long to train as a result. Please help.

    Other Comments

    opened by poko1 0
  • [ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

    [ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

    Description

    I run in Google Colab the following code

    !pip install --upgrade 
    !pip install -q git+https://github.com/microsoft/nlp-recipes.git
    !pip install jsonlines
    !pip install pyrouge
    !pip install scrapbook
    
    import os
    import shutil
    import sys
    from tempfile import TemporaryDirectory
    import torch
    import nltk
    from nltk import tokenize
    import pandas as pd
    import pprint
    import scrapbook as sb
    
    nlp_path = os.path.abspath("../../")
    if nlp_path not in sys.path:
        sys.path.insert(0, nlp_path)
    
    from utils_nlp import models
    from utils_nlp.models import transformers 
    from utils_nlp.models.transformers.abstractive_summarization_bertsum \
         import BertSumAbs, BertSumAbsProcessor
    

    It breaks on the last line and I get the following error

    /usr/local/lib/python3.7/dist-packages/utils_nlp/models/transformers/abstractive_summarization_bertsum.py in <module>()
         15 from torch.utils.data.distributed import DistributedSampler
         16 from tqdm import tqdm
    ---> 17 from transformers import AutoTokenizer, BertModel
         18 
         19 from utils_nlp.common.pytorch_utils import (
    
    ModuleNotFoundError: No module named 'transformers'
    

    In summary, the code in abstractive_summarization_bertsum.py doesn't resolve transformers where it is located in the transformer folder. Is it something to be fixed on your side?

    opened by neqkir 1
  • [ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

    [ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

    When I run below code. summarizer.fit( ext_sum_train, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE, gradient_accumulation_steps=2, max_steps=MAX_STEPS, learning_rate=LEARNING_RATE, warmup_steps=WARMUP_STEPS, verbose=True, report_every=REPORT_EVERY, clip_grad_norm=False, use_preprocessed_data=USE_PREPROCSSED_DATA )

    It gives me error like this.

    Iteration:   0%|          | 0/199 [00:00<?, ?it/s]
    
    ---------------------------------------------------------------------------
    
    TypeError                                 Traceback (most recent call last)
    
    <ipython-input-40-343cf59f0aa4> in <module>()
         12             report_every=REPORT_EVERY,
         13             clip_grad_norm=False,
    ---> 14             use_preprocessed_data=USE_PREPROCSSED_DATA
         15         )
         16 
    
    11 frames
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in fit(self, train_dataset, num_gpus, gpu_ids, batch_size, local_rank, max_steps, warmup_steps, learning_rate, optimization_method, max_grad_norm, beta1, beta2, decay_method, gradient_accumulation_steps, report_every, verbose, seed, save_every, world_size, rank, use_preprocessed_data, **kwargs)
        775             report_every=report_every,
        776             clip_grad_norm=False,
    --> 777             save_every=save_every,
        778         )
        779 
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/common.py in fine_tune(self, train_dataloader, get_inputs, device, num_gpus, max_steps, global_step, max_grad_norm, gradient_accumulation_steps, optimizer, scheduler, fp16, amp, local_rank, verbose, seed, report_every, save_every, clip_grad_norm, validation_function)
        191                 disable=local_rank not in [-1, 0] or not verbose,
        192             )
    --> 193             for step, batch in enumerate(epoch_iterator):
        194                 inputs = get_inputs(batch, device, self.model_name)
        195                 outputs = self.model(**inputs)
    
    /usr/local/lib/python3.7/dist-packages/tqdm/std.py in __iter__(self)
       1102                 fp_write=getattr(self.fp, 'write', sys.stderr.write))
       1103 
    -> 1104         for obj in iterable:
       1105             yield obj
       1106             # Update and possibly print the progressbar.
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
        519             if self._sampler_iter is None:
        520                 self._reset()
    --> 521             data = self._next_data()
        522             self._num_yielded += 1
        523             if self._dataset_kind == _DatasetKind.Iterable and \
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
        559     def _next_data(self):
        560         index = self._next_index()  # may raise StopIteration
    --> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
        562         if self._pin_memory:
        563             data = _utils.pin_memory.pin_memory(data)
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
         45         else:
         46             data = self.dataset[possibly_batched_index]
    ---> 47         return self.collate_fn(data)
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate_fn(data)
        744             def collate_fn(data):
        745                 return self.processor.collate(
    --> 746                     data, block_size=self.max_pos_length, device=device
        747                 )
        748 
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate(self, data, block_size, device, train_mode)
        470         else:
        471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
    --> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
        473                 batch = Batch(list(filter(None, encoded_text)), True)
        474             else:
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in <listcomp>(.0)
        470         else:
        471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
    --> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
        473                 batch = Batch(list(filter(None, encoded_text)), True)
        474             else:
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in encode_single(self, d, block_size, train_mode)
        539             + ["[SEP]"]
        540         )
    --> 541         src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        542         _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
        543         segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
    
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in convert_tokens_to_ids(self, tokens)
    
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _convert_token_to_id_with_added_voc(self, token)
    
    TypeError: Can't convert 0 to PyString
    

    P.S. I try to run this code using google colab free GPU.

    Any help is welcome :)

    opened by ToonicTie 2
Releases(2.2.0)
  • 2.2.0(Mar 30, 2020)

    Text Summarization

    In this release, we support both abstractive and extractive text summarization.

    New Model: UniLM

    UniLM is a state of the art model developed by Microsoft Research Asia (MSRA). The model is pre-trained on a large unlabeled natural language corpus (English Wikipedia and BookCorpus) and can be fine-tuned on different types of labeled data for various NLP tasks like text classification and abstractive summarization.

    Supported Models

    • unilm-large-cased
    • unilm-base-cased

    For more info about UniLM, please refer to the following:

    Thanks to the UniLM team, Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon, for their great work and support for the integration.

    New Model: BERTSum

    BERTSum is an encoder architecture designed for text summarization. It can be used together with different decoders to support both extractive and abstractive summarization.

    Supported Models

    • bert-base-uncased (extractive and abstractive)
    • distilbert-base-uncased (extractive)

    Thanks to the original authors Yang Liu and Mirella Lapata for their great contribution.

    All model implementations support distributed training and multi-GPU inferencing. For abstractive summarization, we also support mixed-precision training and inference.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Dec 4, 2019)

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

Wordle_Bot Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time. It will log onto the wordle website and en

Lucas Polidori 15 Dec 11, 2022
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
Sample data associated with the Aurora-BP study

The Aurora-BP Study and Dataset This repository contains sample code, sample data, and explanatory information for working with the Aurora-BP dataset

Microsoft 16 Dec 12, 2022
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Takker - 一个普通的QQ机器人 此项目为基于 Nonebot2 和 go-cqhttp 开发,以 Sqlite 作为数据库的QQ群娱乐机器人 关于 纯兴趣开发,部分功能借鉴了大佬们的代码,作为Q群的娱乐+功能性Bot 声明 此项目仅用于学习交流,请勿用于非法用途 这是开发者的第一个Pytho

风屿 79 Dec 29, 2022
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 66 Jul 05, 2022
Use fastai-v2 with HuggingFace's pretrained transformers

FastHugs Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task: Text classification: fasthugs_seq_c

Morgan McGuire 111 Nov 16, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021
Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your t

Ravn Tech, Inc. 166 Jan 07, 2023
p-tuning for few-shot NLU task

p-tuning_NLU Overview 这个小项目是受乐于分享的苏剑林大佬这篇p-tuning 文章启发,也实现了个使用P-tuning进行NLU分类的任务, 思路是一样的,prompt实现方式有不同,这里是将[unused*]的embeddings参数抽取出用于初始化prompt_embed后

3 Dec 29, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 07, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022