Data loaders and abstractions for text and NLP

Overview
https://circleci.com/gh/pytorch/text.svg?style=svg https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v

torchtext

This repository consists of:

  • torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
  • torchtext.datasets: Pre-built loaders for common NLP datasets

Note: we are currently re-designing the torchtext library to make it more compatible with pytorch (e.g. torch.utils.data). Several datasets have been written with the new abstractions in torchtext.experimental folder. We also created an issue to discuss the new abstraction, and users are welcome to leave feedback link. These prototype building blocks and datasets in the experimental folder are available in the nightly release only. The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command:

pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

For more detailed instructions, please refer to Install PyTorch. It should be noted that the new building blocks are still under development, and the APIs have not been solidified.

Installation

We recommend Anaconda as Python package management system. Please refer to pytorch.org for the detail of PyTorch installation. The following is the corresponding torchtext versions and supported Python versions.

Version Compatibility
PyTorch version torchtext version Supported Python version
nightly build master 3.6+
1.7 0.8 3.6+
1.6 0.7 3.6+
1.5 0.6 3.5+
1.4 0.5 2.7, 3.5+
0.4 and below 0.2.3 2.7, 3.5+

Using conda:

conda install -c pytorch torchtext

Using pip:

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses:

pip install sacremoses

For torchtext 0.5 and below, sentencepiece:

conda install -c powerai sentencepiece

Building from source

To build torchtext from source, you need git, CMake and C++11 compiler such as g++.:

git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive

# Linux
python setup.py clean install

# OSX
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py clean install

# or ``python setup.py develop`` if you are making modifications.

Note

When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. If you are using the nightly build of PyTorch, checkout the environment it was built with conda (here) and pip (here).

Documentation

Find the documentation here.

Data

The data module provides the following:

  • Ability to describe declaratively how to load a custom NLP dataset that's in a "normal" format:

    >>> pos = data.TabularDataset(
    ...    path='data/pos/pos_wsj_train.tsv', format='tsv',
    ...    fields=[('text', data.Field()),
    ...            ('labels', data.Field())])
    ...
    >>> sentiment = data.TabularDataset(
    ...    path='data/sentiment/train.json', format='json',
    ...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
    ...            'sentiment_gold': ('labels', data.Field(sequential=False))})
  • Ability to define a preprocessing pipeline:

    >>> src = data.Field(tokenize=my_custom_tokenizer)
    >>> trg = data.Field(tokenize=my_custom_tokenizer)
    >>> mt_train = datasets.TranslationDataset(
    ...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    ...     fields=(src, trg))
  • Batching, padding, and numericalizing (including building a vocabulary object):

    >>> # continuing from above
    >>> mt_dev = datasets.TranslationDataset(
    ...     path='data/mt/newstest2014', exts=('.en', '.de'),
    ...     fields=(src, trg))
    >>> src.build_vocab(mt_train, max_size=80000)
    >>> trg.build_vocab(mt_train, max_size=40000)
    >>> # mt_dev shares the fields, so it shares their vocab objects
    >>>
    >>> train_iter = data.BucketIterator(
    ...     dataset=mt_train, batch_size=32,
    ...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
    >>> # usage
    >>> next(iter(train_iter))
    <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
  • Wrapper for dataset splits (train, validation, test):

    >>> TEXT = data.Field()
    >>> LABELS = data.Field()
    >>>
    >>> train, val, test = data.TabularDataset.splits(
    ...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    ...     validation='_dev.tsv', test='_test.tsv', format='tsv',
    ...     fields=[('text', TEXT), ('labels', LABELS)])
    >>>
    >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
    ...     (train, val, test), batch_sizes=(16, 256, 256),
    >>>     sort_key=lambda x: len(x.text), device=0)
    >>>
    >>> TEXT.build_vocab(train)
    >>> LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb
  • Question classification: TREC
  • Entailment: SNLI, MultiNLI
  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank
  • Machine translation: abstract class + Multi30k, IWSLT, WMT14
  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS, CoNLL2000Chunking
  • Question answering: 20 QA bAbI tasks
  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Others are planned or a work in progress:

  • Question answering: SQuAD

See the test directory for examples of dataset usage.

Experimental Code

We have re-written several datasets under torchtext.experimental.datasets:

  • Sentiment analysis: IMDb
  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

A new pattern is introduced in Release v0.5.0. Several other datasets are also in the new pattern:

  • Unsupervised learning dataset: Enwik9
  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Comments
  • sampler unable in BucketIterator

    sampler unable in BucketIterator

    unable to use XLAs Distributed Data Sampler or any Multi-GPU training with BucketIterator because it doesnt have a sampler feature. train_iterator , valid_iterator = BucketIterator.splits((train_data, test_data), batch_size=batch_size, sort_within_batch=True, sort_key = lambda x: len(x.word_token), device=device)

    so i am constraint to using only one GPU.

    i used BucketIterator because it gives good batches with minimal padding, but the limiting scaling factor is a constraint.

    legacy 
    opened by StephennFernandes 33
  • Migrate datasets to build on top of torchdata datapipes

    Migrate datasets to build on top of torchdata datapipes

    🚀 Feature

    Motivation

    https://github.com/pytorch/data#why-composable-data-loading

    user-experience: TorchData datasets enable new functional API, auto-sharding, and snapshotting support out-of-the-box. They also enable standard flow-control like batching, collation, shuffling, bucketing, and mapping/transformation using user-defined functions and transforms (UDFs).

    Maintenance: By relying on TorchData, we no longer have to maintain low level functionality like downloading, extracting, caching, file/steam parsing, etc.

    Reference Examples: https://github.com/facebookexternal/torchdata/tree/main/examples/text TorchData: https://github.com/facebookexternal/torchdata

    Backlog of datasets

    • [x] AG_NEWS https://github.com/pytorch/text/pull/1498
    • [x] AmazonReviewFull https://github.com/pytorch/text/pull/1499
    • [x] AmazonReviewPolarity https://github.com/pytorch/text/pull/1490
    • [x] DBpedia https://github.com/pytorch/text/pull/1500
    • [x] SogouNews https://github.com/pytorch/text/pull/1503
    • [x] YelpReviewFull https://github.com/pytorch/text/pull/1507
    • [x] YelpReviewPolarity https://github.com/pytorch/text/pull/1509
    • [x] YahooAnswers https://github.com/pytorch/text/pull/1508
    • [x] CoNLL2000Chunking https://github.com/pytorch/text/pull/1515
    • [x] UDPOS https://github.com/pytorch/text/pull/1535
    • [x] IWSLT2016 https://github.com/pytorch/text/pull/1545
    • [x] IWSLT2017 https://github.com/pytorch/text/pull/1547
    • [x] Multi30K https://github.com/pytorch/text/pull/1536
    • [x] SQuAD1 https://github.com/pytorch/text/pull/1513
    • [x] SQuAD2 https://github.com/pytorch/text/pull/1514
    • [x] PennTreebank https://github.com/pytorch/text/pull/1511
    • [x] WikiText103 https://github.com/pytorch/text/pull/1518
    • [x] WikiText2 https://github.com/pytorch/text/pull/1519
    • [x] EnWik9 https://github.com/pytorch/text/pull/1512
    • [x] IMDB https://github.com/pytorch/text/pull/1531
    • [x] SST2 https://github.com/pytorch/text/pull/1538
    • [x] CC-100 https://github.com/pytorch/text/pull/1562

    Contributing

    Please leave a message below if you plan to work on particular dataset(s) to avoid duplication of efforts. Also please link to the corresponding PRs.

    cc: @Nayef211 , @abhinavarora , @erip , @ejguan , @VitalyFedyunin

    enhancement datasets feature request new datasets and building blocks 
    opened by parmeet 31
  • How might I use the tokenizers from the HuggingFace Transformers library

    How might I use the tokenizers from the HuggingFace Transformers library

    ❓ Questions and Help

    Description

    TL;DR: Has anyone been able to successfully integrate the transformers library tokenizer with torchtext?

    I wanted to use the torchtext library to process/load data for use with the transformers library. I was able to set their tokenizer in a Field object, and build a vocabulary without issue

    from torchtext import data
    from torchtext import datasets
    from transformers import AutoTokenizer
    
    path = 'path/to/med_nli/'
    
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    
    TEXT = data.Field(use_vocab=True, tokenize=tokenizer.tokenize)
    LABEL = data.LabelField()
    
    fields = {'sentence1': ('premise', TEXT),
              'sentence2': ('hypothesis', TEXT),
              'gold_label': ('label', LABEL)}
    
    train, valid, test = data.TabularDataset.splits(
        path=path, 
        train='mli_train_v1.jsonl',
        validation='mli_dev_v1.jsonl',
        test='mli_test_v1.jsonl',
        format='json', 
        fields=fields
    )
    
    train_iter, valid_iter, test_iter = data.BucketIterator.splits(
        (train, valid, test), batch_sizes=(16, 256, 256)
    )
    
    TEXT.build_vocab(train)
    LABEL.build_vocab(train)
    

    Note, I am using the MedNLI dataset but it appears to be formatted according to the SNLI dataset.

    But I am stuck on how to numericalize according to their tokenizers vocab. So I tried to numericalize in the field with their tokenizers encode method and set vocab=False.

    from torchtext import data
    from torchtext import datasets
    from transformers import AutoTokenizer
    
    path = 'path/to/med_nli/'
    
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    
    TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
    LABEL = data.LabelField()
    
    fields = {'sentence1': ('premise', TEXT),
              'sentence2': ('hypothesis', TEXT),
              'gold_label': ('label', LABEL)}
    
    train, valid, test = data.TabularDataset.splits(
        path=path, 
        train='mli_train_v1.jsonl',
        validation='mli_dev_v1.jsonl',
        test='mli_test_v1.jsonl',
        format='json', 
        fields=fields
    )
    
    train_iter, valid_iter, test_iter = data.BucketIterator.splits(
        (train, valid, test), batch_sizes=(16, 256, 256)
    )
    
    # TEXT.build_vocab(train)
    LABEL.build_vocab(train)
    

    But then I get strange issues when trying to access the batch,

    batch = next(iter(train_iter))
    print("Numericalize premises:\n", batch.premise)
    print("Numericalize hypotheses:\n", batch.hypothesis)
    print("Entailment labels:\n", batch.label)
    
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-55-9919119fad82> in <module>
    ----> 1 batch = next(iter(train_iter))
    
    ~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/iterator.py in __iter__(self)
    
    ~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)
    
    ~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in process(self, batch, device)
    
    ~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in numericalize(self, arr, device)
    
    ValueError: too many dimensions 'str'
    

    Any suggestions on how to go about this?

    legacy new datasets and building blocks 
    opened by JohnGiorgi 21
  • Unable to compile torchtext v0.13

    Unable to compile torchtext v0.13

    🐛 Bug

    Seeing the following error while building torchtext:

    -- Found CUDA: /usr/local/cuda (found version "11.4")
    -- The CUDA compiler identification is unknown
    -- Detecting CUDA compiler ABI info
    
    CMake Error in <myhome>/work/build/temp.linux-ppc64le-3.9/CMakeFiles/CMakeTmp/CMakeLists.txt:
      CUDA_ARCHITECTURES is empty for target "cmTC_94ffb".
    
    
    CMake Error in <myhome>/work/build/temp.linux-ppc64le-3.9/CMakeFiles/CMakeTmp/CMakeLists.txt:
      CUDA_ARCHITECTURES is empty for target "cmTC_94ffb".
    
    
    CMake Error at <myhome>/_build_env/share/cmake-3.19/Modules/CMakeDetermineCompilerABI.cmake:48 (try_compile):
      Failed to generate test project build system.
    

    To Reproduce :

    Build torchtext tag https://github.com/pytorch/text/releases/tag/v0.13.0-rc2

    Expected behavior . It should build successfully.

    Environment

    OS: Red Hat Enterprise Linux 8.5 (Ootpa) (ppc64le) GCC version: Anaconda GCC 11.2 Python version: 3.9.12 (main, Apr 5 2022, 07:09:29) CUDA runtime version: 11.4.152

    
    
    opened by cdeepali 20
  • torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>`

    torchtext iterator that tokenizes each line of words between the tokens `` and ``

    Hello,

    I generated a text file called openbookQA_train. The contents of this file are shown below:

    <sos> The sun is responsible for <mcoption> (A) puppies learning new tricks <eos>
    <sos> The sun is responsible for <mcoption> (B) children growing up and getting old <eos>
    <sos> The sun is responsible for <mcoption> (C) flowers wilting in a vase <eos>
    <sos> The sun is responsible for <mcoption> (D) plants sprouting, blooming and wilting <eos>
    

    I am trying to use or define torchtext Iterator to generate the input that I can pass into my Transformer.

    I want each sample in my next(iter(openbookQA_train)).text to be a series of integers that are obtained by tokenizing each line of words between <sos> and <eos> (including those special tokens), and for a sample that contains lesser number of tokens than the bptt length, I want the sample to include all of the tokenized words between <sos> and <eos> and the rest of the slots to be filled with the token <pad> up to the bptt length.

    How can I achieve this objective?

    Thank you,

    opened by h56cho 20
  • Fix IWSLT2016 testing

    Fix IWSLT2016 testing

    This serves as a patch to the newly-added IWSLT2016 mock testing which addresses two issues:

    1. Starting from the downloaded archive to test the extraction and cleaning pipeline more fully
    2. Adds missed test split testing path

    cc @parmeet

    cla signed ciflow/default 
    opened by erip 18
  • Customize torchtext.data.Dataset takes much time to generate dataset

    Customize torchtext.data.Dataset takes much time to generate dataset

    ❓ Questions and Help

    Description I wrote a customized data.Dataset for multilabel classification. When I processed the data, I found that it is very slow to generate train and test using the customized dataset (it takes about 1.5s per example). I am wondering is it normal or it's something wrong with my customized dataset.

    Customized data.Dataset for mulilabel classification is as follows:

    class TextMultiLabelDataset(data.Dataset):
        def __init__(self, text, text_field, label_field, lbls=None, **kwargs):
            # torchtext Field objects
            fields = [('text', text_field), ('label', label_field)]
            # for l in lbl_cols:
            # fields.append((l, label_field))
    
            is_test = True if lbls is None else False
            if is_test:
                pass
            else:
                n_labels = len(lbls)
    
            examples = []
            for i, txt in enumerate(tqdm(text)):
                if not is_test:
                    l = lbls[i]
                else:
                    l = [0.0] * n_labels
    
                examples.append(data.Example.fromlist([txt, l], fields))
    
            super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)
    
    where text is a list of list strings that in the documents, and lbls is a list of list labels in binary. (Total number of labels ~ 20000)
    

    examples of text:

    [["There are few factors more important to the mechanisms of evolution than stress. The stress response has formed as a result of natural selection..."], ["A 46-year-old female patient presenting with unspecific lower back pain, diffuse abdominal pain, and slightly elevated body temperature"], ...]
    

    examples of lbls:

    [[1 1 1 1 0 0 0 1 0 ...], [1 0 1 0 1 1 1 1 ...], ...]
    
    new datasets and building blocks 
    opened by xdwang0726 18
  • ImportError undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

    ImportError undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

    getting this error recently

    ImportError: /home/user/miniconda/lib/python3.8/site-packages/torchtext/_torchtext.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

    opened by AK391 16
  • [Discussion] Saving the field object

    [Discussion] Saving the field object

    Usage: The field object is critical to checkpointing as it provides:

    • tokenization
    • padding
    • numericalize

    Having the ability to save the field object allows the user, given arbitrary text, to preprocess the text. The preprocessed text is then used with a checkpointed model. Then the output is predicted and interpreted without the output dictionary.

    Problem: torch.save is implemented with pickle. The field object accepts lambdas for tokenization, preprocessing and postprocessing; therefore, cannot be pickled.

    Key Points:

    • The vocab object needs to be pickled because the output of the model is uninterpretable without it.

    Discussion: What is the right abstraction here? Should the vocab object be saved and the field object discarded? Is it appropriate to have the field object and the vocab object closely bound?

    enhancement help wanted 
    opened by PetrochukM 16
  • NameError: name 'IterableWrapper' is not defined

    NameError: name 'IterableWrapper' is not defined

    🐛 Bug

    Hello, I am trying to load torchtext datasets to reproduce a couple of the tutorials using the new PyTorch MPS support on Mac. After downloading torchdata I get the following error when trying to load any of the datasets in torchtext.datasets:

    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    Input In [51], in <cell line: 1>()
    ----> 1 data = torchtext.datasets.IMDB(
          2     '~/data',
          3     split='train'
          4 )
    
    File ~/opt/anaconda3/envs/main/lib/python3.10/site-packages/torchtext/data/datasets_utils.py:193, in _create_dataset_directory.<locals>.decorator.<locals>.wrapper(root, *args, **kwargs)
        191 if not os.path.exists(new_root):
        192     os.makedirs(new_root, exist_ok=True)
    --> 193 return fn(root=new_root, *args, **kwargs)
    
    File ~/opt/anaconda3/envs/main/lib/python3.10/site-packages/torchtext/data/datasets_utils.py:155, in _wrap_split_argument_with_fn.<locals>.new_fn(root, split, **kwargs)
        153 result = []
        154 for item in _check_default_set(split, splits, fn.__name__):
    --> 155     result.append(fn(root, item, **kwargs))
        156 return _wrap_datasets(tuple(result), split)
    
    File ~/opt/anaconda3/envs/main/lib/python3.10/site-packages/torchtext/datasets/imdb.py:86, in IMDB(root, split)
         81 if not is_module_available("torchdata"):
         82     raise ModuleNotFoundError(
         83         "Package `torchdata` not found. Please install following instructions at `https://github.com/pytorch/data`"
         84     )
    ---> 86 url_dp = IterableWrapper([URL])
         88 cache_compressed_dp = url_dp.on_disk_cache(
         89     filepath_fn=partial(_filepath_fn, root),
         90     hash_dict={_filepath_fn(root): MD5},
         91     hash_type="md5",
         92 )
         93 cache_compressed_dp = HttpReader(cache_compressed_dp).end_caching(mode="wb", same_filepath_fn=True)
    
    NameError: name 'IterableWrapper' is not defined
    

    My environment screen dump is below. Thanks.

    Collecting environment information...
    PyTorch version: 1.13.0.dev20220525
    Is debug build: False
    CUDA used to build PyTorch: None
    ROCM used to build PyTorch: N/A
    
    OS: macOS 12.3.1 (x86_64)
    GCC version: Could not collect
    Clang version: 13.1.6 (clang-1316.0.21.2.5)
    CMake version: Could not collect
    Libc version: N/A
    
    Python version: 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ] (64-bit runtime)
    Python platform: macOS-10.16-x86_64-i386-64bit
    Is CUDA available: False
    CUDA runtime version: No CUDA
    GPU models and configuration: No CUDA
    Nvidia driver version: No CUDA
    cuDNN version: No CUDA
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    
    Versions of relevant libraries:
    [pip3] numpy==1.22.4
    [pip3] torch==1.13.0.dev20220525
    [pip3] torchaudio==0.12.0.dev20220525
    [pip3] torchdata==0.5.0.dev20220525
    [pip3] torchtext==0.13.0.dev20220525
    [pip3] torchvision==0.14.0.dev20220525
    [conda] numpy                     1.22.4                   pypi_0    pypi
    [conda] torch                     1.13.0.dev20220525          pypi_0    pypi
    [conda] torchaudio                0.12.0.dev20220525          pypi_0    pypi
    [conda] torchdata                 0.5.0.dev20220525          pypi_0    pypi
    [conda] torchtext                 0.13.0.dev20220525          pypi_0    pypi
    [conda] torchvision               0.14.0.dev20220525          pypi_0    pypi
    
    torchtext version is  0.13.0.dev20220525
    
    opened by rkingery 15
  • Add Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models

    Add Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models

    Description

    This PR implements the GPT-2 BPE tokenizer that is used by RoBERTa models. The implemented tokenizer is an initial working version that is scriptable. This PR also add test cases to test out the tokenizer.

    Testing

    pytest -k test_gpt2_bpe_tokenizer test/test_transforms.py

    Future Work

    1. Implement Caching
    2. Move implementation to C++
    3. Implement support for character indices with tokenization

    Follow Ups to this PR

    • [ ] Refactor Tokenizer organization in TorchText and set up tokenization specific constant and util functions.
    • [ ] Once integration testing is set up for the repo, create a new hand-crafted unit test and move "shipped BPE model" tests to integration testing
    • [ ] Set up time with @mthrok to have an in-depth discussion on the question of accepting file paths v/s file-like objects.
    cla signed ciflow/default 
    opened by abhinavarora 14
  • Save and loading vocabaluray

    Save and loading vocabaluray

    ❓ Questions and Help

    Description

    I trained a classification model and used torchtext to create vocabulary from a pre-trained model. My problem is that when saving the model, I didn't save the vocabulary TXT object. Now I can't get the model to infer because it can't find the vocabulary. Is there a way to create post training vocabulary or I have to retrain the model. Thanks

    opened by laleye 1
  • Add GPU tests

    Add GPU tests

    🚀 Feature

    Add GPU tests

    Motivation

    Several bugs have some up related to mismatching devices for the T5 Model. This is because we have no proper tests for running the model on a GPU or anything where there might be more than one device. The first step in this would be to add these GPU tests for running a T5Model.

    Additional context

    From @osalpekar on how to start this:

    I think we would need to add a a new GitHub Actions workflow called test-linux-gpu.yml. Setting up the workflow should be the same as the existing CPU one, we'll just need to pass 3 more args (runner-type, gpu-arch-type, and gpu-arch-version) so that we run this on a GPU instance: https://github.com/pytorch/test-infra/wiki/Writing-generic-CI-jobs#simple-gpu. And in the script for the workflow, you can specify the pytest command for running those gpu tests

    opened by joecummings 0
  • Change transforms in RoBERTa into classes

    Change transforms in RoBERTa into classes

    Currently, transforms in the RobertaBundle are defined as anonymous lambda functions. These are not pickleable and cannot be imported for use anywhere else.

    Ex proposal:

    lambda: T.Sequential(
            T.SentencePieceTokenizer(urljoin(_TEXT_BUCKET, "xlmr.sentencepiece.bpe.model")),
            T.VocabTransform(load_state_dict_from_url(urljoin(_TEXT_BUCKET, "xlmr.vocab.pt"))),
            T.Truncate(510),
            T.AddToken(token=0, begin=True),
            T.AddToken(token=2, begin=False),
        ),
    

    -->

    class RobertaTransform:
         def __init__(self, truncate_length=510):
             self.transform =  T.Sequential(
                  T.SentencePieceTokenizer(urljoin(_TEXT_BUCKET, "xlmr.sentencepiece.bpe.model")),
                  T.VocabTransform(load_state_dict_from_url(urljoin(_TEXT_BUCKET, "xlmr.vocab.pt"))),
                  T.Truncate(truncate_length),
                  T.AddToken(token=0, begin=True),
                  T.AddToken(token=2, begin=False),
              ),
    
        def __call__(self, text):
            self.transform(text)
    
    opened by joecummings 0
  • [Nova] Simplify Caller Workflows

    [Nova] Simplify Caller Workflows

    There's no need to have a matrix in the caller workflow. Let's just pass these inputs directly. We should do this for all caller workflow across all the repos as a general cleanup.

    cla signed 
    opened by osalpekar 0
Releases(v0.14.1)
  • v0.14.1(Dec 16, 2022)

  • v0.14.0(Oct 28, 2022)

    Highlights

    In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.

    • Added CNN-DM dataset
    • Added support for RegexTokenizer
    • Added TorchArrow based examples for training RoBERTa model on SST2 classification dataset

    Datasets

    We increased the number of datasets in TorchText from 30 to 31 by adding the CNN-DM (paper) dataset. The datasets supported by TorchText use datapipes from the TorchData project, which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2 from torchdata. For more details, refer to https://pytorch.org/text/stable/datasets.html

    Tokenizers

    TorchText has extended support for TorchScriptable tokenizers by adding a RegexTokenizer that enables splitting based on regular expressions. TorchScriptabilty support would allow users to embed the Regex Tokenizer natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate Regex tokenizers for deployment needs.

    New Features

    Transforms, Tokenizers, Ops

    • Migrate RegexTokenizer from experimental/transforms.py to transforms.py (#1763)
    • Migrate MaskTransform from internal to experimental/transforms.py (#1775)
    • Graduate MaskTransform from prototype (#1882)

    Datasets

    • Add CNN-DM dataset to torchtext (#1789)
    • Resolve inconsistency in IMDB label output (#1914)
    • Cache CNNDM extraction and optimize reading in filenames (#1809)
    • Allow CNNDM to be imported from torchtext.datasets (#1884)

    Improvements

    Features

    • Convert TA transform module to prepoc function (#1854)
    • Use TA functional for adding tokens to the beginning and end of input (#1820)
    • Add TA Tensor creation operation to the benchmark (#1836)
    • Add never_split feature to BERTTokenizer (#1898)
    • Adding benchmarks for add tokens operator (#1807)
    • Add benchmark for roberta prepoc pipelines (#1684)
    • Adding Benchmark for TA ops (#1801)
    • Make BERT benchmark code more robust (#1871)
    • Define TORCHTEXT_API macro for visibility control (#1806)
    • Modify get_local_asset_path to take overwrite option and use it in BERTTokenizer (#1839)

    Testing

    • Add test to compare encoder inference on input with and without padding (#1770)
    • Add m1 tagged build for TorchText (#1776)
    • Refactor TorchText version handing and adding first version of M1 builds (#1773)
    • Fix test execution in torchtext (#1889)
    • Add torchdata to testing requirements in requirements.txt (#1874)
    • Add missing None type hint to tests (#1868)
    • Create pytest fixture to auto delete model checkpoints within integration tests (#1886)
    • Disable test_vocab_from_raw_text_file on Linux (#1901)

    Examples

    • Add libtorchtext cpp example (#1817)
    • Torcharrow based training using RoBERTa model and SST2 classification dataset (#1808)

    Documentation

    • Add Datasets contribution guidelines (#1798)
    • Correct typo in SST-2 tutorial (#1865)
    • Update doc theme to the latest (#1899)
    • Tutorial on using T5 model for text summarization (#1864)
    • Fix docstring type (#1867)

    Bug fixes

    • Fixing incorrect inputs to add eos and bos operators (#1810)
    • Add missing type hints (#1782)
    • Fix typo in nightly branch ref (#1783)
    • Sharing -> sharding (#1787)
    • Remove padding mask for input embeddings (#1799)
    • Fixed on_disk_cache issues (#1957)
    • Fix Multi30k dataset urls (#1816)
    • Add missing Cmake file for in tokenizer dir (#1908)
    • Fix OBO error for vocab files with empty lines (#1841)
    • Fixing build when CUDA enabled torch is installed (#1814)
    • Make comment paths dynamic (#1894)
    • Turn off mask checking for torchtext which is known to have a legal mask ( #1906)
    • Fix push on release reference name (#1792)

    Dependencies

    • Remove future dep from windows (#1838)
    • Remove dependency on the torch::jit::script::Module for mobile builds (#1885)
    • Add Torchdata as a requirement and remove conditional imports of Torchdata (#1962)
    • Remove sphinx_rtd_theme from requirements.txt (#1837)
    • Fix Sphinx-gallery display and pin sphinx-related packages (#1907)

    Others

    • Resolve and remove TODO comments (#1912)
    • Refactor TorchText version handling and adding first version of M1 builds (#1773)
    • Update xcode version to 14.0 in CI (#1881)
    • CI: Use self hosted runners for build (#1851)
    • Move Spacy from Pip dependencies to Conda dependencies (#1890)
    • Update compatibility matrix for 0.13 release (#1802)
    • Update CircleCI Xcode image (#1818)
    • Avoid looping through the whole counter in bleu_score method (#1913)
    • Rename build_tools dir to tools dir (#1804)
    • Usage setup-minicoda action for m1 build (#1897)
    • Making sure we build correctly against release branch (#1790)
    • Adding the conda builds for m1 (#1794)
    • Automatically initialize submodule (#1805)
    • Set MACOSX_DEPLOYMENT_TARGET=10.9 for binary job (#1835)
    Source code(tar.gz)
    Source code(zip)
  • v0.13.1(Aug 5, 2022)

  • v0.13.0(Jun 28, 2022)

    Highlights

    In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.

    • Added all 9 GLUE benchmark’s datasets (#1710): CoLA, MRPC, QQP, STS-B, SST-2, MNLI, QNLI, RTE, WNLI
    • Added support for BERTTokenizer
    • Created native C++ binaries using a CMake based build system (#1644)

    Datasets

    We increased the number of datasets in TorchText from 22 to 30 by adding the remaining 8 datasets from the GLUE benchmark (SST-2 was already supported). The complete list of GLUE datasets is as follows:

    • CoLA (paper): Single sentence binary classification acceptability task
    • SST-2 (paper): Single sentence binary classification sentiment task
    • MRPC (paper): Dual sentence binary classification paraphrase task
    • QQP: Dual sentence binary classification paraphrase task
    • STS-B (paper): Single sentence to float regression sentence similarity task
    • MNLI (paper): Sentence ternary classification NLI task
    • QNLI (paper): Sentence binary classification QA and NLI tasks
    • RTE (paper): Dual sentence binary classification NLI task
    • WNLI (paper): Dual sentence binary classification coreference and NLI tasks

    The datasets supported by TorchText use datapipes from the TorchData project, which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2 from torchdata. For more details, refer to https://pytorch.org/text/stable/datasets.html

    Tokenizers

    TorchText has extended support for TorchScriptable tokenizers by adding the WordPiece tokenizer used in BERT. It is one of the most commonly used algorithms for splitting input text into sub-words units and was introduced in Japanese and Korean Voice Search (Schuster et al., 2012).

    TorchScriptabilty support would allow users to embed the BERT text-preprocessing natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate BERT tokenizers for deployment needs.

    For usage details, please refer to the corresponding documentation.

    CMake Build System

    TorchText has migrated its build system for C++ extension and third party libraries to use CMake rather than PyTorch’s CppExtension module. This allows end-users to integrate TorchText C++ binaries in their applications without having a dependency on libpython thus allowing them to use TorchText operators in a non-Python environment.

    Refer to the GitHub issue for more details.

    Backward Incompatible Changes

    The RobertaModelBundle introduced in 0.12 release, which gets pre-trained RoBERTa/XLM-R models and builds custom models with similar architecture, has been renamed to RobertaBundle (#1653).

    The default caching location (cache_dir) has been changed from os.path.expanduser("~/.TorchText/cache") to os.path.expanduser("~/.cache/torch/text"). Furthermore the default root directory of datasets is cache_dir/datasets (#1740). Users can now control default cache location via the TORCH_HOME environment variable (#1741)

    New Features

    Models

    • [fbsync] BetterTransformer support for TorchText (#1690) (#1694)
    • [fbsync] Killed to_better by having native load_from_state_dict and init (#1695)
    • [fbsync] Removed unneeded modules after using nn.Module for BetterTransformer (#1696)
    • [fbsync] Replaced TransformerEncoder in TorchText with better transformer (#1703)

    Transforms, Tokenizers, Ops

    • Added pad transform, string to int transform (#1683)
    • Added support for Scriptable BERT tokenizer (#1707)
    • Added support for batch input in BERT Tokenizer with perf benchmark (#1745)

    Datasets

    Support for GLUE benchmark’s datasets added:

    • CoLA (#1711)
    • MRPC (#1712)
    • QQP (#1713)
    • STS-B (#1714)
    • MNLI (#1715)
    • QNLI (#1717)
    • RTE (#1721)
    • WNLI (#1724) Note: SST2 was added previously (#1538)

    Others

    • Prepared datasets for new encoding kwarg. (#1616)
    • Added Shuffle and sharding datapipes to datasets (#1729)
    • For Datasets, refactored local functions to be global so that they can be pickled (#1726)
    • Updated TorchData DataPipe API usages (#1663)
    • Replaced lambda functions with regular functions in all datasets (#1718)

    CMake Build System

    • [CMake 1/3] Updated C++ includes to use imports relative to root directory (#1666)
    • [CMake 2/3] Added CMake Build to TorchText to create single `_TorchText library (#1673)
    • [CMake 3/3] Splited source files with Python dependency to separate library (#1660)

    Improvements

    Features

    • [BC-breaking] Renamed Roberta Bundle (#1635)
    • Modified CLIPTokenizer to either infer number of merges from encoder json or take it in constructor (#1622)
    • Provided option to return splitted tokens (#1698)
    • Updated dataset code to avoid creating multiple iterators from a DataPipe (#1708)

    Testing

    • Added unicode generation to IWSLT tests (followup to #1608) (#1642)
    • Added MacOS unit tests on CircleCI (#1672)
    • Added parameterized dataset pickling tests (#1732)
    • Added test to compare encoder inference on input with and without padding (#1770)
    • Added test for shuffle before shard (#1738)
    • Added more test coverage (#1653)
    • Enabled model testing in FBCode (#1720)
    • Fixed for windows builds with python 3.10 , getting rid of ssize_t (#1627)
    • Built and test py3.10 (#1625)
    • Making sure we build correctly against release branch (#1790)
    • Removed caching artifacts for datasets and fix it for vectors (#1674)
    • Installed torchdata from nightly release in CI (#1664)
    • Added m1 tagged build for TorchText (#1776)
    • Refactored TorchText version handing and adding first version of M1 builds (#1773)
    • Removed MACOSX_DEPLOYMENT_TARGET (#1728)

    Examples

    • Added data pipelines for Roberta pre-processing (#1637)
    • Updated sst2 tutorial to replace lambda usage (#1722)

    Documentation

    • Removed _add_docstring_header decorator from amazon review polarity (#1611)
    • Added missing quotation marks to to CLIPTokenizer docs (#1610)
    • Updated README around installing LTS version (#1665)
    • Added contributing guidelines for third party and custom C++ operators (#1742)
    • Added recommendations regarding use of datapipes for multi-processing, shuffling, DDP, etc. (#1755)
    • Fixed roberta bundle example doc (#1648)
    • Updated doc conf (#1634)
    • Removed install instructions (#1641)
    • Updated README (#1652)
    • Updated requirements (#1675)
    • Fixed typo sharing -> sharding (#1787)
    • Fixed docs build (#1730)
    • Replaced git+git with git+https in requirements.txt (#1658)
    • Added header info for BERT tokenizer (#1754)
    • Fixed docstring for Tokenizers (#1739)
    • Fixed doc js initialization (#1736)
    • Added missing type hints (#1782)
    • Fixed SentencePiece Tokenizer doc-string (#1706)

    Bug fixes

    • Fixed missed mask arg in TorchText transformer (#1758)
    • Fixed bug in RTE and WNLI testing (#1759)
    • Fixed bug in QNLI dataset and corresponding test (#1760)
    • Fixed STSB and WikiTexts tests (#1737)
    • Fixed smoke tests for linux (#1687)
    • Removed redundant dataname in test_shuffle_shard_wrapper (#1733)
    • Fixed non-deterministic test failures for IWSLT (#1699)
    • Fixed typo in nightly branch ref (#1783)
    • Fixed windows utils test (#1761)
    • Fixed test utils (#1757)
    • Fixed pad transform test (#1688)
    • Resolved issues in #1653 + sanitize test names generated by nested_params (#1667)
    • Fixed mock tests due to change in datasets directory (#1749)
    • Deleted prints in test_qqp.py (#1734)
    • Fixed logger issue (#1656)

    Others

    • Pinned Jinja2 version to fix broken doc build (#1669)
    • Fixed formatting for all files using pre-commit (#1670)
    • Pinned setuptools to 58.0.4 on Windows (#1746)
    • Added post install script for pywin32 (#1748)
    • Pinned Utf8proc version (#1771)
    • Removed models from experimental (#1643)
    • Cleaned examples folder (#1647)
    • Cleaned stale code (#1654)
    • Took TORCH_HOME env variable into account while setting the cache dir (#1741)
    • Updateed download hooks and datasets to import HttpReader and GDriveReader from download hooks (#1657)
    • Added Model benchmark (#1697)
    • Changed root directory for datasets (#1740)
    • Used _get_torch_home standard utility from torch hub (#1752)
    • Removed ticks (``) from the url under is_module_available (#1753)
    • Prepared repo for auto-formatters (#1546)
    • Fixed flake8 issues introduced from adding auto formatter (#1617)
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Mar 10, 2022)

    Highlights

    In this release, we have revamped the library to provide a more comprehensive experience for users to do NLP modeling using TorchText and PyTorch.

    • Migrated datasets to build on top of TorchData DataPipes
    • Added support RoBERTa and XLM-RoBERTa pre-trained models
    • Added support for Scriptable tokenizers
    • Added support for composable transforms and functionals

    Datasets

    TorchText has modernized its datasets by migrating from older-style Iterable Datasets to TorchData’s DataPipes. TorchData is a library that provides modular/composable primitives, allowing users to load and transform data in performant data pipelines. These DataPipes work out-of-the-box with PyTorch DataLoader and would enable new functionalities like auto-sharding. Users can now easily do data manipulation and pre-processing using user-defined functions and transformations in a functional style programming. Datasets backed by DataPipes also enable standard flow-control like batching, collation, shuffling and bucketizing. Collectively, DataPipes provides a comprehensive experience for data preprocessing and tensorization needs in a pythonic and flexible way for model training.

    from functools import partial
    import torchtext.functional as F
    import torchtext.transforms as T
    from torch.hub import[ load_state_dict_from_url](https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url)
    from torch.utils.data import DataLoader
    from torchtext.datasets import SST2
    
    # Tokenizer to split input text into tokens
    encoder_json_path = "https://download.pytorch.org/models/text/gpt2_bpe_encoder.json"
    vocab_bpe_path = "https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe"
    tokenizer = T.GPT2BPETokenizer(encoder_json_path, vocab_bpe_path)
    # vocabulary converting tokens to IDs
    vocab_path = "https://download.pytorch.org/models/text/roberta.vocab.pt"
    vocab = T.VocabTransform([load_state_dict_from_url](https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url)(vocab_path))
    # Add BOS token to the beginning of sentence
    add_bos = T.AddToken(token=0, begin=True)
    # Add EOS token to the end of sentence
    add_eos = T.AddToken(token=2, begin=False)
    
    # Create SST2 dataset datapipe and apply pre-processing
    batch_size = 32
    train_dp = SST2(split="train")
    train_dp = train_dp.batch(batch_size).rows2columnar(["text", "label"])
    train_dp = train_dp.map(tokenizer, input_col="text", output_col="tokens")
    train_dp = train_dp.map(partial(F.truncate, max_seq_len=254), input_col="tokens")
    train_dp = train_dp.map(vocab, input_col="tokens")
    train_dp = train_dp.map(add_bos, input_col="tokens")
    train_dp = train_dp.map(add_eos, input_col="tokens")
    train_dp = train_dp.map(partial(F.to_tensor, padding_value=1), input_col="tokens")
    train_dp = train_dp.map(F.to_tensor, input_col="label")
    # create DataLoader
    dl = DataLoader(train_dp, batch_size=None)
    batch = next(iter(dl))
    model_input = batch["tokens"]
    target = batch["label"]
    

    TorchData is required in order to use these datasets. Please install following instructions at https://github.com/pytorch/data

    Models

    We have added support for pre-trained RoBERTa and XLM-R models. The models are torchscriptable and hence can be employed for production use-cases. The modeling APIs let users attach custom task-specific heads with pre-trained encoders. The API also comes equipped with data pre-processing transforms to match the pre-trained weights and model configuration.

    import torch, torchtext
    from torchtext.functional import to_tensor
    xlmr_base = torchtext.models.XLMR_BASE_ENCODER
    model = xlmr_base.get_model()
    transform = xlmr_base.transform()
    input_batch = ["Hello world", "How are you!"]
    model_input = to_tensor(transform(input_batch), padding_value=1)
    output = model(model_input)
    output.shape
    torch.Size([2, 6, 768])
    
    # add classification head
    import torch.nn as nn
    class ClassificationHead(nn.Module):
        def __init__(self, input_dim, num_classes):
            super().__init__()
            self.output_layer = nn.Linear(input_dim, num_classes)
    
        def forward(self, features):
            #get features from cls token
            x = features[:, 0, :]
            return self.output_layer(x)
    
    binary_classifier = xlmr_base.get_model(head=ClassificationHead(input_dim=768, num_classes=2)) 
    output = binary_classifier(model_input)
    output.shape
    torch.Size([2, 2])
    

    Transforms and tokenizers

    We have revamped our transforms to provide composable building blocks to do text pre-processing. They support both batched and non-batched inputs. Furthermore, we have added support for a number of commonly used tokenizers including SentencePiece, GPT-2 BPE and CLIP.

    import torchtext.transforms as T
    from torch.hub import load_state_dict_from_url
    
    padding_idx = 1
    bos_idx = 0
    eos_idx = 2
    max_seq_len = 256
    xlmr_vocab_path = r"https://download.pytorch.org/models/text/xlmr.vocab.pt"
    xlmr_spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"
    
    text_transform = T.Sequential(
        T.SentencePieceTokenizer(xlmr_spm_model_path),
        T.VocabTransform(load_state_dict_from_url(xlmr_vocab_path)),
        T.Truncate(max_seq_len - 2),
        T.AddToken(token=bos_idx, begin=True),
        T.AddToken(token=eos_idx, begin=False),
    )
    
    text_transform([“Hello World”, “How are you”])
    
    

    Tutorial

    We have added an end-2-end tutorial to perform SST-2 binary text classification with pre-trained XLM-R base architecture and demonstrates the usage of new datasets, transforms and models.

    Backward Incompatible changes

    We have removed the legacy folder in this release which provided access to legacy datasets and abstractions. For additional information, please refer to the corresponding github issue (#1422) and PR (#1437)

    New Features

    Models

    • Add XLMR Base and Large pre-trained models and corresponding transformations (#1407)
    • Added option to specify whether to load pre-trained weights (#1424)
    • Added Option for freezing encoder weights (#1428)
    • Enable optional return of all states in transformer encoder (#1430)
    • Added support for RobertaModel to accept model configuration (#1431)
    • Allow inferred scaling in MultiheadSelfAttention for head_dim != 64 (#1432)
    • Added attention mask to transformer encoder modules (#1435)
    • Added builder method in Model Bundler to facilitate model creation with user-defined configuration and checkpoint (#1442)
    • Cleaned up Model API (#1452)
    • Fixed bool attention mask in transformer encoder (#1454)
    • Removed xlmr transform class and instead used sequential for model transforms composition (#1482)
    • Added support for pre-trained Roberta encoder for base and large architecture #1491

    Transforms, Tokenizers, Ops

    • Added ToTensor and LabelToIndex transformations (#1415)
    • Added Truncate Transform (#1458)
    • Updated input annotation type to Any to support torch-scriptability during transform composability (#1453)
    • Added AddToken transform (#1463)
    • Added GPT-2 BPE pre-tokenizer operator leveraging re2 regex library (#1459)
    • Added Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models (#1462)
    • Migrated GPT-2 BPE tokenizer logic to C++ (#1469)
    • fix optionality of default arg in to_tensor (#1475)
    • added scriptable sequential transform (#1481)
    • Removed optionality of dtype in ToTensor (#1492)
    • Fixed max sequence length for xlmr transform (#1495)
    • add max_tokens kwarg to vocab factory (#1525)
    • Refactor vocab factory method to accept special tokens as a keyword argument (#1436)
    • Implemented ClipTokenizer that builds on top of GPT2BPETokenizer (#1541)

    Datasets

    Migration of datasets on top of datapipes

    • AG_NEWS (#1498)
    • AmazonReviewFull (#1499)
    • AmazonReviewPolarity (#1490)
    • DBpedia (#1500)
    • SogouNews (#1503)
    • YelpReviewFull (#1507)
    • YelpReviewPolarity (#1509)
    • YahooAnswers (#1508)
    • CoNLL2000Chunking (#1515)
    • UDPOS (#1535)
    • IWSLT2016 (#1545)
    • IWSLT2017 (#1547)
    • Multi30K (#1536)
    • SQuAD1 (#1513)
    • SQuAD2 (#1514)
    • PennTreebank (#1511)
    • WikiText103 (#1518)
    • WikiText2 (#1519)
    • EnWik9 (#1512)
    • IMDB (#1531)

    Newly added datasets

    • SST2 (#1538)
    • CC-100 (#1562)

    Misc

    • Fix split filter logic in AmazonReviewPolarity (#1505)
    • use os.path.join for consistency. #1506
    • Fixing dataset test failures due to incorrect caching mode in AG_NEWS (#1517)
    • Added caching for extraction datapipe for AmazonReviewPolarity (#1527)
    • Added caching for extraction datapipe for Yahoo (#1528)
    • Added caching for extraction datapipe for yelp full (#1529)
    • Added caching for extraction datapipe for yelp polarity (#1530)
    • Added caching for extraction datapipe for DBPedia (#1571)
    • Added caching for extraction datapipe for SogouNews and AmazonReviewFull (#1594)
    • Fixed issues with extraction caching (#1550, #1551, #1552)
    • Updating Conll2000Chunking dataset to be consistent with other datasets (#1590)
    • [BC-breaking] removed unnecessary split argument from datasets (#1591)

    Improvements

    Testing

    Revamp TorchText dataset testing to use mocked data

    • AG_NEWS (#1553)
    • AmazonReviewFull (#1561)
    • AmazonReviewPolarity (#1532)
    • DBpedia (#1566)
    • SogouNews (#1576)
    • YelpReviewFull (#1567)
    • YelpReviewPolarity (#1567)
    • YahooAnswers (#1577)
    • CoNLL2000Chunking (#1570)
    • UDPOS (#1569)
    • IWSLT2016 (#1563)
    • IWSLT2017 (#1598)
    • Multi30K (#1554)
    • SQuAD1 (#1574)
    • SQuAD2 (#1575)
    • PennTreebank (#1578)
    • WikiText103 (#1592)
    • WikiText2 (#1592)
    • EnWik9 (#1560)
    • IMDB (#1579)
    • SST2 (#1542)
    • CC-100 (#1583)

    Others

    • Fixed attention mask testing (#1439)
    • Fixed CircleCI download failures on windows for XLM-R unit tests (#1441)
    • Asses unit tests for testing model training (#1449)
    • Parameterized XLMR and Roberta model integration tests (#1496)
    • Removed redundant get asset functions from parameterized_utils file (#1501)
    • Parameterize jit and non-jit model integration tests (#1502)
    • fixed cache logic to work with datapipes (#1522)
    • Convert get_mock_dataset fn in AmazonReviewPolarity to be private (#1543)
    • Removing unused TEST_MODELS_PARAMETERIZED_ARGS constant from model test (#1544)
    • Removed real dataset caching and testing in favor of mocked dataset testing (#1587)
    • fixed platform-dependent expectation for Multi30k mocked test. (#1593)
    • Fixed Conll2000Chunking Test (#1595)
    • Updated IWSLT testing to start from compressed file (#1596)
    • Used unicode strings to test utf-8 handling for all non-IWSLT dataset tests. (#1599)
    • Parameterize tests for similar datasets (#1600)

    Examples

    • non-distributed training example for SST-2 binary text classification data using XLM-Roberta model (#1468)

    Documentation

    Dataset Documentation

    • Updated docs for text classification and language modeling datasets (#1603)
    • Updated docs for Machine Translation, Sequence Tagging, Question Answer, Unsupervised Learning datasets (#1597)
    • Updated docs for CC100 and SST2 (#1604)
    • Update sphinx version, added rst files for models, transforms and functionals (#1434)
    • Removed experimental documentation (#1457)
    • Fix links in README (#1461)
    • Added sphinx based tutorial for SST-2 binary classification task using XLM-R model (#1468)
    • pointed to pytorch.org docs instead of outdated rtd link (#1480)
    • Added documentation describing XLM-R, the datasets it was trained on, and relevant license information (#1497)
    • Fixed CI doc build (#1504)
    • Remove example using next(...) from README (#1516)

    Misc

    • Hide symbols when building third party code (#1467)
    • Add .DS_Store files to gitignore (#1470)
    • Remove Python 3.6 support as it has reached EOL (#1484)
    • Added .gitattributes file to hide generated circleci files in PRs (#1485)
    • Switched to use FileOpener from FileLoader (#1488)
    • Update python_requires in setup.py to reflect support for non-EOL python versions (#1521)
    • Added auto-formatters (#1545)
    • fix typo in torchtext/vocab/vocab_factory.py (#1565)
    • Formatted datasets and tests (#1601, #1602)
    Source code(tar.gz)
    Source code(zip)
  • v0.11.2(Jan 27, 2022)

  • v0.11.0-rc3(Oct 21, 2021)

    torchtext 0.11.0 Release Notes

    This is a relatively lightweight release while we are working on revamping the library. Users are encouraged to check various developments on the main branch.

    Improvements

    • Refactored C++ codebase to fix clang-tidy warnings and using emplace_back for improving performance (#1327)
    • Updated sentecepience to v0.1.95 to make it compilable on M1 (#1336)
    • Up the priority of numpy array comparison in self.assertEqual (#1341)
    • Removed mentions of conda-forge as it is no longer necessary to build on python 3.9 (#1345)
    • Separated experimental tests to help remove them easily during release cycles (#1348)
    • Splitted the pybind and torchtbind registration in separate file and refactor Vocab modules to allow vocab to be used in pure C++ environment (#1352)
    • Changed the default root directory for downloaded datasets to avoid dirtying the working directory (# 1361)
    • Added method for logging module usage in fbcode (#1367)
    • Updated bug report file (#1377)
    • Renamed default branch to main (#1378)
    • Enabled torchtext extension work seamlessly between fbcode and open-source (#1382)
    • Migrated CircleCI docker image (#1393)

    Docs

    • Fix tag build so so that adding a tag will trigger a documentation build-and-upload (#1332)
    • Minor doc-string fix in Multi30K dataset (#1351)
    • Fixed example in doc-string of get_vec_by_tokens (#1383)
    • Updated docs to point to main instead of deprecated master branch (#1387)
    • Changed various README.md links to point to main instead of master branch (#1392)

    Bug fix

    • Fixed benchmark code that compares performance of vocab (#1339)
    • Fixed text classification example broken due removal of experimental datasets (#1347)
    • Fixed issue in IMDB dataset that result in all samples being positive depending on directory path (#1354)
    • Fixed doc building (#1365)
    Source code(tar.gz)
    Source code(zip)
  • v0.10.1(Sep 27, 2021)

  • v0.10.0(Jun 15, 2021)

    Highlights

    In this release, we introduce a new Vocab module that replaces the current Vocab class. The new Vocab provides common functional APIs for NLP workflows. This module is backed by an efficient C++ implementation that reduces look-up time by up-to ~85% for batch look-up (refer to summary of #1248 and #1290 for further information on benchmarks), and provides support for TorchScript. We provide accompanying factory functions that can be used to build the Vocab object either through a python ordered dictionary or an Iterator that yields lists of tokens.

    creating Vocab from text file

    import io
    from torchtext.vocab import build_vocab_from_iterator
    # generator that yield list of tokens
    def yield_tokens(file_path):
        with io.open(file_path, encoding = 'utf-8') as f:
           for line in f:
               yield line.strip().split()
    # get Vocab object
    vocab_obj = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])
    

    creating Vocab through ordered dict

    from torchtext.vocab import vocab
    from collections import Counter, OrderedDict
    counter = Counter(["a", "a", "b", "b", "b"])
    sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    ordered_dict = OrderedDict(sorted_by_freq_tuples)
    vocab_obj = vocab(ordered_dict)
    

    common API usage

    # look-up index
    vocab_obj["a"]
    
    # batch look-up indices
    vocab_obj.looup_indices(["a","b"])
    # support forward API of PyTorch nn Modules
    vocab_obj(["a","b"])
    
    # batch look-up tokens
    vocab_obj.lookup_tokens([0,1])
    
    # set default index to return when token not found 
    vocab_obj.set_default_index(0)
    vocab_obj["out_of_vocabulary"] #prints 0
    

    Backward Incompatible changes

    • We have retired the old Vocab class into the legacy folder (#1289) . Users relying on this class should be able to access it from torchtext.legacy. The Vocab module that replaces this class is not backward compatible. The most notable difference is that the Vectors object is not an attribute of new Vocab object. We recommend users to use the build_vocab_from_iterator factory function to construct the new Vocab module that provides similar initialization capabilities as the retired Vocab class.
    # retired Vocab class 
    from torchtext.legacy.vocab import Vocab as retired_vocab
    from collections import Counter
    tokens_list = ["a", "a", "b", "b", "b"]
    counter = Counter(tokens_list)
    vocab_obj = retired_vocab(counter, specials=["<unk>","<pad>"], specials_first=True)
    
    # new Vocab Module
    from torchtext.vocab import build_vocab_from_iterator
    vocab_obj = build_vocab_from_iterator([tokens_list], specials=["<unk>","<pad>"], specials_first=True)
    
    • Removed legacy batch from torchtext.data package (#1307) that was kept around for backward compatibility reasons. Users can still access batch from torchtext.data.legacy package.

    New Features

    • Introduced functional to convert Iterable-style to map-style datasets (#1299)
    from torchtext.datasets import IMDB
    from torchtext.data import to_map_style_dataset
    train_iter = IMDB(split='train')
    #convert iterator to map-style dataset
    train_dataset = to_map_style_dataset(train_iter)
    
    • Introduced functional to filter raw wikipedia XML dumps (#1292)
    from torchtext.data.functional import filter_wikipedia_xml
    from torchtext.datasets import EnWik9
    data_iter = EnWik9(split='train')
    # filter data according to https://github.com/facebookresearch/fastText/blob/master/wikifil.pl
    filter_data_iter = filter_wikipedia_xml(data_iter)
    
    • Introduced Multi30k dataset (#1306 (https://github.com/pytorch/text/pull/1306))
    # Added datasets for http://www.statmt.org/wmt16/multimodal-task.html#task1
    from torchtext.datasets import Multi30k
    train_data, valid_data, test_data = Multi30k()
    next(train_data)
    # prints following 
    #('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.\n',
    # 'Two young, White males are outside near many bushes.\n')
    
    • Introduced new Vocab module and associated factory functions (#1289, #1297, #1302, #1304, #1308, #1309, #1310)

    Improvements

    • Separated experimental and legacy tests into separate subfolders (#1285)
    • Stored md5 hash instead of raw text data for in-built datasets testing (#1261)
    • Cleaned up CircleCI cache handling and optimization of daily cache (#1236, #1238)
    • Fixed CircleCI caching issue when new dataset is added (#1314)
    • Organized datasets by names in root folder and moved common file reading functions into dataset_utils (#1233)
    • Added unit-test to verify raw datasets name property (#1234)
    • Fixed jinja2 environment autoescape to enable select extensions (#1277)
    • Added yaml.safe_load instead of yaml.load (#1278)
    • Added defusedxml to parse untrusted XML data (#1279)
    • Added CodeQL and Bandit security checks as GitHub Actions (#1266)
    • Added benchmark code to compare Vocab module with python dict for batch look-up time (#1290)

    Documentation

    • Fixing doc for nn modules (#1267)
    • Store artifacts of rendered docs so that rendered docs can be checked on each PR (#1288)
    • Add Google Analytics support (#1287)

    Bug Fix

    • Fixed import issue in text classification example (#1256)
    • Fixed and re-organized data pipeline example (#1250)

    Performance

    • used c10::string_view and fast-text dictionary inside C++ kernel of Vocab module (#1248)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.1-rc1(Mar 25, 2021)

  • v0.9.0-rc5(Mar 4, 2021)

    Highlights

    In this release, we’re updating torchtext’s datasets to be compatible with the PyTorch DataLoader, and deprecating torchtext’s own DataLoading abstractions. We have published a full review of the legacy code and the new datasets in pytorch/text #664. These new datasets are simple string-by-string iterators over the data, rather than the previously custom set of abstractions such as Field. The legacy Datasets and abstractions have been moved into a new legacy folder to ease the migration, and will remain there for two more releases. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide (link).

    The following raw text datasets are available as the replacement of the legacy datasets. Those datasets are iterators which yield the raw text data line-by-line. To apply those datasets in the NLP workflows, please refer to the end-to-end tutorial for the text classification task (link).

    • Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
    • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
    • Sequence tagging: UDPOS, CoNLL2000Chunking
    • Translation: IWSLT2016, IWSLT2017
    • Question answer: SQuAD1, SQuAD2

    We add Python 3.9 support in this release

    Backwards Incompatible

    The current users of the legacy code will experience BC breakage as we have retired the legacy code (#1172, #1181, #1183). The legacy components are placed in torchtext.legacy.data folder as follows:

    • torchtext.data.Pipeline -> torchtext.legacy.data.Pipeline
    • torchtext.data.Batch -> torchtext.legacy.data.Batch
    • torchtext.data.Example -> torchtext.legacy.data.Example
    • torchtext.data.Field -> torchtext.legacy.data.Field
    • torchtext.data.Iterator -> torchtext.legacy.data.Iterator
    • torchtext.data.Dataset -> torchtext.legacy.data.Dataset

    This means, all features are still available, but within torchtext.legacy instead of torchtext.

    Table 1: Summary of the legacy datasets and the replacements in 0.9.0 release

    Category | Legacy | 0.9.0 release -- | -- | -- Language Modeling | torchtext.legacy.datasets.WikiText2 | torchtext.datasets.WikiText2   | torchtext.legacy.datasets.WikiText103 | torchtext.datasets.WikiText103   | torchtext.legacy.datasets.PennTreebank | torchtext.datasets.PennTreebank   | torchtext.legacy.datasets.EnWik9 | torchtext.datasets.EnWik9 Text Classification | torchtext.legacy.datasets.AG_NEWS | torchtext.datasets.AG_NEWS   | torchtext.legacy.datasets.SogouNews | torchtext.datasets.SogouNews   | torchtext.legacy.datasets.DBpedia | torchtext.datasets.DBpedia   | torchtext.legacy.datasets.YelpReviewPolarity | torchtext.datasets.YelpReviewPolarity   | torchtext.legacy.datasets.YelpReviewFull | torchtext.datasets.YelpReviewFull   | torchtext.legacy.datasets.YahooAnswers | torchtext.datasets.YahooAnswers   | torchtext.legacy.datasets.AmazonReviewPolarity | torchtext.datasets.AmazonReviewPolarity   | torchtext.legacy.datasets.AmazonReviewFull | torchtext.datasets.AmazonReviewFull   | torchtext.legacy.datasets.IMDB | torchtext.datasets.IMDB   | torchtext.legacy.datasets.SST | deferred   | torchtext.legacy.datasets.TREC | deferred Sequence Tagging | torchtext.legacy.datasets.UDPOS | torchtext.datasets.UDPOS   | torchtext.legacy.datasets.CoNLL2000Chunking | torchtext.datasets.CoNLL2000Chunking Translation | torchtext.legacy.datasets.WMT14 | deferred   | torchtext.legacy.datasets.Multi30k | deferred   | torchtext.legacy.datasets.IWSLT | torchtext.datasets.IWSLT2016, torchtext.datasets.IWSLT2017 Natural Language Inference | torchtext.legacy.datasets.XNLI | deferred   | torchtext.legacy.datasets.SNLI | deferred   | torchtext.legacy.datasets.MultiNLI | deferred Question Answer | torchtext.legacy.datasets.BABI20 | deferred

    Improvements

    • Enable importing metrics/utils/functional from torchtext.legacy.data (#1229)
    • Set up daily caching mechanism with Master job (#1219)
    • Reset the functions in datasets_utils.py as private (#1224)
    • Resolve the download folder for some raw datasets (#1213)
    • Store the hash of the extracted CoNLL2000Chunking files so the extraction step will be skipped if the extracted files are detected (#1204)
    • Fix the total number of lines in doc strings of the datasets (#1200)
    • Extend CI tests to cover all the datasets (#1197, #1201, #1171)
    • Document the number of lines in the dataset splits (#1196)
    • Add hashes to skip the slow extraction if the extracted files are available (#1195)
    • Use decorator to loop over the split argument in the datasets (#1194)
    • Remove offset option from torchtext.datasets, and move torchtext.datasets.common to torchtext.data.dataset_utils (#1188, #1145)
    • Remove the step to clean up the cache in test_iwslt() (#1192)
    • Split IWSLT dataset into IWSLT2016 and IWSLT2017 dataset and re-organize the parameters in the constructors (#1191, #1209)
    • Move the prototype datasets in torchtext.experimental.datasets.raw folder to torchtext.datasets folder (#1182, #1202, #1207, #1211, #1212)
    • Add a decorator add_docstring_header() to generate docstring (#1185)
    • Add EnWiki9 dataset (#1184)
    • Avoid unnecessary downloads and extraction for some raw datasets, and add more logging (#1178)
    • Split raw datasets into individual files (#1156, #1173, #1174, #1175, #1176)
    • Extend the unittest coverage for all the raw datasets (#1157, #1149)
    • Define the relative path of the datasets in the download_from_url() func and skip unnecessary download if the downloaded files are detected (#1158, #1155)
    • Add MD5 and NUM_LINES as the meta information in the __init__ file of torchtext.datasets folder (#1155)
    • Standardize the text dataset doc strings and argument order. (#1151)
    • Report the “exceeds quota” error for the datasets using Google drive links (#1150)
    • Add support for the string-typed split values to the text datasets (#1147)
    • Re-name the argument from data_select to split in the dataset constructor (#1143)
    • Add Python 3.9 support across Linux, MacOS, and Windows platforms (#1139)
    • Switch to the new URL for the IWSLT dataset (#1115)
    • Extend the language shortcut in torchtext.data.utils.get_tokenizer func with the full name when Spacy tokenizers are loaded (#1140)
    • Fix broken CI tests due to spacy 3.0 release (#1138)
    • Pass an embedding layer to the constructor of the BertModel class in the BERT example (#1135)
    • Fix test warnings by switching to assertEqual() in PyTorch TestCase class (#1086)
    • Improve CircleCI tests and conda package (#1128, #1121, #1120, #1106)
    • Simplify TorchScript registration by adopting TORCH_LIBRARY_FRAGMENT macro (#1102)

    Bug Fixes

    • Fix the total number of returned lines in setup_iter() func in RawTextIterableDataset (#1142)

    Docs

    • Add number of classes to doc strings for text classification data (#1230)
    • Remove Lato font for pytorch/text website (#1227)
    • Add the migration tutorial (#1203, #1216, #1222)
    • Remove the legacy examples on pytorch/text website (#1206)
    • Update README file for 0.9.0 release (#1198)
    • Add CI check to detect undocumented parameters (#1167)
    • Add a static text link for the package version in the doc website (#1161)
    • Fix sphinx warnings and turn warnings into errors (#1163)
    • Add the text datasets to torchtext website (#1153)
    • Add the constructor document for IMDB and SST datasets (#1118)
    • Fix typos in the README file (#1089)
    • Rename "Arguments" to "Args" in the doc strings (#1110)
    • Build docs and push to gh-pages on nightly basis (#1105, #1111, #1112)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Dec 10, 2020)

    Highlights

    Updated pinned PyTorch version to 1.7.1 and added Python 3.9 support.

    Improvement

    • Added Python 3.9 support #1088
    • Added certifi for Windows unittest envir #1077
    • Added setup version to pin torch dependency #1067

    Docs

    • Updated docs strings for torchtext.nn.InProjContainer #1083
    • Updated the doc strings for torchtext.nn.MultiheadAttentionContainer #1057
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0-rc2(Oct 27, 2020)

    This is a relatively light release while we are working on revamping the library. According to PyTorch feature classification changes, the new building blocks and datasets in the experimental folder are defined as Prototype and available in the nightly release only. Once the prototype building blocks are matured enough, we will release them together with all the relevant commits in a beta release. At the same time, users are encouraged to take a look at those building blocks and give us feedback. An easy way to send your feedback is to open an issue in pytorch/text repo or comment in Issue #664. For details regarding the revamp execution, see Issue #985.

    The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command.

    pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
    

    For more detail instructions, please refer to Install PyTorch. It should be noted that the new building blocks are still under development, and the APIs have not been solidified.

    The stable release branch here includes a few feature improvements and documentation updates. Compiled against the PyTorch 1.7.0 release, the stable release packages are available via Pip and Conda for Windows, Linux, and Mac.

    Improvements

    • Updated the BERT pipeline to improve question-answer task score #950
    • Fixed the order of the datasets used in the BERT example #1040
    • Skipped requests.get in download_from_url function if path exists #922
    • Used Ninja to build extensions and disable C++11 ABI when necessary for libtorch compatibility. #931
    • Removed SentencePiece from setup.py file. SentencePiece source code is now being used as the third-party library in torchtext #1055
    • Improved CircleCI setting for better engineering
      • Switched PyTorch binary location for CI unittests #1044
      • Parameterized UPLOAD_CHANNEL #1037
      • Installed binaries for the CI test directly from the CPU channel #1025, #981
      • Added dataclasses to dependencies for environment.yml #964
      • Bumped Xcode workers to 9.4.1 #951
      • Disabled glove tests due to URL breakage #920
      • Used the specific channel for the CI tests #907

    Docs

    • Added test and updated error message for load_sp_model function in torch.data.functional #984
    • Updated the README file in BERT example #899
    • Updated the legacy retirement message #1047
    • Updated index page to include links to PyTorch libraries and describe feature classification #1048
    • Cleaned up the doc strings #1049
    • Fixed clang-format version to what PyTorch uses #1052
    • Added OSX environment variables to the README file #1054
    • Updated README file for the prototype in the nightly release #1050

    Bug Fixes

    • Fixed the order of the datasets used in the BERT example #1040
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0-rc3(Jul 28, 2020)

    Highlights

    With the continued progress of PyTorch, some code in torchtext grew out of date with the SOTA PyTorch modules (for example torch.utils.data.DataLoader, torchscript). In 0.7.0 release, we’re taking big steps toward modernizing torchtext, and adding warning messages to these legacy components which will be retired in the October 0.8.0 release. We’re also introducing a host of new features, including:

    1. A generalized MultiheadAttentionContainer for flexible attention behavior
    2. Torchscript support for SentencePiece models
    3. An end-to-end BERT example pipeline, including pertained weights and a question answering fine-tuning example
    4. The SQuAD1 and SQuAD2 question answering datasets
    5. Windows support

    Legacy code and issues

    For a period of time (ending around June of 2019), torchtext lacked active maintenance and grew out of date with the present SOTA research and PyTorch features. We’ve committed to bringing the library fully up to date, and identified a few core issues:

    • Several components and functionals were unclear and difficult to adopt. For example, the Field class coupled tokenization, vocabularies, splitting, batching and sampling, padding, and numericalization all together, and was opaque and confusing to users. We determined that these components should be divided into separate orthogonal building blocks. For example, it was difficult to use HuggingFace's tokenizers with the Field class (issue #609). Modular pipeline components would allow a third party tokenizer to be swapped into the pipeline easily.
    • torchtext’s datasets were incompatible with DataLoader and Sampler in torch.utils.data, or even duplicated that code (e.g. torchtext.data.Iterator, torchtext.data.Batch). Basic inconsistencies confused users. For example, many struggled to fix the data order while using Iterator (issue #828), whereas with DataLoader, users can simply set shuffle=False to fix the data order.

    We’ve addressed these issues in this release, and several legacy components are now ready to be retired:

    • torchtext.data.Batch (link)
    • torchtext.data.Field (link)
    • torchtext.data.Iterator (link)
    • torchtext.data.Example (link)

    In 0.7.0 release, we add deprecation warnings, and finally will retire them to the torchtext.legacy directory in 0.8.0 release on October.

    New dataset abstraction

    Since the 0.4.0 release, we’ve been working on a new common interface for the torchtext datasets (inheriting from torch.utils.data.Dataset) to address the issues above, and completed it for this release. For standard usage, we’ve created a map-style dataset which materializes the text iterator. A default dataset processing pipeline, including tokenizer and vocabulary, is added to the map-style datasets to support one-command data loading.

    from torchtext.experimental.datasets import AG_NEWS
    train, test = AG_NEWS(ngrams=3)
    

    For those who want more flexibility, the raw text is still available as a torch.utils.data.IterableDataset by simply inserting .raw into the module path as follows.

    train, test = torchtext.experimental.datasets.raw.AG_NEWS()
    

    Instead of maintaining Batch and Iterator func in torchtext, the new dataset abstraction is fully compatible with torch.utils.data.DataLoader like below. collate_fn is used to process the data batch generated from DataLoader.

    from torch.utils.data import DataLoader
    def collate_fn(batch):
        texts, labels = [], []
        for label, txt in batch:
            texts.append(txt)
            labels.append(label)
        return texts, labels
    dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
    for idx, (texts, labels) in enumerate(dataloader):
        print(idx, texts, labels)
    

    With the new datasets, we worked together with the OSS community to re-write the legacy datasets in torchtext. Here is a brief summary of the progress:

    • Word language modeling datasets (WikiText2, WikiText103, PennTreeBank) #661, #774
    • Text classification datasets (AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull) #701, #775, #776
    • Sentiment analysis dataset (IMDb) #651
    • Translation datasets (Multi30k, IWSLT, WMT14) #751, #821, #851
    • Question-answer datasets (SQuAD1, SQuAD2) #773
    • Sequence tagging datasets (UDPOS, CoNLL2000Chunking) #805

    Those new datasets stay in torchtext.experimental.datasets directory. The old version of the datasets are still available in torchtext.datasets and the new datasets are opt-in. In 0.8.0 release, the old datasets will be moved to torchtext.legacy directory.

    To learn how to apply the new dataset abstraction with DataLoader and SOTA PyTorch compatibilities (like Distributed Data Parallel), we created a full example to use the new torchtext datasets (WikiText103, SQuAD1, etc) to train a BERT model. A pretrained BERT model is generated from masked language task and next sentence task. Then, the model is fine-tuned for the question-answer task. The example is available in torchtext repo (here).

    Backwards Incompatible Changes

    • Remove code specific to Python2 #732

    New Features

    • Refractor nn.MultiheadAttention as MultiheadAttentionContainer in torchtext #720, #839, #883
    • Pre-train BERT pipeline and fine-tune question-answer task #767
    • Experimental datasets in torchtext.experimental.datasets (See New Dataset Abstraction section above for the full list) #701, #773, #774, #775, #776, #805, #821, #851
    • Add Windows support for torchtext #772, #781, #789, #796, #807, #810, #829
    • Add torchscript support to SentencePiece #755, #771, #786, #798, #799

    Improvements

    • Integrates pytorch-probot into the repo #877
    • Switch to pytorch TestCase for build-in dataset #822
    • Switch experimental ngrams_func to data.utils.ngrams_iterator #813
    • Create root directory automatically for download_from_url if not exists #797
    • Add shebang line to suppress the lint warning #787
    • Switch to CircleCI and improve torchtext CI tests #744, #766, #768, #777, #783, #784, #794, #800, #801, #803, #809, #832, #837, #881, #888
    • Put sacremoses tokenizer test back #782
    • Update installation directions #763, #764, #769, #795
    • Add CCI cache for test data #748
    • Disable travis tests except for RUN_FLAKE8 #747
    • Disable Travis tests of which equivalent run on CCI #746
    • Use 'cpu' instead of None for Iterator #745
    • Remove the allow to fail statement in travis test #743
    • Add explicit test results to text classification datasets #738

    Docs

    • Bump nightlies to 0.8.0 #847
    • Update README.rst file #735, #817
    • Update the labels of docs in text classification datasets #734

    Bug Fixes

    None

    Deprecations

    Add deprecation warning to legacy code #863. The following legacy components are ready to be retired, including

    • torchtext.data.Batch (link)
    • torchtext.data.Field (link)
    • torchtext.data.Iterator (link)
    • torchtext.data.Example (link)
    • torchtext.datasets (link)

    In 0.7.0 release, we add deprecation warnings, and finally will retire them to the torchtext.legacy directory in the October 0.8.0 release.

    Source code(tar.gz)
    Source code(zip)
  • 0.6.0(Apr 21, 2020)

    Highlights

    This release drops the Python2 support from torchtext. Some minor bug fixes and doc updates are included.

    We are continuously working on the new dataset abstraction. Users and developers are welcome to send feedback to issue #664. We want also to highlight a pull request #701 where the latest dataset abstraction is applied to the text classification datasets.

    Backward compatibility

    • Unified tar and zip file handling within extract_archive function #692

    Docs

    • Updated the BLEU example in doc #729
    • Updated README file with conda installation #728
    • Allowed maximum sentence length to 120 in flake8 #719
    • Updated CODE_OF_CONDUCT.md file #702
    • Removed duplicate docs on torchtext website #697
    • Updated README file with a disclaimer for the new dataset abstraction #693
    • Updated docs in experimental language modeling dataset #682

    Bug Fixes

    • Sent out error message if SentencePiece is not installed. Fixed the SentencePiece dependency issue within conda package #733
    • Fixed a bug in experimental IMDB dataset to allow a custom vocab #683
    Source code(tar.gz)
    Source code(zip)
  • 0.5.0(Jan 14, 2020)

    Highlights

    We simplify the current torchtext dataset library by leveraging existing utils (DataLoader, Sampler) in PyTorch core library. Separate tokenizer, vocabulary, and data processing functionals. Users will feel empowered to build data processing pipelines.

    [Experimental] New abstraction for torchtext dataset

    torchtext v0.5.0 release officially introduces a new abstraction for the datasets. Based on the feedback from users, the new abstraction will solve several issues existing in torchtext, including

    • Several components and functionals are unclear and difficult to adopt. For example, Field class couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. The current Field class works like a "black box", and users are confused about what's going on within the class. Instead, those components should be divided into several basic building blocks. This is more consistent with PyTorch core library where users build models and pipelines with orthogonal components.
    • Incompatible with PyTorch core library, like DataLoader and Sampler in torch.utils.data. Some custom modules/functions in torchtext (e.g. Iterator, Batch, splits) should be replaced by the corresponding modules in torch.utils.data.

    We have re-written several datasets in torchtext.experimental.datasets, which are using the new abstraction. The old version of the datasets are still available in torchtext.datasets, and the new datasets are opt-in. We expect to replace the legacy datasets with the experimental ones in the future. Torchtext users are welcome to send feedback to issue [#664]

    • Re-write Sentiment Analysis dataset [#651] - IMDB
    • Re-write Language Modeling datasets [#624, #661], including - WikiText2 - WikiText103 - PennTreebank

    SentencePiece binding

    The SentencePiece binding provides an effective way to solve the open vocabulary problems in NLP tasks. The binding now supports two segmentation algorithms, byte-pair-encoding (BPE) and unigram language model. It trains a subword models directly from raw text data, which are used to tokenize corpus and convert them into PyTorch tensors [#597]

    Backward compatibility

    • Last release with the support of Python 2
    • Change the default ngrams value to 1 in text classification datasets [#663]
    • Temporarily remove a unit test test_get_tokenizer_moses from CI tests. Need to push it back after issue related to moses tokenizer is resolved. [#588]

    We would like to thank the open source community, who continues to send pull requests for new features and bug-fixes.

    New Features

    • Add unsupervised learning dataset EnWik9, compressing first 109 bytes of enwiki-20060303-pages-articles.xml [#610]
    • Several generators are created to build the pipeline for text preprocessing [#624, #610, #597].
    • Add Bilingual Evaluation Understudy (BLEU) metric for translation task in torch.data.metrics [#627]
    • Add Cross-Lingual NLI Corpus (XNLI) dataset [#613]

    Improvements

    • Improve download_from_url and extract_archive func. extract_archive func now supports .zip files. download_from_url func now explicitly gets the filename from the url instead of from url header. This allows to download from a non-google drive link [#602]
    • Add a legal disclaimer for torchtext datasets [#590]
    • Add installation command to Travis [#585]
    • Some improvements in the example torchtext/examples/text_classification [#580] [#578] [#576]
    • Fix and improve docs [#603] [#598] [#594] [#577] [#662]
    • Add Code of Conduct document [#638]
    • Add Contributing document [#637]

    Bug Fixes

    • Fix a backward compatibility issue in Vocab class. The old version of torchtext doesn’t have unk_index attribute in Vocab, To avoid BC breaking, the setstate function now checks if there is unk_index attribute in the vocab object [#591]
    • Resolve an overflow error by decreasing the maxInt value, which is used to check csv.field_size_limit in unicode_csv_reader [#584]
    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Nov 27, 2019)

    Highlights

    Supervised learning baselines

    torchtext 0.4.0 includes several example scripts that showcase how to create data, build vocabularies, train, test and run inference for common supervised learning baselines. We further provide a tutorial to explain these examples in more detail.

    For an advanced application of these constructs see the iterable_train.py example.

    Community

    We would like to thank the open source community, who continues to send pull requests for new features and bug-fixes.

    Major New Features

    New Features

    Improvements

    • Added logging to download_from_url (#569)
    • Added fast, basic english sentence normalization to get_tokenizer (#569 #568)
    • Updated docs theme to pytorch_sphinx_theme (#573)
    • Refined Example.fromJSON() to support parse nested key for parsing nested JSON dataset. (#563)
    • Added __len__ & get_vecs_by_tokens in 'Vectors' class to generate vector from a list of tokens (#561)
    • Added templates for torchtext users to bring up issues (#553 #574)
    • Added a new argument specials in Field.build_vocab to save the user-defined special tokens (#495)
    • Added a new argument is_target in RawField class to show whether the field is a target variable - False by default (#459). Adjusted is_target argument in LabelField to True to take it into effect (#450)
    • Added the option to serialize fields with torch.save or pickle.dump, allow tokenizers in different languages (#453)

    Bug Fixes

    • Allow caching from unverified SSL in CharNGram (#554)
    • Fix the wrong unk index by generating the unk_index according to the specials (#531)
    • Update Moses tokenizer link in README.rst file (#529)
    • Fix the url to load wiki.simple.vec (#525), fix the dead url to load fastText vectors (#521)
    • Fix UnicodeDecodeError for loading sequence tagging dataset (#506)
    • Fix collisions between oov words and in-vocab words caused by Issue #447 (#482)
    • Fix a mistake in the processing bar of Vectors class (#480)
    • Add the dependency to six under 'install_requires' in the setup.py file (PR #475 for Issue #465)
    • Fix a bug in Field class which causes overwriting the stop_words attribute (PR #458 for Issue #457)
    • Transpose the text and target tensors if the text field in BPTTIterator has 'batch_first' set to True (#462)
    • Add <unk> to default specials (#567)

    Backward Compatibility

    • Dropped support for python 2.7.9 (#552)
    Source code(tar.gz)
    Source code(zip)
  • 0.3.1(Oct 12, 2018)

    Major changes:

    • Added bABI dataset (#286)
    • Added MultiNLP dataset (#326)
    • Pytorch 0.4 compatibility + bugfixes (#299, #302)
    • Batch iteration now returns a tuple of (inputs), outputs by default without having to index attributes from Batch (#288)
    • [BREAKING] Iterator no longer repeats infinitely by default (now stops after epoch has completed) (#417)

    Minor changes:

    • Handle moses tokenizer being migrated from nltk (#361)
    • Vector loading made more efficient and flexible (#353)
    • Allow special tokens to be added to the end of the vocabulary (#400)
    • Allow filtering unknown words from examples (#413)

    Bugfixes:

    • Documentation (#382, #383, #393 #395, #410)
    • Create cache dir for pretrained embeddings if it doesn't exist (#301)
    • Various typos (#293, #369, #373, #344, #401, #404, #405, #418)
    • Dataset.split() not copying sort_key fixed (#279)
    • Various python 2.* vs python 3.* issues (#280)
    • Fix OOV token vector dimensionality (#308)
    • Lowercased type of TabularDataset (#315)
    • Fix splits method in various translation datasets (#377, #385, #392, #429)
    • Fix ParseTextField postprocessing (#386)
    • Fix SubwordVocab (#399)
    • Make NestedField GPU compatible and fix frequency saving (#409, #403)
    • Allow CSVreader params to be modified by user (#432)
    • Use tqdm progressbar in downloads (#425)
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Apr 9, 2018)

  • v0.2.1(Dec 28, 2017)

    This is a minor release; we have not included any breaking API changes but there are some new features that don't break existing APIs.

    We have always intended to support lazy datasets (specifically, those implemented as Python generators) but this version includes a bugfix that makes that support more useful. See a demo of it in action here.

    Datasets:

    • Added support for sequence tagging (e.g., NER/POS/chunking) datasets and wrapped the Universal Dependencies POS-tagged corpus (#157, thanks @sivareddyg!)

    Features:

    • Added pad_first keyword argument to Field constructors, allowing left-padding in addition to right-padding (#161, thanks @GregorySenay!)
    • Support loading word vectors from local folder (#168, thanks @ahhegazy!)
    • Support using list (character tokenization) in ReversibleField (#188)
    • Added hooks for Sphinx/RTD documentation (#179, thanks @keon and @EntilZha, whose preliminary version is available at torch-text.readthedocs.io)
    • Added support for torchtext.__version__ (#179, thanks @keon!)

    Bugfixes:

    • Fixed deprecated word vector usage in WT2 dataset (#166, thanks @keon!)
    • Fixed bug in word vector loading (#168, thanks @ahhegazy!)
    • Fixed bug in word vector aliases (#191, thanks @ryanleary!)
    • Fixed side effects of building a vocabulary (#193 + #181, thanks @donglixp!)
    • Fixed arithmetic mistake in language modeling dataset length calculation (#182, thanks @jihunchoi!)
    • Avoid materializing an otherwise-lazy dataset when using filter_pred (#194)
    • Fixed bug in raw float fields (#159)
    • Avoid providing a misleading len when using batch_size_fn (#192)
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Oct 20, 2017)

    Breaking changes:

    • By default, examples are now sorted within a batch by decreasing sequence length (#95, #139). This is required for use of PyTorch PackedSequences, and it can be flexibly overridden with a Dataset constructor flag.
    • The unknown token is now included as part of specials and can be overridden or removed in the Field constructor (part of #107).

    New features:

    • New word vector API with classes for GloVe and FastText; string descriptors are still accepted for backwards compatibility (#94, #102, #115, #120, thanks @nelson-liu and @bmccann!)
    • Reversible tokenization (#107). Introduces a new Field subclass, ReversibleField, with a .reverse method that detokenizes. All implementations of ReversibleField should guarantee that the tokenization+detokenization round-trip is idempotent; torchtext provides wrappers for the revtok tokenizer and subword segmenter that satisfy this property.
    • Skip header line in CSV/TSV loading (#146)
    • RawFields that represent any data type without processing (#147, thanks @kylegao91!)

    New datasets:

    • TREC (#92, thanks @bmccann!)
    • IMDb (#93, thanks @bmccann!)
    • Multi30k (#116, thanks @bmccann!)
    • IWSLT (#126, #128, thanks @bmccann!)
    • WMT14 (#138)

    Bugfixes:

    • Fix pretrained word vector loading (#99, thanks @matt-peters!)
    • Fix JSON loader silently ignoring requested columns not present in the file (#105, thanks @nelson-liu!)
    • Many fixes for Python 2, especially surrounding Unicode (#105, #112, #135, #153 thanks @nelson-liu!)
    • Fix Pipeline.call behavior (#113, thanks @nelson-liu!)
    • Fix README example (#134, thanks @czhang99!)
    • Fix WikiText2 loader (#138)
    • Fix typo in MT loader (#142, thanks @sivareddyg!)
    • Fix Example.fromlist behavior on non-strings (#145)
    • Update test set URL for Multi30k (#149)
    • Fix SNLI data loader (#150, thanks @sivareddyg!)
    • Fix language modeling iterator (#151)
    • Remove transpose as a side effect of Field.reverse (#155)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Aug 15, 2017)

    So that we can develop v0.2 on master, with refactored and extended word vectors (minimally breaking) and revtok support (reversible tokenizer with optional wordpieces; major feature but shouldn't break API).

    Source code(tar.gz)
    Source code(zip)
NLP codes implemented with Pytorch (w/o library such as huggingface)

NLP_scratch NLP codes implemented with Pytorch (w/o library such as huggingface) scripts ├── models: Neural Network models ├── data: codes for dataloa

3 Dec 28, 2021
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Nov 16, 2022
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 04, 2021
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
Long text token classification using LongFormer

Long text token classification using LongFormer

abhishek thakur 161 Aug 07, 2022
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 28 Oct 30, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Blackstone Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project f

ICLR&D 579 Jan 08, 2023
中文問句產生器;使用台達電閱讀理解資料集(DRCD)

Transformer QG on DRCD The inputs of the model refers to we integrate C and A into a new C' in the following form. C' = [c1, c2, ..., [HL], a1, ..., a

Philip 1 Oct 22, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 06, 2023
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 02, 2022
AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

AI Dynamic Text Reader: This is a simple dynamic text reader based on Artificial

Md. Rakibul Islam 1 Jan 18, 2022