A natural language modeling framework based on PyTorch

Related tags

Text Data & NLPpytext
Overview

Overview

CircleCI

PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces and abstractions for model components, and by using PyTorch’s capabilities of exporting models for inference via the optimized Caffe2 execution engine. We are using PyText in Facebook to iterate quickly on new modeling ideas and then seamlessly ship them at scale.

Core PyText features:

Installing PyText

PyText requires Python 3.6.1 or above.

To get started on a Cloud VM, check out our guide.

Get the source code:

  $ git clone https://github.com/facebookresearch/pytext
  $ cd pytext

Create a virtualenv and install PyText:

  $ python3 -m venv pytext_venv
  $ source pytext_venv/bin/activate
  (pytext_venv) $ pip install pytext-nlp

Detailed instructions and more installation options can be found in our Documentation. If you encounter issues with missing dependencies during installation, please refer to OS Dependencies.

Train your first text classifier

For this first example, we'll train a CNN-based text-classifier that classifies text utterances, using the examples in tests/data/train_data_tiny.tsv. The data and configs files can be obtained either by cloning the repository or by downloading the files manually from GitHub.

  (pytext_venv) $ pytext train < demo/configs/docnn.json

By default, the model is created in /tmp/model.pt

Now you can export your model as a caffe2 net:

  (pytext_venv) $ pytext export < demo/configs/docnn.json

You can use the exported caffe2 model to predict the class of raw utterances like this:

  (pytext_venv) $ pytext --config-file demo/configs/docnn.json predict <<< '{"text": "create an alarm for 1:30 pm"}'

More examples and tutorials can be found in Full Documentation.

Join the community

License

PyText is BSD-licensed, as found in the LICENSE file.

Comments
  • pytext: torch.quantization -> torch.ao.quantization

    pytext: torch.quantization -> torch.ao.quantization

    Summary: This changes the imports in the caffe2/torch/ao/nn to include the new import locations.

    codemod -d pytext --extensions py 'torch.quantization' 'torch.ao.quantization'
    

    Differential Revision: D31302214

    CLA Signed Merged fb-exported 
    opened by z-a-f 18
  • Integration

    Integration

    Summary: First step integrate PET & Pytext, the idea is introduce a new trainer and by convert Pytext train_loop to a PET state generator. More things include in this diff:

    • create a unittest to make the E2E pass
    • create a flag to enable/disable PET in pytext
    • create util functions like intialize_coordinator

    Differential Revision: D19806903

    CLA Signed Merged fb-exported 
    opened by isunjin 12
  • Minor refactor and code rearrange on module.py

    Minor refactor and code rearrange on module.py

    Summary: Minor refactor and code rearrange on module.py Goal is to reuse the Pytext embedding module methods for pytext module

    Differential Revision: D25997360

    CLA Signed Merged fb-exported 
    opened by mikekgfb 9
  • Convert matmuls to quantizable nn.Linear modules (#889)

    Convert matmuls to quantizable nn.Linear modules (#889)

    Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/889

    We are converting matmuls to quantizable nn.Linear modules in this diff. First let's test profile after the diff to see how low level operations are changing.

    Reviewed By: jmp84, edunov, lly-zero-one, jhcross

    Differential Revision: D17964796

    CLA Signed Merged 
    opened by halilakin 9
  • Change model input tokens to optional

    Change model input tokens to optional

    Summary: The Byte LSTM does not need the input of tokens, which it inherits from LSTM language model. Making it optional and pass it as None will allow to model to skip build vocab part and report confusing OOV problems.

    Reviewed By: kmalik22

    Differential Revision: D18253210

    CLA Signed Merged 
    opened by FanW123 8
  • Id for all modules and torchscript compatibility workaround

    Id for all modules and torchscript compatibility workaround

    Summary: We need a id for each module that was created for incremental decoding. Also this diff includes a workaround to make sure our models work with the new scripting api.

    Differential Revision: D17636128

    CLA Signed Merged 
    opened by arbabu123 8
  • TypeError: __init__() got an unexpected keyword argument 'dtype'

    TypeError: __init__() got an unexpected keyword argument 'dtype'

    I followed tutorial of ATIS on https://pytext-pytext.readthedocs-hosted.com/en/latest/atis_tutorial.html. On my way to step 3: pytext train < sample_config.json this error showed up:

    Traceback (most recent call last):
      File "/home/hiepph/miniconda3/bin/pytext", line 11, in <module>
        load_entry_point('pytext-nlp', 'console_scripts', 'pytext')()
      File "/home/hiepph/miniconda3/lib/python3.6/site-packages/click/core.py", line 722, in __call__
        return self.main(*args, **kwargs)
      File "/home/hiepph/miniconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
        rv = self.invoke(ctx)
      File "/home/hiepph/miniconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/home/hiepph/miniconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/home/hiepph/miniconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
        return callback(*args, **kwargs)
      File "/home/hiepph/miniconda3/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "/home/hiepph/src/pytext/pytext/main.py", line 232, in train
        train_model(config)
      File "/home/hiepph/src/pytext/pytext/workflow.py", line 55, in train_model
        task = prepare_task(config, dist_init_url, device_id, rank, world_size)
      File "/home/hiepph/src/pytext/pytext/workflow.py", line 78, in prepare_task
        return create_task(config.task)
      File "/home/hiepph/src/pytext/pytext/task/task.py", line 38, in create_task
        return create_component(ComponentType.TASK, task_config, metadata, model_state)
      File "/home/hiepph/src/pytext/pytext/config/component.py", line 142, in create_component
        return cls.from_config(config, *args, **kwargs)
      File "/home/hiepph/src/pytext/pytext/task/task.py", line 81, in from_config
        featurizer=featurizer,
      File "/home/hiepph/src/pytext/pytext/config/component.py", line 147, in create_data_handler
        ComponentType.DATA_HANDLER, data_handler_config, *args, **kwargs
      File "/home/hiepph/src/pytext/pytext/config/component.py", line 142, in create_component
        return cls.from_config(config, *args, **kwargs)
      File "/home/hiepph/src/pytext/pytext/data/joint_data_handler.py", line 76, in from_config
        DatasetFieldName.DOC_WEIGHT_FIELD: FloatField(),
      File "/home/hiepph/src/pytext/pytext/fields/field.py", line 288, in __init__
        unk_token=None,
      File "/home/hiepph/src/pytext/pytext/fields/field.py", line 54, in __init__
        super().__init__(*args, **kwargs)
    TypeError: __init__() got an unexpected keyword argument 'dtype'
    

    This doesn't happen with Train your first model tutorial.

    opened by hiepph 8
  • Optional Resorting in VocabBuilder.

    Optional Resorting in VocabBuilder.

    Summary: Added the options to bypass resorting of vocabulary tokens in VocabBuilder when running add_to_file() and make_vocab()

    Differential Revision: D28578974

    CLA Signed Merged fb-exported 
    opened by I2304 7
  • Add support for hypothesis 5.x

    Add support for hypothesis 5.x

    Update to support hypothesis 5.x primarily by overriding the now-enforced default deadline of 200ms where appropriate.

    Reviewed By: thatch

    Differential Revision: D20323893

    CLA Signed Merged fb-exported 
    opened by qwhelan 7
  • Add CompatibleTrainer, an adapter to use the Library with TaskTrainer

    Add CompatibleTrainer, an adapter to use the Library with TaskTrainer

    Summary:

    1. we have some new components(datasets, transforms, models etc) created for PyText Lib
    2. these components are generic and can be re-used in multiple trainers(e.g. lightning)
    3. to reproduce the same results across Lightning trainer and old TaskTrainer, we need CompatibleTrainer as an adapter, which uses the new components with TaskTrainer and minimizes the duplication of the full train loop logic in TaskTrainer

    Differential Revision: D21435152

    CLA Signed Merged fb-exported 
    opened by hudeven 7
  • Dynamic Batch Scheduler Implementation

    Dynamic Batch Scheduler Implementation

    Summary: This diff adds support of dynamic batch training in pytext. It creates a new batcher that computes the batch size depending on the current epoch.

    The diff implements two schedulers:

    • Linear: increases batch size linearly
    • Exponential: increases batch size exponentially

    API

    The dynamic batcher extends the pooling batcher so that most the arguments are there and are pretty consistent. It is important to note that dynamic batch sizes only affects the training batch size, not eval or test.

    Dynamic batcher holds a new configuration object scheduler_config, this contains the information needed to compute dynamic batch sizes namely:

    class SchedulerConfig(ModuleConfig):
      # the initial batch size used for training
      start_batch_size: int = 32
    
      # the final or max batch size to use, any scheduler should
      # not go over this batch size
      end_batch_size: int = 256
    
      # the number of epochs to increase the batch size over
      epoch_period: int = 10
    
      # the batch size is kept constant for `step_size` number of epochs
      step_size: int = 1
    

    Paper: https://arxiv.org/abs/1711.00489

    Reviewed By: seayoung1112, ArmenAg

    Differential Revision: D18900677

    CLA Signed Merged fb-exported 
    opened by AkshatSh 7
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Question on AUC-PR Hinge Loss

    Question on AUC-PR Hinge Loss

    In the implementation of "Scalable Learning of Non-Decomposable Objectives" at https://github.com/facebookresearch/pytext/blob/main/pytext/loss/loss.py ,

    the positive weight is 1 + lambda * (1 - precision) instead of (1 + lambda) * (1 - precision). The old implementation in the tensorflow repo (gone now) was also 1 + lambda * (1 - precision)

    To me (1 + lambda) looks like coming from the following equation in the paper (https://arxiv.org/pdf/1608.04802.pdf) image

    Dividing each side by N = |Y-| + |Y+| and multiplying (1-precision), we get the first term (1+lambda)(1-precision)(loss+)/N:

    loss = (1+lambda)(loss+) + lambda ( precision/(1-precision) ) (loss-) - lambda (# positives)
    
    per-sample loss = (1+lambda)(loss+)/N + lambda ( precision/(1-precision) ) (loss-)/N - lambda (# positives)/N
    
    multiplied per-sample loss =
              (1+lambda)(1-precision)(loss+)/N + lambda * precision * (loss-)/N - lambda (1-precision) (# positives)/N
              ^^^^^^^^^^^^^^^^^^^^^^^
    

    Why is it 1 + lambda * (1-precision) ?

            hinge_loss = loss_utils.weighted_hinge_loss(
                labels.unsqueeze(-1),
                logits.unsqueeze(-1) - self.biases,
                positive_weights=1.0 + lambdas * (1.0 - self.precision_values),
                negative_weights=lambdas * self.precision_values,
            )
    
    opened by elbaro 0
  • docs: Fix a few typos

    docs: Fix a few typos

    There are small typos in:

    • pytext/models/embeddings/int_weighted_multi_category_embedding.py
    • pytext/torchscript/batchutils.py
    • pytext/torchscript/module.py
    • pytext/torchscript/tokenizer/bpe.py

    Fixes:

    • Should read dictionary rather than doctionary.
    • Should read prioritizing rather than proiritizing.
    • Should read input rather than intput.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    CLA Signed 
    opened by timgates42 0
  • ImportError: cannot import name 'metanet_pb2' from partially initialized module 'caffe2.proto'

    ImportError: cannot import name 'metanet_pb2' from partially initialized module 'caffe2.proto'

    Steps to reproduce

    1. git clone https://github.com/facebookresearch/Clinical-Trial-Parser.git
    2. install pytext by source https://pytext.readthedocs.io/en/master/installation.html#install-from-source
    3. go build ./...
    4. go test ./...
    5. pytext train < src/resources/config/ner.json

    Observed Results

    from caffe2.proto import caffe2_pb2 File "/data/anaconda3/envs/pytext/lib/python3.8/site-packages/caffe2/proto/init.py", line 15, in from caffe2.proto import caffe2_pb2, metanet_pb2, torch_pb2 ImportError: cannot import name 'metanet_pb2' from partially initialized module 'caffe2.proto'

    • What happened? This could be a description, log output, etc.

    Expected Results

    I expect pytext to carry out the trainig

    • What did you expect to happen? completed training

    Relevant Code

    in https://github.com/facebookresearch/Clinical-Trial-Parser.git

    // TODO(you): code here to reproduce the problem
    I have been trying to get this lib to work for three days. I have tried python 3.5, 3.6, 3.7, and 3.10.
    I have tried installing pytext using conda, pip, and build from source.
    
    googling ends up in dead trails where people continue to see the same problem, but previously raised tickets are closed.
    
    Please give indication how to get past this step?
    
    opened by bhomass 1
  •  No module named 'pytorch'

    No module named 'pytorch'

    Steps to reproduce

    Installed both torch and torchtext from the source code

    • Torch version: '1.9.0a0+git854cc53'

    • TorchText version: '0.11.0a0+05cb992'

    • Install pytext from source

    cd /tmp 
    git clone https://github.com/facebookresearch/pytext.git
    cd pytext
    sudo python3 setup.py install
    
    • Import pytext
    import pytext
    

    Observed Results

    • What happened? This could be a description, log output, etc.
    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-1-0d876a9b1d3f> in <module>
    ----> 1 import pytext
    
    /usr/local/lib/python3.6/dist-packages/pytext_nlp-0.3.3-py3.6.egg/pytext/__init__.py in <module>
        10 from caffe2.python import workspace
        11 from caffe2.python.predictor import predictor_exporter
    ---> 12 from pytext.data.sources.data_source import DataSource
        13 from pytext.task import load
        14 from pytext.task.new_task import NewTask
    
    /usr/local/lib/python3.6/dist-packages/pytext_nlp-0.3.3-py3.6.egg/pytext/data/__init__.py in <module>
        10     NaturalBatchSampler,
        11 )
    ---> 12 from .data import Batcher, Data, PoolingBatcher, generator_iterator
        13 from .data_handler import BatchIterator, CommonMetadata, DataHandler
        14 from .disjoint_multitask_data import DisjointMultitaskData
    
    /usr/local/lib/python3.6/dist-packages/pytext_nlp-0.3.3-py3.6.egg/pytext/data/data.py in <module>
        12 from pytext.utils.usage import log_class_usage
        13 
    ---> 14 from .sources import DataSource, RawExample, TSVDataSource
        15 from .sources.data_source import (
        16     GeneratorIterator,
    
    /usr/local/lib/python3.6/dist-packages/pytext_nlp-0.3.3-py3.6.egg/pytext/data/sources/__init__.py in <module>
         2 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
         3 
    ----> 4 from .conllu import CoNLLUNERDataSource
         5 from .data_source import DataSource, RawExample
         6 from .dense_retrieval import DenseRetrievalDataSource
    
    /usr/local/lib/python3.6/dist-packages/pytext_nlp-0.3.3-py3.6.egg/pytext/data/sources/conllu.py in <module>
         5 from typing import Dict, List, Optional, Type
         6 
    ----> 7 from pytext.data.sources.data_source import RootDataSource, SafeFileWrapper
         8 
         9 
    
    /usr/local/lib/python3.6/dist-packages/pytext_nlp-0.3.3-py3.6.egg/pytext/data/sources/data_source.py in <module>
        10 from pytext.data.utils import shard
        11 from pytext.utils.data import Slot, parse_slot_string
    ---> 12 from pytext.utils.file_io import PathManager
        13 
        14 
    
    /usr/local/lib/python3.6/dist-packages/pytext_nlp-0.3.3-py3.6.egg/pytext/utils/file_io.py in <module>
        10 # TODO: @stevenliu use PathManagerFactory after it's released to PyPI
        11 from iopath.common.file_io import HTTPURLHandler
    ---> 12 from pytorch.text.fb.utils import PATH_MANAGER as PathManager  # noqa
        13 
        14 
    
    ModuleNotFoundError: No module named 'pytorch'
    
    

    Expected Results

    • What did you expect to happen? No errors while importing pytext

    Module causing this error pytorch.text.fb.utils is not mentioned in the requirements.txt. Is this an internal module to FB? Any recommendation to circumvent this error?

    Thanks!

    opened by anjali-chadha 6
Releases(v0.3.3)
  • v0.3.3(Jun 8, 2020)

    New features

    • Add XLM-R document classification server + console (#1358)
    • MLP layer embed for float tensors and FloatListSeqTensorizer for List[List[[float]] features. (#1374)
    • Add class_accuracy in MultiLabelSoftClassificationMetrics (#1371)
    • Add an option to skip test run after models have been trained (#1372)
    • Support DP in PyText (#1366)
    • Support torchscriptify in multi_label_classification_layer (#1350)
    • Add custom metric class for reporting Joint model metrics (#1339)
    • MultiLabel-MultiClass Model for Joint Sequence Tagging (#1335)
    • Scripted tokenizer support for DocModel (#1314)

    Bugfixes

    • Fixed metric reporter aggregation and output layer for the multi-label classification
    • Remove move_state_dict_to_gpu, which is causing CUDA OOM (#1367)
    • Fix Flow's default conversion of dict to AttrDict
    • Fix bug in ClassificationOutputLayer that pad_idx is never respected (#1347)
    • Serializing/Deserializing type Any: bugfix and simplification (#1344)
    • Fix RoBERTa Q&A Training Bug with multiple BoS tokens. (#1343)

    Other

    • Better error message for misconfigured data fields
    • Replace deprecated integer division with floor division operator
    • Add informative prints to assert statements (#1360)
    • TorchScript: Put dense tensor on the same device with other input tensors (#1361)
    • Update PyTorch + ONNX (#1340)
    • Update PyTorch + ONNX (#1340)- binary ONNX
    • Update PR Template (#1349)
    • Reduce memory request for pytext train operator
    • Add 'contrib' directory for experimental code (#1333)
    Source code(tar.gz)
    Source code(zip)
    pytext-nlp-0.3.3.tar.gz(344.88 KB)
    pytext_nlp-0.3.3-py3-none-any.whl(491.60 KB)
  • v0.3.2(Apr 27, 2020)

    New features

    • Add Roberta model into BertPairwiseModel (#1336)
    • Support read file from http URL (#1317)
    • add a new PyText get_num_examples_from_batch function in model (#1319)
    • Add support for length label smoothing (#1308)
    • Add new metrics type for Masked Seq2Seq Joint Model (#1304)
    • Add mask generator and strategy (#1302)
    • Add separate logging for label loss and length loss (#1294)
    • Add tensorizer support for masking of target tokens (#1297)
    • Add length prediction and basic masked generator (#1290)
    • Add self attention option to conv_encoder and conv_decoder (#1291)
    • Entity Saliency modeling on PyText: EntitySalienceMetricReporter/EntitySalienceTask
    • In-batch negative training for BertPairwiseModel
    • Support embedding from decoder (#1284)
    • Add dense features to Roberta
    • Add projection layer to HuggingFace encoder (#1273)
    • add PyText Embedding TorchScript Wrapper
    • Add option to pad missing label in LabelListTensorizer (#1269)
    • Integrate PET and Introduce ElasticTrainer (#1266)
    • support PoolingType in DocNN. (#1259)
    • Added WordSeqEmbedding (#1255)
    • Open source Assistant NLU seq2seq model (#1236)
    • Support multi label classification
    • BART in decoupled model

    Bug fixes

    • Fix Incorrect State Dict Assumption (#1326)
    • Bug fix for "RoBERTaTensorizer object has no attribute is_input" (#1334)
    • Cast model output to cpu (#1329)
    • Fix OSS predict-py API (#1320)
    • Fix "calling median on empty tensor" issue in MR (#1322)
    • Add ScriptDoNothingTokenizer so that torchscriptification of SPM does not fail (#1316)
    • Fix creating generator everytime (#1301)
    • fix dense feature for fp16
    • Avoid edge cases with quantization by setting a known seed (#1295)
    • Make torchscript predictions even on empty text / token inputs
    • fix dense feature TorchScript typing (#1281)
    • avoid zero division error in metrics reporter (#1271)
    • Fix contiguous issue in bilstm export (#1270)
    • fix debug file generation for multilabel classification (#1247)
    • Fix fp16 optimizer attribute name

    Other

    • Simplify contextual embedding dimension computation in PyText (#1331)
    • New Debug File for masked seq2seq
    • Move MockConfigLoader to OSS (#1324)
    • Pass in optimizer config instead of create_optimizer to trainer
    • Remove unnecessary torch.no_grad() block (#1323)
    • Fix Memory Issues in Metric Reporter for Classification Tasks over large Label Spaces
    • Add contextual embedding support to OS seq2seq model (#1299)
    • recover xlm_r tutorial notebook (#1305)
    • Enable controlling bias in MLP decoder
    • Migrate serving tutorial to TorchScript (#1310)
    • delete caffe2 export (#1307)
    • add whitelist for ONNX export
    • Use dynamic quantization api for BeamSearch (#1303)
    • Remove requirement that eos/bos be supplied for sequence export. (#1300)
    • Multicolumn support
    • Multicolumn support in torchscriptify
    • Add caching support to RawExample and batch predict API (#1298)
    • Add save-pytext-snapshot command to PyText cmdline (#1285)
    • Update with Whatsapp calling data + support dictionary features (#1293)
    • add arrange_caffe2_model_inputs in BaseModel (#1292)
    • Replace unit-tests on LMModel and FLLanguageModelingTask by LiteLMModel and FLLiteLMTask (#1296)
    • changes to make mbart work (#1911)
    • handle encoder and decoder embedding
    • Add tutorial for semantic parsing. (#1288)
    • Add new fb beam search with fused operator (#1287)
    • Move generator builder to constructor so that it can easily overridden. (#1286)
    • Torchscriptify ELTensorizer (#1282)
    • Torchscript export for Seq2Seq model (#1265)
    • Change Seq2Seq model from_config() to a more general api (#1280)
    • add max_seq_len to DocNN TorchScript model (#1279)
    • support XLM-R model Embedding in TorchScript (#1278)
    • Generic PyText Checkpoint Manager Interface (#1267)
    • Fix backward compatibility issue of pad_missing in LabelListTensorizer (#1277)
    • Update mean reduction in NLLLoss (#1272)
    • migrate pages.integrity.scam.docnn_models.xxx (#1275)
    • Unify model input for ByteTokensDocumentModel (#1274)
    • Torchscriptify TokenTensorizer
    • Allow dictionaries to overwrite entries with #fairseq:overwrite comment (#1073)
    • Make WordSeqEmbedding ONNX compatible
    • If the snapshot path provided is not valid, throw error (#1268)
    • support vocab filter by min count
    • Unify input for TorchScript Tensorizers and Models (#1256)
    • Torchscriptify XLM-R
    • Add class logging to task (#1264)
    • Add usage logging to exporter (#1262)
    • Add usage logging across models (#1263)
    • Usage logging on data classes (#1261)
    • GPT2 BPE add lower casing support (#1260)
    • FAISS Embedding Search Space [3/5]
    • Return len of tokens of each sequence in SeqTokenTensorizer (#1254)
    • Vocab Limited Pretrained Embedding [2/5] (#1248)
    • add Stage.OTHERS and allow TB to print to a seperate prefix not in (TRAIN, TEST, EVAL) (#1258)
    • Add option to skip 2 stage tokenizer and bpe decode sequences in the debug file (#1257)
    • Add Testcase for Wordpiece Tokenizer (#1249)
    • modify accuracy calculation for multi-label classification (#1244)
    • Enable tests in pytext/config:pytext_all_config_test
    • Introduce Class Usage Logging (#1243)
    • Make PyText compatible with Any type (#1242)
    • Make dict_embedding Torchscript friendly (#1240)
    • Support MultipleData for export and kd generation
    • delete flaky/broken tests (#1238)
    • Add support for returning start & end indices.
    Source code(tar.gz)
    Source code(zip)
    pytext-nlp-0.3.2.tar.gz(339.35 KB)
    pytext_nlp-0.3.2-py3-none-any.whl(482.20 KB)
  • 0.3.1(Jan 15, 2020)

    New features

    • Implement SquadQA tensorizer in TorchScript (#1211)
    • Add session data source for df (#1202)
    • Dynamic Batch Scheduler Implementation (#1200)
    • Implement loss aware sparsifier (#1204)
    • Ability to Fine-tune XLM-R for NER on CoNLL Datasets (#1201)
    • TorchScriptify Tokenizer after training (#1191)
    • Linear Layer only blockwise sparsifier (#478)
    • Adding performance graph to pytext models (#1192)
    • Enable inference on GPUs by moving tensors to specified device (#472)
    • Add support for learning from soft labels for Squad (MRC) models (#1188)
    • Create byte-aware model that can make byte predictions (#468)
    • Minimum Trust Lamb (#1186)
    • Allow model to take byte-level input and make byte-level prediction (#1187)
    • Scheduler with Warmup (#1184)
    • Implement LAMB optimizer (#1183)
    • CyclicLRScheduler (#1157)
    • PyText Entity Linking: ELTask and ELMetricReporter (#1165)

    Bug fixes

    • Don't upgrade if Tensorizer already given (#504)
    • avoid torchscriptify on a ScriptModule (#1214)
    • Make tensorboard robust to NaN and Inf in model params (#1206)
    • Fix circleCLI Test broken in D19027834 (#1205)
    • Fix small bug in pytext vocabulary (#401)
    • Fix CircleCI failure caused by black and regex (#1199)
    • Fix CircleCI (#1194)
    • Fix Circle CI Test broken by D18880705 (#1190)
    • fix weight load for new fairseq checkpoints (#1189)
    • Fix Heirarchical intent and slot filling demo is broken (#1012) (#1151)
    • Fix index error in dict embedding when exported to Caffe2 (#1182)
    • Fix zero loss tensor in SquadOutputLayer (#1181)
    • qa fix for ignore_impossible=False

    Other

    • Printing out error's underlying reason (#1227)
    • tidy file path in help text for invocation of docnn.json example (#1221)
    • PyText option to disable CUDA when testing. (#1223)
    • make augmented lstm compatible w other lstms (#1224)
    • Vocab recursive lookup (#1222)
    • Fix simple typo: valus -> value (#1219)
    • support using RoundRobin ProcessGroup in Distributed training (#1213)
    • Use PathManager for all I/O (#1198)
    • Make PathManager robust to API changes in fvcore (#1196)
    • Support for TVM training (BERT) (#1210)
    • Exit LM task if special token exists in text for ByteTensorizer (#1207)
    • Config adapter for pytext XLM (#1172)
    • Use TensorizerImpl for both training and inference for BERT, RoBERTa and XLM tensorizer (#1195)
    • Replace gluster paths with local file paths for NLG configs (#1197)
    • Make BERT Classification compatible with TSEs that return Encoded Layers.
    • implement BertTensorizerImpl and XLMTensorizerImpl (#1193)
    • Make is_input field of tensorizer configurable (#474)
    • BERTTensorizerBaseImpl to reimplement BERTTensorizerBase to be TorchScriptable (#1163)
    • Improve LogitsWorkflow to handle dumping of raw inputs and multiple output tensors (#683)
    • Accumulative blockwise pruning (#1170)
    • Patch for UnicodeDecodeError due to BPE. (#1179)
    • Add pre-loaded task as parameter to caffe2 batch prediction API
    • Specify CFLAGS to install fairseq in MacOS (#1175)
    • Resolve dependency conflict by specifying python-dateutil==2.8.0 (#1176)
    • Proper training behavior if setting do_eval=False (#1155)
    • Make DeepCNNRepresentation torchscriptable (#453)
    Source code(tar.gz)
    Source code(zip)
    pytext-nlp-0.3.1.tar.gz(303.89 KB)
    pytext_nlp-0.3.1-py3-none-any.whl(427.40 KB)
  • 0.3.0(Nov 28, 2019)

    New Features

    RoBERTa and XLM-R

    • Integrate XLM-R into PyText (#1120)
    • Consolidate BERT, XLM and RobERTa Tensorizers (#1119)
    • Add XLM-R for joint model (#1135)
    • Open source Roberta (#1032)
    • Simple Transformer module components for RoBERTa (#1043)
    • RoBERTa models for document classification (#933)
    • Enable MLM training for RobertaEncoder (#1126)
    • Standardize RoBERTa Tensorizer Vocab Creation (#1113)
    • Make RoBERTa usable in more tasks including QA (#1017)
    • RoBERTa-QA JIT (#1088)
    • Unify GPT2BPE Tokenizer (#1110)
    • Adding Google SentencePiece as a Tokenizer (#1106)

    TorchScript support

    • General torchscript module (#1134)
    • Support torchscriptify XLM-R (#1138)
    • Add support for torchscriptification of XLM intent slot models (#1167)
    • Script xlm tensorizer (#1118)
    • Refactor ScriptTensorizer with general tensorize API (#1117)
    • ScriptXLMTensorizer (#1123)
    • Add support for Torchscript export of IntentSlotOutputLayer and CRF (#1146)
    • Refactor ScriptTensorizor to support both text and tokens input (#1096)
    • Add torchscriptify API in tokenizer and tensorizer (#1055)
    • Add more stats in torchscript latency script (#1044)
    • Exported Roberta torchscript model include both traced_model and pre-processing logic (#1013)
    • Native Torchscript Wordpiece Tokenizer Op for BERTSquadQA, Torchscriptify BertSQUADQAModel (#879)
    • TorchScript-ify BERT training (#887)
    • Modify Return Signature of TorchScript BERT (#1058)
    • Implement BertTensorizer and RoBERTaTensorizer in TorchScript (#1053)

    Others

    • FairseqModelEnsemble class (#1116)
    • Inverse Sqrt Scheduler (#1150)
    • Lazy modules (#1039)
    • Adopt Fairseq MemoryEfficientFP16Optimizer in PyText (#910)
    • Add RAdam (#952)
    • Add AdamW (#945)
    • Unify FP16&FP32 API (#1006)
    • Add precision at recall metric (#1079)
    • Added PandasDataSource (#1098)
    • Support testing Caffe2 model (#1097)
    • Add contextual feature support to export for Seq2Seq models
    • Convert matmuls to quantizable nn.Linear modules (#1304)
    • PyTorch eager mode implementation (#1072)
    • Implement Blockwise Sparsification (#1050)
    • Support Fairseq FP16Optimizer (#1008)
    • Make FP16OptimizerApex wrapper on Apex/amp (#1007)
    • Remove vocab from cuda (#955)
    • Add dense input to XLMModel (#997)
    • Replace tensorboardX with torch.utils.tensorboard (#1003)
    • Add mentioning of mixed precision training support (#643)
    • Sparsification for CRF transition matrix (#982)
    • Add dense feature normalization to Char-LSTM TorchScript model. (#986)
    • Cosine similarity support for BERT pairwise model training (#967)
    • Combine training data from multiple sources (#953)
    • Support visualization of word embeddings in Tensorboard (#969)
    • Decouple decoder and output layer creation in BasePairwiseModel (#973)
    • Drop rows with insufficient columns in TSV data source (#954)
    • Add use_config_from_snapshot option(load config from snapshot or current task) (#970)
    • Add predict function for NewTask (#936)
    • Use create_module to create CharacterEmbedding (#920)
    • Add XLM based joint model
    • Add ConsistentXLMModel (#913)
    • Optimize Gelu module for caffe2 export (#918)
    • Save best model's sub-modules when enabled (#912)

    Documentation / Usability

    • XLM-R tutorial in notebook (#1159)
    • Update XLM-R OSS tutorial and add Google Colab link (#1168)
    • Update "raw_text" to "text" in tutorial (#1010)
    • Make tutorial more trivial (add git clone) (#1037)
    • Changes to make tutorial code simpler (#1002)
    • Fix datasource tutorial example (#998)
    • Handle long documents in squad qa datasource and models (#975)
    • Fix pytext tutorial syntax (#971)
    • Use torch.equal() instead of "==" in Custom Tensorizer tutorial (#939)
    • Remove and mock doc dependencies because readthedocs is OOM (#983)
    • Fix Circle CI build_docs error (#959)
    • Add OSS integration tests: DocNN (#1021)
    • Print model into the output log (#1127)
    • Migrate pytext/utils/torch.py logic into pytext/torchscript/ for long term maintainability (#1082)
    • Demo datasource fix + cleanup (#994)
    • Documentation on the config files and config-related commands (#984)
    • Config adapter old data handler helper (#943)
    • Nicer gen_config_impl (#944)

    Deprecated Features

    • Remove DocModel_Deprecated (#916)
    • Remove RNNGParser_Deprecated, SemanticParsingTask_Deprecated, SemanticParsingCppTask_Deprecate, RnngJitTask,
    • Remove QueryDocumentTask_Deprecated(#926)
    • Remove LMTask_Deprecated and LMLSTM_Deprecated (#882)
    • CompositionDataHandler to fb/deprecated (#963)
    • Delete deprecated Word Tagging tasks, models and data handlers (#910)

    Bug Fixes

    • Fix caffe2 predict (#1103)
    • Fix bug when tensorizer is not defined (#1169)
    • Fix multitask metric reporter for lr logging (#1164)
    • Fix broken gradients logging and add lr logging to tensorboard (#1158)
    • Minor fix in blockwise sparsifier (#1130)
    • Fix clip_grad_norm API (#1143)
    • Fix for roberta squad tensorizer (#1137)
    • Fix multilabel metric reporter (#1115)
    • Fixed prepare_input in tensorizer (#1102)
    • Fix unk bug in exported model (#1076)
    • Fp16 fixes for byte-lstm and distillation (#1059)
    • Fix clip_grad_norm_ if grad_norm > max_norm > 0: TypeError: '>' not supported between instances of 'float' and 'NoneType' (#1054)
    • Fix context in multitask (#1040)
    • Fix regression in ensemble trainer caused by recent fp16 change (#1033)
    • ReadTheDocs OOM fix with CPU Torch (#1027)
    • Dimension mismatch after setting max sequence length (#1154)
    • Allow null learning rate (#1156)
    • Don't fail on 0 input (#1104)
    • Remove side effect during pickling PickleableGPT2BPEEncoder
    • Set onnx==1.5.0 to fix CircleCI build temporarily (#1014)
    • Complete training loop gracefully even if no timing is reported (#1128)
    • Propagate min_freq for vocab correctly (#907)
    • Fix gen-default-config with Model param (#917)
    • Fix torchscript export for PyText modules (#1125)
    • Fix label_weights in DocModel (#1081)
    • Fix label_weights in bert models (#1100)
    • Fix config issues with Python 3.7 (#1066)
    • Temporary fix for Fairseq dependency (#1026)
    • Fix MultipleData by making tensorizers able to initialize from multiple data sources (#972)
    • Fix bug in copy_unk (#964)
    • Division by Zero bug in MLM Metric Reporter (#968)
    Source code(tar.gz)
    Source code(zip)
    pytext-nlp-0.3.0.tar.gz(288.70 KB)
    pytext_nlp-0.3.0-py3-none-any.whl(407.89 KB)
  • v0.2.2(Aug 15, 2019)

    Note: this is the last release with _Deprecated classes. Those classes will be removed in the next release.

    New Features:

    • DeepCNN Representation for word tagging
    • Combine KLDivergenceBCELoss with SoftHardBCELoss and F.cross_entropy() in CrossEntropyLoss (#689)
    • add dense feature support for doc model (#710)
    • add torchscript quantizaiton support in pytext
    • pytext multi-label support (#731)
    • open source transformer representations (#736)
    • open source transformer based models - data, tensorizers and tokenizer (#708)
    • Create AlternatingRandomizedBatchSampler (#737)
    • open source MaskedLM and BERT models (#734)
    • Support bytes input in word tagging model OSS (#745)
    • open source extractive question answering models (#742)
    • torchscriptify for ensemle task
    • enabled lmlstm labels exporting (#767)
    • Enable dense features in ByteTokensDocumentModel (#763)
    • created bilstm dropout condition (#769)
    • enabled lmlstm caffe2 exporting (#766)
    • PolynomialDecayScheduler (#791)
    • removed bilstm dependence on seq_lengths (#776)
    • fp16 optimizer (#782)
    • Add Dense Feature Normalization to FloatListTensorizer and DocModel (#859)
    • Add Sparsifier component to PyText and L0-projection based sparsifier (#860)
    • implemented cnn pooling for doc classification (#872)
    • implemented bottleneck separable convolutions (#855)
    • Add eps to Adam (#858)
    • implemented mobile exporter (#785)
    • support starting training from saved checkpoint (#824)
    • implemented separable convolutions (#830)
    • implemented gelu activations (#829)
    • implemented causal convolutions (#811)
    • implemented dilation for convolutions (#810)
    • created weight norm option (#809)
    • Ordered Neuron LSTM (#854)
    • Add PersonalizedByteDocModel (#816)
    • CNN based language models (#827)
    • improve csv support in TSVDataSource (#777)
    • Change default batch sampler DisjointMultitaskData to RoundRobinBatchSampler (#802)
    • Support using serialized pretrained embedding file (#797)

    Documentation / Usability / Logging:

    • Fewer out-of-vocab print messages, with some stats (#697)
    • Echo epoch number to console while training (#712)
    • Separate timing for prediction and metric calculation. (#738)
    • multi-label soft metrics (#754)
    • changed lm metric reporting (#765)
    • fix data source tutorial (#762)
    • fix doc sphinx deprecation warning (#775)
    • Add the ability to pass parameter values to gen-default-config (#856)
    • Remove "pytext/" from paths in demo json config (#878)
    • New documentation about hacking pytext and dealing with github. (#862)
    • install_deps supports updates (#863)
    • Reduce number of PEP print (#861)
    • better error message for config with unknown component (#801)
    • Add Recall at Precision Thresholds to Config (#792)
    • implemented perplexity reductions for lm score reporting (#799)
    • adapt prediction workflow to new design (#746)

    Bug fixes:

    • block sharded tsv eval/test fix (#698)
    • Fix BoundaryPooling tracing (#713)
    • fixes LMLSTM weight tying bug (#704)
    • Fix duplicate entries in vocab (#721)
    • Bugfix for trainer not reporting eval results (#740)
    • Reintroduce metrics export in new task (#748)
    • fix open source tests (#750)
    • Fix missing init_tensorizers arg (#893)
    • Fix intent slot metric reporter not working with byte offset (#883)
    • Fix issue with some tensorizers still re-initializing vocab when loaded from saved state (#848)
    • fixed overflow error in lm reporting (#831)
    • fix BlockShardedTSVDataSource (#832)

    v0.2.1 (skipped because of packaging issues)

    Source code(tar.gz)
    Source code(zip)
    pytext-nlp-0.2.2.tar.gz(266.37 KB)
    pytext_nlp-0.2.2-py3-none-any.whl(381.24 KB)
  • v0.2.0(Jun 15, 2019)

    Note: This release makes the new data handler API the default and deprecates Task and Model classes using the old data handler API. We recommend that you migrate your models to the new API as soon as possible. More details here: https://www.facebook.com/groups/pytext/permalink/1038962512978256/

    New Stuff

    • most tasks and models deprecated, replaced with better versions using the new data handler API
    • performance improvements in metric reporter
    • Add Multilingual TSV Data Source
    • LabelSmoothedCrossEntropyLoss
    • Support for pretrained word embedding in TokenTensorizer
    • option to use pretrained embedding
    • TorchScript export for document classification
    • Improve log in trainer
    • performance measurement: reporting tokens_per_second and updates_per_second
    • Implement DocumentReader from DrQA in PyText (StackedBidirectionalRNN)
    • improved and updated documentation
    • Implement SWA(SGD|ADAM) and Adagrad Optimizers
    • cache numerized data in memory
    • TorchScript BPE tokenization
    • CLI command to update configs
    • Visualize gradients with tensorboard

    Many bug fixes and code clean-ups

    Source code(tar.gz)
    Source code(zip)
    pytext-nlp-0.2.0.tar.gz(211.96 KB)
    pytext_nlp-0.2.0-py3-none-any.whl(308.73 KB)
  • v0.1.5(Apr 16, 2019)

    v0.1.5

    Note: this is a last release in 0.1.x. The next release will deprecate Task and Model base classes and make the improved API of the new data handler the default. You can start using it already by inheriting from NewTask. NewDocumentClassification and NewWordTaggingTask use this new API, and you can get the first example in the tutorial "Custom Data Format".

    New Stuff

    • add config adapter
      • PyText is very young and its API is still in flux, making the config files brittle
      • config files now have a version number reflecting the API at the time it was created
      • older versions can be loaded and internally transformed into newer versions
    • better metrics and reporting
      • better training time tracking
      • cool new visualization of model state in TensorBoard
      • pretty results in the terminal
    • improved distributed training
    • torchscript export
    • support for SQuAD dataset
    • add AugmentedLSTM
    • add dense features support
    • new plugin system: command line option --include to import custom user classes (see tutorial "Custom Data Format" for example)

    Many bug fixes and code clean-ups

    Source code(tar.gz)
    Source code(zip)
    pytext-nlp-0.1.5.tar.gz(176.07 KB)
    pytext_nlp-0.1.5-py3-none-any.whl(268.69 KB)
  • v0.1.4(Feb 7, 2019)

    v0.1.4

    New Stuff

    • Used pip freeze to pin all dependency versions as part of release, should increase stability
    • Refactor Metric Reporters to reduce coupling
    • RNNG Improvements:
      • Support Pretrained embeddings in RNNG
      • Support GPU Training
      • More Test Coverage
      • Tensorboard Support
    • Added QueryDocumentPairwiseRankingModel
    • Distributed Training Improvments:
      • Sharded Data Loading to reduce memory consumption
      • Fix Several issues with race conditions and unserializable state
    • Reduced GPU memory Consumption by skipping gradient computation on evaluation

    And lots of bug fixes

    Known Issues

    PyText doesn't work with the new ONNX v1.4.0, so we have pinned it to 1.3.0 for now

    Source code(tar.gz)
    Source code(zip)
    pytext-nlp-0.1.4.tar.gz(130.13 KB)
    pytext_nlp-0.1.4-py3-none-any.whl(205.29 KB)
  • v0.1.0(Dec 14, 2018)

    Source code(tar.gz)
    Source code(zip)
Mirco Ravanelli 2.3k Dec 27, 2022
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

Victor Zhong 33 Dec 27, 2022
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in s

Jonas Belouadi 7 Nov 07, 2022
Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero. From zero to hero • Installation • Getting Started • Examples • API • FAQ • Co

Jonathan Besomi 2.7k Jan 08, 2023
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 05, 2023
An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my the

Corentin Jemine 38.5k Jan 03, 2023
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
UniSpeech - Large Scale Self-Supervised Learning for Speech

UniSpeech The family of UniSpeech: WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing UniSpeech (ICML 202

Microsoft 281 Dec 15, 2022
Harvis is designed to automate your C2 Infrastructure.

Harvis Harvis is designed to automate your C2 Infrastructure, currently using Mythic C2. 📌 What is it? Harvis is a python tool to help you create mul

Thiago Mayllart 99 Oct 06, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 07, 2023