NLP made easy

Overview

GluonNLP Logo

GluonNLP: Your Choice of Deep Learning for NLP

GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you load the text data, process the text data, and train models.

See our documents at https://nlp.gluon.ai/master/index.html.

Features

  • Easy-to-use Text Processing Tools and Modular APIs
  • Pretrained Model Zoo
  • Write Models with Numpy-like API
  • Fast Inference via Apache TVM (incubating) (Experimental)
  • AWS Integration via SageMaker

Installation

First of all, install the latest MXNet. You may use the following commands:

# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20210121" -f https://dist.mxnet.io/python

To install GluonNLP, use

python3 -m pip install -U -e .

# Also, you may install all the extra requirements via
python3 -m pip install -U -e ."[extras]"

If you find that you do not have the permission, you can also install to the user folder:

python3 -m pip install -U -e . --user

For Windows users, we recommend to use the Windows Subsystem for Linux.

Access the Command-line Toolkits

To facilitate both the engineers and researchers, we provide command-line-toolkits for downloading and processing the NLP datasets. For more details, you may refer to GluonNLP Datasets and GluonNLP Data Processing Tools.

# CLI for downloading / preparing the dataset
nlp_data help

# CLI for accessing some common data processing scripts
nlp_process help

# Also, you can use `python -m` to access the toolkits
python3 -m gluonnlp.cli.data help
python3 -m gluonnlp.cli.process help

Run Unittests

You may go to tests to see how to run the unittests.

Use Docker

You can use Docker to launch a JupyterLab development environment with GluonNLP installed.

# GPU Instance
docker pull gluonai/gluon-nlp:gpu-latest
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=2g gluonai/gluon-nlp:gpu-latest

# CPU Instance
docker pull gluonai/gluon-nlp:cpu-latest
docker run --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=2g gluonai/gluon-nlp:cpu-latest

For more details, you can refer to the guidance in tools/docker.

Comments
  • [AMP] Add AMP support to Machine Translation

    [AMP] Add AMP support to Machine Translation

    Description

    • Fix the horovod support and add the amp support to machine translation.
    • Support TN in training and inference
    • Update training results of SQuAD, transformer-base, transformer-large, transformer-t2t-big.
    • Add Deep Encoder, Shallow Decoder

    Checklist

    Essentials

    • [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [x] Changes are complete (i.e. I finished coding on this PR)
    • [ ] All changes have test coverage
    • [x] Code is well-documented

    Changes

    • [x] Add AMP, tests

    Comments

    • If this change is a backward incompatible change, why must this change be made.
    • Interesting edge cases to note here

    cc @dmlc/gluon-nlp-team

    opened by sxjscience 103
  • [Fix][Docker] Fix the docker image + Fix pretrain_corpus document.

    [Fix][Docker] Fix the docker image + Fix pretrain_corpus document.

    Description

    Since the horovod support has been fixed, improve our docker image. Now, the CI docker will depend on the base docker image, which supports:

    • horovod training
    • TVM

    Checklist

    Essentials

    • [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [x] Changes are complete (i.e. I finished coding on this PR)
    • [ ] All changes have test coverage
    • [ ] Code is well-documented

    cc @dmlc/gluon-nlp-team

    opened by sxjscience 83
  • [TVM] Add TVM Support

    [TVM] Add TVM Support

    Description

    Add TVM test case + profiling after https://github.com/apache/incubator-tvm/pull/6699 is merged.

    • [x] Test case
    • [x] Profile

    Checklist

    Essentials

    • [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [x] Changes are complete (i.e. I finished coding on this PR)
    • [x] All changes have test coverage
    • [x] Code is well-documented

    cc @dmlc/gluon-nlp-team

    opened by sxjscience 58
  • [FEATURE] Add transformer inference code

    [FEATURE] Add transformer inference code

    Description

    Add transformer inference code to make inference easy and convenient to analysis the performance of transformer inference. @TaoLv @juliusshufan @pengzhao-intel

    can use below command to do inference: python inference_transformer.py --dataset WMT2014BPE --src_lang en --tgt_lang de --batch_size 2700 --scaled --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a --model_parameter PATH/TO/valid_best.params

    will get output:

    2019-08-19 22:03:57,600 - root - batch id=10, batch_bleu=26.0366 2019-08-19 22:04:45,904 - root - batch id=20, batch_bleu=30.8409 2019-08-19 22:05:26,991 - root - batch id=30, batch_bleu=25.3955 2019-08-19 22:06:11,089 - root - batch id=40, batch_bleu=21.9322 2019-08-19 22:06:58,313 - root - batch id=50, batch_bleu=29.7584 2019-08-19 22:07:49,634 - root - batch id=60, batch_bleu=26.5373 2019-08-19 22:08:33,846 - root - batch id=70, batch_bleu=23.2735 2019-08-19 22:09:24,003 - root - batch id=80, batch_bleu=22.8065 2019-08-19 22:10:03,324 - root - batch id=90, batch_bleu=26.0000 2019-08-19 22:10:41,997 - root - batch id=100, batch_bleu=27.7887 2019-08-19 22:11:26,346 - root - batch id=110, batch_bleu=22.6277 2019-08-19 22:12:10,353 - root - batch id=120, batch_bleu=25.9580 2019-08-19 22:12:47,614 - root - batch id=130, batch_bleu=22.6479 2019-08-19 22:13:20,316 - root - batch id=140, batch_bleu=26.6224 2019-08-19 22:13:54,895 - root - batch id=150, batch_bleu=30.2036 2019-08-19 22:14:32,938 - root - batch id=160, batch_bleu=22.4694 2019-08-19 22:15:09,624 - root - batch id=170, batch_bleu=26.4245 2019-08-19 22:15:39,387 - root - batch id=180, batch_bleu=28.8940 2019-08-19 22:16:11,217 - root - batch id=190, batch_bleu=26.2148 2019-08-19 22:16:47,089 - root - batch id=200, batch_bleu=24.3723 2019-08-19 22:17:22,472 - root - batch id=210, batch_bleu=27.1375 2019-08-19 22:18:00,030 - root - batch id=220, batch_bleu=25.5695 2019-08-19 22:18:32,847 - root - batch id=230, batch_bleu=25.9404 2019-08-19 22:19:01,637 - root - batch id=240, batch_bleu=25.6699 2019-08-19 22:19:29,690 - root - batch id=250, batch_bleu=22.1795 2019-08-19 22:19:58,859 - root - batch id=260, batch_bleu=21.1670 2019-08-19 22:20:28,113 - root - batch id=270, batch_bleu=24.0742 2019-08-19 22:20:53,027 - root - batch id=280, batch_bleu=27.6126 2019-08-19 22:21:20,014 - root - batch id=290, batch_bleu=25.6340 2019-08-19 22:21:50,416 - root - batch id=300, batch_bleu=22.7178 2019-08-19 22:22:14,171 - root - batch id=310, batch_bleu=30.1331 2019-08-19 22:22:37,462 - root - batch id=320, batch_bleu=23.2388 2019-08-19 22:23:01,075 - root - batch id=330, batch_bleu=27.9605 2019-08-19 22:23:22,236 - root - batch id=340, batch_bleu=23.9418 2019-08-19 22:23:40,851 - root - batch id=350, batch_bleu=22.2135 2019-08-19 22:24:01,679 - root - batch id=360, batch_bleu=23.6225 2019-08-19 22:24:15,178 - root - Inference at test dataset. inference bleu=26.0137, throughput=0.1236K wps

    Checklist

    Essentials

    • [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [x] Changes are complete (i.e. I finished coding on this PR)
    • [x] All changes have test coverage
    • [x] Code is well-documented

    Changes

    • [ ] Feature1, tests, (and when applicable, API doc)
    • [ ] Feature2, tests, (and when applicable, API doc)

    Comments

    • If this change is a backward incompatible change, why must this change be made.
    • Interesting edge cases to note here
    release focus 
    opened by pengxin99 43
  • [SCRIPT] XLNet finetuning scripts for glue

    [SCRIPT] XLNet finetuning scripts for glue

    Description

    XLNet finetuning scripts for glue

    Checklist

    Essentials

    • [ ] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [ ] Changes are complete (i.e. I finished coding on this PR)
    • [ ] All changes have test coverage
    • [ ] Code is well-documented

    Changes

    • [ ] Feature1, tests, (and when applicable, API doc)
    • [ ] Feature2, tests, (and when applicable, API doc)

    Comments

    • If this change is a backward incompatible change, why must this change be made.
    • Interesting edge cases to note here

    cc @dmlc/gluon-nlp-team

    opened by zburning 40
  • [SCRIPT]QA Fine-tuning Example for BERT

    [SCRIPT]QA Fine-tuning Example for BERT

    Description

    add QA Fine-tuning Example for BERT #476 use Bert Tokenizer from #464

    In squad1.1, use bert_base uncased model, dev_dataset has F1 of 88.52% and EM of 80.98%.(Based on mxnet-cu90-1.5.0b20190216). Ues bert_large uncased model, dev_dataset has F1 of 90.97% and EM of 84.04%.(Based on mxnet-cu90-1.5.0b20190216). In the mxnet-cu90-1.5.0b20190112 , use bert_base uncased model, dev_dataset has F1 of 88.45% and EM of 81.21%.
    Using mxnet-cu90-1.5.0b20190216 because dropout uses cudnn implementation, training speed is increased by one hour (base model, epochs=2). Log in https://github.com/dmlc/web-data/pull/161

    In squad2.0 use bert_large uncased model, null_score_diff_threshold=-2.0,The results of the dev data set are as follows((Based on mxnet-cu90-1.5.0b20190216):

    {
      "exact": 77.958392992504,
      "f1": 81.02012658815627,
      "total": 11873,
      "HasAns_exact": 73.3974358974359,
      "HasAns_f1": 79.52968336389662,
      "HasAns_total": 5928,
      "NoAns_exact": 82.50630782169891,
      "NoAns_f1": 82.50630782169891,
      "NoAns_total": 5945
    }
    

    Log in https://github.com/dmlc/web-data/pull/164

    optimizer is adam, lr=3e-5, beta1=0.9, beta2=0.999, epsilon=1e-08.(original repo optimizer is adamw, lr=3e-5,wd=0.01,beta_1=0.9,beta_2=0.999,epsilon=1e-6,)

    Checklist

    Essentials

    • [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [x] Changes are complete (i.e. I finished coding on this PR)
    • [ ] All changes have test coverage
    • [x] Code is well-documented

    Comments

    • If this change is a backward incompatible change, why must this change be made.
    • Interesting edge cases to note here
    release focus 
    opened by fierceX 38
  • Numerous doc updates

    Numerous doc updates

    Description

    Numerous doc updates

    Checklist

    Essentials

    • [x] Changes are complete (i.e. I finished coding on this PR)
    • [x] All changes have test coverage
    • [x] Code is well-documented

    Changes

    • [x] add —upgrade flag in installation for mxnet. (official answer from pypa people: https://pypi.org/help/#tls-deprecation) (might not be advisable for forcing an upgrade, szha@)
    • [x] maybe reduce doc nested level in gluon-nlp.mxnet.io (See pytorch doc http://pytorch.org/docs/stable/index.html) (szha@)
    • [x] get rid of unused namespace in doc (e.g. data for submodules that are already import *) (szha@)
    • [x] use api package namespace directly in API doc. API doc is for reference purpose (szha@)
    • [x] Link is using markdown format so it's not displaying properly http://gluon-nlp.mxnet.io/api/data.html#gluonnlp.data.transforms.NLTKMosesTokenizer (szha@)
    • [x] Separate public datasets from data API in API doc (szha@)
    • [x] for scripts, we should link to compressed archive for the whole folder. view source of script should link to the folder. (szha@)
    • [x] Examples should have download links for the ipynb. (szha@)
    opened by szha 36
  • Add nt-asgd for language model

    Add nt-asgd for language model

    Description

    1. Add nt-asgd for language model
    2. Online update of nt-asgd

    Checklist

    Essentials

    • [ ] Changes are complete (i.e. I finished coding on this PR)
    • [ ] All changes have test coverage
    • [ ] Code is well-documented

    Changes

    • [ ] Feature1, tests, (and when applicable, API doc)
    • [ ] Feature2, tests, (and when applicable, API doc)

    Comments

    • If this change is a backward incompatible change, why must this change be made.
    • Interesting edge cases to note here
    release focus 
    opened by cgraywang 35
  • [FEATURE] INT8 Quantization for BERT Sentence Classification and Question Answering

    [FEATURE] INT8 Quantization for BERT Sentence Classification and Question Answering

    Description

    Quantization solution for BERT SC and QA with Intel DLBoost.

    Main Code Changes:

    • [x] change inputs order in BERT SC dataloader to make it align with the inputs order in symbolic model(data0=input_ids, data1=segment_ids, data2=valid_length)
    • [x] implement BertLayerCollector to support output clipping while calibration. Now we clip the max_range of GeLU output to 10 and the min_range of layer_norm output to -50 by default.
    • [x] add calibration pass and symbolblock inference pass in finetune_classification.py.
    • [x] add calibration pass and symbolblock inference pass in finetune_squad.py.
    • [x] Quantization Readme
    • [x] Document
    • [ ] accuracy wait to remeasure

    Dependency:

    https://github.com/apache/incubator-mxnet/pull/17161 https://github.com/apache/incubator-mxnet/pull/17187 https://github.com/dmlc/gluon-nlp/pull/1091 https://github.com/dmlc/gluon-nlp/issues/1127 https://github.com/dmlc/gluon-nlp/pull/1124 ...

    FP32 and INT8 Accuracy:

    will remeasure on c5 when pending PRs are ready.

    | Task | maxLength | FP32 Accuracy | INT8 Accuracy | FP32 F1 | INT8 F1 | |-------|-----------|---------------|---------------|---------|---------| | SQUAD | 128 | 77.32 | 76.61 | 84.84 | 84.26 | | SQUAD | 384 | 80.86 | 80.56 | 88.31 | 88.14 | | MRPC | 128 | 87.75 | 87.25 | 70.50 | 70.56 |

    @pengzhao-intel @TaoLv @eric-haibin-lin @szha

    Checklist

    Essentials

    • [ ] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [ ] Changes are complete (i.e. I finished coding on this PR)
    • [ ] All changes have test coverage
    • [ ] Code is well-documented

    Changes

    • [ ] Feature1, tests, (and when applicable, API doc)
    • [ ] Feature2, tests, (and when applicable, API doc)

    Comments

    • If this change is a backward incompatible change, why must this change be made.
    • Interesting edge cases to note here
    release focus 
    opened by xinyu-intel 34
  • [SCRIPT] Reproducing GLUE score on 8 tasks

    [SCRIPT] Reproducing GLUE score on 8 tasks

    Description

    [BERT] Reproducing GLUE score on 8 tasks

    • Add scripts for RTE, QQP, QNLI, STS-B, CoLA, WNLI, SST tasks and specific metric(mcc, accuracy, F1, pearson corr) for each task.
    • Modify example tutorial.
    • Split trainer with bias and weight for simplicity.

    Checklist

    Essentials

    • [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [x] Changes are complete (i.e. I finished coding on this PR)
    • [x] All changes have test coverage
    • [x] Code is well-documented

    Changes

    • [x] Feature1, tests, (and when applicable, API doc)
    • [x] Feature2, tests, (and when applicable, API doc)

    Comments

    • If this change is a backward incompatible change, why must this change be made.
    • Interesting edge cases to note here
    release focus 
    opened by haven-jeon 32
  • Make BERT-GPU deploy compatible with MXNet 1.8

    Make BERT-GPU deploy compatible with MXNet 1.8

    Description

    Change custom graph pass implementation to make it compatible with MXNet 1.8 Solving issue https://github.com/dmlc/gluon-nlp/issues/1388

    Checklist

    Essentials

    • [x] Changes are complete (i.e. I finished coding on this PR)
    • [x] All changes have test coverage
    • [x] Code is well-documented

    Changes

    • [x] Change custom graph pass to support both MXNet 1.7 & MXNet 1.8
    • [x] Change setup and deploy scripts accordingly
    • [x] Activate CUDA Graphs for MXNet 1.8 (> 30% speedup with small batch sizes)

    cc @dmlc/gluon-nlp-team, @samskalicky, @Kh4L

    opened by MoisesHer 30
  • update CI

    update CI

    Description

    (Brief description on what this PR is about)

    Checklist

    Essentials

    • [ ] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
    • [ ] Changes are complete (i.e. I finished coding on this PR)
    • [ ] All changes have test coverage
    • [ ] Code is well-documented

    Changes

    • [ ] Feature1, tests, (and when applicable, API doc)
    • [ ] Feature2, tests, (and when applicable, API doc)

    Comments

    • If this change is a backward incompatible change, why must this change be made.
    • Interesting edge cases to note here

    cc @dmlc/gluon-nlp-team

    opened by barry-jin 0
  • (not a bug) question about bert `create_pretraining_data.tokenize_lines()`

    (not a bug) question about bert `create_pretraining_data.tokenize_lines()`

    Description

    In the function scripts.pretraining.bert.create_pretraining_data.tokenize_lines()

    The code snippet:

    for line in lines:
            if not line:
                break
            line = line.strip()
            # Empty lines are used as document delimiters
            if not line:
                results.append([])
            else:
                #<OMITTED FOR BREVITY...>
        return results
    

    Suggests that empty or null lines (e.g. "" or None) break the for-loop returning only the lines that have been processed so far whereas stripped-empty lines (e.g. " ") are used as document delimiters.

    Could someone shed light as to what the (empty line + break-from-loop) is meant to accomplish? Are empty/null lines used as delimiters?

    bug 
    opened by kiukchung 1
  • Wheel fails to build setup.py for gluonnlp

    Wheel fails to build setup.py for gluonnlp

    Description

    gluonnlp failed to install on win10 because setup.py was not exited. I already have wheels installed both for pip and pip3.

    Error Message

    (venv) PS C:\Users\HarshadPrakashBhandw\PycharmProjects\demo> py -m pip install gluonnlp-0.10.0.tar.gz --no-cache-dir Processing c:\users\harshadprakashbhandw\pycharmprojects\demo\gluonnlp-0.10.0.tar.gz Preparing metadata (setup.py) ... done Requirement already satisfied: numpy>=1.16.0 in c:\users\harshadprakashbhandw\pycharmprojects\demo\venv\lib\site-packages (from gluonnlp==0.10.0) (1.22. 4) Requirement already satisfied: cython in c:\users\harshadprakashbhandw\pycharmprojects\demo\venv\lib\site-packages (from gluonnlp==0.10.0) (0.29.30) Requirement already satisfied: packaging in c:\users\harshadprakashbhandw\pycharmprojects\demo\venv\lib\site-packages (from gluonnlp==0.10.0) (21.3)
    Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\harshadprakashbhandw\pycharmprojects\demo\venv\lib\site-packages (from packaging->gl uonnlp==0.10.0) (3.0.9) Building wheels for collected packages: gluonnlp Building wheel for gluonnlp (setup.py) ... error error: subprocess-exited-with-error

    × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [140 lines of output] running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-cpython-310 creating build\lib.win-amd64-cpython-310\gluonnlp copying src\gluonnlp\base.py -> build\lib.win-amd64-cpython-310\gluonnlp copying src\gluonnlp_constants.py -> build\lib.win-amd64-cpython-310\gluonnlp copying src\gluonnlp_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp creating build\lib.win-amd64-cpython-310\gluonnlp\calibration copying src\gluonnlp\calibration\collector.py -> build\lib.win-amd64-cpython-310\gluonnlp\calibration copying src\gluonnlp\calibration_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\calibration creating build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\baidu_ernie_data.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\candidate_sampler.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\classification.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\conll.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\dataloader.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\dataset.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\datasetloader.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\glue.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\intent_slot.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\question_answering.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\registry.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\sampler.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\sentiment.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\stream.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\super_glue.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\transforms.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\translation.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\utils.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\word_embedding_evaluation.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data creating build\lib.win-amd64-cpython-310\gluonnlp\embedding copying src\gluonnlp\embedding\evaluation.py -> build\lib.win-amd64-cpython-310\gluonnlp\embedding copying src\gluonnlp\embedding\token_embedding.py -> build\lib.win-amd64-cpython-310\gluonnlp\embedding copying src\gluonnlp\embedding_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\embedding creating build\lib.win-amd64-cpython-310\gluonnlp\initializer copying src\gluonnlp\initializer\initializer.py -> build\lib.win-amd64-cpython-310\gluonnlp\initializer copying src\gluonnlp\initializer_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\initializer creating build\lib.win-amd64-cpython-310\gluonnlp\loss copying src\gluonnlp\loss\activation_regularizer.py -> build\lib.win-amd64-cpython-310\gluonnlp\loss copying src\gluonnlp\loss\label_smoothing.py -> build\lib.win-amd64-cpython-310\gluonnlp\loss copying src\gluonnlp\loss\loss.py -> build\lib.win-amd64-cpython-310\gluonnlp\loss copying src\gluonnlp\loss_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\loss creating build\lib.win-amd64-cpython-310\gluonnlp\metric copying src\gluonnlp\metric\length_normalized_loss.py -> build\lib.win-amd64-cpython-310\gluonnlp\metric copying src\gluonnlp\metric\masked_accuracy.py -> build\lib.win-amd64-cpython-310\gluonnlp\metric copying src\gluonnlp\metric_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\metric creating build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\attention_cell.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\bert.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\bilm_encoder.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\block.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\convolutional_encoder.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\elmo.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\highway.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\info.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\language_model.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\lstmpcellwithclip.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\parameter.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\sampled_block.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\seq2seq_encoder_decoder.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\sequence_sampler.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\transformer.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\translation.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\utils.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\model creating build\lib.win-amd64-cpython-310\gluonnlp\optimizer copying src\gluonnlp\optimizer\bert_adam.py -> build\lib.win-amd64-cpython-310\gluonnlp\optimizer copying src\gluonnlp\optimizer_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\optimizer creating build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\files.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\parallel.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\parameter.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\seed.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\version.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils creating build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab\bert.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab\elmo.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab\subwords.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab\vocab.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab creating build\lib.win-amd64-cpython-310\gluonnlp\data\batchify copying src\gluonnlp\data\batchify\batchify.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\batchify copying src\gluonnlp\data\batchify\embedding.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\batchify copying src\gluonnlp\data\batchify\language_model.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\batchify copying src\gluonnlp\data\batchify_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\batchify creating build\lib.win-amd64-cpython-310\gluonnlp\data\bert copying src\gluonnlp\data\bert\glue.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\bert copying src\gluonnlp\data\bert\squad.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\bert copying src\gluonnlp\data\bert_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\bert creating build\lib.win-amd64-cpython-310\gluonnlp\data\corpora copying src\gluonnlp\data\corpora\google_billion_word.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\corpora copying src\gluonnlp\data\corpora\large_text_compression_benchmark.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\corpora copying src\gluonnlp\data\corpora\wikitext.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\corpora copying src\gluonnlp\data\corpora_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\corpora creating build\lib.win-amd64-cpython-310\gluonnlp\data\xlnet copying src\gluonnlp\data\xlnet\squad.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\xlnet copying src\gluonnlp\data\xlnet_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\xlnet creating build\lib.win-amd64-cpython-310\gluonnlp\model\train copying src\gluonnlp\model\train\cache.py -> build\lib.win-amd64-cpython-310\gluonnlp\model\train copying src\gluonnlp\model\train\embedding.py -> build\lib.win-amd64-cpython-310\gluonnlp\model\train copying src\gluonnlp\model\train\language_model.py -> build\lib.win-amd64-cpython-310\gluonnlp\model\train copying src\gluonnlp\model\train_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\model\train running egg_info writing src\gluonnlp.egg-info\PKG-INFO writing dependency_links to src\gluonnlp.egg-info\dependency_links.txt writing requirements to src\gluonnlp.egg-info\requires.txt writing top-level names to src\gluonnlp.egg-info\top_level.txt reading manifest file 'src\gluonnlp.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no files found matching '.py' under directory 'gluonnlp' warning: no previously-included files matching '' found under directory 'tests' warning: no previously-included files matching '*' found under directory 'scripts' adding license file 'LICENSE' writing manifest file 'src\gluonnlp.egg-info\SOURCES.txt' copying src\gluonnlp\data\fast_bert_tokenizer.c -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\fast_bert_tokenizer.pyx -> build\lib.win-amd64-cpython-310\gluonnlp\data running build_ext skipping 'src/gluonnlp/data\fast_bert_tokenizer.c' Cython extension (up-to-date) building 'gluonnlp.data.fast_bert_tokenizer' extension creating build\temp.win-amd64-cpython-310 creating build\temp.win-amd64-cpython-310\Release creating build\temp.win-amd64-cpython-310\Release\src creating build\temp.win-amd64-cpython-310\Release\src\gluonnlp creating build\temp.win-amd64-cpython-310\Release\src\gluonnlp\data "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\HarshadPrakashBhandw\PycharmProjects\demo\venv\include "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x 64__qbz5n2kfra8p0\include" "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\Include" "-IC:\Program Fi les (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-I C:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86 )\Windows Kits\10\include\10.0.19041.0\cppwinrt" /Tcsrc/gluonnlp/data\fast_bert_tokenizer.c /Fobuild\temp.win-amd64-cpython-310\Release\src/gluonnlp/d ata\fast_bert_tokenizer.obj fast_bert_tokenizer.c src/gluonnlp/data\fast_bert_tokenizer.c(4005): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data src/gluonnlp/data\fast_bert_tokenizer.c(13614): warning C4013: '_PyGen_Send' undefined; assuming extern returning int src/gluonnlp/data\fast_bert_tokenizer.c(13614): warning C4047: '=': 'PyObject *' differs in levels of indirection from 'int' src/gluonnlp/data\fast_bert_tokenizer.c(13619): warning C4047: '=': 'PyObject *' differs in levels of indirection from 'int' src/gluonnlp/data\fast_bert_tokenizer.c(13702): warning C4047: '=': 'PyObject *' differs in levels of indirection from 'int' "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\bin\HostX86\x64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\HarshadPrakashBhandw\PycharmProjects\demo\venv\libs "/LIBPATH:C:\Program Files\WindowsApps\P ythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\libs" "/LIBPATH:C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.1 0.1520.0_x64__qbz5n2kfra8p0" /LIBPATH:C:\Users\HarshadPrakashBhandw\PycharmProjects\demo\venv\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft V isual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\lib\um\x64" "/LIBPATH:C:\Prog ram Files (x86)\Windows Kits\10\lib\10.0.19041.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.19041.0\um\x64" /EXPORT:PyInit_fa st_bert_tokenizer build\temp.win-amd64-cpython-310\Release\src/gluonnlp/data\fast_bert_tokenizer.obj /OUT:build\lib.win-amd64-cpython-310\gluonnlp\data
    fast_bert_tokenizer.cp310-win_amd64.pyd /IMPLIB:build\temp.win-amd64-cpython-310\Release\src/gluonnlp/data\fast_bert_tokenizer.cp310-win_amd64.lib
    Creating library build\temp.win-amd64-cpython-310\Release\src/gluonnlp/data\fast_bert_tokenizer.cp310-win_amd64.lib and object build\temp.win-a md64-cpython-310\Release\src/gluonnlp/data\fast_bert_tokenizer.cp310-win_amd64.exp fast_bert_tokenizer.obj : error LNK2001: unresolved external symbol _PyGen_Send build\lib.win-amd64-cpython-310\gluonnlp\data\fast_bert_tokenizer.cp310-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\bin\HostX86\x64\link.exe' fai led with exit code 1120 [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for gluonnlp Running setup.py clean for gluonnlp Failed to build gluonnlp Installing collected packages: gluonnlp Running setup.py install for gluonnlp ... error error: subprocess-exited-with-error

    × Running setup.py install for gluonnlp did not run successfully. │ exit code: 1 ╰─> [142 lines of output] running install C:\Users\HarshadPrakashBhandw\PycharmProjects\demo\venv\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( running build running build_py creating build creating build\lib.win-amd64-cpython-310 creating build\lib.win-amd64-cpython-310\gluonnlp copying src\gluonnlp\base.py -> build\lib.win-amd64-cpython-310\gluonnlp copying src\gluonnlp_constants.py -> build\lib.win-amd64-cpython-310\gluonnlp copying src\gluonnlp_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp creating build\lib.win-amd64-cpython-310\gluonnlp\calibration copying src\gluonnlp\calibration\collector.py -> build\lib.win-amd64-cpython-310\gluonnlp\calibration copying src\gluonnlp\calibration_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\calibration creating build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\baidu_ernie_data.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\candidate_sampler.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\classification.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\conll.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\dataloader.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\dataset.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\datasetloader.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\glue.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\intent_slot.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\question_answering.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\registry.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\sampler.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\sentiment.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\stream.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\super_glue.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\transforms.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\translation.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\utils.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\word_embedding_evaluation.py -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data creating build\lib.win-amd64-cpython-310\gluonnlp\embedding copying src\gluonnlp\embedding\evaluation.py -> build\lib.win-amd64-cpython-310\gluonnlp\embedding copying src\gluonnlp\embedding\token_embedding.py -> build\lib.win-amd64-cpython-310\gluonnlp\embedding copying src\gluonnlp\embedding_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\embedding creating build\lib.win-amd64-cpython-310\gluonnlp\initializer copying src\gluonnlp\initializer\initializer.py -> build\lib.win-amd64-cpython-310\gluonnlp\initializer copying src\gluonnlp\initializer_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\initializer creating build\lib.win-amd64-cpython-310\gluonnlp\loss copying src\gluonnlp\loss\activation_regularizer.py -> build\lib.win-amd64-cpython-310\gluonnlp\loss copying src\gluonnlp\loss\label_smoothing.py -> build\lib.win-amd64-cpython-310\gluonnlp\loss copying src\gluonnlp\loss\loss.py -> build\lib.win-amd64-cpython-310\gluonnlp\loss copying src\gluonnlp\loss_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\loss creating build\lib.win-amd64-cpython-310\gluonnlp\metric copying src\gluonnlp\metric\length_normalized_loss.py -> build\lib.win-amd64-cpython-310\gluonnlp\metric copying src\gluonnlp\metric\masked_accuracy.py -> build\lib.win-amd64-cpython-310\gluonnlp\metric copying src\gluonnlp\metric_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\metric creating build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\attention_cell.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\bert.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\bilm_encoder.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\block.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\convolutional_encoder.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\elmo.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\highway.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\info.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\language_model.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\lstmpcellwithclip.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\parameter.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\sampled_block.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\seq2seq_encoder_decoder.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\sequence_sampler.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\transformer.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\translation.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model\utils.py -> build\lib.win-amd64-cpython-310\gluonnlp\model copying src\gluonnlp\model_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\model creating build\lib.win-amd64-cpython-310\gluonnlp\optimizer copying src\gluonnlp\optimizer\bert_adam.py -> build\lib.win-amd64-cpython-310\gluonnlp\optimizer copying src\gluonnlp\optimizer_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\optimizer creating build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\files.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\parallel.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\parameter.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\seed.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils\version.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils copying src\gluonnlp\utils_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\utils creating build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab\bert.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab\elmo.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab\subwords.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab\vocab.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab copying src\gluonnlp\vocab_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\vocab creating build\lib.win-amd64-cpython-310\gluonnlp\data\batchify copying src\gluonnlp\data\batchify\batchify.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\batchify copying src\gluonnlp\data\batchify\embedding.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\batchify copying src\gluonnlp\data\batchify\language_model.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\batchify copying src\gluonnlp\data\batchify_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\batchify creating build\lib.win-amd64-cpython-310\gluonnlp\data\bert copying src\gluonnlp\data\bert\glue.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\bert copying src\gluonnlp\data\bert\squad.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\bert copying src\gluonnlp\data\bert_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\bert creating build\lib.win-amd64-cpython-310\gluonnlp\data\corpora copying src\gluonnlp\data\corpora\google_billion_word.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\corpora copying src\gluonnlp\data\corpora\large_text_compression_benchmark.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\corpora copying src\gluonnlp\data\corpora\wikitext.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\corpora copying src\gluonnlp\data\corpora_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\corpora creating build\lib.win-amd64-cpython-310\gluonnlp\data\xlnet copying src\gluonnlp\data\xlnet\squad.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\xlnet copying src\gluonnlp\data\xlnet_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\data\xlnet creating build\lib.win-amd64-cpython-310\gluonnlp\model\train copying src\gluonnlp\model\train\cache.py -> build\lib.win-amd64-cpython-310\gluonnlp\model\train copying src\gluonnlp\model\train\embedding.py -> build\lib.win-amd64-cpython-310\gluonnlp\model\train copying src\gluonnlp\model\train\language_model.py -> build\lib.win-amd64-cpython-310\gluonnlp\model\train copying src\gluonnlp\model\train_init_.py -> build\lib.win-amd64-cpython-310\gluonnlp\model\train running egg_info writing src\gluonnlp.egg-info\PKG-INFO writing dependency_links to src\gluonnlp.egg-info\dependency_links.txt writing requirements to src\gluonnlp.egg-info\requires.txt writing top-level names to src\gluonnlp.egg-info\top_level.txt reading manifest file 'src\gluonnlp.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no files found matching '.py' under directory 'gluonnlp' warning: no previously-included files matching '' found under directory 'tests' warning: no previously-included files matching '*' found under directory 'scripts' adding license file 'LICENSE' writing manifest file 'src\gluonnlp.egg-info\SOURCES.txt' copying src\gluonnlp\data\fast_bert_tokenizer.c -> build\lib.win-amd64-cpython-310\gluonnlp\data copying src\gluonnlp\data\fast_bert_tokenizer.pyx -> build\lib.win-amd64-cpython-310\gluonnlp\data running build_ext skipping 'src/gluonnlp/data\fast_bert_tokenizer.c' Cython extension (up-to-date) building 'gluonnlp.data.fast_bert_tokenizer' extension creating build\temp.win-amd64-cpython-310 creating build\temp.win-amd64-cpython-310\Release creating build\temp.win-amd64-cpython-310\Release\src creating build\temp.win-amd64-cpython-310\Release\src\gluonnlp creating build\temp.win-amd64-cpython-310\Release\src\gluonnlp\data "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\HarshadPrakashBhandw\PycharmProjects\demo\venv\include "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x 64__qbz5n2kfra8p0\include" "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\Include" "-IC:\Program Fi les (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-I C:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86 )\Windows Kits\10\include\10.0.19041.0\cppwinrt" /Tcsrc/gluonnlp/data\fast_bert_tokenizer.c /Fobuild\temp.win-amd64-cpython-310\Release\src/gluonnlp/d ata\fast_bert_tokenizer.obj fast_bert_tokenizer.c src/gluonnlp/data\fast_bert_tokenizer.c(4005): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data src/gluonnlp/data\fast_bert_tokenizer.c(13614): warning C4013: '_PyGen_Send' undefined; assuming extern returning int src/gluonnlp/data\fast_bert_tokenizer.c(13614): warning C4047: '=': 'PyObject *' differs in levels of indirection from 'int' src/gluonnlp/data\fast_bert_tokenizer.c(13619): warning C4047: '=': 'PyObject *' differs in levels of indirection from 'int' src/gluonnlp/data\fast_bert_tokenizer.c(13702): warning C4047: '=': 'PyObject *' differs in levels of indirection from 'int' "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\bin\HostX86\x64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\HarshadPrakashBhandw\PycharmProjects\demo\venv\libs "/LIBPATH:C:\Program Files\WindowsApps\P ythonSoftwareFoundation.Python.3.10_3.10.1520.0_x64__qbz5n2kfra8p0\libs" "/LIBPATH:C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.1 isual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\lib\um\x64" "/LIBPATH:C:\Prog st_bert_tokenizer build\temp.win-amd64-cpython-310\Release\src/gluonnlp/data\fast_bert_tokenizer.obj /OUT:build\lib.win-amd64-cpython-310\gluonnlp\data
    fast_bert_tokenizer.cp310-win_amd64.pyd /IMPLIB:build\temp.win-amd64-cpython-310\Release\src/gluonnlp/data\fast_bert_tokenizer.cp310-win_amd64.lib
    md64-cpython-310\Release\src/gluonnlp/data\fast_bert_tokenizer.cp310-win_amd64.exp fast_bert_tokenizer.obj : error LNK2001: unresolved external symbol _PyGen_Send build\lib.win-amd64-cpython-310\gluonnlp\data\fast_bert_tokenizer.cp310-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.32.31326\bin\HostX86\x64\link.exe' fai led with exit code 1120 [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure

    × Encountered error while trying to install package. ╰─> gluonnlp

    note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure. (venv) PS C:\Users\HarshadPrakashBhandw\PycharmProjects\demo> pip install wheel Requirement already satisfied: wheel in c:\users\harshadprakashbhandw\pycharmprojects\demo\venv\lib\site-packages (0.37.1) (venv) PS C:\Users\HarshadPrakashBhandw\PycharmProjects\demo> pip3 install wheel Requirement already satisfied: wheel in c:\users\harshadprakashbhandw\pycharmprojects\demo\venv\lib\site-packages (0.37.1)

    What have you tried to solve it?

    1. Installed wheel package
    2. Update pip version
    bug 
    opened by hb0313 2
  • Add sorting of chunks to evaluation

    Add sorting of chunks to evaluation

    Description

    This change introduces sorting of chunks before executing evaluation to reduce padding to minimum and in this way improve performance.

    How the change works

    As every input feature has unique qas_id it can be used for sorting. With the sorting evaluation function goes like this:

    1. sort input features by qas_id
    2. chunk data
    3. sort chunks by their length (to reduce padding to minimum)
    4. perform inference
    5. sort chunks and results by qas_id
    6. evaluate data

    Step number 1 is performed so that chunks and their inference results can be easily put in proper order in step number 5 for evaluation in step 6.

    Performance

    Results for max_seq_length=128, doc_stride=32: no sort: image

    sorted: image

    Performance did not improve much due to most of the chunks being of same 128 length due to relatively small values of max_seq_length and doc_stride.

    Results for max_seq_length=512, doc_stride=128 (default values in run_squad.py script): no sort: image

    sorted: image

    As you can see the performance improved significantly (~20%) without any loss of accuracy.

    cc @dmlc/gluon-nlp-team

    opened by bartekkuncer 0
  • Add assert for doc_stride, max_seq_length and max_query_length

    Add assert for doc_stride, max_seq_length and max_query_length

    Description

    This change adds assert for doc_stride, max_seq_length and max_query_length relation (args.doc_stride <= args.max_seq_length - args.max_query_length - 3) as incautious setting of them can cause data loss when chunking input features and ultimately significantly lower accuracy.

    Example

    Without the assert when one sets max_seq_length to e.g. 128 and keeps default 128 value for doc_stride this happens for the input feature of qas_id == "572fe53104bcaa1900d76e6b" when running bash ~/gluon-nlp/scripts/question_answering/commands/run_squad2_uncased_bert_base.sh: image

    As you can see we are losing some of the context_tokens_ids (in red rectangle) as they are not included in any of the ChunkFeatures due to too high doc_stride in comparison to max_seq_length and user does not get notified even with a simple warning. This can lead to significant accuracy drop as this kind of data losses happen for all input features which do not fit entirely into single chunk.

    This change introduces an assert popping when there is a possible data loss and forces the user to set proper/safe values for doc_stride, max_seq_length and max_query_length.

    Error message

    image

    Chunk from example above with doc_stride reduced to 32

    image

    As you can see when values of doc_stride, max_seq_length and max_query_length satisfy abovementioned equation no data is lost during chunking and we avoid accuracy loss.

    cc @dmlc/gluon-nlp-team

    opened by bartekkuncer 0
  • Wrong ETA for max_seq_length != 512

    Wrong ETA for max_seq_length != 512

    Description

    When you change max_seq_length value from 512 the ETA in eval_validation function does not end on 0.

    Error Message

    image

    To Reproduce

    Change max_seq_length to e.g. 128 and run e.g ~/gluon-nlp/scripts/question_answering/commands/run_squad2_uncased_bert_base.sh.

    bug 
    opened by bartekkuncer 0
Releases(v0.10.0)
  • v0.10.0(Aug 13, 2020)

    This release includes the following fixes:

    • [BUGFIX] remove wd from squad (#1223)
    • Fix deprecation warnings due to invalid escape sequences. (#1219)
    • Fix layer_norm_eps in BERTEncoder (#1214)
    • [BUGFIX] Fix vocab determinism in py35 (#1166) (#1167)

    As we prepare for the NumPy-based GluonNLP development, we are making the following adjustments to the branch usage:

    • master (old) -> v0.x: this branch will be used for maintenance of GluonNLP 0.x versions.
    • numpy -> master: the new master branch will be used for GluonNLP 1.0 onward with NumPy-compatible interface, based on the upcoming MXNet 2.0.
    Source code(tar.gz)
    Source code(zip)
  • v0.9.2(Aug 13, 2020)

    This patch release includes the following bug fix:

    • [BUGFIX] remove wd from squad (#1223)
    • Fix deprecation warnings due to invalid escape sequences. (#1219)
    • Fix layer_norm_eps in BERTEncoder (#1214)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Mar 3, 2020)

    This release includes the bug fix for https://github.com/dmlc/gluon-nlp/pull/1158 (#1167). It affects the determinism of the instantiated vocabulary object on the order of special tokens on Python 3.5. Users of Python 3.5 are strongly encouraged to upgrade to this version.

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Feb 10, 2020)

    News

    Models and Scripts in v0.9

    BERT

    INT8 Quantization for BERT Sentence Classification and Question Answering (#1080)! Also Check out the blog post.

    Enhancements to the pretraining script (#1121, #1099) and faster tokenizer for BERT (#921, #1024) as well as multi-GPU support for SQuAD fine-tuning (#1079).

    Make BERT a HybridBlock (#877).

    XLNet

    The XLNet model introduced by Yang, Zhilin, et. al in "XLNet: Generalized Autoregressive Pretraining for Language Understanding". The model was converted from the original repository (#866).

    GluonNLP further provides scripts for finetuning XLNet on the Glue (#995) and SQuAD datasets (#1130) that reproduce the authors results. Check out the usage.

    DistilBERT

    The DistilBERT model introduced by Sanh, Victor, et. al in "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (#922).

    Transformer

    Add a separate Transformer inference script to make inference easy and make it convenient to analysis the performance of transformer inference (#852).

    Korean BERT

    Pre-trained Korean BERT is available as part of GluonNLP (#1057)

    RoBERTa

    GluonNLP now provides scripts for finetuning RoBERTa (#931).

    GPT2

    GPT2 is now a HybridBlock the model can be exported for running from other MXNet language bindings (#1010).

    New Features

    • Add NamedTuple + Dict batchify (#959)
    • Add even_size option to split sampler (#1028)
    • Add length normalized metrics for machine translation tasks (#1095)
    • Add raw attention scores to the AttentionCell #951 (#964)
    • Add round_to feature to BERT & XLNet finetuning scripts (#1133)
    • Add stratified train_valid_split similar to sklearn.model_selection.train_test_split (#933)
    • Add SuperGlue dataset API (#858)
    • Add Multi Model Server deployment code example for developers (#1140)
    • Allow custom dropout, number of layers/units for BERT (#950)
    • Avoid race condition when downloading vocab (#1078)
    • Deprecate specifying Vocab padding, bos and eos_token as positional arguments (#945)
    • Fast multitensor adam optimizer (#1111)
    • Faster grad_global_norm for clipping (#1115)
    • Hybridizable AWDRNN/StandardRNN (#911)
    • Padding seq length to multiple of 8 in BERT model (#909)
    • Scripts for producing the figures that explain the bucketing strategy (#908)
    • Split up Seq2SeqDecoder in Seq2SeqDecoder and Seq2SeqOneStepDecoder (#976)
    • Switch CI to Python 3.5 and declare Python 3.5 support (#1009)
    • Try to use the new None feature in MXNet + Drop support for MXNet 1.5 (#967)
    • Use fused gelu operator (#1082)
    • Use softmax with length, and interleaved matmul for BERT (#1136)
    • Documentation of Model Conversion Scripts at https://gluon-nlp.mxnet.io/v0.9.x/model_zoo/conversion_tools/index.html (#922)

    Bug Fixes and code cleanup

    • Add version checker to all scripts (#930)
    • Add version checker to all tutorials (#934)
    • Add 'packaging' to requirements (#1143)
    • Adjust code owner (#923)
    • Avoid using dict for attention cell parameter creation (#1050)
    • Bump version in preparation for 0.9 release (#987)
    • Change SimVerb3500 URL to aclweb hosted version (#979)
    • Correct propagation of error codes in GluonNLP-py3-master-gpu-doc (#971)
    • Corrected np.random.randint upper limit in data.stream.py (#935)
    • Declare Python version requirement in setup.py (#927)
    • Declare more optional dependencies (#958)
    • Declare pytest seed marker in pytest.ini (#940)
    • Disable HybridBeamSearch (#1021)
    • Drop LAMB optimizer from GluonNLP in favor of MXNet version (#1116)
    • Drop unused compatibility helpers and fix doc (#928)
    • Fix #905 (#906)
    • Fix a SQuAD 2.0 evaluation bug (#907)
    • Fix argument analogy-max-vocab-size (#904)
    • Fix broken multi-head attention cell (#878)
    • Fix bugs in BERT export script (#944)
    • Fix chnsenticorp dataset download link (#873)
    • Fix file sampler for BERT (#977)
    • Fix index.rst and gpu flag in machine translation (#952)
    • Fix log in finetune_squad.py (#1001)
    • Fix parameter sharing of WeightDropParameter (#1083)
    • Fix scripts/question_answering/data_pipeline.py requiring optional package (#1013)
    • Fix the weight tie and weight sharing for AWDRNN (#1087)
    • Fix training command in Language Modeling index.rst (#1100)
    • Fix version check in train_gnmt.py and train_transformer.py (#1003)
    • Fix standard rnn weight sharing error (#1122)
    • Glue data preprocessing pipeline and bert & xlnet scripts (#1031)
    • Improve Vocab.repr if reserved_tokens or unknown_token is None (#989)
    • Improve readability (#975)
    • Improve test robustness (#960)
    • Improve the readability of the training script. This fix replaces magic numbers with the name (#1006)
    • Make EmbeddingCenterContextBatchify returned dtype robust to empty sentences (#954)
    • Modify the log average loss (#1103)
    • Move ICSL script out of BERT folder (#1131)
    • Move NER script out of bert folder (#1090)
    • Move ParallelBigRNN into nlp.model namespace (#1118)
    • Move get_rnn_cell out of seq2seq_encoder_decoder (#1073)
    • Mxnet version check (#1063)
    • Refactor BERT with new data preprocessing (#1124)
    • Remove NLTKMosesTokenizer in favor of SacreMosesTokenizer (#942)
    • Remove extra dropout in BERT/RoBERTa (#1022)
    • Remove outdated comment (#943)
    • Remove padding warning (#916)
    • Replace unicode comma with ascii comma (#1056)
    • Split up inheritance structure of TransformerEncoder and BERTEncoder (#988)
    • Support int32 for sampled blocks (#1106)
    • Switch batch jobs to use G4dn.2x instance (#1041)
    • TransformerXL LayerNorm eps and XLNet pretrained model config (#1005)
    • Unify BERT horovod and kvstore pre-training script (#889)
    • Update README.rst (#884)
    • Update data_api.rst (#893)
    • Update embedding script (#1046)
    • Update fp16_utils.py (#1037)
    • Update index.rst (#876)
    • Update index.rst (#891)
    • Update navbar install (#983)
    • Update numba dependency in setup.py (#941)
    • Update outdated contributor list (#963)
    • Update prepare_clean_env.sh (#998)

    Documentation

    • Add comment to BERT notebook (#1026)
    • Add missing docs for nlp.utils (#936)
    • Add more documentation to XLNet scripts (#985)
    • Add section for "Clone the master branch for development" (#1075)
    • Add to toc tree depth to enable multiple level menu (#1108)
    • Cite source of pretrained parameters for bert_12_768_12 (#915)
    • Doc fix for vocab.subwords (#885)
    • Enhance vocab not found err msg (#917)
    • Fix command line examples for text classification (#874)
    • Fix math formula in docs (#920)
    • More detailed doc for CorpusBPTTBatchify (#888)
    • Release checklist (#890)
    • Remove non-existent arguments for BERT and Transformer (#946)
    • Remove py3 usage from the doc (#1077)
    • Update installation guide with selectors (#966)
    • Update mxnet version in installation doc (#1072)
    • Update pre-trained model link (#1117)
    • Update Installation instructions for source (#1146)

    Continuous Integration

    • Disable SimVerb test for 14 days (#953)
    • Disable horovod test temporarily (#1030)
    • Disable known bad mxnet nightly version (#997)
    • Enable integration tests on CPU (#957)
    • Enable testing warnings with pytest and update deprecated API invocations (#980)
    • Enable timestamp in CI (#925)
    • Enable type checks and inference with pytype (#1018)
    • Fix CI (#875)
    • Preserve stderr and stdout streams in doc CI stage for Cloudwatch (#882)
    • Remove skip_master feature (#1017)
    • Switch source of MXNet nightly build (#1058)
    • Test MXNet 1.6 pre-release as part of CI pipeline (#1023)
    • Update MXNet master version tested on CI (#1113)
    • Update numba (#1096)
    • Use Cuda 10.0 MXNet build (#991)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.3(Jan 14, 2020)

  • v0.8.2(Dec 21, 2019)

    This release covers a few fixes for the bugs reported:

    • Fixed argument passing in the bert/embedding.py script
    • Updated SimVerb3500 dataset URL to the aclweb hosted version
    • Removed multi-processing in DataLoader from in bert/pretraining_utils.py which potentially causes crash when horovod mpi is used for training
    • Before MXNet 1.6.0, Gluon Trainer assumes deterministic parameter creation order for distributed traiing. The attention cell for BERT and transformer has a non-deterministic parameter creation order in v0.8.1 and v0.8.0, which will cause divergence during distributed training. It is now fixed.

    Note that since v0.8.2, the default branch of gluon-nlp github will be switched to the latest stable branch, instead of the master branch under development.

    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Aug 21, 2019)

    News

    Models and Scripts

    RoBERTa

    Transformer-XL

    Bug Fixes

    • Fixed hybridization for the BERT model (#877)
    • Change the variable model to bert_classifier (#828) thank you @LindenLiu
    • Revert "Add axis argument to squeeze()" (#857)
    • [BUGFIX] Remove incorrect vocab.padding_token requirement in CorpusBPTTBatchify
    • [BUGFIX] Fix Vocab with unknown_token remapped to != 0 via token_to_idx arg (#862)
    • [BUGFIX] Fix AMP in finetune_classifier.py (#848)
    • [BUGFIX] fix broken multi-head attention cell (#878) @ZiyueHuang
    • [FIX] fix chnsenticorp dataset download link (#873)
    • fix the usage of pad in bert (#850)

    Documentation

    • Clarify Bert does not require MXNet nightly anymore (#860)
    • [DOC] fix broken links (#833)
    • [DOC] Update BERT index.rst (#844)
    • [DOC] Add GluonCV/NLP archive (#823)
    • [DOC] add missing dataset document (#832)
    • [DOC] remove wrong tutorial header level (#826)
    • [DOC] Fix a typo in attention_cell's docstring (#841) thank you @shenfei
    • [DOC] Upgrade mxnet dependency to 1.5.0 and use Cuda 10.1 on CI (#842)
    • Remove Py2 icon from Readme. Add 3.7 (#856)
    • [DOC] Improve help message (#855) thank you @apeforest
    • Update index.rst (#853)
    • [DOC] Fix Machine Translation with Transformers example (#865)
    • update button style (#869)
    • [DOC] doc fix for vocab.subwords (#885) thank you @liusy182

    Continuous Integration

    • [CI] Support py3-master_gpu_doc CI run on arbitrary branches (#829)
    • Enforce AWS Batch jobName rules (#836)
    • dump linkcheck errors to comments (#827)
    • Enable Sphinx Autodoc typehints (#830)
    • [CI] Preserve stderr and stdout streams in doc CI stage for Cloudwatch
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Aug 8, 2019)

  • v0.7.1(Jul 17, 2019)

    News

    Models and Scripts

    BERT

    • a BERT BASE model pre-trained on a large corpus including OpenWebText Corpus, BooksCorpus, and English Wikipedia, which has comparable performance with the BERT large model from Google. The test score on GLUE Benchmark is reported below. Also improved usability of the BERT pre-training script: on-the-fly training data generation, sentencepiece, horovod, etc. (#799, #687, #806, #669, #665). Thank you @davisliang @vanyacohen @Skylion007

    | Source | GluonNLP | google-research/bert | google-research/bert | |-----------|-----------------------------------------|-----------------------------|-----------------------------| | Model | bert_12_768_12 | bert_12_768_12 | bert_24_1024_16 | | Dataset | openwebtext_book_corpus_wiki_en_uncased | book_corpus_wiki_en_uncased | book_corpus_wiki_en_uncased | | SST-2 | 95.3 | 93.5 | 94.9 | | RTE | 73.6 | 66.4 | 70.1 | | QQP | 72.3 | 71.2 | 72.1 | | SQuAD 1.1 | 91.0/84.4 | 88.5/80.8 | 90.9/84.1 | | STS-B | 87.5 | 85.8 | 86.5 | | MNLI-m/mm | 85.3/84.9 | 84.6/83.4 | 86.7/85.9 |

    GPT-2

    ESIM

    Data

    • Natural language understanding with datasets from the GLUE benchmark: CoLA, SST-2, MRPC, STS-B, MNLI, QQP, QNLI, WNLI, RTE (#682)
    • Sentiment analysis datasets: CR, MPQA (#663)
    • Intent classification and slot labeling datasets: ATIS and SNIPS (#816)

    New Features

    • [Feature] support save model / trainer states to S3 (#700)
    • [Feature] support load model/trainer states from s3 (#702)
    • [Feature] Add SentencePieceTokenizer for BERT (#669)
    • [FEATURE] Flexible vocabulary (#732)
    • [API] Moving MaskedSoftmaxCELoss and LabelSmoothing to model API (#754) thanks @ThomasDelteil
    • [Feature] add the List batchify function (#812) thanks @ThomasDelteil
    • [FEATURE] Add LAMB optimizer (#733)

    Bug Fixes

    • [BUGFIX] Fixes for BERT embedding, pretraining scripts (#640) thanks @Deseaus
    • [BUGFIX] Update hash of wiki_cn_cased and wiki_multilingual_cased vocab (#655)
    • fix bert forward call parameter mismatch (#695) thanks @paperplanet
    • [BUGFIX] Fix mlm_loss reporting for eval dataset (#696)
    • Fix _get_rnn_cell (#648) thanks @MarisaKirisame
    • [BUGFIX] fix mrpc dataset idx (#708)
    • [bugfix] fix hybrid beam search sampler(#710)
    • [BUGFIX] [DOC] Update nlp.model.get_model documentation and get_model API (#734)
    • [BUGFIX] Fix handling of duplicate special tokens in Vocabulary (#749)
    • [BUGFIX] Fix TokenEmbedding serialization with emb[emb.unknown_token] != 0 (#763)
    • [BUGFIX] Fix glue test result serialization (#773)
    • [BUGFIX] Fix init bug for multilevel BiLMEncoder (#783) thanks @Ishitori

    API Changes

    • [API] Dropping support for wiki_multilingual and wiki_cn (#764)
    • [API] Remove get_bert_model from the public API list (#767)

    Enhancements

    • [FEATURE] offer load_w2v_binary method to load w2v binary file (#620)
    • [Script] Add inference function for BERT classification (#639) thanks @TaoLv
    • [SCRIPT] - Add static BERT base export script (for use with MXNet Module API) (#672)
    • [Enhancement] One script to export bert for classification/regression/QA (#705)
    • [enhancement] refactor bert finetuning script (#692)
    • [Enhancement] only use the best model for inference for bert classification (#716)
    • [Dataset] redistribute conll2004 (#719)
    • [Enhancement] add periodic evaluation for BERT pre-training (#720)
    • [FEATURE]add XNLI task (#717)
    • [refactor] Refactor BERT script folder (#744)
    • [Enhancement] BERT pre-training data generation from sentencepiece vocab (#743)
    • [REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals (#750)
    • [Refactor] Refactor BERT SQuAD inference code (#758)
    • [Enhancement] Fix dtype conversion, add sentencepiece support for SQuAD (#766)
    • [Dataset] Move MRPC dataset to API (#780)
    • [BiDAF-QANet] Common data processing logic for BiDAF and QANet (#739) thanks @Ishitori
    • [DATASET] add LCQMC, ChnSentiCorp dataset (#774) thanks @paperplanet
    • [Improvement] Implement parser evaluation in Python (#772)
    • [Enhancement] Add whole word masking for BERT (#770) thanks @basicv8vc
    • [Enhancement] Mix precision support for BERT finetuning (#793)
    • Generate BERT training samples in compressed format (#651)

    Minor Fixes

    • Various documentation fixes: #635, #637, #647, #656, #664, #667, #670, #676, #678, #681, #698, #704, #731, #745, #762, #771, #746, #778, #800, #810, #807 #814 thanks @rongruosong @crcrpar @mrchypark @xwind-h
    • Fix BERT multiprocessing data creation bug which causes unnecessary dispatching to single worker (#649)
    • [BUGFIX] Update BERT test and pre-train script (#661)
    • update url for ws353 (#701)
    • bump up version (#742)
    • [DOC] Update textCNN results (#737)
    • padding value warning (#747)
    • [TUTORIAL][DOC] Tutorial Updates (#802) thanks @faramarzmunshi

    Continuous Integration

    • skip failing tests in mxnet master (#685)
    • [CI] update nodes for CI (#686)
    • [CI] CI refactoring to speed up tests (#566)
    • [CI] fix codecov (#693)
    • use fixture for squad dataset tests (#699)
    • [CI] create zipped notebooks for link check (#712)
    • Fix test infrastructure for pytest > 4 and bump CI pytest version (#728)
    • [CI] set root in BERT tests (#738)
    • Fix conftest.py function_scope_seed (#748)
    • [CI] Fix links in contribute.rst (#752)
    • [CI] Update CI dependencies (#756)
    • Revert "[CI] Update CI dependencies (#756)" (#769)
    • [CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step (#791)
    • [CI] Don't exit pipeline before displaying AWS Batch logfiles (#801)
    • [CI] Fix for "Don't exit pipeline before displaying AWS Batch logfile (#803)
    • add license checker (#804)
    • enable timeout (#813)
    • Fix website build on master branch (#819)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jul 9, 2019)

    News

    Models and Scripts

    BERT

    • BERT model pre-trained on OpenWebText Corpus, BooksCorpus, and English Wikipedia. The test score on GLUE Benchmark is reported below. Also improved usability of the BERT pre-training script: on-the-fly training data generation, sentencepiece, horovod, etc. (#799, #687, #806, #669, #665). Thank you @davisliang

    | Source | GluonNLP | google-research/bert | google-research/bert | |-----------|-----------------------------------------|-----------------------------|-----------------------------| | Model | bert_12_768_12 | bert_12_768_12 | bert_24_1024_16 | | Dataset | openwebtext_book_corpus_wiki_en_uncased | book_corpus_wiki_en_uncased | book_corpus_wiki_en_uncased | | SST-2 | 95.3 | 93.5 | 94.9 | | RTE | 73.6 | 66.4 | 70.1 | | QQP | 72.3 | 71.2 | 72.1 | | SQuAD 1.1 | 91.0/84.4 | 88.5/80.8 | 90.9/84.1 | | STS-B | 87.5 | 85.8 | 86.5 | | MNLI-m/mm | 85.3/84.9 | 84.6/83.4 | 86.7/85.9 |

    GPT-2

    ESIM

    Data

    • Natural language understanding with datasets from the GLUE benchmark: CoLA, SST-2, MRPC, STS-B, MNLI, QQP, QNLI, WNLI, RTE (#682)
    • Sentiment analysis datasets: CR, MPQA (#663)
    • Intent classification and slot labeling datasets: ATIS and SNIPS (#816)

    New Features

    • [Feature] support save model / trainer states to S3 (#700)
    • [Feature] support load model/trainer states from s3 (#702)
    • [Feature] Add SentencePieceTokenizer for BERT (#669)
    • [FEATURE] Flexible vocabulary (#732)
    • [API] Moving MaskedSoftmaxCELoss and LabelSmoothing to model API (#754) thanks @ThomasDelteil
    • [Feature] add the List batchify function (#812) thanks @ThomasDelteil
    • [FEATURE] Add LAMB optimizer (#733)

    Bug Fixes

    • [BUGFIX] Fixes for BERT embedding, pretraining scripts (#640) thanks @Deseaus
    • [BUGFIX] Update hash of wiki_cn_cased and wiki_multilingual_cased vocab (#655)
    • fix bert forward call parameter mismatch (#695) thanks @paperplanet
    • [BUGFIX] Fix mlm_loss reporting for eval dataset (#696)
    • Fix _get_rnn_cell (#648) thanks @MarisaKirisame
    • [BUGFIX] fix mrpc dataset idx (#708)
    • [bugfix] fix hybrid beam search sampler(#710)
    • [BUGFIX] [DOC] Update nlp.model.get_model documentation and get_model API (#734)
    • [BUGFIX] Fix handling of duplicate special tokens in Vocabulary (#749)
    • [BUGFIX] Fix TokenEmbedding serialization with emb[emb.unknown_token] != 0 (#763)
    • [BUGFIX] Fix glue test result serialization (#773)
    • [BUGFIX] Fix init bug for multilevel BiLMEncoder (#783) thanks @Ishitori

    API Changes

    • [API] Dropping support for wiki_multilingual and wiki_cn (#764)
    • [API] Remove get_bert_model from the public API list (#767)

    Enhancements

    • [FEATURE] offer load_w2v_binary method to load w2v binary file (#620)
    • [Script] Add inference function for BERT classification (#639) thanks @TaoLv
    • [SCRIPT] - Add static BERT base export script (for use with MXNet Module API) (#672)
    • [Enhancement] One script to export bert for classification/regression/QA (#705)
    • [enhancement] refactor bert finetuning script (#692)
    • [Enhancement] only use the best model for inference for bert classification (#716)
    • [Dataset] redistribute conll2004 (#719)
    • [Enhancement] add periodic evaluation for BERT pre-training (#720)
    • [FEATURE]add XNLI task (#717)
    • [refactor] Refactor BERT script folder (#744)
    • [Enhancement] BERT pre-training data generation from sentencepiece vocab (#743)
    • [REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals (#750)
    • [Refactor] Refactor BERT SQuAD inference code (#758)
    • [Enhancement] Fix dtype conversion, add sentencepiece support for SQuAD (#766)
    • [Dataset] Move MRPC dataset to API (#780)
    • [BiDAF-QANet] Common data processing logic for BiDAF and QANet (#739) thanks @Ishitori
    • [DATASET] add LCQMC, ChnSentiCorp dataset (#774) thanks @paperplanet
    • [Improvement] Implement parser evaluation in Python (#772)
    • [Enhancement] Add whole word masking for BERT (#770) thanks @basicv8vc
    • [Enhancement] Mix precision support for BERT finetuning (#793)
    • Generate BERT training samples in compressed format (#651)

    Minor Fixes

    • Various documentation fixes: #635, #637, #647, #656, #664, #667, #670, #676, #678, #681, #698, #704, #731, #745, #762, #771, #746, #778, #800, #810, #807 #814 thanks @rongruosong @crcrpar @mrchypark @xwind-h
    • Fix BERT multiprocessing data creation bug which causes unnecessary dispatching to single worker (#649)
    • [BUGFIX] Update BERT test and pre-train script (#661)
    • update url for ws353 (#701)
    • bump up version (#742)
    • [DOC] Update textCNN results (#737)
    • padding value warning (#747)
    • [TUTORIAL][DOC] Tutorial Updates (#802) thanks @faramarzmunshi

    Continuous Integration

    • skip failing tests in mxnet master (#685)
    • [CI] update nodes for CI (#686)
    • [CI] CI refactoring to speed up tests (#566)
    • [CI] fix codecov (#693)
    • use fixture for squad dataset tests (#699)
    • [CI] create zipped notebooks for link check (#712)
    • Fix test infrastructure for pytest > 4 and bump CI pytest version (#728)
    • [CI] set root in BERT tests (#738)
    • Fix conftest.py function_scope_seed (#748)
    • [CI] Fix links in contribute.rst (#752)
    • [CI] Update CI dependencies (#756)
    • Revert "[CI] Update CI dependencies (#756)" (#769)
    • [CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step (#791)
    • [CI] Don't exit pipeline before displaying AWS Batch logfiles (#801)
    • [CI] Fix for "Don't exit pipeline before displaying AWS Batch logfile (#803)
    • add license checker (#804)
    • enable timeout (#813)
    • Fix website build on master branch (#819)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Mar 18, 2019)

    News

    • Tutorial proposal for GluonNLP is accepted at EMNLP 2019, Hong Kong, and KDD 2019, Anchorage.

    Models and Scripts

    • BERT pre-training on BooksCorpus and English Wikipedia with mixed precision and gradient accumulation on GPUs. We achieved the following fine-tuning results based on the produced checkpoint on validation sets(#482, #505, #489). Thank you @haven-jeon

      • | Dataset | MRPC | SQuAD 1.1 | SST-2 | MNLI-mm | |:----------:|:--------------:|:--------------:|:-----------:|:-------:| | Score | 87.99% | 80.99/88.60 | 93% | 83.6% |
    • BERT fine-tuning on various sentence classification datasets with checkpoints converted from the official repository(#600, #571, #481). Thank you @kenjewu @haven-jeon

      • | Dataset | MRPC | RTE | SST-2 | MNLI-m/mm | |:---------:|:--------------:|:--------------:|:--------------:|:--------------:| | Score | 88.7% | 70.8% | 93% | 84.55%, 84.66% |
    • BERT fine-tuning on question answering datasets with checkpoints converted from the official repository(#493). Thank you @fiercex

      • | Dataset | SQuAD 1.1 | SQuAD 1.1 | SQuAD 2.0 | |:---------:|:---------------:|:---------------:|:-------------:| | Model | bert_12_768_12| bert_24_1024_16 |bert_24_1024_16| | F1/EM | 88.53/80.98 | 90.97/84.05 | 77.96/81.02 |
    • BERT model convertion scripts for checkpoints from the original tensorflow repository, and more converted models(#456, #461, #449). Thank you @fiercex:

      • Multilingual Wikipedia (cased, BERT Base)
      • Chinese Wikipedia (cased, BERT Base)
      • Books Corpus & English Wikipedia (uncased, BERT Large)
    • Scripts and command line interface for BERT embedding of raw sentences(#587, #618). Thank you @imgarylai

    • Scripts for exporting BERT model for deployment (#624)

    New Features

    • [API] Add BERTVocab (#509) thanks @kenjewu
    • [API] Add Transforms for BERT (#526) thanks @kenjewu
    • [API] add data parallel for transformer (#387)
    • [FEATURE] Add squad2.0 Dataset (#551) thanks @fiercex
    • [FEATURE] Add NumpyDataset (#498)
    • [FEATURE] Add TruncNorm initializer for BERT (#548) thanks @Ishitori
    • [FEATURE] Add split sampler for distributed training (#494)
    • [FEATURE] Custom metric for masked accuracy (#503)
    • [FEATURE] Support custom sampler in SimpleDatasetStream (#507)
    • [FEATURE] clip gradient norm by parameter (#470)

    Bug Fixes

    • [BUGFIX] Fix Data Preprocessing for Translation Data (#568)
    • [FIX] fix parameter clip (#527)
    • [FIX] Fix divergence of the training of transformer (#543)
    • [FIX] Fix documentation and a bug in NCE Block (#558)
    • [FIX] Fix hashing single ngrams in NGramHashes (#450)
    • [FIX] Fix weight dying in BERTModel.decoder for BERT pre-training (#500)
    • [BUGFIX] Modifying the FastText Classification training for accurate mean pooling (#529) thanks @sravanbabuiitm

    API Changes

    • [API] BERT return intermediate encodings per layer (#606) thanks @Ishitori
    • [API] Better handle case when backoff is not possible in TokenEmbedding (#459)
    • [FIX] Rename wiki_cn/wiki_multilingual to wiki_cn_cased/wiki_multilingual_uncased (#594) thanks @kenjewu
    • [FIX] Update default value of BERTAdam epsilon to 1e-6 (#601)
    • [FIX] Fix BERT decoder API for masked language model prediction (#501)
    • [FIX] Remove bias correction term in BERTAdam (#499)

    Enhancements

    • [BUGFIX] use glove.840B.300d for NLI experiments (#567)
    • [API] Add debug option for parallel (#584)
    • [FEATURE] Skip dropout layer in Transformer when rate=0 (#597) thanks @TaoLv
    • [FEATURE] update sharded loader (#468)
    • [FIX] Update BERTLayerNorm Implementation (#485)
    • [TUTORIAL] Use FixedBucketSampler in BERT tutorial for better performance (#506) thanks @Ishitori
    • [API] Add Bert tokenizer to transforms.py (#464) thanks @fiercex
    • [FEATURE] Add data parallel to big rnn lm script (#564)

    Minor Fixes

    • Various documentation fixes: #484, #613, #614, #438, #448, #550, #563, #611, #605, #440, #554, #445, #556, #603, #483, #576, #610, #547, #458, #574, #510, #447, #465, #436, #622, #583 thanks @anuragsarkar97 @brettkoonce
    • [FIX] fix repeated unzipping in squad dataset (#553)
    • [FIX] web fixes (#453)
    • [FIX] Remove unused argument in fasttext_word_ngram.py (#486) thanks @kurtjanssensai
    • [FIX] Remove unused code (#528)
    • [FIX] Remove unused code in text_classification script (#442)
    • [MISC] Bump up version (#454)
    • [BUGFIX] fix pylint error (#549)
    • [FIX] Simplify the data preprocessing code for the sentiment analysis script (#462)
    • [FEATURE] BERT doc fixes and script usability enhancements (#444)
    • [FIX] Fix Py2 compatibility of machine_translation/dataprocessor.py (#541) thanks @ymjiang
    • [BUGFIX] Fix GluonNLP MXNet dependency (#555)
    • [BUGFIX] Fix Weight Drop and Test (#546)
    • [CI] Add version upper bound to doc.yml (#467)
    • [CI] speed up tests (#582)
    • [CI] upgrade mxnet to 1.4.0 (#617)
    • [FIX] Revert an unintended change (#525)
    • [BUGFIX] update paths and imports in bert scripts (#634)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Nov 27, 2018)

    Highlights

    Models

    New Tutorials

    New Datasets

    • Sentiment Analysis
      • MR, a movie-review data set of 10,662 sentences labeled with respect to their overall sentiment polarity (positive or negative). (#391)
      • SST_1, an extension of the MR data set with fine-grained labels (#391)
      • SST_2, an extension of the MR data set with binary sentiment polarity labels (#391)
      • SUBJ, a subjectivity data set for sentiment analysis (#391)
      • TREC, a movie-review data set of 10,000 sentences labeled with respect to their subjectivity status (subjective or objective). (#391)

    API Updates

    • Changed Vocab constructor from staticmethod to classmethod to handle inheritance (#386)
    • Added Transformer Encoder APIs (#409)
    • Added pre-trained ELMo model to model.get_model API (#227)
    • Added pre-trained BERT model to model.get_model API (#409)
    • Added unknown_lookup setter to TokenEmbedding (#429)
    • Added dtype support to EmbeddingCenterContextBatchify (#416)
    • Propagated exceptions from PrefetchingStream (#406)
    • Added sentencepiece tokenizer detokenizer (#380)
    • Added CSR format for variable length data in embedding training (#384)

    Fixes & Small Changes

    • Included output of nlp.embedding.list_sources() in API docs (#421)
    • Supported symlinks in examples and scripts (#403)
    • Fixed weight tying in GNMT and Transformer (#413)
    • Simplified transformer notebook (#400)
    • Fixed LazyTransformDataStream prefetching (#397)
    • Adopted src/gluonnlp folder layout (#390)
    • Fixed text8 archive file name for downloads from S3 (#388) Thanks @bkktimber!
    • Fixed ppl reporting for training on multi gpu in the language model notebook (#365). Thanks @ThomasDelteil!
    • Fixed a spelling mistake in QA script. (#379) Thanks @qyhfbqz!
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Oct 24, 2018)

    Highlights

    Models

    • Language Model
      • The Large Scale Word Language Model as introduced by Jozefowicz, Rafal, et al. “Exploring the limits of language modeling”. arXiv preprint arXiv:1602.02410 (2016) achieved test PPL 43.62 on GBW dataset (#179 #270 #277 #278 #286 #294)
      • The NT-ASGD based Language Model as introduced by Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018 achieved test PPL 65.62 on WikiText-2 dataset (#170)
    • Document Classification
      • The Classification Model as introduced by Joulin, Armand, et al. “Bag of tricks for efficient text classification” achieved validation accuracy validation accuracy 98 on Yelp review dataset (#258 #297)
    • Question Answering
      • The QANet as introduced by Jozefowicz, Rafal, et al. “ QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension”. ICLR 2018 achieved F1 score 79.5 on SQuAD 1.1 dataset (#339) (coming soon to master branch)

    New Tutorials

    • Machine Translation
      • The Google NMT as introduced by Wu, Yonghui, et al. “Google's neural machine translation system: Bridging the gap between human and machine translation”. arXiv preprint arXiv:1609.08144 (2016) is introduced as part of the gluonnlp tutorial (#261)
      • The Transformer based Machine Translation by Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017 is introduced as part of the gluonnlp tutorial (#279)
    • Sentence Embedding

    New Datasets

    API updates

    • Added dataloader that allows multi-shard sampling (#237 #280 #285)
    • Simplified DataStream, added DatasetStream, refactored and extended PrefetchingStream (#235)
    • Unified BPTT batchify for dataset and stream (#246)
    • Added symbolic beam search (#233)
    • Added SequenceSampler (#272)
    • Refactored Transform APIs (#282)
    • Reorganized index of the repo and model zoo page (#357)

    Fixes & Small Changes

    • Fixed module name in batchify.py example (#239)
    • Improved imports structure (#248)
    • Added test for nmt scripts (#234)
    • Speeded up batchify.Pad (#249)
    • Fixed LanguageModelDataset.bptt_batchify (#243)
    • Fixed weight drop and add tests (#268)
    • Fixed relative links that pypi doesn't handle (#293)
    • Updated notebook build logic (#309)
    • Added community link (#313)
    • Enabled run tests in parallel (#317)
    • Enabled word embedding scripts tests (#321)

    See all commits

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Jun 13, 2018)

    GluonNLP v0.3 contains many exciting new features. (depends on MXNet 1.3.0b20180725)

    Models

    • Language Models
      • The Cache Language Model as introduced by Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017 is introduced as part of gluonnlp.model.train (#110)
      • The Activation Regularizer and Temporal Activation Regularizer as introduced by Merity, S., et al. "Regularizing and optimizing LSTM language models". ICLR 2018 is introduced as part of gluonnlp.loss (#110)
    • Machine Translation
      • The Transformer Model as introduced by Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017* is introduced as part of the gluonnlp nmt scripts (#133)
    • Word embeddings
      • Trainable word embedding models are introduced as part of gluonnlp.model.train (#136)
        • Word2Vec by Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
        • FastText models by Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135-146.

    New Datasets

    • Machine Translation
    • Question Answering
      • Stanford Question Answering Dataset (SQuAD) Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2383-2392). (#113)
    • Word Embeddings

    API changes

    • The download directory for datasets and other artifacts can now be specified via the MXNET_HOME environment variable. (#106)
    • TokenEmbedding class now exposes the Inverse Vocab as well (#123)
    • SortedSampler now supports use_average_length option (#135)
    • Add more strategies for bucket creation (#145)
    • Add tokenizer to bleu (#154)
    • Add Convolutional Encoder and Highway Layer (#129) (#186)
    • Add plain text of translation data. (#158)
    • Use Sherlock Holmes dataset instead of PTB for language model notebook (#174)
    • Add classes JiebaToknizer and NLTKStanfordSegmenter for Chinese Word Segmentation (#164)
    • Allow toggling output and prompt in documentation website (#184)
    • Add shape assertion statements for better user experience to some attention cells (#201)
    • Add support for computation of word embeddings for unknown words in TokenEmbedding class (#185)
    • Distribute subword vectors for pretrained fastText embeddings enabling embeddings for unknown words (#185)

    Fixes & Small Changes

    • fixed bptt_batchify sometimes returned an invalid last batch (#120)
    • Fixed wrong PPL calculation in word language model script for multi-GPU (#150)
    • Fix split compound words and wmt16 results (#151)
    • Adapt pretrained word embeddings example notebook for nd.topk change in mxnet 1.3 (#153)
    • Fix beam search script (#175)
    • Fix small bugs in parser (#183)
    • TokenEmbedding: Skip lines with invalid bytes instead of crashing (#188)
    • Fix overly large memory use in TokenEmbedding serialization/deserialization if some tokens are overly large (eg. 50k characters) (#187)
    • Remove duplicates in WordSim353 when combining segments (#192)

    See all commits

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(May 4, 2018)

    Features

    GluonNLP provides its users with easy access to

    • State of the art models
    • Pre-trained word embeddings
    • Many public datasets for different tasks
    • Examples friendly to users that are new to the task
    • Reproducible training scripts

    Models

    Gluon NLP Toolkit supplies model definitions for common NLP tasks. These can be adapted for the users requirements or taken as blueprint for new developments. All of these are implemented using Gluon Blocks allowing easy reuse as plug-and-play neural network building blocks.

    Data

    Gluon NLP Toolkit provides tools for building efficient data pipelines for NLP tasks by defining a Dataset class interface and utilities for transforming them. Several datasets are included by default and will be automatically downloaded when used.

    • Language modeling with WikiText
      • WikiText is a popular language modeling dataset from Salesforce. It is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
    • Sentiment Analysis with IMDB
      • IMDB: IMDB is a popular dataset for binary sentiment classification. It provides a set of 25,000 highly polar movie reviews for training, 25,000 for testing, and additional unlabeled data.
    • CoNLL datasets
      • These datasets include data for the shared tasks, such as part-of-speech (POS) tagging, chunking, named entity recognition (NER), semantic role labeling (SRL), etc.
      • We provide built in support for CoNLL 2000 – 2002, 2004, as well as the Universal Dependencies dataset which is used in the 2017 and 2018 competitions.
    • Word embedding evaluation datasets
      • There are a number of commonly used datasets for intrinsic evaluation for word embeddings. We provide commonly used datasets for the similarity and analogy evaluation tasks.

    Gluon NLP further ships with common datasets data transformation functions, dataset samplers to determine how to iterate through datasets as well as functions to generate data batches.

    A complete and up-to-date list of supplied datasets and utilities is available in the API documentation.

    Other features

    Examples and scripts

    The Gluon NLP toolkit also provides scripts that use the functionality of the toolkit for various tasks

    • Word Embedding Evaluation
    • Beam Search Generator
    • Word language modeling
    • Sentiment Analysis through Fine-tuning, w/ Bucketing
    • Machine Translation
    Source code(tar.gz)
    Source code(zip)
Owner
Distributed (Deep) Machine Learning Community
A Community of Awesome Machine Learning Projects
Distributed (Deep) Machine Learning Community
Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

Shutdown 213 Dec 22, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

BADER ALABDAN 2 Oct 22, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 08, 2023
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022
Arabic speech recognition, classification and text-to-speech.

klaam Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows tr

ARBML 177 Dec 27, 2022
A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

Dipanjan (DJ) Sarkar 1.5k Jan 03, 2023
Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

EBUS Coding Club 0 Apr 09, 2022
Application to help find best train itinerary, uses speech to text, has a spam filter to segregate invalid inputs, NLP and Pathfinding algos.

T-IAI-901-MSC2022 - GROUP 18 Gestion de projet Notre travail a été organisé et réparti dans un Trello. https://trello.com/b/X3s2fpPJ/ia-projet Install

1 Feb 05, 2022
Backend for the Autocomplete platform. An AI assisted coding platform.

Introduction A custom predictor allows you to deploy your own prediction implementation, useful when the existing serving implementations don't fit yo

Tatenda Christopher Chinyamakobvu 1 Jan 31, 2022
Pretrained Japanese BERT models

Pretrained Japanese BERT models This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face. Mod

Inui Laboratory 387 Dec 30, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries

GTFONow Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries. Features Automatically escalate privileges using miscon

101 Jan 03, 2023
TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

TFPNER TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech Named entity recognition (NER), which aims at identifyin

1 Feb 07, 2022
A flask application to predict the speech emotion of any .wav file.

This is a speech emotion recognition app. It will allow you to train a modular MLP model with the RAVDESS dataset, and then use that model with a flask application to predict the speech emotion of an

Aryan Vijaywargia 2 Dec 15, 2021