fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Overview

fastNLP

Build Status codecov Pypi Hex.pm Documentation Status

fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。

fastNLP具有如下的特性:

  • 统一的Tabular式数据容器,简化数据预处理过程;
  • 内置多种数据集的Loader和Pipe,省去预处理代码;
  • 各种方便的NLP工具,例如Embedding加载(包括ELMo和BERT)、中间数据cache等;
  • 部分数据集与预训练模型的自动下载;
  • 提供多种神经网络组件以及复现模型(涵盖中文分词、命名实体识别、句法分析、文本分类、文本匹配、指代消解、摘要等任务);
  • Trainer提供多种内置Callback函数,方便实验记录、异常捕获等。

安装指南

fastNLP 依赖以下包:

  • numpy>=1.14.2
  • torch>=1.0.0
  • tqdm>=4.28.1
  • nltk>=3.4.1
  • requests
  • spacy
  • prettytable>=0.7.2

其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 PyTorch 官网 。 在依赖包安装完成后,您可以在命令行执行如下指令完成安装

pip install fastNLP
python -m spacy download en

fastNLP教程

中文文档教程

快速入门

详细使用教程

扩展教程

内置组件

大部分用于的 NLP 任务神经网络都可以看做由词嵌入(embeddings)和两种模块:编码器(encoder)、解码器(decoder)组成。

以文本分类任务为例,下图展示了一个BiLSTM+Attention实现文本分类器的模型流程图:

fastNLP 在 embeddings 模块中内置了几种不同的embedding:静态embedding(GloVe、word2vec)、上下文相关embedding (ELMo、BERT)、字符embedding(基于CNN或者LSTM的CharEmbedding)

与此同时,fastNLP 在 modules 模块中内置了两种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 两种模块的功能和常见组件如下:

类型 功能 例子
encoder 将输入编码为具有具有表示能力的向量 Embedding, RNN, CNN, Transformer, ...
decoder 将具有某种表示意义的向量解码为需要的输出形式 MLP, CRF, ...

项目结构

fastNLP的大致工作流程如上图所示,而项目结构如下:

fastNLP 开源的自然语言处理库
fastNLP.core 实现了核心功能,包括数据处理组件、训练器、测试器等
fastNLP.models 实现了一些完整的神经网络模型
fastNLP.modules 实现了用于搭建神经网络模型的诸多组件
fastNLP.embeddings 实现了将序列index转为向量序列的功能,包括读取预训练embedding等
fastNLP.io 实现了读写功能,包括数据读入与预处理,模型读写,数据与模型自动下载等

In memory of @FengZiYjun. May his soul rest in peace. We will miss you very very much!

Comments
  • star-transformer何时可以放出完整代码?实验完全无法重现,SST-5数据集上相差6个点哦

    star-transformer何时可以放出完整代码?实验完全无法重现,SST-5数据集上相差6个点哦

    Describe the bug A clear and concise description of what the bug is. 清晰而简要地描述bug

    To Reproduce 使用你们的star-transformer代码,然后用allennlp做训练(glove 42B 词向量), 最后结果见如图,与论文中报告的结果相差6个点。

    请求解释!以及完整版的代码,就是可以完全复现结果的完整版。

    Additional context Add any other context about the problem here. 备注 image

    opened by michael-wzhu 10
  • RuntimeError: CUDA error: device-side assert triggered

    RuntimeError: CUDA error: device-side assert triggered

    Describe the bug 用Predictor方法去加载训练好的模型,在预测时会出现第一张图里面的错误,这个bug被我fixed了。详细请见我在下文上传的项目链接。 出现原因:经过debug分析,发现此bug是由于预测新数据时出现了训练时候没有的新字符,而在bert_embedding.py 脚本里面读取的是训练时候的Vocab维度,并把它初始化成1的vocab向量做mask预测,而这导致了此向量的维度小于实际维度,实际维度=训练时候的Vocab维度+新字符的维度。 Bug结果请看图一,Bug位置及修复请看图二。 image

    image

    To Reproduce 1.把test.txt、dev.txt、train.txt移到data目录下。data目录为自己创建的目录 2. 调用fastNLP_trainer.py脚本 3. 调用fastNLP_predictor.py脚本 4. See error 重现这个bug的步骤

    项目链接:https://github.com/Chris-cbc/fastNLP_Bug_Report_And_Fix.git

    Expected behavior image 上图也是bug修复后出现的结果

    Desktop

    • OS: windows10
    • Python Version: 3.6

    Additional context 请项目主确认后 发邮件并at我github账户一下,让我知道这个bug最终是怎样被修复的 备注

    opened by Chris-cbc 9
  • fastNLP安装完成之后导入有错

    fastNLP安装完成之后导入有错

    Python 3.5环境下安装fastNLP,显示可以安装成功,但是import fastNLP时会出现 File "D:\anaconda\lib\site-packages\fastNLP\core\instance.py", line 40 f" type={(str(type(self.fields[field_name]))).split(s)[1]}" for field_name in self.fields) + "}" ^ SyntaxError: invalid syntax Python3.6和Python3.7也不行,都是安装完成之后,import时就会报错

    opened by lovelyvivi 8
  • a new function for argparse

    a new function for argparse

    we should provide a function for arg parse so that we can support "python fastnlp.py --arg1 value1 --arg2 value2" and so on.

    in this way, what argument should we have?

    enhancement 
    opened by xuyige 8
  • 在运行matching_esim.py时报错RuntimeError: CUDA error: device-side assert triggered

    在运行matching_esim.py时报错RuntimeError: CUDA error: device-side assert triggered

    使用cpu训练没有问题,刚开始以为是pytorch版本问题,后来尝试了1.2、1.4、1.7,其中1.2和1.4都会报错,都是训练到第二个epoch在test时报错RuntimeError: CUDA error: device-side assert triggered。1.7会提示由于pytorch版本问题,对超出词表的词要使用long型。 同样的问题在之前的一个脚本中也出现了。我在4月份使用bertMatching模型训练跑通了,但是现在再做的时候也会报这个错误。 感谢项目组。

    opened by jwc19890114 7
  • Default value for train args.

    Default value for train args.

    https://github.com/fastnlp/fastNLP/blob/8a87807274735046a48be8eb4b1ca10801875039/fastNLP/core/trainer.py#L42-L45

    Should we set some default value for train_args? Otherwise we will pass all these args every time, which is very redundant.

    opened by keezen 7
  • 关于Trainer基本使用部分实例的报错

    关于Trainer基本使用部分实例的报错

    在学习Trainer部分的时候,运行了这一节最开始部分的代码 但是原始的实例代码会报错

    TypeError: can't convert np.ndarray of type numpy.int32. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.
    

    我尝试在数据生成部分直接使用torch生成tensor

    def generate_psedo_dataset(num_samples):
        data=torch.randint(2,size=(num_samples,10))
        print(data.shape)
        list=[]
        for n in range(num_samples):
            label=torch.sum(data[n])%2
            list.append(label)
        list=torch.stack(list)
        dataset = DataSet({'x':data, 'label': list})
        dataset.set_input('x')
        dataset.set_target('label')
        return dataset
    tr_dataset=generate_psedo_dataset(1000)
    dev_dataset=generate_psedo_dataset(100)
    

    但是在训练中会报如下错误

    TypeError: issubclass() arg 1 must be a class
    

    是不是我的数据生成写错了。。。 gitbook部分的实例代码应该如何调整呢? torch:1.2.0+cu92 FastNLP:0.5.0

    opened by jwc19890114 6
  • 以BertEmbedding为基础进行上层应用训练,训练中更新bert参数的问题。

    以BertEmbedding为基础进行上层应用训练,训练中更新bert参数的问题。

    首先感谢你们的代码!

    我现在想直接利用from fastNLP.embeddings import BertEmbedding来读入BertEmbedding模型,然后根据你们的教程搭建一个vocab,初始化self.embed = BertEmbedding(vocab, model_dir_or_name="en-base-uncased"),进而输入一个句子的中每个词在vocab中对应的index得到对应的embedding向量,然后在此基础上进行后续的语言应用的建模。

    简单来讲,使用方式是否如同pytorch提供的nn.Embedding一样,有什么需要注意的吗?因为我利用上述方式简单搭建了一个baseline,但是并不能很好的收敛。

    还望不吝赐教,谢谢!

    opened by Reply1999 6
  • A question in crf.py

    A question in crf.py

    您好,我发现在decoder的crf.py的代码中,第263行是这样写的 score = trans_score + emit_score[:seq_len - 1, :] 其中的trans_score大小为[seq_len-1, batch_size],trans_score[0][0]代表第0个句子的第0个字符到第1个字符的转移得分; 而emit_score[:seq_len - 1, :]的大小为[seq_len-1, batch_size],emit_score[0, 0]代表第0个句子第0个字符的发射得分; 但是第0个句子第0个字符的转移得分不应该是start字符到第0个字符的score么?请问这里为什么不写成 score = trans_score + emit_score[1:, :]呢 感谢您的解答~

    opened by tyistyler 5
  • [fix]修复fitlocallback 在DistTrainer的使用中无法添加dev_data的问题

    [fix]修复fitlocallback 在DistTrainer的使用中无法添加dev_data的问题

    Description:fitlocallback 在DistTrainer的使用中无法添加dev_data,主要原因在于fitlogcallback验证self.trainer.dev_data时,也即DIstanbulTrainer没有dev_data属性导致调用失败报错

    Main reason: 修复fitlocallback 在DistTrainer的使用中无法添加dev_data的问题

    Checklist 检查下面各项是否完成

    Please feel free to remove inapplicable items for your PR.

    • [x] The PR title starts with [$CATEGORY] (例如[bugfix]修复bug,[new]添加新功能,[test]修改测试,[rm]删除旧代码)
    • [x] Changes are complete (i.e. I finished coding on this PR) 修改完成才提PR
    • [x] All changes have test coverage 修改的部分顺利通过测试。对于fastnlp/fastnlp/的修改,测试代码必须提供在fastnlp/test/
    • [x] Code is well-documented 注释写好,API文档会从注释中抽取
    • [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change 修改导致例子或tutorial有变化,请找核心开发人员

    Changes: 修复fitlocallback 在DistTrainer的使用中无法添加dev_data的问题

    • 并在DIstTrainer中添加了kwargs,test_use_tqdm,dev_data,metrics类变量

    Mention: 找人review你的PR

    @修改过这个文件的人 @核心开发人员

    opened by ROGERDJQ 5
  • improve the compatibility of

    improve the compatibility of "Trainer"

    Description:简要描述这次PR的内容 Delete DEFAULT_CHECK_BATCH_SIZE and make it same with the input batch size.

    Main reason: 做出这次修改的原因 It is unnecessary to use DEFAULT_CHECK_BATCH_SIZE, which may cause some conficts with the initialized model.

    Checklist 检查下面各项是否完成

    Please feel free to remove inapplicable items for your PR.

    • [x] The PR title starts with [$CATEGORY] (例如[bugfix]修复bug,[new]添加新功能,[test]修改测试,[rm]删除旧代码)
    • [x] Changes are complete (i.e. I finished coding on this PR) 修改完成才提PR
    • [x] All changes have test coverage 修改的部分顺利通过测试。对于fastnlp/fastnlp/的修改,测试代码必须提供在fastnlp/test/
    • [x] Code is well-documented 注释写好,API文档会从注释中抽取
    • [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change 修改导致例子或tutorial有变化,请找核心开发人员

    Changes: 逐项描述修改的内容

    • 去掉 DEFAULT_CHECK_BATCH_SIZE,并将其修改为预先设置的batch_size

    Mention: 找人review你的PR

    @修改过这个文件的人 @核心开发人员

    opened by hendrydong 5
  • [bugfix] 修改requentments.txt中的rich版本,给topk_saver增加参数

    [bugfix] 修改requentments.txt中的rich版本,给topk_saver增加参数

    Description:修改requentments.txt中的rich版本,给topk_saver增加参数。

    Main reason: 升级rich版本以解决剩余时间过长导致的异常。topk_saver中增加参数use_timestamp_path参数,决定是否跳过创建时间戳命名的文件夹的步骤。

    Checklist 检查下面各项是否完成

    Please feel free to remove inapplicable items for your PR.

    • [x] The PR title starts with [$CATEGORY] (例如[bugfix]修复bug,[new]添加新功能,[test]修改测试,[rm]删除旧代码)
    • [x] Changes are complete (i.e. I finished coding on this PR) 修改完成才提PR
    • [x] All changes have test coverage 修改的部分顺利通过测试。对于fastnlp/fastnlp/的修改,测试代码必须提供在fastnlp/test/
    • [x] Code is well-documented 注释写好,API文档会从注释中抽取
    • [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change 修改导致例子或tutorial有变化,请找核心开发人员

    Changes: 逐项描述修改的内容

    • 将rich版本由11.2.0升级为12.6.0;
    • topk_saver中增加参数use_timestamp_path参数,决定是否跳过创建时间戳命名的文件夹的步骤。
    opened by 00INDEX 0
  • 版本 1.0.1 No module named 'fastNLP.embeddings.embedding'

    版本 1.0.1 No module named 'fastNLP.embeddings.embedding'

    在 fastNLP 版本 1.0.1 中

    from fastNLP.embeddings.embedding import TokenEmbedding
    

    报错:

    ModuleNotFoundError: No module named 'fastNLP.embeddings.embedding'
    
    opened by MrRace 1
  • 文档疑似错误?

    文档疑似错误?

    https://github.com/fastnlp/fastNLP/blob/6f21084dafeeb937e137adcf33a0858dec921f8c/fastNLP/core/drivers/torch_driver/initialize_torch_driver.py#L36-L37 older ?

    opened by iamqiz 0
  • [疑问][建议]话说为什么DataSet不支持List[Dict]的data?建议像huggingface 的Dataset那样支持一下?

    [疑问][建议]话说为什么DataSet不支持List[Dict]的data?建议像huggingface 的Dataset那样支持一下?

    fastNLP不支持

    from fastNLP import DataSet
    ds=[{"name":"aa","age":21},{"name":"bb","age":22},{"name":"cc","age":19}]
    data_set = DataSet(ds)
    

    huggingface 的Dataset支持

    from datasets import Dataset
    ds=[{"name":"aa","age":21},{"name":"bb","age":22},{"name":"cc","age":19}]
    dataset = Dataset.from_list(ds)
    

    希望能支持

    opened by iamqiz 1
Releases(v0.6.0)
Owner
fastNLP
由复旦大学的自然语言处理(NLP)团队发起的国产自然语言处理开源项目
fastNLP
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognit

SpeechBrain 5.1k Jan 09, 2023
NLP - Machine learning

Flipkart-product-reviews NLP - Machine learning About Product reviews is an essential part of an online store like Flipkart’s branding and marketing.

Harshith VH 1 Oct 29, 2021
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 6.4k Jan 09, 2023
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
Legal text retrieval for python

legal-text-retrieval Overview This system contains 2 steps: generate training data containing negative sample found by mixture score of cosine(tfidf)

Nguyễn Minh Phương 22 Dec 06, 2022
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

Translations 🇩🇪 DE 🇫🇷 FR 🇭🇺 HU 🇮🇩 ID 🇮🇹 IT 🇳🇱 NL 🇧🇷 PT-BR 🇷🇺 RU 🇨🇳 ZH ➡️ Documentation | Discord | Installation Guide ⬅️ Fully autom

11.2k Jan 05, 2023
nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

Tae-Hwan Jung 11.9k Jan 08, 2023
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 09, 2022
A flask application to predict the speech emotion of any .wav file.

This is a speech emotion recognition app. It will allow you to train a modular MLP model with the RAVDESS dataset, and then use that model with a flask application to predict the speech emotion of an

Aryan Vijaywargia 2 Dec 15, 2021
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 01, 2023
HuggingTweets - Train a model to generate tweets

HuggingTweets - Train a model to generate tweets Create in 5 minutes a tweet generator based on your favorite Tweeter Make my own model with the demo

Boris Dayma 318 Jan 04, 2023
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

Yunxiao Zhao 5 Nov 04, 2022