PyTorch impelementations of BERT-based Spelling Error Correction Models.

Overview

BertBasedCorrectionModels

基于BERT的文本纠错模型,使用PyTorch实现

数据准备

  1. http://nlp.ee.ncu.edu.tw/resource/csc.html下载SIGHAN数据集
  2. 解压上述数据集并将文件夹中所有 ''.sgml'' 文件复制至 datasets/csc/ 目录
  3. 复制 ''SIGHAN15_CSC_TestInput.txt'' 和 ''SIGHAN15_CSC_TestTruth.txt'' 至 datasets/csc/ 目录
  4. 下载 https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml 至 datasets/csc 目录
  5. 请确保以下文件在 datasets/csc 中
    train.sgml
    B1_training.sgml
    C1_training.sgml  
    SIGHAN15_CSC_A2_Training.sgml  
    SIGHAN15_CSC_B2_Training.sgml  
    SIGHAN15_CSC_TestInput.txt
    SIGHAN15_CSC_TestTruth.txt
    

环境准备

  1. 使用已有编码环境或通过 conda create -n python=3.7 创建一个新环境(推荐)
  2. 克隆本项目并进入项目根目录
  3. 安装所需依赖 pip install -r requirements.txt
  4. 如果出现报错 GLIBC 版本过低的问题(GLIBC 的版本更迭容易出事故,不推荐更新),openCC 改为安装较低版本(例如 1.1.0)
  5. 在当前终端将此目录加入环境变量 export PYTHONPATH=.

训练

运行以下命令以训练模型,首次运行会自动处理数据。

python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

可选择不同配置文件以训练不同模型,目前支持以下配置文件:

  • train_bert4csc.yml
  • train_macbert4csc.yml
  • train_SoftMaskedBert.yml

如有其他需求,可根据需要自行调整配置文件中的参数。

实验结果

SoftMaskedBert

component sentence level acc p r f
Detection 0.5045 0.8252 0.8416 0.8333
Correction 0.8055 0.9395 0.8748 0.9060

Bert类

char level

MODEL p r f
BERT4CSC 0.9269 0.8651 0.8949
MACBERT4CSC 0.9380 0.8736 0.9047

sentence level

model acc p r f
BERT4CSC 0.7990 0.8482 0.7214 0.7797
MACBERT4CSC 0.8027 0.8525 0.7251 0.7836

推理

方法一,使用inference脚本:

python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --texts "我今天很高心"
# 或给出line by line格式的文本地址
python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --text_file /ml/data/text.txt

其中/ml/data/text.txt文本如下:

我今天很高心
你这个辣鸡模型只能做错别字纠正

方法二,直接调用

texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)

方法三、导出bert权重,使用transformers或pycorrector调用

  1. 使用convert_to_pure_state_dict.py导出bert权重
  2. 后续步骤参考https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/README.md

引用

如果你在研究中使用了本项目,请按如下格式引用:

@article{cai2020pre,
  title={BERT Based Correction Models},
  author={Cai, Heng and Chen, Dian},
  journal={GitHub. Note: https://github.com/gitabtion/BertBasedCorrectionModels},
  year={2020}
}

License

本源代码的授权协议为 Apache License 2.0,可免费用做商业用途。请在产品说明中附加本项目的链接和授权协议。本项目受版权法保护,侵权必究。

更新记录

20210618

  1. 修复数据处理的编码报错问题

20210518

  1. 将BERT4CSC检错任务改为使用FocalLoss
  2. 更新修改后的模型实验结果
  3. 降低数据处理时保留原文的概率

20210517

  1. 对BERT4CSC模型新增检错任务
  2. 新增基于LineByLine文件的inference

References

  1. Spelling Error Correction with Soft-Masked BERT
  2. http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
  3. https://github.com/wdimmy/Automatic-Corpus-Generation
  4. transformers
  5. https://github.com/sunnyqiny/Confusionset-guided-Pointer-Networks-for-Chinese-Spelling-Check
  6. SoftMaskedBert-PyTorch
  7. Deep-Learning-Project-Template
  8. https://github.com/lonePatient/TorchBlocks
  9. https://github.com/shibing624/pycorrector
Owner
Heng Cai
NLPer
Heng Cai
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

Saad A. Bazaz 26 Nov 27, 2022
The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021
Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

2 Jan 20, 2022
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

Imanol Schlag 77 Dec 19, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
Repository for the paper: VoiceMe: Personalized voice generation in TTS

🗣 VoiceMe: Personalized voice generation in TTS Abstract Novel text-to-speech systems can generate entirely new voices that were not seen during trai

Pol van Rijn 80 Dec 29, 2022
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

Mohammad Javad Hossieni 4 Aug 10, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
⚖️ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
👄 The most accurate natural language detection library for Python, suitable for long and short text alike

1. What does this library do? Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a prepr

Peter M. Stahl 334 Dec 30, 2022
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 167 Jan 03, 2023
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Weaviate demo with the text2vec-openai module

Weaviate demo with the text2vec-openai module This repository contains an example of how to use the Weaviate text2vec-openai module. When using this d

SeMI Technologies 11 Nov 11, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022