Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

Overview



GitHub

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published in "Findings of EMNLP". You can read our camera-ready paper through ACL Anthology or arXiv pre-print.

Revisiting Pre-trained Models for Chinese Natural Language Processing
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Guoping Hu

For resources other than MacBERT, please visit the following repositories:

More resources by HFL: https://github.com/ymcui/HFL-Anthology

News

2021/10/24 We propose the first pre-trained language model that specifically focusing on Chinese minority languages. Check:https://github.com/ymcui/Chinese-Minority-PLM

2021/7/21 由哈工大SCIR多位学者撰写的《自然语言处理:基于预训练模型的方法》已出版,欢迎大家选购,也可参与我们的赠书活动

[Nov 3, 2020] Pre-trained MacBERT models are available through direct Download or Quick Load. Use it as if you are using original BERT (except for it cannot perform the original MLM).

[Sep 15, 2020] Our paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" is accepted to Findings of EMNLP as a long paper.

Guide

Section Description
Introduction Introduction to MacBERT
Download Download links for MacBERT
Quick Load Learn how to quickly load our models through 🤗 Transformers
Results Results on several Chinese NLP datasets
FAQ Frequently Asked Questions
Citation Citation

Introduction

MacBERT is an improved BERT with novel MLM as correction pre-training task, which mitigates the discrepancy of pre-training and fine-tuning.

Instead of masking with [MASK] token, which never appears in the fine-tuning stage, we propose to use similar words for the masking purpose. A similar word is obtained by using Synonyms toolkit (Wang and Hu, 2017), which is based on word2vec (Mikolov et al., 2013) similarity calculations. If an N-gram is selected to mask, we will find similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement.

Here is an example of our pre-training task.

Example
Original Sentence we use a language model to predict the probability of the next word.
MLM we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word .
Whole word masking we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word .
N-gram masking we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word .
MLM as correction we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word .

Except for the new pre-training task, we also incorporate the following techniques.

  • Whole Word Masking (WWM)
  • N-gram masking
  • Sentence-Order Prediction (SOP)

Note that our MacBERT can be directly replaced with the original BERT as there is no differences in the main neural architecture.

For more technical details, please check our paper: Revisiting Pre-trained Models for Chinese Natural Language Processing

Download

We mainly provide pre-trained MacBERT models in TensorFlow 1.x.

  • MacBERT-large, Chinese: 24-layer, 1024-hidden, 16-heads, 324M parameters
  • MacBERT-base, Chinese:12-layer, 768-hidden, 12-heads, 102M parameters
Model Google Drive iFLYTEK Cloud Size
MacBERT-large, Chinese TensorFlow TensorFlow(pw:3Yg3) 1.2G
MacBERT-base, Chinese TensorFlow TensorFlow(pw:E2cP) 383M

PyTorch/TensorFlow2 Version

If you need these models in PyTorch/TensorFlow2,

  1. Convert TensorFlow checkpoint into PyTorch/TensorFlow2, using 🤗 Transformers

  2. Download from https://huggingface.co/hfl

Steps: select one of the model in the page above → click "list all files in model" at the end of the model page → download bin/json files from the pop-up window.

Quick Load

With Huggingface-Transformers, the models above could be easily accessed and loaded through the following codes.

tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")

**Notice: Please use BertTokenizer and BertModel for loading MacBERT models. **

The actual model and its MODEL_NAME are listed below.

Original Model MODEL_NAME
MacBERT-large hfl/chinese-macbert-large
MacBERT-base hfl/chinese-macbert-base

Results

We present the results of MacBERT on the following six tasks (please read our paper for other results).

To ensure the stability of the results, we run 10 times for each experiment and report the maximum and average scores (in brackets).

CMRC 2018

CMRC 2018 dataset is released by the Joint Laboratory of HIT and iFLYTEK Research. The model should answer the questions based on the given passage, which is identical to SQuAD. Evaluation metrics: EM / F1

Model Development Test Challenge #Params
BERT-base 65.5 (64.4) / 84.5 (84.0) 70.0 (68.7) / 87.0 (86.3) 18.6 (17.0) / 43.3 (41.3) 102M
BERT-wwm 66.3 (65.0) / 85.6 (84.7) 70.5 (69.1) / 87.4 (86.7) 21.0 (19.3) / 47.0 (43.9) 102M
BERT-wwm-ext 67.1 (65.6) / 85.7 (85.0) 71.4 (70.0) / 87.7 (87.0) 24.0 (20.0) / 47.3 (44.6) 102M
RoBERTa-wwm-ext 67.4 (66.5) / 87.2 (86.5) 72.6 (71.4) / 89.4 (88.8) 26.2 (24.6) / 51.0 (49.1) 102M
ELECTRA-base 68.4 (68.0) / 84.8 (84.6) 73.1 (72.7) / 87.1 (86.9) 22.6 (21.7) / 45.0 (43.8) 102M
MacBERT-base 68.5 (67.3) / 87.9 (87.1) 73.2 (72.4) / 89.5 (89.2) 30.2 (26.4) / 54.0 (52.2) 102M
ELECTRA-large 69.1 (68.2) / 85.2 (84.5) 73.9 (72.8) / 87.1 (86.6) 23.0 (21.6) / 44.2 (43.2) 324M
RoBERTa-wwm-ext-large 68.5 (67.6) / 88.4 (87.9) 74.2 (72.4) / 90.6 (90.0) 31.5 (30.1) / 60.1 (57.5) 324M
MacBERT-large 70.7 (68.6) / 88.9 (88.2) 74.8 (73.2) / 90.7 (90.1) 31.9 (29.6) / 60.2 (57.6) 324M

DRCD

DRCD is also a span-extraction machine reading comprehension dataset, released by Delta Research Center. The text is written in Traditional Chinese. Evaluation metrics: EM / F1

Model Development Test #Params
BERT-base 83.1 (82.7) / 89.9 (89.6) 82.2 (81.6) / 89.2 (88.8) 102M
BERT-wwm 84.3 (83.4) / 90.5 (90.2) 82.8 (81.8) / 89.7 (89.0) 102M
BERT-wwm-ext 85.0 (84.5) / 91.2 (90.9) 83.6 (83.0) / 90.4 (89.9) 102M
RoBERTa-wwm-ext 86.6 (85.9) / 92.5 (92.2) 85.6 (85.2) / 92.0 (91.7) 102M
ELECTRA-base 87.5 (87.0) / 92.5 (92.3) 86.9 (86.6) / 91.8 (91.7) 102M
MacBERT-base 89.4 (89.2) / 94.3 (94.1) 89.5 (88.7) / 93.8 (93.5) 102M
ELECTRA-large 88.8 (88.7) / 93.3 (93.2) 88.8 (88.2) / 93.6 (93.2) 324M
RoBERTa-wwm-ext-large 89.6 (89.1) / 94.8 (94.4) 89.6 (88.9) / 94.5 (94.1) 324M
MacBERT-large 91.2 (90.8) / 95.6 (95.3) 91.7 (90.9) / 95.6 (95.3) 324M

XNLI

We use XNLI data for testing the NLI task. Evaluation metrics: Accuracy

Model Development Test #Params
BERT-base 77.8 (77.4) 77.8 (77.5) 102M
BERT-wwm 79.0 (78.4) 78.2 (78.0) 102M
BERT-wwm-ext 79.4 (78.6) 78.7 (78.3) 102M
RoBERTa-wwm-ext 80.0 (79.2) 78.8 (78.3) 102M
ELECTRA-base 77.9 (77.0) 78.4 (77.8) 102M
MacBERT-base 80.3 (79.7) 79.3 (78.8) 102M
ELECTRA-large 81.5 (80.8) 81.0 (80.9) 324M
RoBERTa-wwm-ext-large 82.1 (81.3) 81.2 (80.6) 324M
MacBERT-large 82.4 (81.8) 81.3 (80.6) 324M

ChnSentiCorp

We use ChnSentiCorp data for testing sentiment analysis. Evaluation metrics: Accuracy

Model Development Test #Params
BERT-base 94.7 (94.3) 95.0 (94.7) 102M
BERT-wwm 95.1 (94.5) 95.4 (95.0) 102M
BERT-wwm-ext 95.4 (94.6) 95.3 (94.7) 102M
RoBERTa-wwm-ext 95.0 (94.6) 95.6 (94.8) 102M
ELECTRA-base 93.8 (93.0) 94.5 (93.5) 102M
MacBERT-base 95.2 (94.8) 95.6 (94.9) 102M
ELECTRA-large 95.2 (94.6) 95.3 (94.8) 324M
RoBERTa-wwm-ext-large 95.8 (94.9) 95.8 (94.9) 324M
MacBERT-large 95.7 (95.0) 95.9 (95.1) 324M

LCQMC

LCQMC is a sentence pair matching dataset, which could be seen as a binary classification task. Evaluation metrics: Accuracy

Model Development Test #Params
BERT 89.4 (88.4) 86.9 (86.4) 102M
BERT-wwm 89.4 (89.2) 87.0 (86.8) 102M
BERT-wwm-ext 89.6 (89.2) 87.1 (86.6) 102M
RoBERTa-wwm-ext 89.0 (88.7) 86.4 (86.1) 102M
ELECTRA-base 90.2 (89.8) 87.6 (87.3) 102M
MacBERT-base 89.5 (89.3) 87.0 (86.5) 102M
ELECTRA-large 90.7 (90.4) 87.3 (87.2) 324M
RoBERTa-wwm-ext-large 90.4 (90.0) 87.0 (86.8) 324M
MacBERT-large 90.6 (90.3) 87.6 (87.1) 324M

BQ Corpus

BQ Corpus is a sentence pair matching dataset, which could be seen as a binary classification task. Evaluation metrics: Accuracy

Model Development Test #Params
BERT 86.0 (85.5) 84.8 (84.6) 102M
BERT-wwm 86.1 (85.6) 85.2 (84.9) 102M
BERT-wwm-ext 86.4 (85.5) 85.3 (84.8) 102M
RoBERTa-wwm-ext 86.0 (85.4) 85.0 (84.6) 102M
ELECTRA-base 84.8 (84.7) 84.5 (84.0) 102M
MacBERT-base 86.0 (85.5) 85.2 (84.9) 102M
ELECTRA-large 86.7 (86.2) 85.1 (84.8) 324M
RoBERTa-wwm-ext-large 86.3 (85.7) 85.8 (84.9) 324M
MacBERT-large 86.2 (85.7) 85.6 (85.0) 324M

FAQ

Question 1: Do you have an English version of MacBERT?

A1: Sorry, we do not have English version of pre-trained MacBERT.

Question 2: How to use MacBERT?

A2: Use it as if you are using original BERT in the fine-tuning stage (just replace the checkpoint and config files). Also, you can perform further pre-training on our checkpoint with MLM/NSP/SOP objectives.

Question 3: Could you provide pre-training code for MacBERT?

A3: Sorry, we cannot provide source code at the moment, and maybe we'll release them in the future, but there is no guarantee.

Question 4: How about releasing the pre-training data?

A4: We have no right to redistribute these data, which will have potential legal violations.

Question 5: Will you release pre-trained MacBERT on a larger data?

A5: Currently, we have no plans on this.

Citation

If you find our resource or paper is useful, please consider including the following citation in your paper.

@inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

Or:

@journal{cui-etal-2021-pretrain,
  title={Pre-Training with Whole Word Masking for Chinese BERT},
  author={Cui, Yiming and Che, Wanxiang and Liu, Ting and Qin, Bing and Yang, Ziqing},
  journal={IEEE Transactions on Audio, Speech and Language Processing},
  year={2021},
  url={https://ieeexplore.ieee.org/document/9599397},
  doi={10.1109/TASLP.2021.3124365},
 }

Acknowledgment

The first author would like to thank Google TensorFlow Research Cloud (TFRC) Program.

Issues

Before you submit an issue:

  • You are advised to read FAQ first before you submit an issue.
  • Repetitive and irrelevant issues will be ignored and closed by [stable-bot](stale · GitHub Marketplace). Thank you for your understanding and support.
  • We cannot acommodate EVERY request, and thus please bare in mind that there is no guarantee that your request will be met.
  • Always be polite when you submit an issue.
Owner
Yiming Cui
NLP Researcher. Mainly interested in Machine Reading Comprehension, Question Answering, Pre-trained Language Model, etc.
Yiming Cui
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 09, 2023
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

chatbot Bu Chatbot, Konya Bilim Merkezi Yeni Ufuklar Sergisi için 2021 Yılında tasarlanmış olan bir projedir. Chatbot Python ortamında yazılmıştır. Sö

Emre Özkul 1 Feb 23, 2022
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
kochat

Kochat 챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요? Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Continuously update some NLP practice based on different tasks.

NLP_practice We will continuously update some NLP practice based on different tasks. prerequisites Software pytorch = 1.10 torchtext = 0.11.0 sklear

0 Jan 05, 2022
AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

AI Dynamic Text Reader: This is a simple dynamic text reader based on Artificial

Md. Rakibul Islam 1 Jan 18, 2022
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022
Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

S41R4J 121 Dec 27, 2022
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

Alexander H. Liu 43 Nov 15, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022