MPNet: Masked and Permuted Pre-training for Language Understanding

Related tags

Text Data & NLPMPNet
Overview

MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for language understanding tasks. It solves the problems of MLM (masked language modeling) in BERT and PLM (permuted language modeling) in XLNet and achieves better accuracy.

News: We have updated the pre-trained models now.

Supported Features

  • A unified view and implementation of several pre-training models including BERT, XLNet, MPNet, etc.
  • Code for pre-training and fine-tuning for a variety of language understanding (GLUE, SQuAD, RACE, etc) tasks.

Installation

We implement MPNet and this pre-training toolkit based on the codebase of fairseq. The installation is as follow:

pip install --editable pretraining/
pip install pytorch_transformers==1.0.0 transformers scipy sklearn

Pre-training MPNet

Our model is pre-trained with bert dictionary, you first need to pip install transformers to use bert tokenizer. We provide a script encode.py and a dictionary file dict.txt to tokenize your corpus. You can modify encode.py if you want to use other tokenizers (like roberta).

1) Preprocess data

We choose WikiText-103 as a demo. The running script is as follow:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

for SPLIT in train valid test; do \
    python MPNet/encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

Then, we need to binarize data. The command of binarizing data is following:

fairseq-preprocess \
    --only-source \
    --srcdict MPNet/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

2) Pre-train MPNet

The below command is to train a MPNet model:

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
    --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
    --arch mpnet_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'

Notes: You can replace arch with mpnet_rel_base and add command --mask-whole-words --bpe bert to use relative position embedding and whole word mask.

Notes: You can specify --input-mode as mlm or plm to train masked language model or permutation language model.

Pre-trained models

We have updated the final pre-trained MPNet model for fine-tuning.

You can load the pre-trained MPNet model like this:

from fairseq.models.masked_permutation_net import MPNet
mpnet = MPNet.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data', bpe='bert')
assert isinstance(mpnet.model, torch.nn.Module)

Fine-tuning MPNet on down-streaming tasks

Acknowledgements

Our code is based on fairseq-0.8.0. Thanks for their contribution to the open-source commuity.

Reference

If you find this toolkit useful in your work, you can cite the corresponding papers listed below:

@article{song2020mpnet,
    title={MPNet: Masked and Permuted Pre-training for Language Understanding},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    journal={arXiv preprint arXiv:2004.09297},
    year={2020}
}

Related Works

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

18 Nov 28, 2022
Continuously update some NLP practice based on different tasks.

NLP_practice We will continuously update some NLP practice based on different tasks. prerequisites Software pytorch = 1.10 torchtext = 0.11.0 sklear

0 Jan 05, 2022
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
基于“Seq2Seq+前缀树”的知识图谱问答

KgCLUE-bert4keras 基于“Seq2Seq+前缀树”的知识图谱问答 简介 博客:https://kexue.fm/archives/8802 环境 软件:bert4keras=0.10.8 硬件:目前的结果是用一张Titan RTX(24G)跑出来的。 运行 第一次运行的时候,会给知

苏剑林(Jianlin Su) 65 Dec 12, 2022
100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量 中文 This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse),

embedding 10.4k Jan 09, 2023
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 07, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022
FastFormers - highly efficient transformer models for NLU

FastFormers FastFormers provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Underst

Microsoft 678 Jan 05, 2023
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

6 May 22, 2022