使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Overview

Pretrain_Bert_with_MaskLM

Info

使用Mask LM预训练任务来预训练Bert模型。

基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。

Pretraining Task

Mask Language Model,简称Mask LM,即基于Mask机制的预训练语言模型。

同时支持 原生的MaskLM任务和Whole Words Masking任务。默认使用Whole Words Masking

MaskLM

使用来自于Bert的mask机制,即对于每一个句子中的词(token):

  • 85%的概率,保留原词不变
  • 15%的概率,使用以下方式替换
    • 80%的概率,使用字符[MASK],替换当前token。
    • 10%的概率,使用词表随机抽取的token,替换当前token。
    • 10%的概率,保留原词不变。

Whole Words Masking

与MaskLM类似,但是在mask的步骤有些少不同。

在Bert类模型中,考虑到如果单独使用整个词作为词表的话,那词表就太大了。不利于模型对同类词的不同变种的特征学习,故采用了WordPiece的方式进行分词。

Whole Words Masking的方法在于,在进行mask操作时,对象变为分词前的整个词,而非子词。

Model

使用原生的Bert模型作为基准模型。

Datasets

项目里的数据集来自wikitext,分成两个文件训练集(train.txt)和测试集(test.txt)。

数据以行为单位存储。

若想要替换成自己的数据集,可以使用自己的数据集进行替换。(注意:如果是预训练中文模型,需要修改配置文件Config.py中的self.initial_pretrain_modelself.initial_pretrain_tokenizer,将值修改成 bert-base-chinese

自己的数据集不需要做mask机制处理,代码会处理。

Training Target

本项目目的在于基于现有的预训练模型参数,如google开源的bert-base-uncasedbert-base-chinese等,在垂直领域的数据语料上,再次进行预训练任务,由此提升bert的模型表征能力,换句话说,也就是提升下游任务的表现。

Environment

项目主要使用了Huggingface的datasetstransformers模块,支持CPU、单卡单机、单机多卡三种模式。

可通过以下命令安装依赖包

    pip install -r requirement.txt

主要包含的模块如下:

    python3.6
    torch==1.3.0
    tqdm==4.61.2
    transformers==4.6.1
    datasets==1.10.2
    numpy==1.19.5
    pandas==1.1.3

Get Start

单卡模式

直接运行以下命令

    python train.py

或修改Config.py文件中的变量self.cuda_visible_devices为单卡后,运行

    chmod 755 run.sh
    ./run.sh

多卡模式

如果你足够幸运,拥有了多张GPU卡,那么恭喜你,你可以进入起飞模式。 🚀 🚀

(1)使用torch的nn.parallel.DistributedDataParallel模块进行多卡训练。其中config.py文件中参数如下,默认可以不用修改。

  • self.cuda_visible_devices表示程序可见的GPU卡号,示例:1,2→可在GPU卡号为1和2上跑,亦可以改多张,如0,1,2,3
  • self.device在单卡模式,表示程序运行的卡号;在多卡模式下,表示master的主卡,默认会变成你指定卡号的第一张卡。若只有cpu,那么可修改为cpu
  • self.port表示多卡模式下,进程通信占用的端口号。(无需修改)
  • self.init_method表示多卡模式下进程的通讯地址。(无需修改)
  • self.world_size表示启动的进程数量(无需修改)。在torch==1.3.0版本下,只需指定一个进程。在1.9.0以上,需要与GPU数量相同。

(2)运行程序启动命令

    chmod 755 run.sh
    ./run.sh

Experiment

使用交叉熵(cross-entropy)作为损失函数,困惑度(perplexity)和Loss作为评价指标来进行训练,训练过程如下:

Reference

【Bert】https://arxiv.org/pdf/1810.04805.pdf

【transformers】https://github.com/huggingface/transformers

【datasets】https://huggingface.co/docs/datasets/quicktour.html

Owner
Desmond Ng
NLP Engineer
Desmond Ng
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

Sipeed 267 Dec 25, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
تولید اسم های رندوم فینگیلیش

karafs کرفس تولید اسم های رندوم فینگیلیش installation ➜ pip install karafs usage دو زبانه ➜ karafs -n 10 توت فرنگی بی ناموس toot farangi-ye bi_namoos

Vaheed NÆINI (9E) 36 Nov 24, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Ecommerce product title recognition package

revizor This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you

Bureaucratic Labs 16 Mar 03, 2022
Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

Shimin Zhang 87 Dec 19, 2022
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
A paper list of pre-trained language models (PLMs).

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

RUCAIBox 124 Jan 02, 2023
Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

9 Jan 08, 2023
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
null

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
用Resnet101+GPT搭建一个玩王者荣耀的AI

基于pytorch框架用resnet101加GPT搭建AI玩王者荣耀 本源码模型主要用了SamLynnEvans Transformer 的源码的解码部分。以及pytorch自带的预训练模型"resnet101-5d3b4d8f.pth"

冯泉荔 2.2k Jan 03, 2023
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

Imanol Schlag 77 Dec 19, 2022