超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

Overview

bert4pytorch

2021年8月27更新:

感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune案例。

背景

目前最流行的pytorch版本的bert框架,莫过于huggingface团队的Transformers项目,但是随着项目的越来越大,显得很重,对于初学者、有一定nlp基础的人来说,想看懂里面的代码逻辑,深入了解bert,有很大的难度。

另外,如果想修改Transformers的底层代码也是想当困难的,导致很难对模型进行魔改。

本项目把整个bert架构,浓缩在几个文件当中(主要修改自Transfomers开源项目),删除大量无关紧要的代码,新增了一些功能,比如:ema、warmup schedule,并且在核心部分,添加了大量中文注释,力求解答读者在使用过程中产生的一些疑惑。

此项目核心只有三个文件,modeling、tokenization、optimization。并且都在几百行内完成。结合大量的中文注释,分分钟透彻理解bert。

功能

现在已经实现

  • 加载bert、RoBERTa-wwm-ext的预训练权重进行fintune
  • 实现了带warmup的优化器
  • 实现了模型权重的指数滑动平均(ema)

未来将实现

  • albert、GPT、XLnet等网络架构
  • 实现对抗训练、conditional Layer Norm等功能(想法来自于苏神(苏剑林)的bert4keras开源项目,事实上,bert4pytorch就是受到了它的启发)
  • 添加大量的例子和中文注释,减轻学习难度

安装

pip install bert4pytorch==0.1.2

使用

  • 加载预训练模型
from bert4pytorch.modeling import BertModel, BertConfig
from bert4pytorch.tokenization import BertTokenizer
from bert4pytorch.optimization import AdamW, get_linear_schedule_with_warmup
import torch

model_path = "/model/pytorch_bert_pretrain_model"
config = BertConfig(model_path + "/config.json")

tokenizer = BertTokenizer(model_path + "/vocab.txt")
model = BertModel.from_pretrained(model_path, config)

input_ids, token_type_ids = tokenizer.encode("今天很开心")

input_ids = torch.tensor([input_ids])
token_type_ids = torch.tensor([token_type_ids])

model.eval()

outputs = model(input_ids, token_type_ids, output_all_encoded_layers=True)

## orther code
  • 带warmup的优化器实现
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer
                if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer
                if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5, correct_bias=False)

num_training_steps=train_batches * num_epoches
num_warmup_steps=num_training_steps * warmup_proportion
schedule = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps)

其他

最初整理这个项目,只是为了自己方便。这一段时间,经常逛苏剑林大佬的博客,里面的内容写得相当精辟,更加感叹的是, 苏神经常能闭门造车出一些还不错的trick,只能说,大佬牛逼。

所以本项目命名也雷同bert4keras,以感谢苏大佬无私的分享。

后来,慢慢萌生把学习中的小小成果开源出来,后期会渐渐补充例子,前期会借用苏神的bert4keras里面的例子,实现pytorch版本。如果有问题,欢迎讨论;如果本项目对您有用,请不吝star!

Owner
muqiu
muqiu
MEDIALpy: MEDIcal Abbreviations Lookup in Python

A small python package that allows the user to look up common medical abbreviations.

Aberystwyth Systems Biology 7 Nov 09, 2022
This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Ai-ChatBot-Python A chatbot is an intelligent system which can hold a conversation with a human using natural language in real time. Due to the rise o

1 Oct 30, 2021
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 07, 2023
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量 中文 This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse),

embedding 10.4k Jan 09, 2023
Mirco Ravanelli 2.3k Dec 27, 2022
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 06, 2023
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
Pangu-Alpha for Transformers

Pangu-Alpha for Transformers Usage Download MindSpore FP32 weights for GPU from here to data/Pangu-alpha_2.6B.ckpt Activate MindSpore environment and

One 5 Oct 01, 2022
Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

SilkyArcTool English Dual languaged (rus+eng) GUI tool for packing and unpacking archives of Silky Engine. It is not the same arc as used in Ai6WIN. I

Tester 5 Sep 15, 2022
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 09, 2021
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. 🛠 Features Basic E-c

Dishant Gandhi 23 Oct 27, 2022
Long text token classification using LongFormer

Long text token classification using LongFormer

abhishek thakur 161 Aug 07, 2022