Pre-training BERT Masked Language Models (MLM)

This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to pre-train JuriBERT presented in [https://arxiv.org/abs/2110.01485].

It also contains the code of the classification task that was used to evaluate JuriBERT.

Our models can be found at [http://master2-bigdata.polytechnique.fr/FrenchLinguisticResources/resources#juribert] and downloaded upon request.

Instructions

To pre-train a new BERT model you need the path to a dataset containing raw text. You can also specify an existing tokenizer for the model. Paths for saving the model and the checkpoints are required.

python pretrain.py \
      --files /path/to/text \
      --model_path /path/to/save/model \
      --checkpoint /path/to/save/checkpoints \
      --epochs 30 \
      --hidden_layers 2 \
      --hidden_size 128 \
      --attention_heads 2 \
      --save_steps 10 \
      --save_limit 0 \
      --min_freq 0

To finetune on a classification task you need the path to the pre-trained model and a CSV file containing the classification dataset. You need to specify the columns containing the category and the text as well as the path for saving the final model and the checkpoints.

python classification.py \
  --model "custom" \
  --pretrained_path /path/to/model.bin \
  --tokenizer_path /path/to/tokenizer.json \
  --data /path/to/data.csv \
  --category "category-column" \
  --text "text-column" \
  --model_path /path/to/save/model \
  --checkpoint /path/to/save/checkpoints

You can use --help to see all the available commands.

To test the masked language model use:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer=tokenizer
)

fill_mask("Paris est la capitale de la <mask>.")

Pre-training BERT masked language models with custom vocabulary

Related tags

Overview

Pre-training BERT Masked Language Models (MLM)

Instructions

Owner

Stella Douka

YACLC - Yet Another Chinese Learner Corpus

KR-FinBert And KR-FinBert-SC

Comprehensive-E2E-TTS - PyTorch Implementation

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

spaCy plugin for Transformers , Udify, ELmo, etc.

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Kinky furry assitant based on GPT2

Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

Module for automatic summarization of text documents and HTML pages.

ConvBERT: Improving BERT with Span-based Dynamic Convolution

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Sequence Modeling with Structured State Spaces

BERT, LDA, and TFIDF based keyword extraction in Python

Official PyTorch implementation of SegFormer

Python3 to Crystal Translation using Python AST Walker

Open source code for AlphaFold.