Super easy library for BERT based NLP models

Last update: Dec 27, 2022

Overview

Fast-Bert

New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder)

Supports LAMB optimizer for faster training. Please refer to https://arxiv.org/abs/1904.00962 for the paper on LAMB optimizer.

Supports BERT and XLNet for both Multi-Class and Multi-Label text classification.

Fast-Bert is the deep learning library that allows developers and data scientists to train and deploy BERT and XLNet based models for natural language processing tasks beginning with Text Classification.

The work on FastBert is built on solid foundations provided by the excellent Hugging Face BERT PyTorch library and is inspired by fast.ai and strives to make the cutting edge deep learning technologies accessible for the vast community of machine learning practitioners.

With FastBert, you will be able to:

Train (more precisely fine-tune) BERT, RoBERTa and XLNet text classification models on your custom dataset.
Tune model hyper-parameters such as epochs, learning rate, batch size, optimiser schedule and more.
Save and deploy trained model for inference (including on AWS Sagemaker).

Fast-Bert will support both multi-class and multi-label text classification for the following and in due course, it will support other NLU tasks such as Named Entity Recognition, Question Answering and Custom Corpus fine-tuning.

BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

XLNet (from Google/CMU) released with the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
DistilBERT (from HuggingFace), released together with the blogpost Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.

Installation

This repo is tested on Python 3.6+.

With pip

PyTorch-Transformers can be installed by pip as follows:

pip install fast-bert

From source

Clone the repository and run:

pip install [--editable] .

pip install git+https://github.com/kaushaltrivedi/fast-bert.git

You will also need to install NVIDIA Apex.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Usage

Text Classification

1. Create a DataBunch object

The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT, RoBERTa, DistilBERT or XLNet. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.

from fast_bert.data_cls import BertDataBunch

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer='bert-base-uncased',
                          train_file='train.csv',
                          val_file='val.csv',
                          label_file='labels.csv',
                          text_col='text',
                          label_col='label',
                          batch_size_per_gpu=16,
                          max_seq_length=512,
                          multi_gpu=True,
                          multi_label=False,
                          model_type='bert')

File format for train.csv and val.csv

index	text	label
0	Looking through the other comments, I'm amazed that there aren't any warnings to potential viewers of what they have to look forward to when renting this garbage. First off, I rented this thing with the understanding that it was a competently rendered Indiana Jones knock-off.	neg
1	I've watched the first 17 episodes and this series is simply amazing! I haven't been this interested in an anime series since Neon Genesis Evangelion. This series is actually based off an h-game, which I'm not sure if it's been done before or not, I haven't played the game, but from what I've heard it follows it very well	pos
2	his movie is nothing short of a dark, gritty masterpiece. I may be bias, as the Apartheid era is an area I've always felt for.	pos

In case the column names are different than the usual text and labels, you will have to provide those names in the databunch text_col and label_col parameters.

labels.csv will contain a list of all unique labels. In this case the file will contain:

pos
neg

For multi-label classification, labels.csv will contain all possible labels:

toxic
severe_toxic
obscene
threat
insult
identity_hate

The file train.csv will then contain one column for each label, with each column value being either 0 or 1. Don't forget to change multi_label=True for multi-label classification in BertDataBunch.

id	text	toxic	severe_toxic	obscene	threat	insult	identity_hate
0	Why the edits made under my username Hardcore Metallica Fan were reverted?	0	0	0	0	0	0
0	I will mess you up	1	0	0	1	0	0

label_col will be a list of label column names. In this case it will be:

['toxic','severe_toxic','obscene','threat','insult','identity_hate']

Tokenizer

You can either create a tokenizer object and pass it to DataBunch or you can pass the model name as tokenizer and DataBunch will automatically download and instantiate an appropriate tokenizer object.

For example for using XLNet base cased model, set tokenizer parameter to 'xlnet-base-cased'. DataBunch will automatically download and instantiate XLNetTokenizer with the vocabulary for xlnet-base-cased model.

Model Type

Fast-Bert supports XLNet, RoBERTa and BERT based classification models. Set model type parameter value to 'bert', roberta or 'xlnet' in order to initiate an appropriate databunch object.

2. Create a Learner Object

BertLearner is the ‘learner’ object that holds everything together. It encapsulates the key logic for the lifecycle of the model such as training, validation and inference.

The learner object will take the databunch created earlier as as input alongwith some of the other parameters such as location for one of the pretrained models, FP16 training, multi_gpu and multi_label options.

The learner class contains the logic for training loop, validation loop, optimiser strategies and key metrics calculation. This help the developers focus on their custom use-cases without worrying about these repetitive activities.

At the same time the learner object is flexible enough to be customised either via using flexible parameters or by creating a subclass of BertLearner and redefining relevant methods.

from fast_bert.learner_cls import BertLearner
from fast_bert.metrics import accuracy
import logging

logger = logging.getLogger()
device_cuda = torch.device("cuda")
metrics = [{'name': 'accuracy', 'function': accuracy}]

learner = BertLearner.from_pretrained_model(
						databunch,
						pretrained_path='bert-base-uncased',
						metrics=metrics,
						device=device_cuda,
						logger=logger,
						output_dir=OUTPUT_DIR,
						finetuned_wgts_path=None,
						warmup_steps=500,
						multi_gpu=True,
						is_fp16=True,
						multi_label=False,
						logging_steps=50)

parameter	description
databunch	Databunch object created earlier
pretrained_path	Directory for the location of the pretrained model files or the name of one of the pretrained models i.e. bert-base-uncased, xlnet-large-cased, etc
metrics	List of metrics functions that you want the model to calculate on the validation set, e.g. accuracy, beta, etc
device	torch.device of type cuda or cpu
logger	logger object
output_dir	Directory for model to save trained artefacts, tokenizer vocabulary and tensorboard files
finetuned_wgts_path	provide the location for fine-tuned language model (experimental feature)
warmup_steps	number of training warms steps for the scheduler
multi_gpu	multiple GPUs available e.g. if running on AWS p3.8xlarge instance
is_fp16	FP16 training
multi_label	multilabel classification
logging_steps	number of steps between each tensorboard metrics calculation. Set it to 0 to disable tensor flow logging. Keeping this value too low will lower the training speed as model will be evaluated each time the metrics are logged

3. Find the optimal learning rate

The learning rate is one of the most important hyperparameters for model training. We have incorporated the learining rate finder that was proposed by Leslie Smith and then built into the fastai library.

learner.lr_find(start_lr=1e-5,optimizer_type='lamb')

The code is heavily borrowed from David Silva's pytorch-lr-finder library.

4. Train the model

learner.fit(epochs=6,
			lr=6e-5,
			validate=True, 	# Evaluate the model after each epoch
			schedule_type="warmup_cosine",
			optimizer_type="lamb")

Fast-Bert now supports LAMB optmizer. Due to the speed of training, we have set LAMB as the default optimizer. You can switch back to AdamW by setting optimizer_type to 'adamw'.

5. Save trained model artifacts

learner.save_model()

Model artefacts will be persisted in the output_dir/'model_out' path provided to the learner object. Following files will be persisted:

File name	description
pytorch_model.bin	trained model weights
spiece.model	sentence tokenizer vocabulary (for xlnet models)
vocab.txt	workpiece tokenizer vocabulary (for bert models)
special_tokens_map.json	special tokens mappings
config.json	model config
added_tokens.json	list of new tokens

As the model artefacts are all stored in the same folder, you will be able to instantiate the learner object to run inference by pointing pretrained_path to this location.

6. Model Inference

If you already have a Learner object with trained model instantiated, just call predict_batch method on the learner object with the list of text data:

texts = ['I really love the Netflix original movies',
		 'this movie is not worth watching']
predictions = learner.predict_batch(texts)

If you have persistent trained model and just want to run inference logic on that trained model, use the second approach, i.e. the predictor object.

from fast_bert.prediction import BertClassificationPredictor

MODEL_PATH = OUTPUT_DIR/'model_out'

predictor = BertClassificationPredictor(
				model_path=MODEL_PATH,
				label_path=LABEL_PATH, # location for labels.csv file
				multi_label=False,
				model_type='xlnet',
				do_lower_case=False,
				device=None) # set custom torch.device, defaults to cuda if available

# Single prediction
single_prediction = predictor.predict("just get me result for this text")

# Batch predictions
texts = [
	"this is the first text",
	"this is the second text"
	]

multiple_predictions = predictor.predict_batch(texts)

Language Model Fine-tuning

A useful approach to use BERT based models on custom datasets is to first finetune the language model task for the custom dataset, an apporach followed by fast.ai's ULMFit. The idea is to start with a pre-trained model and further train the model on the raw text of the custom dataset. We will use the masked LM task to finetune the language model.

This section will describe the usage of FastBert to finetune the language model.

1. Import the necessary libraries

The necessary objects are stored in the files with '_lm' suffix.

# Language model Databunch
from fast_bert.data_lm import BertLMDataBunch
# Language model learner
from fast_bert.learner_lm import BertLMLearner

from pathlib import Path
from box import Box

2. Define parameters and setup datapaths

# Box is a nice wrapper to create an object from a json dict
args = Box({
    "seed": 42,
    "task_name": 'imdb_reviews_lm',
    "model_name": 'roberta-base',
    "model_type": 'roberta',
    "train_batch_size": 16,
    "learning_rate": 4e-5,
    "num_train_epochs": 20,
    "fp16": True,
    "fp16_opt_level": "O2",
    "warmup_steps": 1000,
    "logging_steps": 0,
    "max_seq_length": 512,
    "multi_gpu": True if torch.cuda.device_count() > 1 else False
})

DATA_PATH = Path('../lm_data/')
LOG_PATH = Path('../logs')
MODEL_PATH = Path('../lm_model_{}/'.format(args.model_type))

DATA_PATH.mkdir(exist_ok=True)
MODEL_PATH.mkdir(exist_ok=True)
LOG_PATH.mkdir(exist_ok=True)

3. Create DataBunch object

The BertLMDataBunch class contains a static method 'from_raw_corpus' that will take the list of raw texts and create DataBunch for the language model learner.

The method will at first preprocess the text list by removing html tags, extra spaces and more and then create files lm_train.txt and lm_val.txt. These files will be used for training and evaluating the language model finetuning task.

The next step will be to featurize the texts. The text will be tokenized, numericalized and split into blocks on 512 tokens (including special tokens).

databunch_lm = BertLMDataBunch.from_raw_corpus(
					data_dir=DATA_PATH,
					text_list=texts,
					tokenizer=args.model_name,
					batch_size_per_gpu=args.train_batch_size,
					max_seq_length=args.max_seq_length,
                    multi_gpu=args.multi_gpu,
                    model_type=args.model_type,
                    logger=logger)

As this step can take some time based on the size of your custom dataset's text, the featurized data will be cached in pickled files in the data_dir/lm_cache folder.

The next time, instead of using from_raw_corpus method, you may want to directly instantiate the DataBunch object as shown below:

databunch_lm = BertLMDataBunch(
						data_dir=DATA_PATH,
						tokenizer=args.model_name,
                        batch_size_per_gpu=args.train_batch_size,
                        max_seq_length=args.max_seq_length,
                        multi_gpu=args.multi_gpu,
                        model_type=args.model_type,
                        logger=logger)

4. Create the LM Learner object

BertLearner is the ‘learner’ object that holds everything together. It encapsulates the key logic for the lifecycle of the model such as training, validation and inference.

The learner class contains the logic for training loop, validation loop, and optimizer strategies. This help the developers focus on their custom use-cases without worrying about these repetitive activities.

At the same time the learner object is flexible enough to be customized either via using flexible parameters or by creating a subclass of BertLearner and redefining relevant methods.

learner = BertLMLearner.from_pretrained_model(
							dataBunch=databunch_lm,
							pretrained_path=args.model_name,
							output_dir=MODEL_PATH,
							metrics=[],
							device=device,
							logger=logger,
							multi_gpu=args.multi_gpu,
							logging_steps=args.logging_steps,
							fp16_opt_level=args.fp16_opt_level)

5. Train the model

learner.fit(epochs=6,
			lr=6e-5,
			validate=True, 	# Evaluate the model after each epoch
			schedule_type="warmup_cosine",
			optimizer_type="lamb")

Fast-Bert now supports LAMB optmizer. Due to the speed of training, we have set LAMB as the default optimizer. You can switch back to AdamW by setting optimizer_type to 'adamw'.

6. Save trained model artifacts

learner.save_model()

Model artefacts will be persisted in the output_dir/'model_out' path provided to the learner object. Following files will be persisted:

File name	description
pytorch_model.bin	trained model weights
spiece.model	sentence tokenizer vocabulary (for xlnet models)
vocab.txt	workpiece tokenizer vocabulary (for bert models)
special_tokens_map.json	special tokens mappings
config.json	model config
added_tokens.json	list of new tokens

The pytorch_model.bin contains the finetuned weights and you can point the classification task learner object to this file throgh the finetuned_wgts_path parameter.

Amazon Sagemaker Support

The purpose of this library is to let you train and deploy production grade models. As transformer models require expensive GPUs to train, I have added support for training and deploying model on AWS SageMaker.

The repository contains the docker image and code for building BERT based classification models in Amazon SageMaker.

Please refer to my blog Train and Deploy the Mighty BERT based NLP models using FastBert and Amazon SageMaker that provides detailed explanation on using SageMaker with FastBert.

Citation

Please include a mention of this library and HuggingFace pytorch-transformers library and a link to the present repository if you use this work in a published or open-source project.

Also include my blogs on this topic:

Comments

learner.save_model gives KeyError while saving tokenizer/vocab file

I'm trying to run the multilabel classification model and while saving the model it give me an error on vocab file learner.save_model() gives below error:

Is this because I have not specified some path or because I'm not using a pretrained model path from local as in sample notebook.

My learner config is as below:

DataBunchConfig as below:

Any help appreciated. Thanks!

opened by mohammedayub44 17
notebook not working out of the box
I'm trying to just get the included toxicity notebook to work from a fresh clone and am having some issues:

Out of the box, the data & labels directory are pointing to the wrong place and the DataBunch is using filenames that are not part of the repo. These are fixed easily enough.

It would help if there was a pointer to where to get the PyTorch pretrained model uncased_L-12_H-768_A-12. There is a Google download which will not work with the from_pretrained_model cell:

FileNotFoundError: [Errno 2] No such file or directory: '../../bert/bert-models/uncased_L-12_H-768_A-12/pytorch_model.bin'

I have been able to get past this step by instead of using 'bert-base-uncased' instead of BERT_PRETRAINED_PATH as the model spec in the tokenizer and from_pretrained_model steps.

Once I get everything loaded, RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 7.43 GiB total capacity; 6.91 GiB already allocated; 10.94 MiB free; 24.36 MiB cached)

This is a standard 8G GPU compute engine instance on GCP. Advice on how to not run out of memory would help the tutorial a lot.
opened by mschmill 17

Argmax unexpected key and Cant convert Cuda tensor to Numpy error

Hi I am facing the issue below. I have installed fast-bert using pip and just copied the code from the readme. Any suggestions on how to fix?

model/tensorboard
Traceback (most recent call last):
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
    return bound(*args, **kwds)
TypeError: argmax() got an unexpected keyword argument 'axis'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_bert.py", line 67, in <module>
    main()
  File "train_bert.py", line 62, in main
    optimizer_type="lamb")
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/fast_bert/learner_cls.py", line 406, in fit
    results = self.validate()
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/fast_bert/learner_cls.py", line 524, in validate
    all_logits, all_labels
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/fast_bert/metrics.py", line 15, in accuracy
    outputs = np.argmax(y_pred, axis=1)
  File "<__array_function__ internals>", line 6, in argmax
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1153, in argmax
    return _wrapfunc(a, 'argmax', axis=axis, out=out)
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 70, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 47, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/torch/tensor.py", line 433, in __array__
    return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

opened by rishabhjoshi 8

BertDataBunch' object has no attribute 'model_type'

I have been following the tutorials concerning Fast-Bert: https://pypi.org/project/fast-bert/ https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/92668

My goal is to do binary text classifictation. Therefore, my label.csv has only two labels and I set multi_label to False.

When executing BertLearner.from_pretrained_model, I am receiving the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-240-ef4cead1d6f0> in <module>
     16                                             loss_scale = args['loss_scale'],
     17                                             multi_gpu = True,
---> 18                                             multi_label = False)

~/.local/lib/python3.6/site-packages/fast_bert/learner_cls.py in from_pretrained_model(dataBunch, pretrained_path, output_dir, metrics, device, logger, finetuned_wgts_path, multi_gpu, is_fp16, loss_scale, warmup_steps, fp16_opt_level, grad_accumulation_steps, multi_label, max_grad_norm, adam_epsilon, logging_steps, freeze_transformer_layers)
    131         model_state_dict = None
    132 
--> 133         model_type = dataBunch.model_type
    134 
    135         if torch.cuda.is_available():

AttributeError: 'BertDataBunch' object has no attribute 'model_type'

What I have tried so far is including model_type = 'bert' to the BertDataBunch command. This has not helped so far. I am quite sure that my .csv's are in the right format, but of course, this could also be one source of the problem. PATH and imported modules should be fine.

Attached you find my code:

from pytorch_pretrained_bert.tokenization import BertTokenizer
from fast_bert.data import BertDataBunch

# Default args. If GPU runs out of memory while training, decrease training
# batch size
args = Box({
    "run_text": "tweet sentiment",
    "task_name": "Tweet Sentiment",
    "max_seq_length": 512,
    "do_lower_case": True,
    "train_batch_size": 8,
    "learning_rate": 6e-5,
    "num_train_epochs": 12.0,
    "warmup_proportion": 0.002,
    "local_rank": -1,
    "gradient_accumulation_steps": 1,
    "fp16": True,
    "loss_scale": 128
})

device = torch.device('cuda')

# check if multiple GPUs are available
if torch.cuda.device_count() > 1:
    multi_gpu = True
else:
    multi_gpu = False

# The tokenizer object is used to split the text into tokens used in training
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = args['do_lower_case'])
    
# Databunch    
databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer = tokenizer,
                          train_file = 'X_train.csv', 
                          val_file = 'X_test.csv', 
                          label_file = 'label.csv',
                          text_col = 'text',
                          label_col = 'label',
                          bs = args['train_batch_size'], 
                          maxlen = args['max_seq_length'], 
                          multi_gpu = True, 
                          multi_label = False,
                          model_type = 'bert')

databunch.save()
num_labels = len(databunch.labels)
num_labels

# Set logger
import logging
import sys

logfile = str(LOG_PATH/'log-{}-{}.txt'.format(run_start_time, args["run_text"]))

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
    datefmt='%m/%d/%Y %H:%M:%S',
    handlers=[
        logging.FileHandler(logfile),
        logging.StreamHandler(sys.stdout)
    ])

logger = logging.getLogger()
logger.info(args)

When executing this field, the error happens:

from fast_bert.learner_cls import BertLearner
from fast_bert.metrics import accuracy

# Choose the metrics used for the error function in training
metrics = []
metrics.append({'name': 'accuracy', 'function': accuracy})

learner = BertLearner.from_pretrained_model(databunch, 
                                            pretrained_path = "bert-base-uncased", 
                                            metrics = metrics, 
                                            device = device,
                                            logger = logger, 
                                            output_dir = OUTPUT_DIR,
                                            finetuned_wgts_path = None, 
                                            is_fp16 = args['fp16'], 
                                            loss_scale = args['loss_scale'],
                                            multi_gpu = True,
                                            multi_label = False)

Thank you for your help!

opened by JRatschat 7

Binary text classification: The size of tensor a (2) must match the size of tensor b (39) at non-singleton dimension 1

Hello,

I'm working on binary text classification with CamemBert using fast-bert.

When I run the code below

from fast_bert.data_cls import BertDataBunch from fast_bert.learner_cls import BertLearner

databunch = BertDataBunch(DATA_PATH,LABEL_PATH, tokenizer='camembert-base', train_file='train.csv', val_file='val.csv', label_file='labels.csv', text_col='text', label_col='label', batch_size_per_gpu=8, max_seq_length=512, multi_gpu=multi_gpu, multi_label=False, model_type='camembert-base')

learner = BertLearner.from_pretrained_model( databunch, pretrained_path='camembert-base', #'/content/drive/My Drive/model/model_out' metrics=metrics, device=device_cuda, logger=logger, output_dir=OUTPUT_DIR, finetuned_wgts_path=None, #WGTS_PATH warmup_steps=300, multi_gpu=multi_gpu, is_fp16=True, multi_label=False, logging_steps=50)

learner.fit(epochs=10, lr=9e-5, validate=True, schedule_type="warmup_cosine", optimizer_type="adamw") Everything works fine until training. I get this error message when I try to train my model:

RuntimeError Traceback (most recent call last) in () 3 validate=True, 4 schedule_type="warmup_cosine", ----> 5 optimizer_type="adamw")

2 frames /usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in fit(self, epochs, lr, validate, return_results, schedule_type, optimizer_type) 421 # Evaluate the model against validation set after every epoch 422 if validate: --> 423 results = self.validate() 424 for key, value in results.items(): 425 self.logger.info(

/usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in validate(self, quiet, loss_only) 515 for metric in self.metrics: 516 validation_scores[metric["name"]] = metric["function"]( --> 517 all_logits, all_labels 518 ) 519 results.update(validation_scores)

/usr/local/lib/python3.6/dist-packages/fast_bert/metrics.py in fbeta(y_pred, y_true, thresh, beta, eps, sigmoid) 56 y_pred = (y_pred > thresh).float() 57 y_true = y_true.float() ---> 58 TP = (y_pred * y_true).sum(dim=1) 59 prec = TP / (y_pred.sum(dim=1) + eps) 60 rec = TP / (y_true.sum(dim=1) + eps)

RuntimeError: The size of tensor a (2) must match the size of tensor b (39) at non-singleton 1

How can I fix this ?

opened by NawelAr 6

RobertaTokenizer object has no attribute 'add_special_tokens_single_sentence'

In trying to test out the roberta model I received this error. My setup is the same as in the Fine Tune Model section of the readme.

transformers==2.0.0 fast-bert==1.4.2

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-c876b1d42fd6> in <module>
      7     multi_gpu=args.multi_gpu,
      8     model_type=args.model_type,
----> 9     logger=logger)

~/.conda/envs/transclass/lib/python3.7/site-packages/fast_bert/data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
    152                                model_type=model_type,
    153                                logger=logger,
--> 154                                clear_cache=clear_cache, no_cache=no_cache)
    155 
    156     def __init__(self, data_dir, tokenizer, train_file='lm_train.txt', val_file='lm_val.txt',

~/.conda/envs/transclass/lib/python3.7/site-packages/fast_bert/data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
    209             train_filepath = str(self.data_dir/train_file)
    210             train_dataset = TextDataset(self.tokenizer, train_filepath, cached_features_file,
--> 211                                         self.logger, block_size=self.tokenizer.max_len_single_sentence)
    212 
    213             self.train_batch_size = self.batch_size_per_gpu * \

~/.conda/envs/transclass/lib/python3.7/site-packages/fast_bert/data_lm.py in __init__(self, tokenizer, file_path, cache_path, logger, block_size)
    104 
    105             while len(tokenized_text) >= block_size:  # Truncate in block of block_size
--> 106                 self.examples.append(tokenizer.add_special_tokens_single_sentence(
    107                     tokenized_text[:block_size]))
    108                 tokenized_text = tokenized_text[block_size:]

AttributeError: 'RobertaTokenizer' object has no attribute 'add_special_tokens_single_sentence'

It appears that the RobertaTokenizer has attributes:

add_special_tokens add_special_tokens_sequence_pair add_special_tokens_single_sequence add_tokens

But not add_special_tokens_single_sentence.

It seems this method is quite similar to add_special_tokens_single_sequence, and perhaps that is the intended method.

opened by gphillips-ema 6

KeyError: 'distilroberta-base' | UnboundLocalError: local variable 'file_path' referenced before assignment

Step 23/23 : RUN python download_pretrained_models.py --location_dir ./pretraine d_models/ --models bert-base-uncased roberta-base distilbert-base-uncased distil roberta-base ---> Running in ea0f4907e7f3 Namespace(location_dir='./pretrained_models/', models=['bert-base-uncased', 'rob erta-base', 'distilbert-base-uncased', 'distilroberta-base']) model name is bert-base-uncased location is pretrained_models/bert-base-uncased file path is https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncas ed-vocab.txt 100%|██████████| 231508/231508 [00:00<00:00, 23495328.36B/s] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_mo del.bin 100%|█████████▉| 440048640/440473133 [00:08<00:00, 52815268.67B/s]https://s3.ama zonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json 100%|██████████| 440473133/440473133 [00:08<00:00, 50842146.41B/s] 0%| | 0/313 [00:00<?, ?B/s]model name is roberta-base location is pretrained_models/roberta-base 100%|██████████| 313/313 [00:00<00:00, 235567.41B/s] file path is https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vo cab.json 100%|██████████| 898823/898823 [00:00<00:00, 35036913.95B/s] https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt 100%|██████████| 456318/456318 [00:00<00:00, 34051566.76B/s] https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.b in 100%|██████████| 501200538/501200538 [00:12<00:00, 41648047.00B/s] https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json 100%|██████████| 473/473 [00:00<00:00, 325337.13B/s] model name is distilbert-base-uncased location is pretrained_models/distilbert-base-uncased file path is https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncas ed-vocab.txt 100%|██████████| 231508/231508 [00:00<00:00, 31093372.52B/s] https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pyto rch_model.bin 99%|█████████▉| 265882624/267967963 [00:07<00:00, 53809389.38B/s]https://s3.ama zonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json 100%|██████████| 267967963/267967963 [00:07<00:00, 35726735.77B/s] 100%|██████████| 492/492 [00:00<00:00, 315341.93B/s] model name is distilroberta-base location is pretrained_models/distilroberta-base Traceback (most recent call last): File "download_pretrained_models.py", line 113, in download_pretrained_files file_path = PRETRAINED_VOCAB_FILES_MAP[model_name] KeyError: 'distilroberta-base'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "download_pretrained_models.py", line 203, in main() File "download_pretrained_models.py", line 198, in main for item in args.models File "download_pretrained_models.py", line 198, in for item in args.models File "download_pretrained_models.py", line 130, in download_pretrained_files file_path, model_name UnboundLocalError: local variable 'file_path' referenced before assignment The command '/bin/sh -c python download_pretrained_models.py --location_dir ./pr etrained_models/ --models bert-base-uncased roberta-base distilbert-base-uncased distilroberta-base' returned a non-zero code: 1 Error response from daemon: No such image: fluent-sagemaker-fast-bert:1.0-gpu-py 36 The push refers to repository [182918221797.dkr.ecr.us-east-1.amazonaws.com/flue nt-sagemaker-fast-bert] An image does not exist locally with the tag: 182918221797.dkr.ecr.us-east-1.ama zonaws.com/fluent-sagemaker-fast-bert

opened by emtropyml 5

KeyError: None of the keys are in index

I am getting this error ,

KeyError: ("None of [Index(['CLASS_1', 'CLASS_2', 'CLASS_3', 'CLASS_4', 'CLASS_5', 'CLASS_6',\n       'CLASS_7', 'CLASS_8', 'CLASS_9', 'CLASS_10', 'CLASS_11', 'CLASS_12',\n       'CLASS_13', 'CLASS_14', 'CLASS_15', 'CLASS_16', 'CLASS_17', 'CLASS_E',\n       'CLASS_V'],\n      dtype='object')] are in the [index]", 'occurred at index 0')

When i run the new_toxic_multilabel.ipynb from sample notebook, I am getting this error for command:

databunch = BertDataBunch(args['data_dir'], LABEL_PATH, args.model_name, train_file='train.csv', val_file='val.csv',
                          test_data='test.csv',
                          text_col="NOTES", label_col=label_cols,
                          batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'], 
                          multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)

here is my label_col:

label_cols = ['CLASS_1','CLASS_2','CLASS_3','CLASS_4','CLASS_5','CLASS_6','CLASS_7','CLASS_8','CLASS_9','CLASS_10','CLASS_11','CLASS_12','CLASS_13','CLASS_14','CLASS_15','CLASS_16','CLASS_17','CLASS_E','CLASS_V']

my labels.csv contains the same classes but listed one after the another: labels.csv:


'CLASS_1'
'CLASS_2'
'CLASS_3'
'CLASS_4'
'CLASS_5'
'CLASS_6'
'CLASS_7'
'CLASS_8'
'CLASS_9'
'CLASS_10'
'CLASS_11'
'CLASS_12'
'CLASS_13'
'CLASS_14'
'CLASS_15'
'CLASS_16'
'CLASS_17'
'CLASS_E'
'CLASS_V'

Here is the traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-c5a2ac3a5e99> in <module>
      3                           text_col="NOTES", label_col=label_cols,
      4                           batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'],
----> 5                           multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in __init__(self, data_dir, label_dir, tokenizer, train_file, val_file, test_data, label_file, text_col, label_col, batch_size_per_gpu, max_seq_length, multi_gpu, multi_label, backend, model_type, logger, clear_cache, no_cache)
    352             if os.path.exists(cached_features_file) == False or self.no_cache == True:
    353                 train_examples = processor.get_train_examples(
--> 354                     train_file, text_col=text_col, label_col=label_col)
    355 
    356             train_dataset = self.get_dataset_from_examples(

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in get_train_examples(self, filename, text_col, label_col, size)
    230             data_df = pd.read_csv(os.path.join(self.data_dir, filename))
    231 
--> 232             return self._create_examples(data_df, "train", text_col=text_col, label_col=label_col)
    233         else:
    234             data_df = pd.read_csv(os.path.join(self.data_dir, filename))

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in _create_examples(self, df, set_type, text_col, label_col)
    286         else:
    287             return list(df.apply(lambda row: InputExample(guid=row.index, text_a=row[text_col],
--> 288                                                           label=_get_labels(row, label_col)), axis=1))
    289 
    290 

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6926             kwds=kwds,
   6927         )
-> 6928         return op.get_result()
   6929 
   6930     def applymap(self, func):

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
    290 
    291         # compute the result using the series generator
--> 292         self.apply_series_generator()
    293 
    294         # wrap results

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
    319             try:
    320                 for i, v in enumerate(series_gen):
--> 321                     results[i] = self.f(v)
    322                     keys.append(v.name)
    323             except Exception as e:

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in <lambda>(row)
    286         else:
    287             return list(df.apply(lambda row: InputExample(guid=row.index, text_a=row[text_col],
--> 288                                                           label=_get_labels(row, label_col)), axis=1))
    289 
    290 

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in _get_labels(row, label_col)
    273         def _get_labels(row, label_col):
    274             if isinstance(label_col, list):
--> 275                 return list(row[label_col])
    276             else:
    277                 # create one hot vector of labels

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
   1111             key = check_bool_indexer(self.index, key)
   1112 
-> 1113         return self._get_with(key)
   1114 
   1115     def _get_with(self, key):

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/series.py in _get_with(self, key)
   1153             # handle the dup indexing case (GH 4246)
   1154             if isinstance(key, (list, tuple)):
-> 1155                 return self.loc[key]
   1156 
   1157             return self.reindex(key)

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1422 
   1423             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1424             return self._getitem_axis(maybe_callable, axis=axis)
   1425 
   1426     def _is_scalar_access(self, key: Tuple):

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1837                     raise ValueError("Cannot index with multidimensional key")
   1838 
-> 1839                 return self._getitem_iterable(key, axis=axis)
   1840 
   1841             # nested tuple slicing

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1131         else:
   1132             # A collection of keys
-> 1133             keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1134             return self.obj._reindex_with_indexers(
   1135                 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1090 
   1091         self._validate_read_indexer(
-> 1092             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1093         )
   1094         return keyarr, indexer

~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1175                 raise KeyError(
   1176                     "None of [{key}] are in the [{axis}]".format(
-> 1177                         key=key, axis=self.obj._get_axis_name(axis)
   1178                     )
   1179                 )

KeyError: ("None of [Index(['CLASS_1', 'CLASS_2', 'CLASS_3', 'CLASS_4', 'CLASS_5', 'CLASS_6',\n       'CLASS_7', 'CLASS_8', 'CLASS_9', 'CLASS_10', 'CLASS_11', 'CLASS_12',\n       'CLASS_13', 'CLASS_14', 'CLASS_15', 'CLASS_16', 'CLASS_17', 'CLASS_E',\n       'CLASS_V'],\n      dtype='object')] are in the [index]", 'occurred at index 0')

What is the issue?

opened by adiv5 5

use_fast=True not working after upgrade to transformers v2.10.0

On upgrading to transformers==2.10.0, when instantiating a tokenizer, the vocabulary file is not saved after training. A TypeError is returned when trying to save the tokenizer after training (i.e. on calling data.tokenizer.save_pretrained(path) in learner_util.py).

I've traced this to line 367 in data_cls.py: https://github.com/kaushaltrivedi/fast-bert/blob/77f09adc7bc2706e0c7e3b8cdd09cb6ddd66ae28/fast_bert/data_cls.py#L367

if I comment out the use_fast argument, the tokenizer file can be saved correctly, i.e: tokenizer = AutoTokenizer.from_pretrained(tokenizer)#, use_fast=True)

opened by lingdoc 4
The current BertClassificationPredictor has a bug in model_path parameter

The current BertClassificationPredictor has a bug in model_path parameter when it tries to create tokenizer from AutoTokenizer. It would be good to fix it but also let an option to provide a custom tokenizer.

opened by markovivl 4

ImportError: cannot import name 'ConstantLRSchedule'

Doing the following steps to install the fast-bert:

pip install fast-bert
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir ./
create train.py

from fast_bert.data_cls import BertDataBunch

DATA_PATH = 'data'
LABEL_PATH = 'data'

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer='bert-base-multilingual-uncased',
                          train_file='train.csv',
                          val_file='val.csv',
                          label_file='labels.csv',
                          text_col='text',
                          label_col='label',
                          batch_size_per_gpu=16,
                          max_seq_length=512,
                          multi_gpu=True,
                          multi_label=False,
                          model_type='bert')

run it

Getting the following error:

Traceback (most recent call last):
  File "/home/kleysonr/.vscode/extensions/ms-python.python-2019.11.50794/pythonFiles/ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "/home/kleysonr/.vscode/extensions/ms-python.python-2019.11.50794/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/kleysonr/.vscode/extensions/ms-python.python-2019.11.50794/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/dev/python/mestrado/aula9/train.py", line 1, in <module>
    from fast_bert.data_cls import BertDataBunch
  File "/home/kleysonr/.virtualenvs/fastai/lib/python3.6/site-packages/fast_bert/__init__.py", line 5, in <module>
    from .learner_cls import BertLearner
  File "/home/kleysonr/.virtualenvs/fastai/lib/python3.6/site-packages/fast_bert/learner_cls.py", line 3, in <module>
    from .learner_util import Learner
  File "/home/kleysonr/.virtualenvs/fastai/lib/python3.6/site-packages/fast_bert/learner_util.py", line 4, in <module>
    from transformers import (ConstantLRSchedule,
ImportError: cannot import name 'ConstantLRSchedule'

Installed python modules:

$ pip freeze
apex==0.1
beautifulsoup4==4.8.1
blis==0.4.1
boto3==1.10.28
botocore==1.13.28
Bottleneck==1.3.1
catalogue==0.0.8
certifi==2019.11.28
chardet==3.0.4
Click==7.0
cycler==0.10.0
cymem==2.0.3
dataclasses==0.7
docutils==0.15.2
fast-bert==1.4.4
fastai==1.0.59
fastprogress==0.1.22
idna==2.8
importlib-metadata==0.23
jmespath==0.9.4
joblib==0.14.0
kiwisolver==1.1.0
matplotlib==3.1.2
more-itertools==7.2.0
murmurhash==1.0.2
numexpr==2.7.0
numpy==1.17.4
nvidia-ml-py3==7.352.0
packaging==19.2
pandas==0.25.3
Pillow==6.2.1
plac==1.1.3
preshed==3.0.2
protobuf==3.11.0
pyparsing==2.4.5
python-dateutil==2.8.1
pytorch-lamb==1.0.0
pytz==2019.3
PyYAML==5.1.2
regex==2019.11.1
requests==2.22.0
s3transfer==0.2.1
sacremoses==0.0.35
scikit-learn==0.21.3
scipy==1.3.3
sentencepiece==0.1.83
six==1.13.0
sklearn==0.0
soupsieve==1.9.5
spacy==2.2.3
srsly==0.2.0
tensorboardX==1.9
thinc==7.3.1
torch==1.3.1
torchvision==0.4.2
tqdm==4.39.0
transformers==2.2.0
urllib3==1.25.7
wasabi==0.4.0
zipp==0.6.0

opened by kleysonr 4

Updated data.py and data_cls.py to work with xlsx data files

This hotfix allows xlsx files as data files for training and evaluation. It simply checks whether xlsx is in the filename and uses the read_excel() import function from the pandas library. It may require openpyxl to be installed via pip or another package manager.

Addresses #311 (possibly others), whereby imports via read_csv() can result in errors due to formatting problems.

opened by lingdoc 0
Updated learner_util.py save_model() to work with an alternate path in string format

Currently when a path string is provided to learner.save_model(), a directory is not created. This hotfix converts the string to a Path object so that a new directory can be created.

opened by lingdoc 0

DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False

Hello, I am quite new on the topic, sorry if it's a false issue.

When loading with BertDataBunch, I got this warning:

lib/python3.9/site-packages/fast_bert/data_cls.py:231: DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False.
  data_df = pd.read_csv(os.path.join(self.data_dir, filename))

I already have this sort of issue with panda in my code, but with BertDataBunch I can't find a way to set dtype option ? Installed fast-bert yesterday, so latest version I guess

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                              tokenizer='camembert-base',
                              train_file='train_set.csv',
                              val_file='val_set.csv',
                              label_file='labels.txt',
                              text_col='source_clean',
                              label_col=['aaa', 'bbb', 'ccc','ddd', 'eee'],
                              batch_size_per_gpu=16,
                              max_seq_length=512,
                              multi_gpu=False,
                              multi_label=True,
                              model_type='camembert-base')

opened by mathieuchateau 2

TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'

learner.fit(epochs=1, r=6e-5, validate=True, # Evaluate the model after each epoch schedule_type="warmup_cosine", optimizer_type="lamb") Hi, following the official tutorial ("Language Model Fine-tuning) , i get the following error presented in screenshot, while running .fit function.

opened by FirstGalacticEmpire 0
[BUG] AttributeError: 'RobertaTokenizer' object has no attribute 'max_len'

args = Box({ "seed": 42, "task_name": 'Medical_language_modelling', "model_name": 'roberta-base', "model_type": 'roberta', "train_batch_size": 16, "learning_rate": 4e-5, "num_train_epochs": 20, "fp16": True, "fp16_opt_level": "O2", "warmup_steps": 1000, "logging_steps": 0, "max_seq_length": 512, "multi_gpu": True if torch.cuda.device_count() > 1 else False })

databunch_lm = BertLMDataBunch.from_raw_corpus( data_dir=Path("./raw_text/"), text_list=list_of_files, tokenizer=args.model_name, batch_size_per_gpu=args.train_batch_size, max_seq_length=args.max_seq_length, multi_gpu=args.multi_gpu, model_type=args.model_type, logger=logger)

When running the following line I get the following error: "AttributeError: 'RobertaTokenizer' object has no attribute 'max_len'" Which I suspect is due to update, that caused the RobertaTokenizer to lost its attribute max_len.

opened by FirstGalacticEmpire 0
[Suggestion] Pin requirement versions (specifically python-box)
Hello, I am the developer of python-box and see that it is a requirement in this repo and has not been version pinned. I suggest that you pin it to the max known compatible version in your requirements.txt and/or setup.py file(s):

python-box[all]~=5.4

Or without extra dependencies

python-box~=5.4

Using ~=5.0 (or any minor version) will lock it to the major version of 5 and minimum of minor version specified. If you add a bugfix space for 5.4.0 it would lock it to the minor version 5.4.*.

The next major release of Box is right around the corner, and while it has many improvements, I want to ensure you have a smooth transition by being able to test at your own leisure to ensure your standard user cases do not run into any issues. I am keeping track of major changes, so please check there as a quick overview of any differences.

To test new changes, try out the release candidate:

pip install python-box[all]~=6.0.0rc4
opened by cdgriffith 0

Releases(v1.8.0)

v1.8.0(Jul 9, 2020)

Release 1.8.0 - New Learning Rate finder integrated with learner object.
Source code(tar.gz)
Source code(zip)
v1.7.0(Apr 14, 2020)

We have switched to Auto-model for Multi-class classification. This would let you train any pretrained model architecture for text classification.
Source code(tar.gz)
Source code(zip)
v1.6.0(Dec 22, 2019)

Now supports the initial version of Abstractive Summarisation inference, fast-bert style

In a not so future release, you will be able to use your custom language model fine-tuned on custom corpus for the encoder model.
Source code(tar.gz)
Source code(zip)
v1.5.1(Dec 14, 2019)

Fixed some of the bugs related to fastai dependencies.
Source code(tar.gz)
Source code(zip)
v1.5.0(Nov 28, 2019)
Three new models have been added in v1.5.0

ALBERT (Pytorch) (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.

CamemBERT (Pytorch) (from Facebook AI Research, INRIA, and La Sorbonne Université), as the first large-scale Transformer language model. Released alongside the paper CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot. It was added by @louismartin with the help of @julien-c.

DistilRoberta (Pytorch) from @VictorSanh as the third distilled model after DistilBERT and DistilGPT-2.

Source code(tar.gz)
Source code(zip)

Super easy library for BERT based NLP models

Related tags

Overview

Fast-Bert

Installation

With pip

From source

Usage

Text Classification

1. Create a DataBunch object

File format for train.csv and val.csv

Tokenizer

Model Type

2. Create a Learner Object

3. Find the optimal learning rate

4. Train the model

5. Save trained model artifacts

6. Model Inference

Language Model Fine-tuning

1. Import the necessary libraries

2. Define parameters and setup datapaths

3. Create DataBunch object

4. Create the LM Learner object

5. Train the model

6. Save trained model artifacts

Amazon Sagemaker Support

Citation

Comments

Releases(v1.8.0)

v1.8.0(Jul 9, 2020)

v1.7.0(Apr 14, 2020)

v1.6.0(Dec 22, 2019)

v1.5.1(Dec 14, 2019)

v1.5.0(Nov 28, 2019)

Three new models have been added in v1.5.0

Owner

Utterworks

novel deep learning research works with PaddlePaddle

justCTF [*] 2020 challenges sources

TLA - Twitter Linguistic Analysis

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

A python framework to transform natural language questions to queries in a database query language.

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Search-Engine - 📖 AI based search engine

A Chinese to English Neural Model Translation Project

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

SDL: Synthetic Document Layout dataset

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

SciBERT is a BERT model trained on scientific text.

A Flask Sentiment Analysis API, with visual implementation

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks