Training BERT with Compute/Time (Academic) Budget
This repository contains scripts for pre-training and finetuning BERT-like models with limited time and compute budget. The code is based on the work presented in the following paper:
Peter Izsak, Moshe Berchansky, Omer Levy, How to Train BERT with an Academic Budget - (to appear at EMNLP 2021).
Installation
The pre-training and finetuning scripts are based on Deepspeed and HuggingFace Transformers libraries.
Preliminary Installation
We recommend creating a virtual environment with python 3.6+, PyTorch and apex.
Installation Requirements
pip install -r requirements.txt
We suggest running Deepspeed's utility ds_report and verify Deepspeed components can be compiled (JIT).
Dataset
The dataset directory includes scripts to pre-process the datasets we used in our experiments (Wikipedia, Bookcorpus). See dedicated README for full details.
Pretraining
Pretraining script: run_pretraining.py
For all possible pretraining arguments see: python run_pretraining.py -h
We highly suggest reviewing the various training features we provide within the library.
Example for training with the best configuration presented in our paper (24-layers/1024H/time-based learning rate schedule/fp16):
deepspeed run_pretraining.py \
  --model_type bert-mlm --tokenizer_name bert-large-uncased \
  --hidden_act gelu \
  --hidden_size 1024 \
  --num_hidden_layers 24 \
  --num_attention_heads 16 \
  --intermediate_size 4096 \
  --hidden_dropout_prob 0.1 \
  --attention_probs_dropout_prob 0.1 \
  --encoder_ln_mode pre-ln \
  --lr 1e-3 \
  --train_batch_size 4096 \
  --train_micro_batch_size_per_gpu 32 \
  --lr_schedule time \
  --curve linear \
  --warmup_proportion 0.06 \
  --gradient_clipping 0.0 \
  --optimizer_type adamw \
  --weight_decay 0.01 \
  --adam_beta1 0.9 \
  --adam_beta2 0.98 \
  --adam_eps 1e-6 \
  --total_training_time 24.0 \
  --early_exit_time_marker 24.0 \
  --dataset_path <dataset path> \
  --output_dir /tmp/training-out \
  --print_steps 100 \
  --num_epochs_between_checkpoints 10000 \
  --job_name pretraining_experiment \
  --project_name budget-bert-pretraining \
  --validation_epochs 3 \
  --validation_epochs_begin 1 \
  --validation_epochs_end 1 \
  --validation_begin_proportion 0.05 \
  --validation_end_proportion 0.01 \
  --validation_micro_batch 16 \
  --deepspeed \
  --data_loader_type dist \
  --do_validation \
  --use_early_stopping \
  --early_stop_time 180 \
  --early_stop_eval_loss 6 \
  --seed 42 \
  --fp16
Time-based Training
Pretraining can be limited to a time-based value by defining --total_training_time=24.0 (24 hours for example).
Time-based Learning Rate Scheduling
The learning rate can be scheduled to change according to the configured total training time. The argument --total_training_time controls the total time assigned for the trainer to run, and must be specified in order to use time-based learning rate scheduling.
To select time-based learning rate scheduling, define --lr_schedule time, and define a shape for for the annealing curve (--curve=linear for example, as seen in the figure). The warmup phase of the learning rate is define by specifying a proportion (--warmup_proportion) which accounts for the time-budget proportion available in the training session (as defined by --total_training_time). For example, for a 24 hour training session, warmup_proportion=0.1 would account for 10% of 24 hours, that is, 2.4 hours (or 144 minutes) to reach peak learning rate. The learning rate will then be scheduled to reach 0 at the end of the time budget. We refer to the provided figure for an example.
Checkpoints and Finetune Checkpoints
There are 2 types of checkpoints that can be enabled:
- Training checkpoint - saves model weights, optimizer state and training args. Defined by --num_epochs_between_checkpoints.
- Finetuning checkpoint - saves model weights and configuration to be used for finetuning later on. Defined by --finetune_time_markers.
finetune_time_markers can be assigned multiple points in the training time-budget by providing a list of time markers of the overall training progress. For example --finetune_time_markers=0.5 will save a finetuning checkpoint when reaching 50% of training time budget. For multiple finetuning checkpoints, use commas without space 0.5,0.6,0.9.
Validation Scheduling
Enable validation while pre-training with --do_validation
Control the number of epochs between validation runs with --validation_epochs=
To control the amount of validation runs in the beginning and end (running more that validation_epochs) use validation_begin_proportion and validation_end_proportion to specify the proportion of time and, validation_epochs_begin and validation_epochs_end to control the custom values accordingly.
Mixed Precision Training
Mixed precision is supported by adding --fp16. Use --fp16_backend=ds to use Deepspeed's mixed precision backend and --fp16_backend=apex for apex (--fp16_opt controls optimization level).
Finetuning
Use run_glue.py to run finetuning for a saved checkpoint on GLUE tasks.
The finetuning script is identical to the one provided by Huggingface with the addition of our model.
For all possible pretraining arguments see: python run_glue.py -h
Example for finetuning on MRPC:
python run_glue.py \
  --model_name_or_path <path to model> \
  --task_name MRPC \
  --max_seq_length 128 \
  --output_dir /tmp/finetuning \
  --overwrite_output_dir \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --per_device_train_batch_size 32 --gradient_accumulation_steps 1 \
  --per_device_eval_batch_size 32 \
  --learning_rate 5e-5 \
  --weight_decay 0.01 \
  --eval_steps 50 --evaluation_strategy steps \
  --max_grad_norm 1.0 \
  --num_train_epochs 5 \
  --lr_scheduler_type polynomial \
  --warmup_steps 50
Generating Pretraining Commands
We provide a useful script for generating multiple (or single) pretraining commands by using python generate_training_commands.py.
python generate_training_commands.py -h
	--param_file PARAM_FILE Hyperparameter and configuration yaml
  	--job_name JOB_NAME   job name
 	--init_cmd INIT_CMD   initialization command (deepspeed or python directly)
A parameter yaml must be defined with 2 main keys: hyperparameters with argument values defined as a list of possible values, and default_parameters as default values. Each generated command will be a possible combination of the various arguments specified in the hyperparameters section.
Example:
hyperparameters:
  param1: [val1, val2]
  param2: [val1, val2]
default_parameters:
  param3: 0.0
will result in:
deepspeed run_pretraining.py --param1=val1 --param2=val1 --param3=0.0
deepspeed run_pretraining.py --param1=val1 --param2=val2 --param3=0.0
deepspeed run_pretraining.py --param1=val2 --param2=val1 --param3=0.0
deepspeed run_pretraining.py --param1=val2 --param2=val2 --param3=0.0
Citation
If you find this paper or this code useful, please cite this paper:
@article{izsak2021,
  author={Izsak, Peter and Berchansky, Moshe and Levy, Omer},
  title={How to Train BERT with an Academic Budget},
  journal={arXiv preprint arXiv:2104.07705},
  url = {https://arxiv.org/abs/2104.07705} 
  year={2021}
}
