Benchmark for Tuning Accuracy and Efficiency

Overview

The benchmark includes our efforts in using Colossal-AI to train different tasks to achieve SOTA results. We are interested in both validataion accuracy and training speed, and prefer larger batch size to take advantage of more GPU devices. For example, we trained vision transformer with batch size 512 on CIFAR10 and 4096 on ImageNet1k, which are basically not used in existing works. Some of the results in the benchmark trained with 8x A100 are shown below.

Task	Model	Training Time	Top-1 Accuracy
CIFAR10	ViT-Lite-7/4	~ 16 min	~ 90.5%
ImageNet1k	ViT-S/16	~ 16.5 h	~ 74.5%

The train.py script in each task runs training with the specific configuration script in configs/ for different parallelisms. Supported parallelisms include data parallel only (ends with vanilla), 1D (ends with 1d), 2D (ends with 2d), 2.5D (ends with 2p5d), 3D (ends with 3d).

Each configuration scripts basically includes the following elements, taking ImageNet1k task as example:

TOTAL_BATCH_SIZE = 4096
LEARNING_RATE = 3e-3
WEIGHT_DECAY = 0.3

NUM_EPOCHS = 300
WARMUP_EPOCHS = 32

# data parallel only
TENSOR_PARALLEL_SIZE = 1    
TENSOR_PARALLEL_MODE = None

# parallelism setting
parallel = dict(
    pipeline=1,
    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
)

fp16 = dict(mode=AMP_TYPE.TORCH, ) # amp setting

gradient_accumulation = 2 # accumulate 2 steps for gradient update

BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation # actual batch size for dataloader

clip_grad_norm = 1.0 # clip gradient with norm 1.0

Upper case elements are basically what train.py needs, and lower case elements are what Colossal-AI needs to initialize the training.

Usage

To start training, use the following command to run each worker:

$ DATA=/path/to/dataset python train.py --world_size=WORLD_SIZE \
                                        --rank=RANK \
                                        --local_rank=LOCAL_RANK \
                                        --host=MASTER_IP_ADDRESS \
                                        --port=MASTER_PORT \
                                        --config=CONFIG_FILE

It is also recommended to start training with torchrun as:

$ DATA=/path/to/dataset torchrun --nproc_per_node=NUM_GPUS_PER_NODE \
                                 --nnodes=NUM_NODES \
                                 --node_rank=NODE_RANK \
                                 --master_addr=MASTER_IP_ADDRESS \
                                 --master_port=MASTER_PORT \
                                 train.py --config=CONFIG_FILE

ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Related tags

Overview

Benchmark for Tuning Accuracy and Efficiency

Overview

Usage

Owner

HPC-AI Tech

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

Implementation and replication of ProGen, Language Modeling for Protein Generation, in Jax

Exploring Machine Learning Models for detecting anomalous behavior in credit-card transactions. It's crucial that credit-card companies are able to recognize fraudulent activity so that customers are not charged for items they didn't purchase.

Companion repository to the paper accepted at the 4th ACM SIGSPATIAL International Workshop on Advances in Resilient and Intelligent Cities

Automatic Differentiation Multipole Moment Molecular Forcefield

a short visualisation script for pyvideo data

Use unsupervised and supervised learning to predict stocks

A Pytorch reproduction of Range Loss, which is proposed in paper 《Range Loss for Deep Face Recognition with Long-Tailed Training Data》

Machine Learning with JAX Tutorials

This tool uses Deep Learning to help you draw and write with your hand and webcam.

Additional functionality for use with fastai’s medical imaging module

A Simple Framwork for CV Pre-training Model (SOCO, VirTex, BEiT)

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

SBINN: Systems-biology informed neural network

PyTorch implementation of Pointnet2/Pointnet++

[ECCV 2020] Gradient-Induced Co-Saliency Detection

The codebase for Data-driven general-purpose voice activity detection.

Code for the USENIX 2017 paper: kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels