Simple tutorials on Pytorch DDP training

Overview

pytorch-distributed-training

Distribute Dataparallel (DDP) Training on Pytorch

Features

Good Notes

分享一些网上优质的笔记

TODO

  • 完成DP和DDP源码解读笔记(当前进度50%)
  • 修改代码细节, 复现实验结果

Quick start

想直接运行查看结果的可以执行以下命令, 注意一定要用--ip--port来指定主机的ip地址以及空闲的端口,否则可能无法运行

$ python dataparallel.py --gpu 0,1,2,3
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py
  • --ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址

  • --port=int, e.g --port=23456 来指定启动端口号

  • --batch_size=int, e.g --batch_size=128 设定训练batch_size

  • distributed_gradient_accumulation.py

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py
  • --ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
  • --port=int, e.g --port=23456 来指定启动端口号
  • --grad_accu_steps=int, e.g --grad_accu_steps=4' 来指定gradient_step

Comparison

结果不够准确,GPU状态不同结果可能差异较大

默认情况下都使用SyncBatchNorm, 这会导致执行速度变慢一些,因为需要增加进程之间的通讯来计算BatchNorm, 但有利于保证准确率

Concepts

  • apex
  • DP: DataParallel
  • DDP: DistributedDataParallel

Environments

  • 4 × 2080Ti
model dataset training method time(seconds/epoch) Top-1 accuracy
resnet18 cifar100 DP 20s
resnet18 cifar100 DP+apex 18s
resnet18 cifar100 DDP 16s
resnet18 cifar100 DDP+apex 14.5s

Basic Concept

  • group: 表示进程组,默认情况下只有一个进程组。
  • world size: 全局进程个数
    • 比如16张卡单卡单进程: world size = 16
    • 8卡单进程: world size = 1
    • 只有当连接的进程数等于world size, 程序才会执行
  • rank: 进程序号,用于进程间通讯,表示进程优先级,rank=0表示主进程
  • local_rank: 进程内,GPU编号,非显示参数,由torch.distributed.launch内部指定,rank=3, local_rank=0 表示第3个进程的第1GPU

Usage 单机多卡

1. 获取当前进程的index

pytorch可以通过torch.distributed.lauch启动器,在命令行分布式地执行.py文件, 在执行的过程中会将当前进程的index通过参数传递给python

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int,
                    help='node rank for distributed training')
args = parser.parse_args()
print(args.local_rank)

2. 定义 main_worker 函数

主要的训练流程都写在main_worker函数中,main_worker需要接受三个参数(最后一个参数optional):

def main_worker(local_rank, nprocs, args):
    training...
  • local_rank: 接受当前进程的rank值,在一机多卡的情况下对应使用的GPU号
  • nprocs: 进程数量
  • args: 自己定义的额外参数

main_worker,相当于你每个进程需要运行的函数(每个进程执行的函数内容是一致的,只不过传入的local_rank不一样)

3. main_worker函数中的整体流程

main_worker函数中完整的训练流程

import torch
import torch.distributed as dist
import torch.backends.cudnn as cudnn
def main_worker(local_rank, nprocs, args):
    args.local_rank = local_rank
    # 分布式初始化,对于每个进程来说,都需要进行初始化
    cudnn.benchmark = True
    dist.init_process_group(backend='nccl', init_method='tcp://ip:port', world_size=nprocs, rank=local_rank)
    # 模型、损失函数、优化器定义
    model = ...
    criterion = ...
    optimizer = ...
    # 设置进程对应使用的GPU
    torch.cuda.set_device(local_rank)
    model.cuda(local_rank)
    # 使用分布式函数定义模型
    model = model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
    
    # 数据集的定义,使用 DistributedSampler
    mini_batch_size = batch_size / nprocs # 手动划分 batch_size to mini-batch_size
    train_dataset = ...
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., 
                                              sampler=train_sampler)
    
    test_dataset = ...
    test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)
    testloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., 
                                             sampler=test_sampler) 
    
    # 正常的 train 流程
    for epoch in range(300):
       model.train()
       for batch_idx, (images, target) in enumerate(trainloader):
          images = images.cuda(non_blocking=True)
          target = target.cuda(non_blocking=True)
          ...
          pred = model(images)
          loss = loss_function(pred, target)
          ...
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

4. 定义main函数

import argparse
import torch
parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--batch_size','--batch-size', default=256, type=int)
parser.add_argument('--lr', default=0.1, type=float)

def main_worker(local_rank, nprocs, args):
    ...

def main():
    args = parser.parse_args()
    args.nprocs = torch.cuda.device_count()
    # 执行 main_worker
    main_worker(args.local_rank, args.nprocs, args)

if __name__ == '__main__':
    main()

5. Command Line 启动

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py
  • --ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
  • --port=int, e.g --port=23456 来指定启动端口号

参数说明:

  • --nnodes 表示机器的数量
  • --node_rank 表示当前的机器
  • --nproc_per_node 表示每台机器上的进程数量

参考 distributed.py

6. torch.multiprocessing

使用torch.multiprocessing来解决进程自发控制可能产生问题,这种方式比较稳定,推荐使用

import argparse
import torch
import torch.multiprocessing as mp

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--batch_size','--batch-size', default=256, type=int)
parser.add_argument('--lr', default=0.1, type=float)

def main_worker(local_rank, nprocs, args):
    ...

def main():
    args = parser.parse_args()
    args.nprocs = torch.cuda.device_count()
    # 将 main_worker 放入 mp.spawn 中
    mp.spawn(main_worker, nprocs=args.nprocs, args=(args.nprocs, args))

if __name__ == '__main__':
    main()

参考 distributed_mp.py 启动方式如下:

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py
  • --ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
  • --port=int, e.g --port=23456 来指定启动端口号

Implemented Work

参考的文章如下(如果有文章没有引用,但是内容差不多的,可以提issue给我,我会补上,实在抱歉):

Owner
Ren Tianhe
Ren Tianhe
Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Vision Transformer Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: T

Eunkwang Jeon 1.4k Dec 28, 2022
StyleGAN2 Webtoon / Anime Style Toonify

StyleGAN2 Webtoon / Anime Style Toonify Korea Webtoon or Japanese Anime Character Stylegan2 base high Quality 1024x1024 / 512x512 Generate and Transfe

121 Dec 21, 2022
This repository contains the implementation of the paper: Federated Distillation of Natural Language Understanding with Confident Sinkhorns

Federated Distillation of Natural Language Understanding with Confident Sinkhorns This repository provides an alternative method for ensembled distill

Deep Cognition and Language Research (DeCLaRe) Lab 11 Nov 16, 2022
Data for "Driving the Herd: Search Engines as Content Influencers" paper

herding_data Data for "Driving the Herd: Search Engines as Content Influencers" paper Dataset description The collection contains 2250 documents, 30 i

0 Aug 17, 2021
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano https:

9.6k Jan 06, 2023
[제 13회 투빅스 컨퍼런스] OK Mugle! - 장르부터 멜로디까지, Content-based Music Recommendation

Ok Mugle! 🎵 장르부터 멜로디까지, Content-based Music Recommendation 'Ok Mugle!'은 제13회 투빅스 컨퍼런스(2022.01.15)에서 진행한 음악 추천 프로젝트입니다. Description 📖 본 프로젝트에서는 Kakao

SeongBeomLEE 5 Oct 09, 2022
[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

G-PATE This is the official code base for our NeurIPS 2021 paper: "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of T

AI Secure 14 Oct 12, 2022
A testcase generation tool for Persistent Memory Programs.

PMFuzz PMFuzz is a testcase generation tool to generate high-value tests cases for PM testing tools (XFDetector, PMDebugger, PMTest and Pmemcheck) If

Systems Research at ShiftLab 14 Jul 24, 2022
Winners of the Facebook Image Similarity Challenge

Winners of the Facebook Image Similarity Challenge

DrivenData 111 Jan 05, 2023
PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement.

DECOR-GAN PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement, Zhiqin Chen, Vladimir G. Kim, Matthew Fish

Zhiqin Chen 72 Dec 31, 2022
Gradient representations in ReLU networks as similarity functions

Gradient representations in ReLU networks as similarity functions by Dániel Rácz and Bálint Daróczy. This repo contains the python code related to our

1 Oct 08, 2021
Official implementation of "Motif-based Graph Self-Supervised Learning forMolecular Property Prediction"

Motif-based Graph Self-Supervised Learning for Molecular Property Prediction Official Pytorch implementation of NeurIPS'21 paper "Motif-based Graph Se

zaixi 71 Dec 20, 2022
[CVPR 2021 Oral] Variational Relational Point Completion Network

VRCNet: Variational Relational Point Completion Network This repository contains the PyTorch implementation of the paper: Variational Relational Point

PL 121 Dec 12, 2022
A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.

WILDS is a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.

P-Lambda 437 Dec 30, 2022
OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network

Stock Price Prediction of Apple Inc. Using Recurrent Neural Network OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network Dataset:

Nouroz Rahman 410 Jan 05, 2023
Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

4 Mar 11, 2022
Table-Extractor 表格抽取

(t)able-(ex)tractor 本项目旨在实现pdf表格抽取。 Models 版面分析模块(Yolo) 表格结构抽取(ResNet + Transformer) 文字识别模块(CRNN + CTC Loss) Acknowledgements TableMaster attention-i

2 Jan 15, 2022
A library for using chemistry in your applications

Chemistry in python Resources Used The following items are not made by me! Click the words to go to the original source Periodic Tab Json - Used in -

Tech Penguin 28 Dec 17, 2021
TipToiDog - Tip Toi Dog With Python

TipToiDog Was ist dieses Projekt? Meine 5-jährige Tochter spielt sehr gerne das

1 Feb 07, 2022
⚡ H2G-Net for Semantic Segmentation of Histopathological Images

H2G-Net This repository contains the code relevant for the proposed design H2G-Net, which was introduced in the manuscript "Hybrid guiding: A multi-re

André Pedersen 8 Nov 24, 2022