TorchShard is a lightweight engine for slicing a PyTorch tensor into parallel shards

Last update: Nov 22, 2022

Overview

Documents | Projects | API References

TorchShard is a lightweight engine for slicing a PyTorch tensor into parallel shards. It can reduce GPU memory and scale up the training when the model has massive linear layers (e.g., ViT, BERT and GPT) or huge classes (millions). It has the same API design as PyTorch.

Installation

pip install torchshard

More options in INSTALL.md.

Usage

import torchshard as ts

ts.init_process_group(group_size=2)                       # init parallel groups

m = torch.nn.Sequential(
    torch.nn.Linear(20, 30, bias=True),               
    ts.nn.ParallelLinear(30, 30, bias=True, dim=None),    # equal to nn.Linear()
    ts.nn.ParallelLinear(30, 30, bias=True, dim=0),       # parallel in row dimension
    ts.nn.ParallelLinear(30, 30, bias=True, dim=1),       # parallel in column dimension
).cuda()

x = m(x)                                                  # forward
loss = ts.nn.functional.parallel_cross_entropy(x, y)      # parallel loss function
loss.backward()                                           # backward

torch.save(
  ts.collect_state_dict(m, m.state_dict()), 'm.pt')       # save model state

Performance

The following figure is a showcase of training ResNet-50 on 8 NVIDIA TITAN-XP (12196 MiB) GPUs with scaling up classes from 1000 → 1 Million. The input size is 224 x 224, and the batch size is 256. Parallelism is with 8-way data parallel and 8-way model parallel.

The following figure shows training minGPT on 8 NVIDIA TITAN-XP (12196 MiB) GPUs with scaling up parameters from 10 Million → 808 Million. The input size is 32 x 32, and the batch size is 16. Parallelism is with 1-way data parallel and 8-way model parallel.

Contributing

The TorchShard welcomes your expertise and enthusiasm!

If you are interested in torchshard, you are welcome to help

polish code and develop new features
develop high-quality tutorials, projects, and advanced materials

Direct pull requests are welcome. Contact: kaiyuyue [at] umd.edu.

Citing TorchShard

If you think TorchShard is helpful in your research and consider to cite it, please use the following BibTeX entry.

@misc{torchshard2021,
  author =       {Kaiyu Yue},
  title =        {TorchShard},
  howpublished = {\url{https://github.com/KaiyuYue/torchshard}},
  year =         {2021}
}

Comments

Future Planinig on this project.
Hello Kaiyu, I love this awesome project. The API design is elegant and simple and the software is lightweight and user-friendly. My understanding is that this project has realized a series of PyTorch wrappers for tensor slicing.

I am curious about the future planning of this project.

Is there some overlap in functionality between torchshard and N-D parallelism proposed in ColossalAI.

How is compatibility with ZeRO? According to am+zero example, the memory footprint has a little change after combination torchshard with ZeRO.
opened by feifeibear 2
Which one is faster?

Thanks for contributing this great lib. I have one question. Which one is faster (in speed) between dim=0and dim=1? The documentations seem to only contain accuracy results.

opened by NOBLES5E 2
8 gpus test example raise error.

When I do Unit Tests, it can pass when use two gpu devices, run command below: CUDA_VISIBLE_DEVICES=0,1 python3 -m unittest discover -v -s tests

But I do Unit Tests with eight gpu devices, it raise ncclSystemError. run command: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m unittest discover -v -s tests raise error: RuntimeError: NCCL error in ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Is it necessary to pass unittest in eights gpu devices?

opened by JiaquanYe 1

Error?

Hi, thanks for the excellent job! When I install it from pip, and

import torchshard as ts
ts.init_process_group(group_size=2)

The AttributeError occurs:

AttributeError: module 'torchshard' has no attribute 'init_process_group'

opened by WangWenhao0716 1

Multi-node setting?

https://github.com/KaiyuYue/torchshard/blob/89e21def180bf6063ceb2e312a61631173abc7e7/projects/minGPT/main.py#L150

I have noticed that the group_size is set to world_size in examples, but in fact the group_size can be set to other numbers according to my understanding.

https://github.com/KaiyuYue/torchshard/blob/main/torchshard/distributed/core.py#L18

I have also found that the get_world_size() will return the number of all processes.

The two findings make me confused in a multi-node setting, say 2 nodes with each node with 2 processes.

If the group_size is 2, then there are 2 distinct groups besides the default group (w/ overlap). However, get_world_size() is used without specifying a group can make a layer be splitted to 4 parts, which is expected to be 2 in our case.

Correct me if I am wrong.
Good Issue

opened by GeneZC 1
Is it possible to collect state dict in cpu?

When I finish one epoch in trianing, the main_worker function will call ts.collect_state_dict(model, state_dict). But because the limit of GPU resource, it will raise Out of Memory in my machine, when call ts.collect_state_dict(model, state_dict). I found that will gather the state_dict in GPU, is it anyway to gather in CPU?
Good Issue

opened by JiaquanYe 2

Releases(v0.1)

v0.1(Apr 28, 2021)

TorchShard: Slicing a PyTorch Tensor Into Parallel Shards. It has the same API design as PyTorch.
Source code(tar.gz)
Source code(zip)

Owner

Kaiyu Yue

GitHub Repository

higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.

higher is a library providing support for higher-order optimization, e.g. through unrolled first-order optimization loops, of "meta" aspects of these

1.5k Jan 03, 2023

The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.

News March 3: v0.9.97 has various bug fixes and improvements: Bug fixes for NTXentLoss Efficiency improvement for AccuracyCalculator, by using torch i

5k Jan 02, 2023

A collection of extensions and data-loaders for few-shot learning & meta-learning in PyTorch

Torchmeta A collection of extensions and data-loaders for few-shot learning & meta-learning in PyTorch. Torchmeta contains popular meta-learning bench

1.7k Jan 06, 2023

pip install antialiased-cnns to improve stability and accuracy

Antialiased CNNs [Project Page] [Paper] [Talk] Making Convolutional Networks Shift-Invariant Again Richard Zhang. In ICML, 2019. Quick & easy start Ru

1.6k Dec 28, 2022

TorchShard is a lightweight engine for slicing a PyTorch tensor into parallel shards

TorchShard is a lightweight engine for slicing a PyTorch tensor into parallel shards. It can reduce GPU memory and scale up the training when the model has massive linear layers (e.g., ViT, BERT and

275 Nov 22, 2022

PyTorch Extension Library of Optimized Scatter Operations

PyTorch Scatter Documentation This package consists of a small extension library of highly optimized sparse update (scatter and segment) operations fo

1.2k Jan 07, 2023

PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations

PyTorch Sparse This package consists of a small extension library of optimized sparse matrix operations with autograd support. This package currently

757 Jan 04, 2023

A code copied from google-research which named motion-imitation was rewrited with PyTorch

motor-system Introduction A code copied from google-research which named motion-imitation was rewrited with PyTorch. More details can get from this pr

6 Jan 08, 2022

Riemannian Adaptive Optimization Methods with pytorch optim

geoopt Manifold aware pytorch.optim. Unofficial implementation for “Riemannian Adaptive Optimization Methods” ICLR2019 and more. Installation Make sur

642 Jan 03, 2023

OptNet: Differentiable Optimization as a Layer in Neural Networks

OptNet: Differentiable Optimization as a Layer in Neural Networks This repository is by Brandon Amos and J. Zico Kolter and contains the PyTorch sourc

428 Dec 24, 2022

ONNX Runtime for PyTorch accelerates PyTorch model training using ONNX Runtime.

Accelerate PyTorch models with ONNX Runtime

270 Dec 24, 2022

Learning Sparse Neural Networks through L0 regularization

Example implementation of the L0 regularization method described at Learning Sparse Neural Networks through L0 regularization, Christos Louizos, Max W

202 Nov 10, 2022

TorchSSL: A PyTorch-based Toolbox for Semi-Supervised Learning

1k Dec 28, 2022

PyTorch implementations of normalizing flow and its variants.

55 Dec 01, 2022

Code for paper "Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking"

model_based_energy_constrained_compression Code for paper "Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and

16 Jun 15, 2022

TorchShard is a lightweight engine for slicing a PyTorch tensor into parallel shards

Related tags

Overview

Installation

Usage

Performance

Contributing

Citing TorchShard

Comments

Future Planinig on this project.

Which one is faster?

8 gpus test example raise error.

Error?

Multi-node setting?

Is it possible to collect state dict in cpu?

Releases(v0.1)

v0.1(Apr 28, 2021)

Owner

Kaiyu Yue

higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.

The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.

A collection of extensions and data-loaders for few-shot learning & meta-learning in PyTorch

pip install antialiased-cnns to improve stability and accuracy

TorchShard is a lightweight engine for slicing a PyTorch tensor into parallel shards

PyTorch Extension Library of Optimized Scatter Operations

PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations

A code copied from google-research which named motion-imitation was rewrited with PyTorch

Riemannian Adaptive Optimization Methods with pytorch optim

OptNet: Differentiable Optimization as a Layer in Neural Networks

ONNX Runtime for PyTorch accelerates PyTorch model training using ONNX Runtime.

Learning Sparse Neural Networks through L0 regularization

TorchSSL: A PyTorch-based Toolbox for Semi-Supervised Learning

PyTorch implementations of normalizing flow and its variants.

Code for paper "Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking"

Pretrained EfficientNet, EfficientNet-Lite, MixNet, MobileNetV3 / V2, MNASNet A1 and B1, FBNet, Single-Path NAS

PyTorch extensions for fast R&D prototyping and Kaggle farming

Fast Discounted Cumulative Sums in PyTorch

Fast, general, and tested differentiable structured prediction in PyTorch

Fast and Easy-to-use Distributed Graph Learning for PyTorch Geometric