Introduction
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters. This library is under active development and we are adding more advanced training strategies and auto-parallelization features.
DistML currently supports:
-  Distributed training strategies - Data parallelism
- AllReduce strategy
- Sharded parameter server strategy
- BytePS strategy Pipeline parallleism
- Micro-batch pipeline parallelism
 
 
- Data parallelism
-  DL Frameworks: - PyTorch
- JAX
 
Installation
Install Dependencies
Depending on your CUDA version, install cupy following https://docs.cupy.dev/en/stable/install.html.
Install from source for dev
pip install -e .