Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Last update: Dec 29, 2022

Overview

Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

Horovod is hosted by the LF AI & Data Foundation (LF AI & Data). If you are a company that is deeply committed to using open source technologies in artificial intelligence, machine, and deep learning, and want to support the communities of open source projects in these domains, consider joining the LF AI & Data Foundation. For details about who's involved and how Horovod plays a role, read the Linux Foundation announcement.

Contents

Documentation
Why Horovod?
Install
Concepts
Supported frameworks
Usage
Running Horovod
Gloo
mpi4py
Inference
Tensor Fusion
Horovod Timeline
Automated Performance Tuning
Guides
Troubleshooting
Citation
Publications
References
Getting Involved

Documentation

Why Horovod?

The primary motivation for this project is to make it easy to take a single-GPU training script and successfully scale it to train across many GPUs in parallel. This has two aspects:

How much modification does one have to make to a program to make it distributed, and how easy is it to run it?
How much faster would it run in distributed mode?

Internally at Uber we found the MPI model to be much more straightforward and require far less code changes than previous solutions such as Distributed TensorFlow with parameter servers. Once a training script has been written for scale with Horovod, it can run on a single-GPU, multiple-GPUs, or even multiple hosts without any further code changes. See the Usage section for more details.

In addition to being easy to use, Horovod is fast. Below is a chart representing the benchmark that was done on 128 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:

Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16. See Benchmarks to find out how to reproduce these numbers.

While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at scale.

Install

To install Horovod:

Install CMake

If you've installed TensorFlow from PyPI, make sure that the g++-4.8.5 or g++-4.9 or above is installed.

If you've installed PyTorch from PyPI, make sure that the g++-4.9 or above is installed.

If you've installed either package from Conda, make sure that the gxx_linux-64 Conda package is installed.

Install the horovod pip package.

To run on CPUs:

$ pip install horovod

To run on GPUs with NCCL:

$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod

For more details on installing Horovod with GPU support, read Horovod on GPU.

For the full list of Horovod installation options, read the Installation Guide.

If you want to use MPI, read Horovod with MPI.

If you want to use Conda, read Building a Conda environment with GPU support for Horovod.

If you want to use Docker, read Horovod in Docker.

To compile Horovod from source, follow the instructions in the Contributor Guide.

Concepts

Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather and, broadcast. See this page for more details.

Supported frameworks

See these pages for Horovod examples and best practices:

Usage

To use Horovod, make the following additions to your program:

Run hvd.init() to initialize Horovod.

Pin each GPU to a single process to avoid resource contention.

With the typical setup of one GPU per process, set this to local rank. The first process on the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth.

Scale the learning rate by the number of workers.

Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in learning rate compensates for the increased batch size.

Wrap the optimizer in hvd.DistributedOptimizer.

The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.

Broadcast the initial variable states from rank 0 to all other processes.

This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.

Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.

Example using TensorFlow v1 (see the examples directory for full training examples):

import tensorflow as tf
import horovod.tensorflow as hvd


# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                       config=config,
                                       hooks=hooks) as mon_sess:
  while not mon_sess.should_stop():
    # Perform synchronous training.
    mon_sess.run(train_op)

Running Horovod

The example commands below show how to run distributed training. See Run Horovod for more details, including RoCE/InfiniBand tweaks and tips for dealing with hangs.

To run on a machine with 4 GPUs:

$ horovodrun -np 4 -H localhost:4 python train.py

To run on 4 machines with 4 GPUs each:

$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

To run using Open MPI without the horovodrun wrapper, see Running Horovod with Open MPI.
To run in Docker, see Horovod in Docker.
To run in Kubernetes, see Kubeflow, MPI Operator, Helm Chart, FfDL, and Polyaxon.
To run on Spark, see Horovod on Spark.
To run on Ray, see Horovod on Ray.
To run in Singularity, see Singularity.
To run in a LSF HPC cluster (e.g. Summit), see LSF.

Gloo

Gloo is an open source collective communications library developed by Facebook.

Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed.

For environments that have support both MPI and Gloo, you can choose to use Gloo at runtime by passing the --gloo argument to horovodrun:

$ horovodrun --gloo -np 2 python train.py

mpi4py

Horovod supports mixing and matching Horovod collectives with other MPI libraries, such as mpi4py, provided that the MPI was built with multi-threading support.

You can check for MPI multi-threading support by querying the hvd.mpi_threads_supported() function.

import horovod.tensorflow as hvd

# Initialize Horovod
hvd.init()

# Verify that MPI multi-threading is supported.
assert hvd.mpi_threads_supported()

from mpi4py import MPI
assert hvd.size() == MPI.COMM_WORLD.Get_size()

You can also initialize Horovod with an mpi4py sub-communicator, in which case each sub-communicator will run an independent Horovod training.

from mpi4py import MPI
import horovod.tensorflow as hvd

# Split COMM_WORLD into subcommunicators
subcomm = MPI.COMM_WORLD.Split(color=MPI.COMM_WORLD.rank % 2,
                               key=MPI.COMM_WORLD.rank)

# Initialize Horovod
hvd.init(comm=subcomm)

print('COMM_WORLD rank: %d, Horovod rank: %d' % (MPI.COMM_WORLD.rank, hvd.rank()))

Inference

Learn how to optimize your model for inference and remove Horovod operations from the graph here.

Tensor Fusion

One of the unique things about Horovod is its ability to interleave communication and computation coupled with the ability to batch small allreduce operations, which results in improved performance. We call this batching feature Tensor Fusion.

See here for full details and tweaking instructions.

Horovod Timeline

Horovod has the ability to record the timeline of its activity, called Horovod Timeline.

Use Horovod timeline to analyze Horovod performance. See here for full details and usage instructions.

Automated Performance Tuning

Selecting the right values to efficiently make use of Tensor Fusion and other advanced Horovod features can involve a good amount of trial and error. We provide a system to automate this performance optimization process called autotuning, which you can enable with a single command line argument to horovodrun.

See here for full details and usage instructions.

Guides

Run distributed training in Microsoft Azure using Batch AI and Horovod.
Distributed model training using Horovod.

Send us links to any user guides you want to publish on this site

Troubleshooting

See Troubleshooting and submit a ticket if you can't find an answer.

Citation

Please cite Horovod in your publications if it helps your research:

@article{sergeev2018horovod,
  Author = {Alexander Sergeev and Mike Del Balso},
  Journal = {arXiv preprint arXiv:1802.05799},
  Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}},
  Year = {2018}
}

Publications

1. Sergeev, A., Del Balso, M. (2017) Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow. Retrieved from https://eng.uber.com/horovod/

2. Sergeev, A. (2017) Horovod - Distributed TensorFlow Made Easy. Retrieved from https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy

3. Sergeev, A., Del Balso, M. (2018) Horovod: fast and easy distributed deep learning in TensorFlow. Retrieved from arXiv:1802.05799

References

The Horovod source code was based off the Baidu tensorflow-allreduce repository written by Andrew Gibiansky and Joel Hestness. Their original work is described in the article Bringing HPC Techniques to Deep Learning.

Getting Involved

Community Slack for collaboration and discussion
Horovod Announce for updates on the project
Horovod Technical-Discuss for public discussion

Comments

Trying to install Horovod from a fresh conda environment (with tensorflow) and nothing seems to work
Environment:

Framework: (TensorFlow)

Framework version:

Horovod version: 0.19.5

MPI version: -

CUDA version: -

NCCL version: -

Python version: 3.6

OS and version: Ubuntu

GCC version: -

Your question: Please ask your question here.

Looked through all the available open questions. Currently trying to run go-explore (https://github.com/uber-research/go-explore/tree/master/policy_based) and I have only managed to make horovod work once for whatever reason.

I need it built with tensorflow (aka horovod.tensorflow) and when I try to force the tensorflow flag during installation I get a 10 page log dump which is hard to discern what it actually needs.

How do I get horovod running?

Im not sure what im doing wrong, I've tried everything else
question
opened by illumidas-agn 98
Meet error when run examples keras_mnist.py in horovod V0.9.10

I run examples keras_mnit.py in horovod/examples.

mpirun --allow-run-as-root --mca btl_tcp_if_include eno1 -np 2 -x LD_LIBRARY_PATH -H 192.168.12.50:1,192.168.12.49:1 python /home/dyc/horovod/examples/keras_mnist.py

But got warning messages below so that it can't continue to run.

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops: HorovodBroadcast_TFOptimizer_iterations_0 [ready ranks: 0], HorovodBroadcast_iterations_0 [ready ranks: 1]

What's wrong with it. could you please help me to find a way to solve this warning? Thank you very much in advance.
question

opened by dyoung23 73

horovod can't work in distributed.

horovod (0.11.1) Keras (2.1.1) tensorflow-gpu (1.4.0) openmpi(3.0.0)

It works by using the following cmd in an single machine:

mpirun -np 1 \
    -H 192.168.2.243:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    python3 keras_mnist_advanced.py

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Epoch 1/24
467/468 [============================>.] - ETA: 0s - loss: 0.6053 - acc: 0.8081mtk:21122:21129 [0] INFO NET : Using interface enp0s31f6:192.168.2.243<0>
mtk:21122:21129 [0] INFO NET/IB : Using interface enp0s31f6 for sideband communication
mtk:21122:21129 [0] INFO Using internal Network Socket
mtk:21122:21129 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
mtk:21122:21129 [0] INFO NET : Using interface enp0s31f6:192.168.2.243<0>
mtk:21122:21129 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.1.2+cuda8.0
mtk:21122:21129 [0] INFO Using 256 threads
mtk:21122:21129 [0] INFO Min Comp Cap 6
mtk:21122:21129 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
469/468 [==============================] - 8s 16ms/step - loss: 0.6027 - acc: 0.8088 - val_loss: 0.0706 - val_acc: 0.9776
Epoch 2/24
469/468 [==============================] - 7s 15ms/step - loss: 0.2589 - acc: 0.9224 - val_loss: 0.0469 - val_acc: 0.9854
Epoch 3/24
469/468 [==============================] - 7s 14ms/step - loss: 0.2044 - acc: 0.9385 - val_loss: 0.0376 - val_acc: 0.9892
Epoch 4/24
469/468 [==============================] - 7s 14ms/step - loss: 0.1818 - acc: 0.9460 - val_loss: 0.0362 - val_acc: 0.9880
Epoch 5/24
469/468 [==============================] - 7s 14ms/step - loss: 0.1584 - acc: 0.9520 - val_loss: 0.0291 - val_acc: 0.9909

But it seem doesn't work by using the following cmd in two machines :


(dp) [email protected]:~/Desktop/horovod/examples$ mpirun -np 2 \
>     -H 192.168.2.243:1,192.168.3.246:1 \
>     -bind-to none -map-by slot \
>     -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
>     python3 keras_mnist_advanced.py

It seems the program hangs up and output nothing.

enp0s31f6 Link encap:Ethernet  HWaddr 10:7b:44:16:20:8b  
          inet addr:192.168.2.243  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::1de5:985:b555:96d1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:58080420 errors:0 dropped:0 overruns:0 frame:0
          TX packets:69461209 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:64689742509 (64.6 GB)  TX bytes:87891226474 (87.8 GB)
          Interrupt:16 Memory:f7100000-f7120000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:26940 errors:0 dropped:0 overruns:0 frame:0
          TX packets:26940 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:26761208 (26.7 MB)  TX bytes:26761208 (26.7 MB)

https://github.com/uber/horovod/issues/106 said that the reason(mpirun will hang, no error, no output) is mpi don't known which card to use. So I try to use

mpirun -np 2 \
    -H 192.168.2.243:1,192.168.3.246:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    -mca btl_tcp_if_include enp0s31f6 \
    python3 keras_mnist_advanced.py

But it still doesn't work? (mpirun still hang, no error, no output, no threads run) Is somthing wrong ?? Can someone help me?

question

opened by BIGBALLON 63

WARNING: One or more tensors were submitted to be reduced, gathered

Hi, all

I got warning like this. I believe it slow down the GPU calculation. I am using 4 node and 4 GPUs. Any suggestions will be highly welcome!

Thank you!

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops: HorovodAllreduce_gradients_197_model_0_0_f1_l_last_add_1_grad_Reshape_1_0 [missing ranks: 0] HorovodAllreduce_gradients_197_model_0_0_f1_l_last_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0] HorovodAllreduce_gradients_197_model_0_0_f1_l_last_mul_2_grad_Reshape_0 [missing ranks: 0] HorovodAllreduce_gradients_197_model_0_0_f1_l_2_actnorm_center_add_1_grad_Reshape_1_0 [missing ranks: 0] HorovodAllreduce_gradients_197_model_0_0_f1_l_2_actnorm_scale_mul_2_grad_Reshape_0 [missing ranks: 0] HorovodAllreduce_gradients_197_model_0_0_f1_l_2_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0] HorovodAllreduce_gradients_197_model_0_0_f1_l_1_actnorm_center_add_1_grad_Reshape_1_0 [missing ranks: 0] HorovodAllreduce_gradients_197_model_0_0_f1_l_1_actnorm_scale_mul_2_grad_Reshape_0 [missing ranks: 0] HorovodAllreduce_gradients_197_model_0_0_f1_l_1_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0] HorovodAllreduce_gradients_197_AddN_2_0 [missing ranks: 0]
question wontfix

opened by winwinJJiang 55
Add Pytorch Lightning spark estimator
TorchEstimator(model) can be either a traditional PyTorch module or a PyTorch Lightning LightningModule.

When PyTorch Lightning is used, the optimizer and loss params must be omitted.

Checklist before submitting

[x] Did you read the contributor guide?

[x] Did you update the docs?

[x] Did you write any tests to validate this change?

[x] Did you update the CHANGELOG, if this change affects users?

Description

Fixes # (issue).

Review process to land

All tests and other checks must succeed.

At least one member of the technical steering committee must review and approve.

If any member of the technical steering committee requests changes, they must be addressed.
opened by irasit 53
Add GitHub Workflow CI
A GitHub Workflow for running the CPU test pipeline.

CPU tests finish in 1:30 on GitHub, GPU tests finish in 1:15 on Buildkite.

The workflow file is generated by Python script .github/gen-workflow-ci.py based on the output of .buildkite/gen-pipeline.sh, so that we implement test logic only in one place.

These are the implemented features:

builds all CPU images and runs appropriate tests

builds GPU images but does not run any tests

tests for GPU images are run on Buildkite after all above builds and tests succeed

tests three images on macOS (replaces TravisCI)

images where all frameworks are nightly versions do not prevent the Buildkite tests

PRs without code changes (e.g. docs only) to not run the full build and test cycle (that is what Buildkite pipeline does)

test results are downloaded from Buildkite and together with GitHub test results posted to the pull request

build and test steps have timeouts and are retried three times

workflow detects when its yaml file is not up-to-date

earlier workflow runs on the same branch are cancelled

triggering the Buildkite builds and publishing test results for fork PRs (can only be tested after merging to master)

daily master build schedule

Docker images are build on PR (not pushed), as well as master and version tags (and pushed to Dockerhub), but only if the build and test jobs succeed

Docs and example sync workflows integrated into the CI workflow

Limitations:

Buildkite build always has author "Travis Addair" and build title "GPU Tests triggered by GitHub": we may use a service account here.

ONECCL image does not build and is excluded (af2f3bbf): https://github.com/horovod/horovod/runs/2221724507?check_suite_focus=true (#2846)

GitHub only provides 20 consecutive jobs, so concurrent workflows may block new runs. There can be two concurrent CI workflows, which allows two more jobs.

Missing:

disable Buildkite triggered by git pushs (with this merged into master the buildkite pipeline is empty)
opened by EnricoMi 51
Some error while running pytorch_mnist.py
machine configuration: OS: centos 7 Python: python3.6 pytorch: 0.4.0 nccl: 2.2.13 openmpi: 3.1.2 GPUs : 8 Titan Xp per machine

my run shell script: mpirun -np 8
-H 10.141.202.75:8
-bind-to none -map-by slot
-x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -x PATH -x NCCL_IB_HCA=mlx4_0 -x NCCL_IB_DISABLE=1
-mca pml ob1
python examples/pytorch_mnist.py

ERROR I have encountered：

I have tested tensorflow_mnist.py and keras_mnist.py even in two machines with 16 GPUs, and my run shell script just the same as above(which -np 16,and add another machine with 8 GPUs). Both Tensorflow and Keras can run successfully(I have tested many many times,and have no problems)

2.BUT When i have tested my pytorch(pytorch_mnist.py) with horovod, I have encountered VERY FATAL ERROR, no matter 8GPUs with single machine or 16 GPUs with two machines, SAME errors have appeared!

ERROR log are as follows: A process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged.

The process that invoked fork was:

Local host: [[6927,1],4] (PID 26036)

If you are absolutely sure that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0.

[nmyjs-202-75:26027] [[6927,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 501 *** Error in `python': munmap_chunk(): invalid pointer: 0x0000558febe409e0 *** ======= Backtrace: ========= /usr/lib64/libc.so.6(+0x7ab54)[0x7f3678d3fb54] /usr/lib64/libcuda.so.1(+0x1e3835)[0x7f3664928835] /usr/lib64/libcuda.so.1(+0x248aaf)[0x7f366498daaf] /usr/lib64/libcuda.so.1(+0x1e5180)[0x7f366492a180] /usr/lib64/libpthread.so.0(+0x7e25)[0x7f367908fe25] /usr/lib64/libc.so.6(clone+0x6d)[0x7f3678dbd34d] ======= Memory map: ======== 200000000-200200000 rw-s 00000000 00:05 60924 /dev/nvidiactl 200200000-200400000 ---p 00000000 00:00 0 200400000-200404000 rw-s 00000000 00:05 60924 /dev/nvidiactl 200404000-200600000 ---p 00000000 00:00 0 200600000-200a00000 rw-s 00000000 00:05 60924 /dev/nvidiactl 200a00000-201800000 ---p 00000000 00:00 0 201800000-201804000 rw-s 00000000 00:05 60924 /dev/nvidiactl 201804000-201a00000 ---p 00000000 00:00 0 201a00000-201e00000 rw-s 00000000 00:05 60924 /dev/nvidiactl 201e00000-202c00000 ---p 00000000 00:00 0 202c00000-202c04000 rw-s 00000000 00:05 60924 /dev/nvidiactl 202c04000-202e00000 ---p 00000000 00:00 0 202e00000-203200000 rw-s 00000000 00:05 60924 /dev/nvidiactl 203200000-204000000 ---p 00000000 00:00 0 204000000-204004000 rw-s 00000000 00:05 60924 /dev/nvidiactl 204004000-204200000 ---p 00000000 00:00 0 204200000-204600000 rw-s 00000000 00:05 60924 /dev/nvidiactl 204600000-205400000 ---p 00000000 00:00 0 205400000-205404000 rw-s 00000000 00:05 60924 /dev/nvidiactl 205404000-205600000 ---p 00000000 00:00 0 205600000-205a00000 rw-s 00000000 00:05 60924 /dev/nvidiactl 205a00000-206800000 ---p 00000000 00:00 0 206800000-206804000 rw-s 00000000 00:05 60924 /dev/nvidiactl 206804000-206a00000 ---p 00000000 00:00 0 206a00000-206e00000 rw-s 00000000 00:05 60924 /dev/nvidiactl 206e00000-207c00000 ---p 00000000 00:00 0 207c00000-207c04000 rw-s 00000000 00:05 60924 /dev/nvidiactl 207c04000-207e00000 ---p 00000000 00:00 0 207e00000-208200000 rw-s 00000000 00:05 60924 /dev/nvidiactl 208200000-209000000 ---p 00000000 00:00 0 209000000-209004000 rw-s 00000000 00:05 60924 /dev/nvidiactl 209004000-209200000 ---p 00000000 00:00 0 209200000-209600000 rw-s 00000000 00:05 60924 /dev/nvidiactl 209600000-20a400000 ---p 00000000 00:00 0 20a400000-20a404000 rw-s 00000000 00:05 60924 /dev/nvidiactl 20a404000-20a600000 ---p 00000000 00:00 0 20a600000-20aa00000 rw-s 00000000 00:05 60924 /dev/nvidiactl 20aa00000-20aa04000 rw-s 00000000 00:05 60924 /dev/nvidiactl 20aa04000-20ac00000 ---p 00000000 00:00 0 20ac00000-20b000000 rw-s 00000000 00:05 60924 /dev/nvidiactl 20b000000-20b004000 rw-s 00000000 00:05 60924 /dev/nvidiactl 20b004000-20b200000 ---p 00000000 00:00 0 20b200000-20b600000 rw-s 00000000 00:05 60924 /dev/nvidiactl 20b600000-20b604000 rw-s 00000000 00:05 60924 /dev/nvidiactl 20b604000-20b800000 ---p 00000000 00:00 0 20b800000-20bc00000 rw-s 00000000 00:05 60924 [nmyjs-202-75:26039] *** Process received signal *** [nmyjs-202-75:26039] Signal: Aborted (6) [nmyjs-202-75:26039] Signal code: (-6) [nmyjs-202-75:26037] *** Process received signal *** [nmyjs-202-75:26037] Signal: Aborted (6) [nmyjs-202-75:26037] Signal code: (-6) [nmyjs-202-75:26037] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7f36790975e0] [nmyjs-202-75:26037] [ 1] [nmyjs-202-75:26039] [ 0] /usr/lib64/libc.so.6(gsignal+0x37)[0x7f3678cfa1f7] [nmyjs-202-75:26037] [ 2] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7f0e8d18f5e0] [nmyjs-202-75:26039] [ 1] /usr/lib64/libc.so.6(abort+0x148)[0x7f3678cfb8e8] [nmyjs-202-75:26037] [ 3] /usr/lib64/libc.so.6(+0x74f47)[0x7f3678d39f47] [nmyjs-202-75:26037] [ 4] /usr/lib64/libc.so.6(gsignal+0x37)[0x7f0e8cdf21f7] [nmyjs-202-75:26039] [ 2] /usr/lib64/libc.so.6(+0x7ab54)[0x7f3678d3fb54] [nmyjs-202-75:26037] [ 5] /usr/lib64/libcuda.so.1(+0x1e3835)[0x7f3664928835] [nmyjs-202-75:26037] [ 6] /usr/lib64/libc.so.6(abort+0x148)[0x7f0e8cdf38e8] [nmyjs-202-75:26039] [ 3] /usr/lib64/libcuda.so.1(+0x248aaf)[0x7f366498daaf] [nmyjs-202-75:26037] [ 7] /usr/lib64/libcuda.so.1(+0x1e5180)[0x7f366492a180] [nmyjs-202-75:26037] /usr/lib64/libc.so.6(+0x74f47)[0x7f0e8ce31f47] [nmyjs-202-75:26039] [ 4] [ 8] /usr/lib64/libpthread.so.0(+0x7e25)[0x7f367908fe25] [nmyjs-202-75:26037] [ 9] /usr/lib64/libc.so.6(clone+0x6d)[0x7f3678dbd34d] [nmyjs-202-75:26037] *** End of error message *** /usr/lib64/libc.so.6(+0x7ab54)[0x7f0e8ce37b54] [nmyjs-202-75:26039] [ 5] /usr/lib64/libcuda.so.1(+0x1e3835)[0x7f0e78a20835] [nmyjs-202-75:26039] [ 6] /usr/lib64/libcuda.so.1(+0x248aaf)[0x7f0e78a85aaf] [nmyjs-202-75:26039] [ 7] /usr/lib64/libcuda.so.1(+0x1e5180)[0x7f0e78a22180] [nmyjs-202-75:26039] [ 8] /usr/lib64/libpthread.so.0(+0x7e25)[0x7f0e8d187e25] [nmyjs-202-75:26039] [ 9] /usr/lib64/libc.so.6(clone+0x6d)[0x7f0e8ceb534d] [nmyjs-202-75:26039] *** End of error message ***

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 5 with PID 0 on node nmyjs-202-75 exited on signal 6 (Aborted).

[nmyjs-202-75:26027] 6 more processes have sent help message help-opal-runtime.txt / opal_init:warn-fork [nmyjs-202-75:26027] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

So Why just Pytorch with horovod will have problems and tensorflow or keras can run ok???
question wontfix
opened by NO2-yh 51
Horovod support for MXNet framework
This is a PR for adding Horovod support to do distributed training with the MXNet deep learning framework. The core Allreduce and Broadcast functionality passes unit tests. @ctcyang worked on the design and implementing the core functionalities. @yuxihu and @apeforest worked on addressing reviewers’ comments, simplifying and improving code logic, fixing build issues, implementing error handling mechanisms and improving two examples using MXNet with Hovorod for distributed training. We can consistently train ResNet-50 to top-1 accuracy between 75% and 77% with different training settings.

Our performance results are here showing throughput (scaling efficiency) with and without hierarchical allreduce (HA) on ResNet-50 with float32:

# gpus | Without HA | With HA --------------------------------- 8 | 3072 (NA)| 3078 (NA) 16 | 6027 (98%)| 5859 (95%) 32 | 12030 (98%)| 11675 (95%) 64 | 22346 (83%)| 23166 (94%) 128 | 40938 (84%)| 45972 (93%) 256 | 64998 (66%)| 89858 (91%)

We have completed and merged the PR to make necessary changes on the MXNet for Horovod support. You can see the PR here: https://github.com/apache/incubator-mxnet/pull/12666

Design document
opened by ctcyang 50
Reproduce the example benchmarks

Will you be making the benchmark programs available in the examples section (Resnet, Inception, VGGNet)? That would enable us to reproduce your numbers. Thanks.
question

opened by jimdowling 50
Running Pytorch with Horovod

I am trying to run resnet50 example with Pytorch and Horovod using a cluster. I used the following command in slum script:

mpirun -np 2 -npernode 1 -x NCCL_DEBUG=INFO python horovod_main_testing.py --train-dir=/home/amalik/NEWIMAGENETDATA/raw-data/train/ --val-dir=/home/amalik/NEWIMAGENETDATA/raw-data/val/

I am trying to run it on two nodes each having one GPUs.

I am getting the following message:

Train Epoch #1: 0%| | 0/20019 [00:00<?, ?it/s][node03:62076] *** Process received signal *** [node03:62076] Signal: Segmentation fault (11) [node03:62076] Signal code: Address not mapped (1) [node03:62076] Failing at address: 0x55d9ad6308a8 [node03:62076] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7fd3c65f35e0] [node03:62076] [ 1] /usr/lib64/libc.so.6(+0x7d4a6)[0x7fd3c5b954a6] [node03:62076] [ 2] /usr/lib64/libc.so.6(__libc_malloc+0x4c)[0x7fd3c5b9810c] [node03:62076] [ 3] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyThread_allocate_lock+0x16)[0x7fd3c690f9f6] [node03:62076] [ 4] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyThread_ReInitTLS+0x1f)[0x7fd3c690feef] [node03:62076] [ 5] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyOS_AfterFork+0x45)[0x7fd3c6916025] [node03:62076] [ 6] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x11b1b9)[0x7fd3c691b1b9] [node03:62076] [ 7] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x774c)[0x7fd3c68e26fc] [node03:62076] [ 8] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9] [node03:62076] [ 9] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a] [node03:62076] [10] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3] [node03:62076] [11] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x5763d)[0x7fd3c685763d] [node03:62076] [12] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3] [node03:62076] [13] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0xa1584)[0x7fd3c68a1584] [node03:62076] [14] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x9de3b)[0x7fd3c689de3b] [node03:62076] [15] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3] [node03:62076] [16] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9)[0x7fd3c68deb69] [node03:62076] [17] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x7fee)[0x7fd3c68e2f9e] [node03:62076] [18] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9] [node03:62076] [19] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a] [node03:62076] [20] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3] [node03:62076] [21] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x5763d)[0x7fd3c685763d] [node03:62076] [22] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3] [node03:62076] [23] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0xa1584)[0x7fd3c68a1584] [node03:62076] [24] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x9de3b)[0x7fd3c689de3b] [node03:62076] [25] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3] [node03:62076] [26] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9)[0x7fd3c68deb69] [node03:62076] [27] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9] [node03:62076] [28] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a] [node03:62076] [29] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3] [node03:62076] *** End of error message *** [node02:20957] *** Process received signal *** [node02:20957] Signal: Segmentation fault (11)

Exception in thread Thread-2 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 120, in _worker_manager_loop File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 376, in get File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'loads' Exception in thread QueueFeederThread (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed <type 'exceptions.TypeError'>: 'NoneType' object is not callable Exception in thread QueueFeederThread (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed <type 'exceptions.TypeError'>: 'NoneType' object is not callable Exception in thread QueueFeederThread (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed <type 'exceptions.TypeError'>: 'NoneType' object is not callable

I had a very good experience with Horovod and TF. I am trying Pytorch now. and used the same script that I used to run alexnet in TF for 256 GPUs. Do I need special flags for Pytorch+Horovod?
question wontfix

opened by abidmalikwaterloo 49
Performance tuning parameters for Horovod-TensorFlow benchmarks

Hi all, can you recommend a set of tuning parameters to get the highest performance (throughput images/sec) using Horovod-TensorFlow benchmarks? . I got the below result and I was wonder if there is room for more improvement:

mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib0 --mca plm_rsh_args "-p 50000" -x python tf_cnn_benchmarks.py --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=resnet50 --batch_size=128 --device=gpu --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --optimizer=momentum --display_every=1000 --use_fp16=False --local_parameter_device=gpu --variable_update=horovod

total images/sec: 2047.95
question

opened by vilmara 49

TF/Keras 2.11 isn’t currently working with `KerasEstimator` in horovod 0.26.1 even using legacy optimizer

Environment:

Framework: keras
Framework version: 2.11
Horovod version: 0.26.1
MPI version: 4.1.4
CUDA version: 11.0.3-1
NCCL version: 2.10.3-1
Python version: 3.8

Bug report: With keras=2.11 and horovod 0.26.1, horovod.spark.keras.KerasEstimator doesn't work even when using legacy optimizer. It has the following error message

Traceback (most recent call last):
[1,2]<stderr>:  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[1,2]<stderr>:    return _run_code(code, main_globals, None,
[1,2]<stderr>:  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
[1,2]<stderr>:    exec(code, run_globals)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 52, in <module>
[1,2]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 45, in main
[1,2]<stderr>:    task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK', 'OMPI_COMM_WORLD_LOCAL_RANK')
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/task/__init__.py", line 61, in task_exec
[1,2]<stderr>:    result = fn(*args, **kwargs)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/keras/remote.py", line 136, in train
[1,2]<stderr>:    model = deserialize_keras_model(
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/keras/remote.py", line 299, in deserialize_keras_model
[1,2]<stderr>:    return load_model_fn(f)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/keras/remote.py", line 137, in <lambda>
[1,2]<stderr>:    serialized_model, lambda x: hvd.load_model(x))
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/tensorflow/keras/__init__.py", line 274, in load_model
[1,2]<stderr>:    return _impl.load_model(keras, wrap_optimizer, _OPTIMIZER_MODULES, filepath, custom_optimizers, custom_objects)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/_keras/__init__.py", line 272, in load_model
[1,2]<stderr>:    return keras.models.load_model(filepath, custom_objects=horovod_objects)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,2]<stderr>:    raise e.with_traceback(filtered_tb) from None
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/tensorflow/keras/__init__.py", line 273, in <lambda>
[1,2]<stderr>:    return lambda **kwargs: DistributedOptimizer(cls(**kwargs), compression=compression)
[1,2]<stderr>:ValueError[1,2]<stderr>:: decay is deprecated in the new Keras optimizer, pleasecheck the docstring for valid arguments, or use the legacy optimizer, e.g., tf.keras.optimizers.legacy.Adadelta.

We found this PR seems to solve the issue. And if we install horovod from master it works. Given this, could we make a patch release that include the linked PR?

bug

opened by wenfeiy-db 0

CI: Build pipelines for macOS fail

It looks like pyenv fails to build Python 3.7.7 because it can't find the clang++ compiler (but the log is incomplete, there may be more).

Installing Python-3.7.7...
patching file Misc/NEWS.d/next/macOS/2020-06-24-13-51-57.bpo-41100.mcHdc5.rst
patching file configure
Hunk #1 succeeded at 3374 (offset -52 lines).
patching file configure.ac
Hunk #1 succeeded at 490 (offset -20 lines).
python-build: use tcl-tk from homebrew
python-build: use readline from homebrew
python-build: use zlib from xcode sdk

BUILD FAILED (OS X 12.6.2 using python-build 20180424)

Inspect or clean up the working tree at /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/python-build.20230106010328.4626
Results logged to /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/python-build.20230106010328.4626.log

Last 10 log lines:
checking for --with-cxx-main=<compiler>... no
checking for clang++... no
configure:

  By default, distutils will build C++ extension modules with "clang++".
  If this is not intended, then set CXX on the configure command line.
  
checking for the platform triplet based on compiler characteristics... darwin
configure: error: internal configure error for the platform triplet, please file a bug report
make: *** No targets specified and no makefile found.  Stop.

Similar comments in https://github.com/pyenv/pyenv/issues/2143

bug

opened by maxhgerlach 0

TF 2.11.0 (mixed_float16): 'LossScaleOptimizerV3' object has no attribute

Environment:

Framework: TensorFlow
Framework version:2.11.0
Horovod version:0.26.1
MPI version:4.1.4
CUDA version:11.6
NCCL version:2.11.4-1
Python version:3.8
OS and version: Ubuntu 20.04

Bug report: When a run a training in Tensorflow 2.11.0 with mixed_float16 with horovod. I have the following error message:

[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[1,0]<stderr>:    return _run_code(code, main_globals, None,
[1,0]<stderr>:  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "/home/bruno/erx-ai/src/erxai/tf_train/tf_train.py", line 920, in <module>
[1,0]<stderr>:    main(sys.argv[1:])
[1,0]<stderr>:  File "/home/bruno/erx-ai/src/erxai/tf_train/tf_train.py", line 899, in main
[1,0]<stderr>:    tf_train_semantic.run_train()
[1,0]<stderr>:  File "/home/bruno/erx-ai/src/erxai/tf_train/tf_train.py", line 625, in run_train
[1,0]<stderr>:    self.model.fit(
[1,0]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,0]<stderr>:    raise e.with_traceback(filtered_tb) from None
[1,0]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/horovod/_keras/callbacks.py", line 53, in on_batch_end
[1,0]<stderr>:    hvd.broadcast_variables(self.model.optimizer.variables(),
[1,0]<stderr>:AttributeError: 'LossScaleOptimizerV3' object has no attribute '_variables'

bug

opened by RicoOscar 1

Unable to use GPU on 2nd machine

Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example using tensorflow, and it was executed successfully on both nodes (utlilizing GPUs on both nodes). However when I am using KerasEstimator on spark, the training executes successfully but I think that only one gpu is getting used.

I am following this example: https://docs.databricks.com/_static/notebooks/deep-learning/horovod-spark-estimator-keras.html

here are the logs:

[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Bootstrap : Using eth0:10.84.52.31<0> [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/IB : No device found. [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.52.31<0> [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Using network Socket [1,0]:NCCL version 2.11.4+cuda11.4 [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Bootstrap : Using eth0:10.84.179.52<0> [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/IB : No device found. [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.179.52<0> [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Using network Socket [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00/02 : 0 1 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01/02 : 0 1 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555 [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [receive] via NET/Socket/0 [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [receive] via NET/Socket/0 [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [send] via NET/Socket/0 [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [send] via NET/Socket/0 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [receive] via NET/Socket/0 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [receive] via NET/Socket/0 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [send] via NET/Socket/0 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [send] via NET/Socket/0 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all rings [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all trees [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO comm 0x7fd2247488e0 rank 0 nranks 2 cudaDev 0 busId 2000 - Init COMPLETE [1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Launch mode Parallel [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all rings [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all trees [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer [1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO comm 0x7fad647478a0 rank 1 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE [1,0]: [1,1]:WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.0086s vs on_train_batch_end time: 0.0658s). Check your callbacks. [1,0]:WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.0053s vs on_train_batch_end time: 0.0687s). Check your callbacks. 1/4851 [..............................] - ETA: 8:35:39 - loss: 1.0356 - accuracy: 0.4844[1,0]: 9/4851 [..............................] - ETA: 30s - loss: 0.9629 - accuracy: 0.4219 [1,0]:
17/4851 [..............................] - ETA: 31s - loss: 0.9131 - accuracy: 0.4265[1,0]:
24/4851 [..............................] - ETA: 33s - loss: 0.8747 - accuracy: 0.4421[1,0]:
31/4851 [..............................] - ETA: 34s - loss: 0.8364 - accuracy: 0.4768[1,0]:
39/4851 [..............................] - ETA: 34s - loss: 0.7905 - accuracy: 0.5445[1,0]:
48/4851 [..............................] - ETA: 32s - loss: 0.7389 - accuracy: 0.6286[1,0]:
56/4851 [..............................] - ETA: 32s - loss: 0.6957 - accuracy: 0.6816[1,0]:
64/4851 [..............................] - ETA: 32s - loss: 0.6540 - accuracy: 0.7214[1,0]:
71/4851 [..............................] - ETA: 32s - loss: 0.6205 - accuracy: 0.7489[1,0]:
79/4851 [..............................] - ETA: 32s - loss: 0.5844 - accuracy: 0.7743[1,0]:
87/4851 [..............................] - ETA: 32s - loss: 0.5504 - accuracy: 0.7951[1,0]:
95/4851 [..............................] - ETA: 32s - loss: 0.5194 - accuracy: 0.8123[1,0]:
103/4851 [..............................] - ETA: 32s - loss: 0.4912 - accuracy: 0.8269[1,0]:
112/4851 [..............................] - ETA: 31s - loss: 0.4623 - accuracy: 0.8408[1,0]:
121/4851 [..............................] - ETA: 31s - loss: 0.4364 - accuracy: 0.8525[1,0]:
131/4851 [..............................] - ETA: 30s - loss: 0.4106 - accuracy: 0.8637[1,0]:
140/4851 [..............................] - ETA: 30s - loss: 0.3886 - accuracy: 0.8724[1,0]:
148/4851 [..............................] - ETA: 30s - loss: 0.3706 - accuracy: 0.8793[1,0]:
156/4851 [..............................] - ETA: 30s - loss: 0.3542 - accuracy: 0.8855[1,0]:
164/4851 [>.............................] - ETA: 30s - loss: 0.3388 - accuracy: 0.8911[1,0]:
172/4851 [>.............................] - ETA: 30s - loss: 0.3246 - accuracy: 0.8962[1,0]:
180/4851 [>.............................] - ETA: 30s - loss: 0.3116 - accuracy: 0.9008[1,0]:
188/4851 [>.............................] - ETA: 30s - loss: 0.2994 - accuracy: 0.9050[1,0]:
196/4851 [>.............................] - ETA: 30s - loss: 0.2882 - accuracy: 0.9089[1,0]:
204/4851 [>.............................] - ETA: 30s - loss: 0.2778 - accuracy: 0.9125[1,0]:
212/4851 [>.............................] - ETA: 30s - loss: 0.2680 - accuracy: 0.9158[1,0]:
220/4851 [>.............................] - ETA: 30s - loss: 0.2588 - accuracy: 0.9188[1,0]:
227/4851 [>.............................] - ETA: 30s - loss: 0.2513 - accuracy: 0.9213[1,0]:
235/4851 [>.............................] - ETA: 30s - loss: 0.2432 - accuracy: 0.9240[1,0]:
243/4851 [>.............................] - ETA: 30s - loss: 0.2356 - accuracy: 0.9265[1,0]:
251/4851 [>.............................] - ETA: 30s - loss: 0.2285 - accuracy: 0.9288[1,0]:
259/4851 [>.............................] - ETA: 30s - loss: 0.2218 - accuracy: 0.9310[1,0]:
267/4851 [>.............................] - ETA: 30s - loss: 0.2155 - accuracy: 0.9331[1,0]:
275/4851 [>.............................] - ETA: 30s - loss: 0.2095 - accuracy: 0.9351[1,0]:
283/4851 [>.............................] - ETA: 30s - loss: 0.2038 - accuracy: 0.9369[1,0]:
291/4851 [>.............................] - ETA: 30s - loss: 0.1985 - accuracy: 0.9386[1,0]:
299/4851 [>.............................] - ETA: 30s - loss: 0.1933 - accuracy: 0.9403[1,0]:
307/4851 [>.............................] - ETA: 30s - loss: 0.1885 - accuracy: 0.9418[1,0]:
316/4851 [>.............................] - ETA: 30s - loss: 0.1833 - accuracy: 0.9435[1,0]:
325/4851 [=>............................] - ETA: 30s - loss: 0.1784 - accuracy: 0.9450[1,0]:
334/4851 [=>............................] - ETA: 30s - loss: 0.1738 - accuracy: 0.9465[1,0]:
343/4851 [=>............................] - ETA: 30s - loss: 0.1694 - accuracy: 0.9479[1,0]:
351/4851 [=>............................] - ETA: 29s - loss: 0.1656 - accuracy: 0.9491[1,0]:
358/4851 [=>............................] - ETA: 30s - loss: 0.1625 - accuracy: 0.9501[1,0]:
366/4851 [=>............................] - ETA: 29s - loss: 0.1590 - accuracy: 0.9512[1,0]:
374/4851 [=>............................] - ETA: 29s - loss: 0.1557 - accuracy: 0.9522[1,0]:
383/4851 [=>............................] - ETA: 29s - loss: 0.1521 - accuracy: 0.9534[1,0]:
391/4851 [=>............................] - ETA: 29s - loss: 0.1491 - accuracy: 0.9543[1,0]:
400/4851 [=>............................] - ETA: 29s - loss: 0.1458 - accuracy: 0.9554[1,0]:
408/4851 [=>............................] - ETA: 29s - loss: 0.1430 - accuracy: 0.9562[1,0]:
417/4851 [=>............................] - ETA: 29s - loss: 0.1400 - accuracy: 0.9572[1,0]:
422/4851 [=>............................] - ETA: 29s - loss: 0.1384 - accuracy: 0.9577[1,0]:
428/4851 [=>............................] - ETA: 29s - loss: 0.1365 - accuracy: 0.9583[1,0]:
437/4851 [=>............................] - ETA: 29s - loss: 0.1338 - accuracy: 0.9591[1,0]:
447/4851 [=>............................] - ETA: 29s - loss: 0.1314 - accuracy: 0.9600[1,0]:
456/4851 [=>............................] - ETA: 29s - loss: 0.1289 - accuracy: 0.9608[1,0]:
465/4851 [=>............................] - ETA: 29s - loss: 0.1264 - accuracy: 0.9616[1,0]:
474/4851 [=>............................] - ETA: 29s - loss: 0.1241 - accuracy: 0.9623[1,0]:
483/4851 [=>............................] - ETA: 29s - loss: 0.1218 - accuracy: 0.9630[1,0]:
491/4851 [==>...........................] - ETA: 28s - loss: 0.1199 - accuracy: 0.9636[1,0]:
499/4851 [==>...........................] - ETA: 28s - loss: 0.1180 - accuracy: 0.9642[1,0]:
508/4851 [==>...........................] - ETA: 28s - loss: 0.1160 - accuracy: 0.9648[1,0]:
518/4851 [==>...........................] - ETA: 28s - loss: 0.1138 - accuracy: 0.9655[1,0]:
527/4851 [==>...........................] - ETA: 28s - loss: 0.1118 - accuracy: 0.9661[1,0]:
536/4851 [==>...........................] - ETA: 28s - loss: 0.1100 - accuracy: 0.9667[1,0]:
545/4851 [==>...........................] - ETA: 28s - loss: 0.1082 - accuracy: 0.9672[1,0]:
554/4851 [==>...........................] - ETA: 28s - loss: 0.1065 - accuracy: 0.9677[1,0]:
562/4851 [==>...........................] - ETA: 28s - loss: 0.1050 - accuracy: 0.9682[1,0]:
572/4851 [==>...........................] - ETA: 27s - loss: 0.1032 - accuracy: 0.9688[1,0]:
bug

opened by obaid1922 0
Support sample weight col in Keras estimator

Is your feature request related to a problem? Please describe. It seems we add the sample_weight_col in data loading of remote function, but I do not see the trainer actually using it in fit function here.

Describe the solution you'd like According to this link, the proper way of doing that would be like model.fit(x_train, y_train, sample_weight=sample_weight, batch_size=64, epochs=5)
enhancement

opened by irasit 0
CI: Docker image build for horovod-nvtabular runs into time out
Ex.: https://github.com/horovod/horovod/actions/runs/3656223581/jobs/6181962659

#12 151.5 Solving environment: ...working... Error: The action has timed out.

If Docker time stamps in that log are to be believed, then it seems to hang there at the Conda environment stage for > 50 minutes.
bug
opened by maxhgerlach 0

Releases(v0.26.1)

v0.26.1(Oct 14, 2022)
Fixed

Fixed packaging import during install to occur after install_requires. (#3741)

Source code(tar.gz)
Source code(zip)
v0.26.0(Oct 13, 2022)
Added

Spark Estimator: Added support for custom data loaders in KerasEstimator. (#3603)

Spark Estimator: Added NVTabular data loader for KerasEstimator. (#3603)

Spark Estimator: Added gradient accumulation support to Spark torch estimator. (#3681)

TensorFlow: Added register_local_var functionality to distributed optimizers and local gradient aggregators. (#3695)

TensorFlow: Added support for local variables for BroadcastGlobalVariablesCallback. (#3703)

Enabled use of native ncclAvg op for NCCL allreduces. (#3646)

Added support for additional reduction operations for allreduce (min, max, product). (#3660)

Added 2D torus allreduce using NCCL. (#3608)

Added support for Petastorm reader level parallel shuffling. (#3665)

Added random seed support for Lightning datamodule to generate reproducible data loading outputs. (#3665)

Added support for int8 and uint8 allreduce and grouped_allreduce in TensorFlow. (#3649)

Added support for batched memory copies in GPUAllgather. (#3590)

Added support for batched memory copies in GPUReducescatter. (#3621)

Added hvd.grouped_allgather() and hvd.grouped_reducescatter() operations. (#3594)

Added warning messages if output tensor memory allocations fail. (#3594)

Added register_local_source and use_generic_names funtionality to DistributedGradientTape. (#3628)

Added PartialDistributedGradientTape() API for model parallel use cases. (#3643)

Spark/Lightning: Added reader_worker_count and reader_pool_type. (#3612)

Spark/Lightning: Added transformation_edit_fields and transformation_removed_fields param for EstimatorParams. (#3651)

TensorFlow: Added doc string for hvd.grouped_allreduce(). (#3594)

ROCm: Enabled alltoall. (#3654)

Changed

Default Petastorm reader pool is changed from process to thread for lower memory usage. (#3665)

Keras: Support only legacy optimizers in Keras 2.11+. (#3725)

Gloo: When negotiating, use gather rather than allgather. (#3633)

Use packaging.version instead of distutils version classes. (#3700)

Deprecated

Deprecated field shuffle_buffer_size from EstimatorParams. Use shuffle to enable shuffle or not. (#3665)

Removed

Build: Removed std::regex use for better cxxabi11 compatibility. (#3584)

Fixed

TensorFlow: Fixed the optimizer iteration increments when backward_passes_per_step > 1. (#3631)

Fixed FuseResponses() on BATCHED_D2D_PADDING edge cases for Reducescatter and/or ROCm. (#3621)

PyTorch: Fixed Reducescatter functions to raise HorovodInternalError rather than RuntimeError. (#3594)

PyTorch on GPUs without GPU operations: Fixed grouped allreduce to set CPU device in tensor table. (#3594)

Fixed race condition in PyTorch allocation handling. (#3639)

Build: Fixed finding nvcc (if not in $PATH) with older versions of CMake. (#3682)

Fixed reducescatter() and grouped_reducescatter() to raise clean exceptions for scalar inputs. (#3699)

Updated Eigen submodule to fix build on macOS with aarch64. (#3619)

Build: Correctly select files in torch/ directory to be hipified. (#3588)

Build: Modify regex match for CUDA|ROCm in FindPytorch.cmake. (#3593)

Build: Fixed ROCm-specific build failure. (#3630)

Source code(tar.gz)
Source code(zip)
v0.25.0(Jun 21, 2022)
Added

Added hvd.reducescatter() operation with implementations in NCCL, MPI, and Gloo. (#3299, #3574)

Added AMD GPU XLA Op Implementation. (#3486)

Added Horovod job to spin up distributed TensorFlow Data Service. (#3525)

Spark: Expose random seed as an optional parameter. (#3517)

Add Helm Chart. (#3546)

Elastic: Add elastic run API. (#3503)

Spark Estimator: Expose random seed for model training reproducibility. (#3517)

Spark Estimator: Add option whether to use GPUs at all. (#3526)

Spark Estimator: Expose parameter to set start method for multiprocessing. (#3580)

Changed

MXNet: Updated allreduce functions to newer op API. (#3299)

TensorFlow: Make TensorFlow output allocations asynchronous when using NCCL backend. (#3464)

TensorFlow: Clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up. (#3505)

Make HorovodVersionMismatchError subclass ImportError instead of just a standard Exception. (#3549)

Elastic: Catch any exception to prevent the discovery thread from silently dying. (#3436)

Horovodrun: Exit check_build (--check-build) via sys.exit to flush stdout. (#3272)

Spark: Use env to set environment vars in remote shell. (#3489)

Build: Avoid redundant ptx generation for maximum specified compute capability. (#3509)

Deprecated

MXNet: Deprecated average argument of allreduce functions. (#3299)

Public and internal APIs: deprecate use of np, min_np, max_np, use num_proc, min_num_proc, and max_num_proc, respectively, instead. (#3409)

Horovodrun: Providing multiple NICS as comma-separated string via --network-interface is deprecated, use --network-interface multiple times or --network-interfaces instead. (#3506)

horovod.run: Argument network_interface with comma-separated string is deprecated, use network_interfaces with Iterable[str] instead. (#3506)

Fixed

Fallback to NCCL shared lib if static one is not found. (#3500

Spark/Lightning: Added missing tranform_spec for Petastorm datamodule. (#3543)

Spark/Lightning: Fixed PTL Spark example with checkpoint usage by calling save_hyperparameters(). (#3527)

Elastic: Fixed empty hostname returned from HostDiscoveryScript. (#3490)

TensorFlow 2.9: Fixed build for API change related to tensorflow_accelerator_device_info. (#3513)

TensorFlow 2.10: Bumped build partially to C++17. (#3558)

TensorFlow: Fixed gradient update timing in TF AggregationHelperEager. (#3496)

TensorFlow: Fixed resource NotFoundError in TF AggregationHelper. (#3499)

Source code(tar.gz)
Source code(zip)
v0.24.3(Apr 21, 2022)
Fixed

Make DBFSLocalStore support "file:/dbfs/...", implement get_localized_path. (#3510)

Source code(tar.gz)
Source code(zip)
v0.24.2(Mar 10, 2022)
Fixed

[Setup] Require fsspec >= 2010.07.0 (#3451)

Fix ignored cuda arch flags (#3462

Source code(tar.gz)
Source code(zip)
v0.24.1(Mar 3, 2022)
Fixed

Extended CMake build script to often find CUDA even if nvcc is not in $PATH. (#3444)

Source code(tar.gz)
Source code(zip)
v0.24.0(Mar 2, 2022)
Added

Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. (#3190)

TensorFlow: Added in-place broadcasting of variables. (#3128)

Elastic: Added support for resurrecting blacklisted hosts. (#3319)

MXNet: Added support for MXNet async dependency engine. (#3242, #2963)

Spark/Lightning: Added history to lightning estimator. (#3214)

Changed

Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. (#3261, #3371)

Moved released Docker image horovod and horovod-cpu to Ubuntu 20.04 and Python 3.8. (#3393)

Spark Estimator: Don't shuffle row groups if training data requires non-shuffle (#3369)

Spark/Lightning: Reduced memory footprint of async dataloader. (#3239)

Elastic: Improved handling NCCL errors under elastic scenario. (#3112)

Spark/Lightning: Do not overwrite model with checkpoint by default. (#3201)

Make checkpoint name optional so that user can save to h5 format. (#3411)

Deprecated

Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. (#3190)

Removed

Spark: Removed h5py<3 constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)

Fixed

Elastic Spark: Fixed indices in initial task-to-task registration. (#3410)

PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. (#3352)

PyTorch: Fixed finalization of ProcessSetTable. (#3351)

Fixed remote trainers to point to the correct shared lib path. (#3258)

Fixed imports from tensorflow.python.keras with tensorflow 2.6.0+. (#3403)

Fixed Adasum communicator init logic. (#3379)

Lightning: Fixed resume logger. (#3375)

Fixed the checkpoint directory structure for pytorch and pytorch lightning. (#3362)

Fixed possible integer overflow in multiplication. (#3368)

Fixed the pytorch_lightning_mnist.py example. (#3245, #3290)

Fixed barrier segmentation fault. (#3313)

Fixed hvd.barrier() tensor queue management. (#3300)

Fixed PyArrow "list index out of range" IndexError. (#3274)

Elastic: Fixed all workers sometimes failing on elastic Horovod failure. (#3264)

Spark/Lightning: Fixed setting limit_train_batches and limit_val_batches. (#3237)

Elastic: Fixed ElasticSampler and hvd.elastic.state losing some indices of processed samples when nodes dropped. (#3143)

Spark/Lightning: Fixed history metrics for estimator serialization. (#3216)

Ray: Fixed RayExecutor to fail when num_workers=0 and num_hosts=None. (#3210)

Spark/Lightning: Fixed checkpoint callback dirpath typo. (#3204)

Source code(tar.gz)
Source code(zip)
v0.23.0(Oct 6, 2021)
Added

Added process sets to concurrently run collective operations on subsets of Horovod processes in TensorFlow, PyTorch, and MXNet. (#2839, #3042, #3043, #3054, #3083, #3090)

Added XLA support for Allreduce via tf.function(jit_compile=True). (#3053)

Added fused buffer scaling and unpack/pack kernels on GPU. (#2973)

Added support for NCCL on CUDA 11.4. (#3182)

Added fp16 compression for MXNet. (#2987)

Added terminate_on_nan flag to Spark Lightning estimator. (#3088)

Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139

Added params for customizing Tensorboard callback. (#3153)

Added hvd.cross_rank() for keras. (#3008)

Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139

Changed

Implemented more asynchronous dependency handling on GPU. (#2963)

Ray: RayExecutor will now use the current placement group instead of always creating a new one. (#3134)

Lightning: turned off shuffling for validation dataset. (#2974)

Ray: RayExecutor will use the current placement group if one exists. (#3134)

Extended hvd.join() to return the last rank that joined. (#3097)

Removed

Spark/Keras: remove bare Keras support. (#3191)

Fixed

Fix Horovod develop/editable install mode and incremental builds. (#3074)

Estimator/Lightning: use lightning datamodule. (#3084)

Fix Horovod Spark StringType and numpy type mapping issue. (#3146)

Fixed error in Keras LearningRateScheduler. (#3135)

Fixed bug in Lightning Profiler on Ray. (#3122)

Fixed torch op lazy release to prevent OOM in elastic training. (#3110)

Lightning: Fixed usage of the checkpoint callback. (#3186)

Fixed MPICH support to use Intel MPI's implementation. (#3148)

Fixed race condition in PyTorch async dataloader. (#3120)

Keras: Fixed learning rate scheduler. (#3142, #3135)

Source code(tar.gz)
Source code(zip)
v0.22.1(Jun 10, 2021)
Added

Estimator: added support for loading data from S3, GCS, ADLS, and other remote filesystems. (#2927)

Estimator: added custom Spark data loader interface. (#2938)

LightningEstimator: added support to supply a logger and associated parameter to control the frequency of logging. (#2926)

Estimator: added check to ensure all ranks have the same device type. (#2942)

Changed

Changed behavior from using TensorBoardLogger to now using it as a fallback if a logger is not supplied. (#2926)

Ray: disabled capturing child tasks in placement group. (#2920)

Fixed

Fixed hvd.tensorflow.keras.Compression, accidentally removed in v0.22.0. (#2945)

TorchEstimator: fixed usage of validation_steps in place of validation_steps_per_epoch. (#2918)

TensorFlow: fixed C++ API for TF v2.6.0. (#2932)

PyTorch: fixed sparse_allreduce_async for PyTorch v0.10.0. (#2965)

Source code(tar.gz)
Source code(zip)
v0.22.0(May 19, 2021)
Added

Added pytorch_lightning spark estimator which enables training pytorch_lightning models. (#2713)

Added NVTX tracing hooks for profiling with Nsight Systems. (#2723)

Added a generic num_workers API for RayExecutor (#2870)

Supports Ray Client without code changes. (#2882)

Supports inmemory cache option for Keras Estimator. (#2896)

Added FP16 support for GPU tensor in mxnet. (#2915)

Added response caching for allgather operations. (#2872)

Estimator: add petastorm reader_pool_type into constructor (#2903)

Changed

Changed alltoall to return the received splits as a second return value if non-uniform splits are sent. (#2631)

Changed RayExecutor to use Ray Placement Groups for worker colocation. (#2824)

Changed Inmemory dataloader usage for Torch Estimator with petastorm v0.11.0 release. (#2896)

Fixed

Changed RayExecutor to use Ray node ID to enable multi-container:single-host setups. (#2883)

Support sparse gradients aggregation in TF1 Keras. (#2879)

Respect global_step parameter for LegacyOptimizers when aggregating gradients. (#2879)

Fixed compatibility with PyTorch 1.9.0. (#2829)

Source code(tar.gz)
Source code(zip)
v0.21.0(Nov 23, 2020)
Detailed Changes

Added

Added support for backward_passes_per_step > 1 for TF Keras graph mode. (#2346)

Added support for backward_passes_per_step > 1 for TF Keras eager execution. (#2371)

Added support for backward_passes_per_step > 1 for TF LegacyOptimizer in graph mode. (#2401)

Added grouped allreduce to enable more efficient tensor fusion and deterministic training. (#2453)

Add support for specifying op and compression in horovod.tensorflow.keras.allreduce(). (#2423)

Adding support for batched D2D memcopy kernel on GPU. (#2435)

Added schema inference in Spark Estimator without sampling. (#2373)

Added Store.create("dbfs:/") mapping to DBFSLocalStore("/dbfs/..."). (#2376)

Changed

Changed Keras callbacks to require parameter initial_lr of LearningRateScheduleCallback and LearningRateWarmupCallback. (#2459)

Changed default cycle time from 5ms to 1ms and fusion threshold from 64MB to 128MB. (#2468)

Fixed

Fixed support for TensorFlow v2.4.0. (#2381)

Fixed averaging using CUDA half2 implementation one element half buffers. (#2375)

Fixed HOROVOD_THREAD_AFFINITY when using oneCCL. (#2350)

Added timeout to SSH check in horovodrun to prevent hanging. (#2448)

Added HOROVOD_GLOO_TIMEOUT_SECONDS value to error messages. (#2436)

Fixed race condition in dynamic timeline API. (#2341)

Fixed --log-hide-timestamp to apply to driver logs with Gloo. (#2388)

Source code(tar.gz)
Source code(zip)
v0.20.3(Oct 1, 2020)
Detailed Changes

Added

Added Elastic Ray integration. (#2291)

Changed

Removed dependency on SSH access for Ray. (#2275)

Source code(tar.gz)
Source code(zip)
v0.20.2(Sep 26, 2020)
Detailed Changes

Fixed

Fixed building Horovod without HOROVOD_WITHOUT_MXNET when MXNet is not installed. (#2334)

Source code(tar.gz)
Source code(zip)
v0.20.1(Sep 25, 2020)
Detailed Changes

Added

Added Databricks storage DBFSLocalStore and support for GPU-aware scheduling to horovod.spark Estimator. (#2234)

Added ElasticSampler and PyTorch Elastic ImageNet example. (#2297)

Added ability to dynamically start and stop timeline programmatically. (#2215)

Added support for Gloo on macOS. (#2254)

Exposed name argument to TensorFlow allreduce operation. (#2325)

Added option to strip outer name scope from Horovod ops in TensorFlow. (#2328)

Fixed

Fixed usage of VERBOSE=1 when setting custom MAKEFLAGS. (#2239)

Fixed bugs in Keras Elastic Callback classes. (#2289)

Fixed RelWithDebInfo build and made it the default with -03 optimizations. (#2305)

Fixed usage of tf.cond in TensorFlow alltoall gradient. (#2327)

Fixed allreduce averaging for TF IndexedSlices in ROCm path. (#2279)

Include stdexcept to handle certain compiler / frameworks that don't include it already. (#2238)

Fixed Debug builds by setting compiler options based on CMake build type. (#2263)

Skipped launching zero-sized send/recvs for NCCLAlltoall. (#2273)

Fixed missing run in tf keras elastic mode. (#2272)

Fixed loss function in TensorFlow2 elastic synthetic benchmark. (#2265)

Fixed usage of HOROVOD_MIXED_INSTALL env var in alltoall tests. (#2266)

Removed keras requirement from Ray example. (#2262)

Source code(tar.gz)
Source code(zip)
v0.20.0(Sep 4, 2020)
Elastic Horovod API + Spark Auto-Scaling (#1849, #1956)

Elastic training enables Horovod to scale up and down the number of workers dynamically at runtime, without requiring a restart or resuming from checkpoints saved to durable storage. With elastic training, workers can come and go from the Horovod job without interrupting the training process.

Support for auto-scaling can be added to any existing Horovod script with just a few modifications:

Decorate retryable functions with @hvd.elastic.run.

Track state that needs to be kept in sync across workers in a hvd.elastic.State object.

Perform all Horovod collective operations (allreduce, allgather, broadcast, etc.) inside the retryable functions.

Here's an example for PyTorch:

import torch import horovod.torch as hvd hvd.init() torch.cuda.set_device(hvd.local_rank()) model = ... dataset = ... @hvd.elastic.run def train(state): for state.epoch in range(state.epoch, args.epochs + 1): dataset.set_epoch(state.epoch) dataset.set_batch_idx(state.batch_idx) for state.batch_idx, (data, target) in enumerate(dataset): state.optimizer.zero_grad() output = state.model(data) loss = F.nll_loss(output, target) loss.backward() state.optimizer.step() state.commit() optimizer = optim.SGD(model.parameters(), lr * hvd.size()) optimizer = hvd.DistributedOptimizer(optimizer) def on_state_reset(): # adjust learning rate on reset for param_group in optimizer.param_groups: param_group['lr'] = lr * hvd.size() state = hvd.elastic.TorchState(model, optimizer, epoch=1, batch_idx=0) state.register_reset_callbacks([on_state_reset]) train(state)

Run using horovodrun by specifying the minimum and maximum number of worker processes, as well as a "host discovery script" that will be used to find available workers to add at runtime:

$ horovodrun -np 8 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.sh python train.py

Elastic Horovod is supported natively with Spark auto-scaling using the hvd.spark.run_elastic API.

For more details, see Elastic Horovod.

Horovod on Ray (#2218)

Ray is a distributed execution framework that makes it easy to provision and scale distributed applications, and can now be used to execute Horovod jobs without needing to coordinate the workers by hand:

from horovod.ray import RayExecutor # Start the Ray cluster or attach to an existing Ray cluster ray.init() # Start num_hosts * num_slots actors on the cluster executor = RayExecutor( setting, num_hosts=num_hosts, num_slots=num_slots, use_gpu=True) # Launch the Ray actors on each machine # This will launch `num_slots` actors on each machine executor.start() # Using the stateless `run` method, a function can take in any args or kwargs def train_fn(): hvd.init() # Train the model on each worker here ... # Execute the function on all workers at once results = executor.run(train_fn) executor.shutdown()

Horovod now also integrates with Ray Tune to scale up your hyperparameter search jobs. Check out the example here.

For more details, see Horovod on Ray.

All-to-All Operation (#2143)

The all-to-all collective can be described as a combination of a scatter and gather, where each worker will scatter a tensor to each worker, while also gathering scattered data from other workers. This type of collective communication can arise in model-parallel training strategies.

The hvd.alltoall function takes the form hvd.alltoall(tensor, splits=None), where tensor is a multi-dimensional tensor of data to scattered and splits is an optional 1D tensor of integers with length equal to the number of workers, describing how to split and distribute tensor. splits is applied along the first dimension of tensor. If splits is not provided, an equal splitting is assumed, where the first dimension is divided by the number of workers.

The implementation supports TensorFlow, PyTorch, and MXNet using the MPI backend, the CUDA-aware MPI backend via HOROVOD_GPU_ALLTOALL=MPI, and the NCCL backend via HOROVOD_GPU_ALLTOALL=NCCL / HOROVOD_GPU_OPERATIONS=NCCL.

Gradient Predivide Factor (#1949)

We've added a gradient_predivide_factor parameter in the DistributedOptimizer, the purpose of which is to enable splitting the averaging before and after the allreduce. This can be useful in managing the numerical range for mixed precision computations.

The gradient_predivide_factor is applied as follows:

If op == Average, gradient_predivide_factor splits the averaging before and after the sum. Gradients are scaled by 1.0 / gradient_predivide_factor before the sum and gradient_predivide_factor / size after the sum.

To facilitate this, additional arguments (prescale_factor and postscale_factor) have been added to the basic hvd.allreduce functions, enabling the definition of multiplicative factors to scale the tensors before and after the allreduce respectively. For efficiency, the pre and post-scaling is implemented in the Horovod backend on the fused tensor buffer, rather than through framework level operations. For GPU, this required a CUDA kernel implementation to scale the GPU buffer which in turn, required adding compilation of CUDA code to the current build infrastructure.

As an additional general benefit from these changes, gradient averaging in the optimizer can now be carried out within the Horovod backend on the fused tensor buffer using the postscale_factor argument, rather than on a tensor by tensor basis at the framework level, decreasing the overhead of each allreduce call.

CMake Build System (#2009)

CMake, previously used to compile the optional Gloo controller, is now required to install Horovod. This change introduces a number of exciting benefits for Horovod developers and users:

Much faster installation times through a parallel task build

Incremental builds (almost instantaneous build when developing and making small changes at a time)

Separation of the build config phase with the build phase (less overhead for repeated builds)

Reuse find_package modules provided by CMake for MPI, CUDA, etc. to better handle a range of environment configurations

Libraries can be built outside of the python build process (no longer requiring setup.py)

Flexibility for the build system (make, ninja, IDEs, etc.)

Detailed Changes

Added

Added bare-metal elastic mode implementation to enable auto-scaling and fault tolerance. (#1849)

Added Elastic Horovod support for Spark auto-scaling. (#1956)

Added All-to-All operation for TensorFlow, PyTorch, and MXNet. (#2143)

Added support for gradient_predivide_factor and averaging in Horovod backend. (#1949)

Added NCCL implementation of the allgather operation. (#1952)

Added HOROVOD_GPU_OPERATIONS installation variable to simplify enabling NCCL support for all GPU operations. (#1960)

Added TensorFlow implementation of SyncBatchNormalization layer. (#2075)

Added hvd.is_initialized() method. (#2020)

Added hvd.allgather_object function for TensorFlow, PyTorch, and MXNet. (#2166)

Added hvd.broadcast_object function for MXNet. (#2122)

Added label_shapes parameter to KerasEstimator and TorchEstimator. (#2140)

Added optional modelCheckPoint callback to KerasEstimator params. (#2124)

Added ssh_identity_file argument to horovodrun. (#2201)

Added support for horovodrun on kubeflow/mpi-job. (#2199)

Added Ray integration. (#2218)

Changed

Moved horovod.run.runner.run to horovod.run. (#2099)

HOROVOD_THREAD_AFFINITY accepts multiple values, one for every Horovod rank (#2131)

Migrated build system for native libraries to CMake (#2009)

Deprecated

HOROVOD_CCL_BGT_AFFINITY is deprected. Use HOROVOD_THREAD_AFFINITY instead (#2131)

Removed

Dropped support for Python 2. (#1954)

Dropped support for TensorFlow < 1.15. (#2169)

Dropped support for PyTorch < 1.2. (#2086)

Fixed

Fixed MXNet allgather implementation to correctly handle resizing the output buffer. (#2092)

Fixed Keras Spark Estimator incompatibility with TensorFlow 1.15 due to tf.autograph. (#2069)

Fixed API compatibility with PyTorch 1.6. (#2051)

Fixed Keras API compatibility with TensorFlow 2.4.0. (#2178)

Fixed allgather gradient for TensorFlow 2 in cases where the tensor shape is not known during graph construction. (#2121)

Fixed running using Gloo with an imbalanced number of workers per host. (#2212)

Source code(tar.gz)
Source code(zip)
v0.19.5(Jun 24, 2020)
Fixed

Added PYTHONPATH to mpirun env. (#2038)

Source code(tar.gz)
Source code(zip)
v0.19.4(May 28, 2020)
Fixed

Fixed Sync Batch Norm when using PyTorch 1.5. (#1980)

Fixed compatibility with mixed precision Keras policy in TensorFlow 2.2. (#1992)

Source code(tar.gz)
Source code(zip)
v0.19.3(May 22, 2020)
Fixed

Fixed issue with network interface discovery causing horovodrun to fail during startup (#1974).

Source code(tar.gz)
Source code(zip)
v0.19.2(May 13, 2020)
Highlights

Added Platform LSF and jsrun support to horovodrun. (#1805)

Added support for running Horovod on Spark with Gloo in place of MPI. (#1807)

Added synchronous batch normalization for horovod.torch API. (#1923)

Additional changes

Added support for providing a set of inclusive NICs to horovodrun. (#1808)

Added optional initial_lr parameter to LearningRateScheduleCallback, deprecated implicit initialization. (#1933)

Changed Spark Estimators to use Petastorm BatchDataLoader. (#1879)

Changed Spark Estimators to use Petastorm's make_reader API. (#1804)

Improved latency of background thread loop. (#1880)

Enabled setting Horovod background thread affinity with all frameworks. (#1881)

Added verbose parameter to SparkBackend. (#1922)

Use parameter names when scheduling broadcasts in MXNet broadcast_parameters. (#1894)

Added metadata cache with calling fit_on_parquet. (#1826)

Added optional local version to package version. (#1925)

Bugfixes

Fixed module resolution for tf.keras optimizers when calling hvd.load_model. (#1935)

Modified safe_shell_exec to use multiprocessing spawn instead of fork to prevent deadlocks. (#1915)

Fixed multiprocessing to support Python 3.8. (#1904)

Added extra preprocessor guard for FMA optimization. (#1835)

Fixed exception in KerasEstimator when num_proc is larger than 4. (#1945)

Fixed memory leaks. (#1845)

Fixed a bug with sample weight in TorchEstimator. (#1790)

Removed torchvision from pytorch extra. (#1899)

Source code(tar.gz)
Source code(zip)
v0.19.1(May 13, 2020)
TensorFlow

Added _aggregate_gradients in DistributedOptimizer to support Keras in TensorFlow 2.2. (#1784)

PyTorch

Removed uses of deprecated PyTorch C++ API to support PyTorch 1.6. (#1731)

Changes to horovodrun

Added process binding arguments to horovodrun. (#1767)

Added --tcp flag to horovodrun for TCP only communication. (#1744)

Changes to installer

Added HOROVOD_BUILD_ARCH_FLAGS to specify architecture-specific compiler flags. (#1751)

Added Python extras to enforce that Horovod is installed after other frameworks. (#1785)

API changes

Added mpi_args to horovod.run.run. (#1787)

Added support for data transformation before train and validation in TorchEstimator (#1750)

Bugs

Fixed bug in cache dump. (#1739)

Fixed root rank output handling in MXNet out-of-place broadcast. (#1740)

Fixed data_type_to_str for SparseVector and DenseVector. (#1780)

Source code(tar.gz)
Source code(zip)
v0.19.0(Jan 14, 2020)
In version 0.19.0, Horovod adds tighter integration with Apache Spark, including a new high-level Horovod Spark Estimator framework and support for accelerator-aware task-level scheduling in the upcoming Spark 3.0 release. This release also contains experimental new features including a join operation for PyTorch and the ability to launch Horovod jobs programmatically from environments like notebooks using a new interactive run mode.

Horovod Spark Estimators (#1554)

To bridge the gap between large-scale data processing in Spark and large-scale deep learning training with Horovod, we’re introducing a new open source API called Horovod Spark Estimators.

With Horovod Spark Estimators, you can train your deep neural network directly on your existing Spark DataFrame, leveraging Horovod’s ability to scale to hundreds of workers in parallel without any specialized code for distributed training:

from tensorflow import keras import tensorflow as tf import horovod.spark.keras as hvd model = keras.models.Sequential() .add(keras.layers.Dense(8, input_dim=2)) .add(keras.layers.Activation('tanh')) .add(keras.layers.Dense(1)) .add(keras.layers.Activation('sigmoid')) # NOTE: unscaled learning rate optimizer = keras.optimizers.SGD(lr=0.1) loss = 'binary_crossentropy' store = HDFSStore('/user/username/experiments') keras_estimator = hvd.KerasEstimator( num_proc=4, store=store, model=model, optimizer=optimizer, loss=loss, feature_cols=['features'], label_cols=['y'], batch_size=32, epochs=10) keras_model = keras_estimator.fit(train_df) \ .setOutputCols(['predict']) predict_df = keras_model.transform(test_df)

Horovod Spark Estimators provide a single abstraction — the Estimator — which hides the complexity of gluing Spark DataFrames to a deep learning training script, reading data into a format interpretable by the training framework, and distributing the training using Horovod. The user only needs to provide a model written in the deep learning framework of their choice, and the Estimator will do the work of fitting it to the DataFrame.

After training, the Estimator returns a Transformer representation of the trained model. The model transformer can be used like any Spark ML transformer to make predictions on an input DataFrame, writing them as new columns in the output DataFrame.

Estimators can be used to track experiment history through model checkpointing, hot start retraining, and metric logging (for Tensorboard) using the Estimator Store abstraction. Stores persist all training artifacts including intermediate representations of the training data. Horovod natively supports stores for HDFS and local filesystems.

Horovod Spark Estimators are available for Keras (both tf.keras and standalone Keras) and PyTorch, with more frameworks (including pure TensorFlow) coming soon.

Spark 3.0 task-level GPU scheduling (#1584)

Apache Spark 3.0 introduces a new accelerator-aware scheduling capability, allowing a production ETL job running on CPUs to hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline, breaking down the barriers between ETL and continuous model training.

Horovod users can now request GPU resources directly from their Spark application, without assuming which tasks should map to which GPUs:

import horovod.spark def train(): from horovod.spark.task import get_available_devices import horovod.tensorflow.keras as hvd hvd.init() config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = get_available_devices()[0] K.set_session(tf.Session(config=config)) ... horovod.spark.run(train)

Check out the keras_spark3_rossmann.py script for a complete example.

Spark 3.0 is currently in preview release, with the full release forthcoming.

Join Operation for PyTorch (#1058)

The ability for different workers to train on a different number of batches in each epoch has been one of the most requested features for Horovod. This problem frequently arises when a dataset doesn’t evenly split among all workers, requiring the user to truncate any extra examples or risk deadlock during training.

With the new join operation, users no longer need to worry about how evenly their dataset divides when training. Just add a join step at the end of each epoch, and Horovod will train on any extra batches without causing the waiting workers to deadlock:

for epoch in range(epochs): for batch in dataset: ... hvd.join(device=hvd.local_rank())

The join operation is currently supported in Horovod for PyTorch, with support for TensorFlow and Apache MXNet coming soon.

Interactive Run Mode (#1307)

With horovod.spark.run, Horovod was made to support launching training jobs programmatically by defining Python functions that were executed on Spark Executors. Within Horovod Interactive Run Mode, we created a similar API that can launch training jobs on any visible hosts, similar to the command-line horovodrun tool:

from horovod.run import run as hvdrun def train(): import horovod.tensorflow as hvd hvd.init() ... results = hvdrun(train, np=2)

Interactive mode supports most of the functionality provided by horovodrun. See the API for a complete reference.

Bug Fixes and Improvements

Added NCCL implementation of hvd.broadcast when building with HOROVOD_GPU_BROADCAST=NCCL (#1579).

Fixed hvd.allgather to work with CUDA tensors when building with HOROVOD_GPU_ALLGATHER=MPI (#1480).

Fixed a crash bug in MXNet caused by early free of tensor object (#1639).

Added experimental implementation for the Adasum gradient aggregation method from Microsoft (full support coming in v0.20.0) (#1485).

Added support for Intel oneCCL to replace MLSL (#1566).

Added FP16 support in IBM DDL (#1465).

Improved support for running Horovod on Spark with YARN (#1525).

Added support for TensorFlow 2.0 learning rate schedules with tf.keras (#1588).

Added support for broadcasting Python objects with PyTorch (#1609).

Added thread pool for CUDA finalizer threads (#1562).

Fixed host file usage and parsing within horovodrun (#1607).

Source code(tar.gz)
Source code(zip)

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Related tags

Overview

Horovod

Comments

Checklist before submitting

Description

Review process to land

If you are absolutely sure that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 5 with PID 0 on node nmyjs-202-75 exited on signal 6 (Aborted).

Releases(v0.26.1)

v0.26.1(Oct 14, 2022)

Fixed

v0.26.0(Oct 13, 2022)

Added

Changed

Deprecated

Removed

Fixed

v0.25.0(Jun 21, 2022)

Added

Changed

Deprecated

Fixed

v0.24.3(Apr 21, 2022)

Fixed

v0.24.2(Mar 10, 2022)

Fixed

v0.24.1(Mar 3, 2022)

Fixed

v0.24.0(Mar 2, 2022)

Added

Changed

Deprecated

Removed

Fixed

v0.23.0(Oct 6, 2021)

Added

Changed

Removed

Fixed

v0.22.1(Jun 10, 2021)

Added

Changed

Fixed

v0.22.0(May 19, 2021)

Added

Changed

Fixed

v0.21.0(Nov 23, 2020)

Detailed Changes

Added

Changed

Fixed

v0.20.3(Oct 1, 2020)

Detailed Changes

Added

Changed

v0.20.2(Sep 26, 2020)

Detailed Changes

Fixed

v0.20.1(Sep 25, 2020)

Detailed Changes

Added

Fixed

v0.20.0(Sep 4, 2020)

Elastic Horovod API + Spark Auto-Scaling (#1849, #1956)

Horovod on Ray (#2218)

All-to-All Operation (#2143)

Gradient Predivide Factor (#1949)

CMake Build System (#2009)

Detailed Changes

Added

Changed

Deprecated

Removed

Fixed

v0.19.5(Jun 24, 2020)

Fixed

Changes to `horovodrun`