Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Last update: Jan 08, 2023

Overview

Hivemind: decentralized deep learning in PyTorch

Hivemind is a PyTorch library to train large neural networks across the Internet. Its intended usage is training a single Transformer model on hundreds of computers from different universities, companies, and volunteers.

Key Features

Train neural networks of arbitrary size: parts of their layers are distributed across the participants.
Distributed training without a master node: Distributed Hash Table allows connecting computers in a decentralized network.
Fault-tolerant backpropagation: forward and backward passes succeed even if some nodes are unresponsive or take too long to respond.
Decentralized parameter averaging: iteratively aggregate updates from multiple workers without the need to synchronize across the entire network.

To learn more about the ideas behind this library, see https://learning-at-home.github.io or read the NeurIPS 2020 paper.

Installation

Before installing hivemind, make sure that your environment has Python 3.7+ and PyTorch with a version at least as new as 1.6.0.

To start using this library, you can either use the pip package manager or build it from source. Since currently the release cycle is not established yet, we recommend installing hivemind from source to keep up with the latest bugfixes and improvements.

With pip

If your versions of Python and PyTorch match the requirements, you can install hivemind from pip:

pip install hivemind

From source

To install hivemind from source, simply clone the repository and install

git clone https://github.com/learning-at-home/hivemind.git
cd hivemind
pip install .

If you would like to verify that your installation is working properly, you can install with pip install -e .[dev] instead. Then, you can run the tests with pytest tests/.

Documentation

Quickstart: install hivemind, set up a server and train experts
Documentation & guides are available at learning-at-home.readthedocs.io

Contributing

Hivemind is currently at the active development stage, and we welcome all contributions. Everything, from bug fixes and documentation improvements to entirely new features, is equally appreciated.

If you want to contribute to hivemind but don't know where to start, take a look at the unresolved issues. Open a new issue or join our chat room in case you want to discuss new functionality or report a possible bug. Bug fixes are always welcome, but new features should be preferably discussed with maintainers beforehand.

If you want to start contributing to the source code of hivemind, please see the contributing guidelines first. To learn more about other ways to contribute, read our guide.

Citation

If you found hivemind useful for your experiments, you can cite the paper that inspired it:

@inproceedings{ryabinin2020crowdsourced,
 author = {Ryabinin, Max and Gusev, Anton},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
 pages = {3659--3672},
 publisher = {Curran Associates, Inc.},
 title = {Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts},
 url = {https://proceedings.neurips.cc/paper/2020/file/25ddc0f8c9d3e22e03d3076f98d83cb2-Paper.pdf},
 volume = {33},
 year = {2020}
}

The initial implementation of hivemind used for the paper is available at mryab/learning-at-home.

In the documentation, we list several related projects and acknowledgements.

Comments

Fine-tuning BERT on GLUE with hivemind
Describe the bug while running hivemind albert experiment, we have one monitor peer and two worker peers. One of the nodes is working fine

But the other peer is stack at downloading parameters from peer. The reason I guess the is the speed of training. If first node train too fast then the other node cannot join, stuck in the download parameter. Can we limit the training speed or force the first node to wait others to join?

To Reproduce If applicable, please create a minimal script that reproduces the problem for you. It would be great to include script outputs as well.

If we change the albert to bert in the example, the speed for each iteration would be faster, then the new worker cannot join the training.

Environment Please list:

python version (e.g. 3.8.1); 3.8

hivemind.version; 1.1.0.dev0

discussion
opened by elricwan 32
Convert hivemind.server to libp2p backend
#242 In this PR we are trying to get rid of GRPC in MoE module of hivemind. I do this PR draft now to be able to start review process and do it piece by piece, because some core thing are done and some results are achieved (I believe). There is still work to be done (it will be mentioned at the end of this message).

What is currently done:

RemoteExperts are able to communicate through libp2p

Some throughput performance optimizations were done: we achieved ~2GiB/sec at ffn_forward benchmark using GTX3060. It is done by enabling balancing multiple handlers of the same protocol in p2pd. Thus we can do packing/unpacking of messages on forward/backward in parallel. Results are at the end of the message

Examples from hivemind tutorial are working

tests/test_dht_experts.py are passing

What are topics to discuss:

[x] Current implementation of RemoteExpert is quite heavy compared to GRPC version. GRPC version was in fact a few string fields containing endpoints, nothing more. Current version contains heavy object: connection to p2p-daemon, which is not serializable. Probably this is not the best decision.

[x] Current moe/server/Server will not probably work if it has no DHT value inside, however old api documentation says that DHT is not mandatory for Server instance. There are two ways I see: make DHT mandatory or create a P2P instance inside a Server. Doing second option DHT inside Server probably might not be useful anymore

What is yet to be done:

[x] Separate RemoteExpertInfo with endpoints from RemoteExpert and make function to make expert from its info

[x] Separate thread and queue for async actions from _RemoteModuleCall

[x] tests/test_moe.py and tests/test_training.py. And probably some other tests.

[x] Rewrite RemoteMixtureOfExpert to the P2P (as RemoteExpert)

[x] Refactoring of Server instance after discussion mentioned above

[x] Wildly benchmark every possible scenario

[x] FIx documentation for tutorials

[x] Add test on scenarios, which might be not covered ~~After this PR is done we have to discuss dht.replicate_p2p() because it is not fork-safe and not in an obvious way~~

Current benchmarks results:

Current benchmarks were performed with GTX3060 and --preset ffn_forward. Each experiment was performed at least 5 times and results bellow are average. On demand I can provide more detailed data. Also worth mentioning that this results can change during review process.

| Branch | Batch size | Number of handlers | Throughput | | ---------------- | -------------- | -------------------------- | -------------------- | | server-p2p | 1024 | 1 | ~838 MiB/sec | | server-p2p | 1024 | 5 | ~2085 MiB/sec | | server-p2p | 1024 | 10 | ~2055 MiB/sec | | master | 1024 | *default | ~1526 MiB/sec | | master | 2048 | *default | ~2248 MiB/sec |

*default in column Number of handlers: TLDR it is 64. It means formula max(1, num_handlers or num_clients // 2) where num_handlers = None and num_clients = 128`.

What can be done after merging this

Thing discovered during review. They are not blocking this PR, but it is better to do them.

[ ] Get rid of multiaddrs everywhere. P2P daemon should be able to communicate using PeerID only

[ ] In some places there are CPU-bound thing happening inside async task. It is better to move them into thread executors. For example hivemind/moe/client/expert.py forward/Backward

[ ] currently hivemind.Server does not check that inputs are correct. If user sends malformed inputs, it may OOM the server. We should check for that in some future PR. See #3

[ ] if clients sends tensor of shape [0, 123], it will be split into zero messages and uid will not be passed. Server will receive uid=None and fail with cryptic KeyError(None). We should either forbid this on client side or ensure that zero-element tensors are serialized into a stream with first emty message.

[ ] Test load balancing for unary handlers on python side

server mixture-of-experts p2p
opened by GreenFatGuy 12

Averaging is extremely slow in some setups

Error log from client-mode peer:

[2021/08/16 22:54:50.049][INFO][optim.collaborative.step:229] Beginning global optimizer step 0
[2021/08/16 22:54:50.253][INFO][optim.collaborative.fetch_collaboration_state:444] Collaboration accumulated 3696 samples from 2 peers; ETA 0.00 seconds (refresh in 0.50s.)
[2021/08/16 22:54:50.445][INFO][optim.collaborative.fetch_collaboration_state:444] Collaboration accumulated 3696 samples from 2 peers; ETA 0.00 seconds (refresh in 0.50s.)
/usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py:87: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
[2021/08/16 22:56:57.350][ERROR][averaging.averager._run_allreduce:426] 
Traceback (most recent call last):
  File "/content/hivemind/hivemind/averaging/averager.py", line 416, in _run_allreduce
    averaging_outputs = [output async for output in allreduce]
  File "/content/hivemind/hivemind/averaging/averager.py", line 416, in <listcomp>
    averaging_outputs = [output async for output in allreduce]
  File "/content/hivemind/hivemind/averaging/allreduce.py", line 132, in run
    async for averaged_tensor_delta in self.tensor_part_container.iterate_output_tensors():
  File "/content/hivemind/hivemind/averaging/partition.py", line 134, in iterate_output_tensors
    await self._output_part_available[peer_index].wait()
  File "/usr/lib/python3.7/asyncio/locks.py", line 293, in wait
    await fut
concurrent.futures._base.CancelledError
[2021/08/16 22:56:57.352][INFO][optim.collaborative.step:250] Skipped averaging: averaging round failed with TimeoutError().
[2021/08/16 22:56:57.368][INFO][optim.collaborative.step:266] Optimizer step: done!

Error log from regular peer:

[2021/08/16 22:55:12.202][INFO][optim.collaborative.step:229] Beginning global optimizer step 0
[2021/08/16 22:55:12.212][INFO][optim.collaborative.fetch_collaboration_state:442] Collaboration accumulated 3856 samples from 2 peers; ETA 0.00 seconds (refresh in 0.50s.)
[2021/08/16 22:56:57.266][ERROR][averaging.averager._run_allreduce:426]
Traceback (most recent call last):
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 416, in _run_allreduce
    averaging_outputs = [output async for output in allreduce]
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 416, in <listcomp>
    averaging_outputs = [output async for output in allreduce]
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/allreduce.py", line 132, in run
    async for averaged_tensor_delta in self.tensor_part_container.iterate_output_tensors():
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/partition.py", line 134, in iterate_output_tensors
    await self._output_part_available[peer_index].wait()
  File "/home/jheuristic/anaconda3/envs/TPU/lib/python3.9/asyncio/locks.py", line 226, in wait
    await fut
asyncio.exceptions.CancelledError
[2021/08/16 22:56:57.268][ERROR][averaging.averager._step:365] Averager caught MatchmakingException('Unable to run All-Reduce: ')
Traceback (most recent call last):
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 416, in _run_allreduce
    averaging_outputs = [output async for output in allreduce]
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 416, in <listcomp>
    averaging_outputs = [output async for output in allreduce]
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/allreduce.py", line 132, in run
    async for averaged_tensor_delta in self.tensor_part_container.iterate_output_tensors():
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/partition.py", line 134, in iterate_output_tensors
    await self._output_part_available[peer_index].wait()
  File "/home/jheuristic/anaconda3/envs/TPU/lib/python3.9/asyncio/locks.py", line 226, in wait
    await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 348, in _step
    await asyncio.wait_for(
  File "/home/jheuristic/anaconda3/envs/TPU/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
  File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 427, in _run_allreduce
    raise MatchmakingException(f"Unable to run All-Reduce: {e}")
hivemind.averaging.matchmaking.MatchmakingException: Unable to run All-Reduce:
[2021/08/16 22:56:57.270][INFO][optim.collaborative.step:250] Skipped averaging: averaging round failed with MatchmakingException('Unable to run All-Reduce: ').
[2021/08/16 22:56:57.301][INFO][optim.collaborative.step:266] Optimizer step: done!

bug averaging

opened by yhn112 10

[BUG] Dead lock when 'Downloading parameters' cost Took too much time

Describe the bug while running hivemind albert experiment, we have one monitor peer and two worker peers. One of the nodes is working fine

But the other peer is stack at downloading parameters from peer peer log is:

[2021/11/01 07:21:50.962][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez
[2021/11/01 07:28:17.871][INFO][averaging.averager._load_state_from_peers:597] Finished downloading state from QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez


/opt/conda/lib/python3.9/site-packages/transformers/trainer.py:1347: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.

  nn.utils.clip_grad_norm_(

[2021/11/01 07:28:18.759][INFO][__main__.on_step_end:153] Step 0

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:154] Your current contribution: 0 samples

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:155] Performance: 0.002546124167199564 samples per second.

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:157] Local loss: 11.4107

[2021/11/01 07:28:18.986][INFO][optim.collaborative.fetch_collaboration_state:442] Collaboration accumulated 81 samples from 1 peers; ETA 36.99 seconds (refresh in 9.25s.)

[2021/11/01 07:28:19.004][INFO][optim.collaborative.step:208] Peer is out of sync.

[2021/11/01 07:28:20.243][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmXpVXnAY6L7WqeW4pzstGK18S1LySDonPmrxQka3GztJa

To Reproduce the monitor running script: python run_training_monitor.py --host_maddrs '/ip4/0.0.0.0/tcp/38888' --experiment_prefix albert --wandb_project albert

the worker peer script: python run_trainer.py --experiment_prefix albert --host_maddrs '/ip4/0.0.0.0/tcp/39997' --initial_peers [INITIAL_PEERS_FROM_MONITOR] --seed 42 --logging_first_step --logging_steps 100 --output_dir /train --overwrite_output_dir --logging_dir /train --target_batch_size 1024 --averaging_expiration 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1

Environment I was running this experiment in a docker container Please list:

python version 3.9.7
hivemind.version; 0.10.0
Please copy and paste the output from pytorch [environment collection script]

Collecting environment information...
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3090
Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.3
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.0
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.21.3                   pypi_0    pypi
[conda] numpy-base                1.21.2           py39h79a1101_0  
[conda] pytorch                   1.10.0          py3.9_cuda11.1_cudnn8.0.5_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch                     1.10.0                   pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                0.10.0                   pypi_0    pypi
[conda] torchvision               0.11.1                   pypi_0    pypi

Considering the file transfer speed, I tested the bandwidth with iperf:

------------------------------------------------------------
Client connecting to 10.8.0.4, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.5 port 39674 connected with 10.8.0.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.4 sec  21.0 MBytes  16.9 Mbits/sec

bug

opened by finger92 7

Gating function averaging

In our preliminary experiments, all peers have independent gating functions and we can only synchronize them manually. It would be great to implement some sort of builtin averaging mechanism.

For instance, every T seconds, assemble peers into groups at random, then perform all-reduce within each group. In case of failure, rollback and repeat T seconds later.
enhancement help wanted

opened by justheuristic 7
GPU lost

Hi there,

In some experiments, I face the situation where one gpu is lost during the training. And I have to restart the work again. Have ever encountered that issue? Thank you.

opened by elricwan 6
Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend
[depends on #238 to be merged ] After we've implemented P2P transport with nat traversal, we should switch the main components to libp2p backend to take advantage of this new transport.

One of three main components is hivemind.server.Server and its counterpart hivemind.client.RemoteExpert

On a client side, hivemind creates a RemoteExpert pytorch module that calls experts via _RemoteModuleCall (and _RemoteCallMany for DMoE)

A server receives incoming connections with several ConnectionHandler processes running in parallel. These processes run gRPC servers and hence should be switched to libp2p.

Checklist

[x] find some way to attach several processes to one RPC (as in server/connection_handler.py)

[ ] make sure it passes tests/test_moe.py

[ ] make sure it passes tests/test_training.py

[x] tune performance in tests/benchmark_througphput.py

enhancement server
opened by justheuristic 6
[BUG] Loss did not decrease in Albert example after 125000 max step.
Describe the bug I run the albert example with wikitext data. I use one peer, default settings (target_batch_size=4096, train_batch_size=4, max_step=125000, lr=0.00176), but the loss did not decrease after training, it start as 11 and finish as 11.

Jan 15 10:30:14.734 [INFO] Step #1 loss = 11.04938 Jan 15 10:32:14.842 [INFO] Step #2 loss = 11.05589 Jan 15 10:34:14.975 [INFO] Step #3 loss = 11.06803 Jan 15 10:36:15.093 [INFO] Step #4 loss = 11.06271 Jan 15 10:38:15.228 [INFO] Step #5 loss = 11.06433 Jan 15 10:40:15.337 [INFO] Step #6 loss = 11.05447 Jan 15 10:41:45.401 [INFO] Step #7 loss = 11.06115 Jan 15 10:43:45.541 [INFO] Step #8 loss = 11.06025 .......... Jan 15 18:09:13.117 [INFO] Step #238 loss = 11.05597 Jan 15 18:11:13.233 [INFO] Step #239 loss = 11.06724 Jan 15 18:13:13.369 [INFO] Step #240 loss = 11.06289 Jan 15 18:15:13.494 [INFO] Step #241 loss = 11.05922 Jan 15 18:16:43.577 [INFO] Step #242 loss = 11.05226 Jan 15 18:18:43.691 [INFO] Step #243 loss = 11.05418 Jan 15 18:20:43.843 [INFO] Step #244 loss = 11.05638

To Reproduce Run the script in albert example. For monitor, I run:

python run_training_monitor.py
--experiment_prefix albert_experiment
--wandb_project albert_wandb

For trainer, I run:

IP=/ip4/192.168.0.188/tcp/45731/p2p/QmSRerwCPUSreHhwMuTLHoVHqTfWuT8J57w3sXFZtU8ECo

WANDB_DISABLED=true CUDA_VISIBLE_DEVICES=0 python run_trainer.py
--experiment_prefix albert_experiment
--initial_peers $IP
--logging_first_step
--output_dir ./outputs
--overwrite_output_dir
--logging_dir ./logs
--dataset_path="/home/protago/Xiangpeng/hivemind/examples/albert/data/albert_tokenized_wikitext"
--per_device_train_batch_size 4
--learning_rate 0.00176
--num_train_epochs=5
--save_steps=60000

Environment Please list:

python version (e.g. 3.8.1); 3.8

hivemind.version; 1.1.0

Please copy and paste the output from pytorch environment collection script

If the script doesn't work, please report pytorch and numpy versions manually. We also encourage you to include any additional information that you believe can help us solve the issue.
bug
opened by elricwan 5
Delayed Parameter Update when step(wait=False)

Is your feature request related to a problem? Please describe.

Eh, this could be a question. I'm trying to use TrainingAverager with step(wait=False). That requires data_lock and use_old_local_tensor=True follows.

When use_old_local_tensor=True, is it correct to simply add the weight difference between local model and all-reduced model to the new model parameters? The gradients calculated from the old model parameter is being added to the new model parameters. That doesn't seem quite right.

Describe the solution you'd like

https://arxiv.org/abs/2101.06840 proposes Delayed Parameter Update. Parameter update is delayed by one step. Apparently, it makes little difference in the training curve if DPU is applied after 40 iterations in BERT-large training.

I think to implement DPU, you simply have to copy back the averaged tensor back to the model in the beginning of step().

Describe alternatives you've considered

I understand that if the weight difference is not added back, the local steps taken before the asynchronous all-reduce completes are being wasted. Not only it defeats purpose of asynchronous all-reduce(if local updates are going to be wasted until async completes, why not just go sync) but it also skips over input data which could trouble training.
enhancement help wanted

opened by bgyoon 5
Set default DHT num_workers = 4

This change seems to speed up (a) DHT get requests by 3.6x and (b) DHT creation by 1.2x (probably due to speeding up the communication with initial nodes).

benchmark_dht.py

nora, this PR, max_workers = 8:

nora, master (fb4813347a18a01d2c780232a5f86266bbd49d26, see #318), max_workers = 1:

opened by borzunov 5
Tutorial: ALBERT-large collaborative training
Let's implement a basic example for collaborative training with ALBERT

core training code ( @leshanbog )

[x] implement basic training scripts (run_first_peer/run_trainer) based on mryab/collaborative-training

[x] achieve exact match with old training code

[x] test fault tolerance against common network failures

update metric logging code (@yhn112 )

[x] tune first peer's wandb to avoid crashes (or restart on crashes)

[x] use the same DHT key prefix for metrics and averaging (aka self.prefix)

[x] make the prototype pep8-compliant

add basic security layer: (@borzunov )

[x] protect value types in:

[x] progress

[x] averaging

[x] metrics

[x] ensure that the averager validate reasonable min/max values (i.e. for batches_processed)

[x] make sure it supports DataParallel on peers

[x] make sure it will save/load scheduler state dict correctly with optimizer

add full description in README.md
opened by justheuristic 5

[BUG] Enable to train a bloat16-compressed model

Describe the bug

Jan 04 22:30:14.302 [INFO] test-run-1112b accumulated 10 samples for epoch #0 from 2 peers. ETA 0.00 sec (refresh in 0.50 sec)
Jan 04 22:30:14.476 [INFO] Beginning optimizer step #0
Jan 04 22:31:26.924 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
    torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_P' did not finish.
Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_Q' did not finish.
Jan 04 22:31:26.925 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
Jan 04 22:35:47.094 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
    torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_P' did not finish.
Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_Q' did not finish.
Jan 04 22:35:47.095 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
Jan 04 22:40:07.221 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead

To Reproduce

git clone https://github.com/the-beee/naifu-diffusion
cd naifu-diffusion
pip install -r requirements.txt
python trainer.py

Please update config/distributed.yaml to include the peers address in the hivemind section, before starting the second peer.

Environment

Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] pytorch-lightning==1.8.6
[pip3] torch==1.13.1
[pip3] torch-ema==0.3
[pip3] torchmetrics==0.11.0
[pip3] torchvision==0.14.1
[pip3] hivemind==1.1.4
[conda] Could not collect

bug

opened by the-beee 0

Add Codespell to CI, fix typos

This PR applies Codespell to the repo and attempts to fix most of the typos found by this tool; the rest are debatable. Also, it adds Codespell to CI to prevent (or at least highlight) future typos, you can see that it works by navigating the PR diff or the diff for this commit.

opened by mryab 2
Mismatched protobuf versions in sub-dependencies
When installing hivemind (as a dependency of petals) using pipenv, pipenv failed to resolve a valid version for protobuf. Could not find a version that matches protobuf<4.0.0,<4.0dev,<5.0dev,>=3.12.2,>=3.20.3,>=4.21.6

Here's the trimmed dependency graph for hivemind to show the conflicts:

- hivemind [required: ==1.1.3, installed: 1.1.3] - grpcio-tools [required: >=1.33.2, installed: 1.51.1] - protobuf [required: <5.0dev,>=4.21.6, installed: 3.20.3] - protobuf [required: <4.0.0,>=3.12.2, installed: 3.20.3]

I haven't tested if this causes any actual issues, but it looks risky.
opened by briansemrau 2
[BUG] Cyclic references in TaskPool

Found in https://github.com/bigscience-workshop/petals/pull/150/files by @borzunov

TL;DR ModuleBackend's contain TaskPools as properties, but TaskPools refer to ModuleBackend's instance methods (e.g. self.forward)

This is harmless for run_server, but will potentially cause memory leaks if server is deleted and recreated.
bug

opened by justheuristic 0

$Read {run_id}_progress from DHT manually throws exceptions$

Read {run_id}_progress from DHT manually throws exceptions

Hi,

I can't seem to be able to read the training information (like here) out of the DHT that was created by hivemind.

I can connect to the DHT and run the following:

> dht.store("key", "value", expiration=get_dht_time() + 600)
> dht.get("key")
ValueWithExpiration(value='value', expiration_time=1670845892.2483625)

However, when training with hivemind, I can't seem to be able to get the data with two different behaviors after calling the get function after each other.

Only the second call shows some actual training progress data, but not complete (1 out of 4 peers) and not in a way that allows me to access it compared to the documentation.

It seems that there is some issue with the get call being run asynchronously and not being able to decode the returning LocalTrainingProgress.

How does the tutorial data get/store differ from what hivemind does with the LocalTrainingProgress?

First call to get

>>> dht.get("hivemind-123_progress")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/dht.py", line 173, in get
    return future if return_future else future.result()
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/utils/mpfuture.py", line 257, in result
    return super().result(timeout)
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
msgpack.exceptions.ExtraData: unpack(b) received extra data.

Second call to get

>>> dht.get("hivemind-123_progress")
Dec 12 12:43:20.841 [ERROR] [asyncio._run:129] Task exception was never retrieved
future: <Task finished name='Task-13381' coro=<DHT._get() done, defined at /home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/dht.py:175> exception=ExtraData({'peer_id': b"\x12 W\xb23\xa4\x85\xd0\xfa\xad\n[t\xec\xc7\xfe'\xed\x1d\x94\x03\n\xf6\x11e\xf4\xe3j,\xf7\xae\xd5h\xca", 'epoch': 24, 'samples_accumulated': 0, 'samples_per_second': 10.078083213276257, 'time': 1670842945.1815588, 'client_mode': False}, b'[signature:P3NGbBDc4ujJwy2afKJSEXD/lsM1s7icix+h5LoxGk1K6ZFvq5vaf7vs4mokUm0TmYbeGMq85DV1M3nr/+lrVg/WGAtC3moq9iiigaKiNnhszcZPx1ls+UOoIbZXGh35kdIzCIr2qsV9GxheuPaohErMoEzxN+kAytZ+wEtxoxEgOCAXEdOGVmee0Dx6eIQVzs96d7aIEpucNLGRu8ylOvgjcZNOu+MMyqVTom3R6yvl8RRTh3Dj/0cS7a0ajo+osIx7ENIadL8Zh8Vqmw+evLR2dZhAULYhN/wq1C/8dNYZzM1C2spbjG9hMYlD33RUhmD0gE+rWP0OKHA7vUPtSA==]')>
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/dht.py", line 177, in _get
    result = await self._node.get(key, latest=latest, **kwargs)
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 543, in get
    result = await self.get_many([key], **kwargs)
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 565, in get_many
    results_by_id = await self.get_many_by_id(key_ids, sufficient_expiration_time, **kwargs)
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 620, in get_many_by_id
    search_results[key_id].add_candidate(self.protocol.storage.get(key_id), source_node_id=self.node_id)
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 844, in add_candidate
    self.finish_search()
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 873, in finish_search
    self.serializer.loads(value_bytes), item_expiration_time
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/utils/serializer.py", line 72, in loads
    return msgpack.loads(buf, ext_hook=cls._decode_ext_types, raw=False)
  File "msgpack/_unpacker.pyx", line 201, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/dht.py", line 173, in get
    return future if return_future else future.result()
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/utils/mpfuture.py", line 257, in result
    return super().result(timeout)
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
msgpack.exceptions.ExtraData: unpack(b) received extra data.

opened by cirquit 1

Releases(1.1.4)

1.1.4(Dec 2, 2022)
What's Changed

Update p2pd to v0.3.13 by @borzunov in https://github.com/learning-at-home/hivemind/pull/527

Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.1.3...1.1.4
Source code(tar.gz)
Source code(zip)
1.1.3(Nov 29, 2022)
What's Changed

Update moe.md by @cirquit in https://github.com/learning-at-home/hivemind/pull/516

Fix "unable to open shared memory" while using MPFuture by @borzunov in https://github.com/learning-at-home/hivemind/pull/517

Fix MPFuture failing outside inference mode by @borzunov in https://github.com/learning-at-home/hivemind/pull/521

Bump torch to >=1.9.0 by @borzunov in https://github.com/learning-at-home/hivemind/pull/522

Fix P2PDaemon's idle timeout by @borzunov in https://github.com/learning-at-home/hivemind/pull/523

Support torch.bfloat16 in hivemind.compression by @borzunov in https://github.com/learning-at-home/hivemind/pull/524

Remove stale PeerIDs in hivemind-dht's routing table by @borzunov in https://github.com/learning-at-home/hivemind/pull/525

New Contributors

@cirquit made their first contribution in https://github.com/learning-at-home/hivemind/pull/516

Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.1.2...1.1.3
Source code(tar.gz)
Source code(zip)
1.1.2(Oct 19, 2022)
What's Changed

Forbid protobuf 4.x in requirements by @justheuristic in https://github.com/learning-at-home/hivemind/pull/508

Check if identity is already taken by @borzunov in https://github.com/learning-at-home/hivemind/pull/511

Add Petals to "Example Use Cases" by @borzunov in https://github.com/learning-at-home/hivemind/pull/512

Follow up #501 and #511 with minor fixes by @borzunov in https://github.com/learning-at-home/hivemind/pull/513

Update bitsandbytes, relax its version constraint by @mryab in https://github.com/learning-at-home/hivemind/pull/510

Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.1.1...1.1.2
Source code(tar.gz)
Source code(zip)
1.1.1(Sep 13, 2022)
What's Changed

Handle errors in Runtime by @justheuristic in https://github.com/learning-at-home/hivemind/pull/489

metadata type changed to bytes by @GreenFatGuy in https://github.com/learning-at-home/hivemind/pull/491

fix: Parameter Averaging quickstart clarification by @IAL32 in https://github.com/learning-at-home/hivemind/pull/492

Make DHT ignore SIGINT by @dbaranchuk in https://github.com/learning-at-home/hivemind/pull/493

Update README with latest projects and publications by @mryab in https://github.com/learning-at-home/hivemind/pull/494

Add links to "Example Use Cases" by @borzunov in https://github.com/learning-at-home/hivemind/pull/497

Support bfloat16 for autograd by @dbaranchuk in https://github.com/learning-at-home/hivemind/pull/499

Remove libp2p handlers when ConnectionHandler, DHT, and DecentralizedAverager are shut down by @borzunov in https://github.com/learning-at-home/hivemind/pull/501

Fix PyTorch warning suppression by @borzunov in https://github.com/learning-at-home/hivemind/pull/502

Fix a potential deadlock in await_asynchronously with nested locks by @justheuristic in https://github.com/learning-at-home/hivemind/pull/503

Require TaskPoolBase to implement load_batch_to_runtime by @justheuristic in https://github.com/learning-at-home/hivemind/pull/506

Change runtime.py to choose tasks with lowest (instead of highest) priority by @justheuristic in https://github.com/learning-at-home/hivemind/pull/505

Add support for quantization with bitsandbytes by @mryab in https://github.com/learning-at-home/hivemind/pull/490

New Contributors

@IAL32 made their first contribution in https://github.com/learning-at-home/hivemind/pull/492

@dbaranchuk made their first contribution in https://github.com/learning-at-home/hivemind/pull/493

Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.1.0...1.1.1
Source code(tar.gz)
Source code(zip)
1.1.0(Jun 20, 2022)
Release highlights

Starting from this release, all components of hivemind.moe use libp2p for communication. This comes with the same benefits as in averaging and DHT previously (simplified NAT traversal, better performance, etc.) and marks the end of gRPC usage in hivemind. The user API is mostly the same: if you were using abstractions like RemoteMixtureOfExperts, the code should not be changed, although cross-release training is not possible.

If you need another way to reduce the network footprint during training with hivemind.Optimizer, you can now use PowerSGD for gradient averaging. This method decreases the communication costs by factorizing the gradients of the model and aggregating the factorized versions. To enable this method in your code, pass grad_averager_factory=partial(PowerSGDGradientAverager, averager_rank=RANK) when creating an instance of Optimizer. Here, RANK denotes the factorization rank; lower values give higher compression at the cost of the reconstruction quality.

Similarly to hivemind-server, it is now possible to launch a dedicated DHT instance with a command-line tool. The tool, available via hivemind-dht, can be used to quickly create a lightweight peer that is used mostly for connecting others to the DHT (for example, on a publicly available server) or for DHT metadata replication.

Previously, restarting a libp2p instance required generating a new P2P identity, which resulted in a new multiaddress. Thus, it was difficult to use the same command to connect to a peer in case of repeated launches, which is often the case during debugging. Now, you can store the persistent peer identity of a peer in a file and reuse it between launches: this is done by specifying the --identity_path argument, available both in the ALBERT example and CLI tools of hivemind.

Deprecations

The parameters quic, use_relay_hop, and use_relay_discovery of hivemind.P2P are deprecated since our update of the libp2p dependency in the p2p daemon. They will be removed in the 1.2.0 release of hivemind

What's Changed

Pin pytest version in requirements-dev, use file_descriptor in tests by @justheuristic in https://github.com/learning-at-home/hivemind/pull/454

Pin isort version, bump black by @mryab in https://github.com/learning-at-home/hivemind/pull/456

Clean compression/init.py by @borzunov in https://github.com/learning-at-home/hivemind/pull/460

Do not use offload_optimizer with local_updates by deafult by @foksly in https://github.com/learning-at-home/hivemind/pull/462

Add PowerSGD for compressed gradient averaging by @artek0chumak in https://github.com/learning-at-home/hivemind/pull/432

Bump Black to 22.3.0, pin Golang version by @mryab in https://github.com/learning-at-home/hivemind/pull/466

use_local_updates in optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/468

Update p2pd to v0.3.8 (and libp2p to v0.17.0) by @borzunov in https://github.com/learning-at-home/hivemind/pull/469

Generate new private key if identity file doesn't exist by @borzunov in https://github.com/learning-at-home/hivemind/pull/473

Convert hivemind.server to libp2p backend by @GreenFatGuy in https://github.com/learning-at-home/hivemind/pull/470

Implement a CLI for hivemind.DHT by @mryab in https://github.com/learning-at-home/hivemind/pull/465

Use PeerID exclusively to address MoE experts by @justheuristic in https://github.com/learning-at-home/hivemind/pull/479

Remove deprecated code in hivemind.optim and hivemind.averaging before the 1.1.0 release by @mryab in https://github.com/learning-at-home/hivemind/pull/480

Fix shape validation in GradientAverager by @mryab in https://github.com/learning-at-home/hivemind/pull/481

Change expiration time in declare_experts, fix update_period discrepancy by @justheuristic in https://github.com/learning-at-home/hivemind/pull/482

Add identity_path option for MoE.Server runners by @GreenFatGuy in https://github.com/learning-at-home/hivemind/pull/484

Simplify ExpertBackend interface by @justheuristic in https://github.com/learning-at-home/hivemind/pull/483

Clean up imports, remove unused utils by @mryab in https://github.com/learning-at-home/hivemind/pull/486

finish renaming experts -> module_backends in ConnectionHandler by @justheuristic in https://github.com/learning-at-home/hivemind/pull/487

Remove gRPC services and grpcio requirement by @mryab in https://github.com/learning-at-home/hivemind/pull/485

New Contributors

@GreenFatGuy made their first contribution in https://github.com/learning-at-home/hivemind/pull/470

Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.0.1...1.1.0
Source code(tar.gz)
Source code(zip)
1.0.1(Feb 7, 2022)
What's Changed

Improve user-friendliness and fix misc errors in Optimizer, Averager and P2P by @justheuristic @pr-Mais @borzunov @mrseeker @mryab in https://github.com/learning-at-home/hivemind/pull/428

Skip gradient averaging if there are no other peers by @justheuristic @soodoshll @borzunov in https://github.com/learning-at-home/hivemind/pull/440

Move hivemind.Server from init, streamline imports by @mryab in https://github.com/learning-at-home/hivemind/pull/441

Change make_empty to make_zeros for TensorDescriptor by @mryab in https://github.com/learning-at-home/hivemind/pull/442

Fix offloaded optimizer with single peer by @justheuristic @elricwan @borzunov in https://github.com/learning-at-home/hivemind/pull/450

Fix "too many open files" issue by @yhn112 in https://github.com/learning-at-home/hivemind/pull/444

Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.0.0...1.0.1
Source code(tar.gz)
Source code(zip)
1.0.0(Dec 20, 2021)
What's Changed

Fix averager speed for TCP connections by @borzunov in https://github.com/learning-at-home/hivemind/pull/373

Fix "Too many open files" and load state freezing by @justheuristic in https://github.com/learning-at-home/hivemind/pull/371

Prefetch while reading rpc_aggregate_part() outputs by @borzunov in https://github.com/learning-at-home/hivemind/pull/370

Use ModeClient in libp2p DHT in case of --client_mode by @borzunov in https://github.com/learning-at-home/hivemind/pull/374

Integrate p2pd logs and outputs into hivemind logging by @borzunov in https://github.com/learning-at-home/hivemind/pull/375

Split compression strategies into separate classes by @justheuristic in https://github.com/learning-at-home/hivemind/pull/366

Implement colored logs by @borzunov in https://github.com/learning-at-home/hivemind/pull/377

Parametrize max message size for persistent connections by @deniskamazur in https://github.com/learning-at-home/hivemind/pull/376

Make log handlers configurable, shorten entries by @borzunov in https://github.com/learning-at-home/hivemind/pull/378

Enable log handler in benchmarks and run_server by @borzunov in https://github.com/learning-at-home/hivemind/pull/380

Fix step_tolerance in CollaborativeOptimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/383

Fix pickle vulnerability by @deniskamazur in https://github.com/learning-at-home/hivemind/pull/386

Remove arguments with default values from example instructions by @borzunov in https://github.com/learning-at-home/hivemind/pull/388

Implement weight as part of the allreduce protocol, not matchmaking by @justheuristic in https://github.com/learning-at-home/hivemind/pull/384

Support different AMP & buffer configurations in one experiment, fix minor bugs by @justheuristic in https://github.com/learning-at-home/hivemind/pull/389

Fix codecov_in_develop_mode with pip>=21.2 by @justheuristic in https://github.com/learning-at-home/hivemind/pull/393

Fix minor issues in documentation by @borzunov in https://github.com/learning-at-home/hivemind/pull/392

Apply averager updates asynchronously by @justheuristic in https://github.com/learning-at-home/hivemind/pull/395

Fix schema typing by @justheuristic in https://github.com/learning-at-home/hivemind/pull/396

backport PerformanceEMA from server_side_averaging by @justheuristic in https://github.com/learning-at-home/hivemind/pull/397

Add an option to pre-schedule averaging by @justheuristic in https://github.com/learning-at-home/hivemind/pull/398

Move DHT to dht/dht.py, update DHT figure by @justheuristic in https://github.com/learning-at-home/hivemind/pull/399

[hotfix] replace StepControl.can_modify with began_allreduce by @justheuristic in https://github.com/learning-at-home/hivemind/pull/402

move PerformanceEMA to utils, TrainingAverager to optim, update utils by @justheuristic in https://github.com/learning-at-home/hivemind/pull/405

Add GradientAverager with support for delayed averaging by @justheuristic in https://github.com/learning-at-home/hivemind/pull/404

[hivemind.Optimizer] TrainingStateAverager by @justheuristic in https://github.com/learning-at-home/hivemind/pull/407

Catch OSError in MPFuture by @artek0chumak in https://github.com/learning-at-home/hivemind/pull/409

[hivemind.Optimizer] ProgressTracker by @justheuristic in https://github.com/learning-at-home/hivemind/pull/408

Fix minor bugs in GradientAverager by @justheuristic in https://github.com/learning-at-home/hivemind/pull/410

Make target group size optional by @justheuristic in https://github.com/learning-at-home/hivemind/pull/412

Prepare GradScaler for hivemind.Optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/413

Patch recursive cancel in StepControl by @justheuristic in https://github.com/learning-at-home/hivemind/pull/411

Replace the invalid link to discord by @artek0chumak in https://github.com/learning-at-home/hivemind/pull/414

Implement state sharing priority by @justheuristic in https://github.com/learning-at-home/hivemind/pull/415

Implement core functionality of hivemind.Optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/403

DHT Benchmark with asynchronous w/r by @MuXauJl11110 in https://github.com/learning-at-home/hivemind/pull/406

Hotfix: load_state_from_peers with offload_optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/417

Improve Optimizer docs, update quickstart to use Optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/416

Quickstart: typos and references by @justheuristic in https://github.com/learning-at-home/hivemind/pull/420

Remove trailing dots in log messages and errors by @borzunov in https://github.com/learning-at-home/hivemind/pull/419

Do not log caller for INFO messages by @borzunov in https://github.com/learning-at-home/hivemind/pull/418

Improve hivemind.optim.experimental and averager stability by @borzunov in https://github.com/learning-at-home/hivemind/pull/421

Add minor tweaks learned from the NeurIPS demo run by @justheuristic in https://github.com/learning-at-home/hivemind/pull/422

Improve All-Reduce fault-tolerance by @justheuristic in https://github.com/learning-at-home/hivemind/pull/423

Fix All-Reduce fault-tolerance: catch Exception instead of BaseException by @justheuristic in https://github.com/learning-at-home/hivemind/pull/424

Fix Task was destroeyd but is pending (put items) by @justheuristic in https://github.com/learning-at-home/hivemind/pull/427

Use hivemind.Optimizer in examples/albert by @mryab in https://github.com/learning-at-home/hivemind/pull/426

New Contributors

@artek0chumak made their first contribution in https://github.com/learning-at-home/hivemind/pull/409

@MuXauJl11110 made their first contribution in https://github.com/learning-at-home/hivemind/pull/406

Full Changelog: https://github.com/learning-at-home/hivemind/compare/0.10.0...1.0.0
Source code(tar.gz)
Source code(zip)
0.10.0(Aug 26, 2021)
This release contains the following new features and bugfixes:

Fix deadlocks in DecentralizedAverager and MPFuture (#331) (@borzunov @justheuristic)

Resolve deadlock in MPFuture (#337) (@justheuristic @borzunov @yhn112)

Convert averager to libp2p backend (#323) (@borzunov @mryab)

Refactor naming and serialization for PeerIDs (#339) (@borzunov)

Set default DHT num_workers = 4 (#342) (@borzunov @deniskamazur @justheuristic @mryab)

Fix typo in dht.md (#345) (@justheuristic)

Fix some warnings related to asyncio (#346) (@borzunov)

Speed up P2P client creation (#343) (@deniskamazur @borzunov)

Propagate startup errors from DHT and averager processes (#347) (@borzunov)

Add less comparator for PeerID (#353) (@deniskamazur @borzunov)

Fix minor asyncio issues in averager (#356) (@borzunov @justheuristic)

Optimize unary handlers with persistent connections to P2P daemon (#328) (@deniskamazur)

Fix import error breaking AllReduceRunner._send_error_to_peer() (#360) (@borzunov)

Fix logger warning in P2P (#361) (@borzunov)

Disable QUIC (#355) (@borzunov)

Disable elasticity for averaging, add error handling (#362) (@justheuristic @mryab)

Improve Matchmaking finalizers (#357) (@borzunov)

Allow to specify P2P identity file (#363) (@borzunov)

Fix loglevel for a message in _read_from_persistent_conn() (#364) (@borzunov)

Source code(tar.gz)
Source code(zip)
0.9.10(Jul 16, 2021)
This release contains the following features and bugfixes:

Add p2pd to package_data (#287) (@mryab)

Add per-tensor compression, make All-Reduce faster and more flexible (#272) (@justheuristic @mponty @mryab @yhn112 @borzunov)

Fix race condition while reserving ports in P2P (#299) (@borzunov)

Add graceful shutdown to DHT and Averager (#301) (@justheuristic @mryab)

Make checkpointing optional in example (#303) (@yhn112)

Refactor MPFuture to use a single pipe/thread per process (#298) (@justheuristic @borzunov @mryab @yhn112)

Split hivemind.client into hivemind.averaging and hivemind.moe (#304) (@mryab)

Update readthedocs with hivemind.optim (#288) (@yhn112 @justheuristic)

Minor fixes in examples/albert (#308) (@yhn112)

Upload the model with push_to_hub in examples (#297) (@leshanbog @mryab @justheuristic)

Account for multi-gpu devices in examples/albert (#309) (@justheuristic)

Convert DHT to libp2p backend (#296) (@borzunov @skobellev)

Simplify argument parsing, update docs in ALBERT example (#315) (@mryab @justheuristic @yhn112)

Improve P2P handler throughput and interface (#316) (@borzunov)

Remove shared memory from MPFuture, fix minor bugs (#317) (@justheuristic @borzunov @mryab)

Implement protobuf-based stream handlers over libp2p backend (#318) (@borzunov)

Refactor for v0.9.10 and fix example (#319) (@justheuristic @borzunov)

Update quickstart tutorials and acknowledgements (#307) (@justheuristic @yhn112 @borzunov @mryab)

Source code(tar.gz)
Source code(zip)
0.9.9(Jun 22, 2021)
This release contains the following improvements and bugfixes:

Add relay options to P2P (#268) (@deniskamazur)

Add packaging to requirements (#269) (@deniskamazur)

Disable p2pd compilation by default (#270) (@yhn112 @justheuristic)

Measure testing coverage on pull request (#271) (@yhn112)

Update p2pd md5 checksum (#273) (@deniskamazur)

Use logging in benchmarks, fix libp2p-related issues (#280) (@justheuristic)

Add BibTeX reference for the library to README (#283) (@mryab)

Fix Codecov (#282) (@yhn112)

Remove use of packaging module (#284) (@borzunov)

Support auxiliary peers in CollaborativeOptimizer (#279) (@yhn112 @justheuristic @mryab)

Source code(tar.gz)
Source code(zip)
0.9.8(Jun 7, 2021)
This release contains the following improvements and bugfixes:

Implement combining validators (#249) (@borzunov)

Decentralized adaptive optimizers (#243) (@nevec)

Add nltk to ALBERT example's requirements (#251) (@borzunov)

Protect training progress and metrics with signatures and DHT schema validation (#250) (@borzunov)

Add state checkpointing and uploading in coordinator (#241) (@leshanbog @mryab)

Fix random freezes in averager.step, improve error handling (#254) (@justheuristic @yhn112 @borzunov @mryab)

Fix device in Switch-MoE, overhaul Server architecture (#256) (@mryab)

Log more stats for user, move performance stats to examples (#257) (@yhn112)

Implement authorization for a moderated Hivemind network (#255) (@borzunov)

Improve error handling, remove deprecated functionality (#261) (@justheuristic @mryab)

Log correct loss in examples/albert/run_first_peer.py (#265) (@borzunov)

Fixed nan when compressing the tensor of zeros (#266) (@Vsevolod-pl)

Support auxiliary participants in AllReduceProtocol (#260) (@foksly)

Log collaboration step to Wandb, store metrics only if peer is synchronized (#267) (@borzunov @yhn112 @justheuristic)

Add initial support for connecting via libp2p (#238) (@MaximKsh @deniskamazur @skobellev @leshanbog @borzunov @mryab @yhn112)

Source code(tar.gz)
Source code(zip)
0.9.7(Apr 27, 2021)
This release contains the following improvements and bugfixes:

Add RSA signature protection for DHT records (#187) (@borzunov)

Improve Runtime exception handling (#207) (@mryab)

Implement basic decentralized optimizers (#210) (@justheuristic, @mryab)

Add gradient clipping support to ExpertBackend (#214) (@mryab)

Convert SerializerBase to an abstract class (#212) (@mryab)

Replace FeedforwardBlock with a correct implementation (#211) (@mryab)

Disentangle DecentralizedAverager components, add averaging weights (#217) (@justheuristic @mryab)

Add CollaborativeOptimizer, TrainingAverager (#215) (@leshanbog @nevec @mryab)

Move compression-related code to hivemind.utils.compression (#213) (@mryab)

Prevent DecentralizedSGD from accidentally skipping a fraction of training batches (#218) (@ploshkin)

Add uniform compression (#202) (@mponty)

Add gradient buffers to CollaborativeOptimizer (#220) (@justheuristic)

Improve zero_grad behavior in CollaborativeOptimizer (#221) (@justheuristic)

Reset gradient buffers when synchronizing with peers (#222) (@justheuristic)

Add tool for custom user experts (#189) (@romakail @justheuristic)

Delta gradients transmission (#225) (@Vsevolod-pl)

Statistics averaging (#229) (@nevec)

Ensure version-consistent result rounding in load_balance_peers (#230) (@justheuristic @mryab)

Add Switch Transformers-like RemoteMixtureOfExperts (#228) (@mryab)

Add example for collaborative ALBERT training (#226) (@leshanbog @yhn112 @nevec @mryab)

Fix loss metric calculation (#240) (@yhn112)

Add DHT schema validator (#227) (@borzunov)

Fix server hanging in certain cases when connection is lost (#247) (@justheuristic)

Add Dockerfile, refactor tests (#245) (@mryab)

Fix incorrect data types/values in RemoteSwitchMixtureOfExperts (#246) (@mryab)

Source code(tar.gz)
Source code(zip)
0.9.6(Apr 2, 2021)
This release adds several new features:

Client-only averaging in AllReduce (#176)

Expert learning rate scheduling (#196)

Quantile compression (#182)

Also, this release contains the following fixes and improvements:

Fix scalar deserialization (#190)

Extract expert-specific methods from DHT (#192)

Source code(tar.gz)
Source code(zip)
0.9.5(Mar 5, 2021)
This release fixes several known bugs and security vulnerabilities:

Copytree implementation for py37 compatibility (#162)

Remove pickle.loads in Averager (#160)

Support edge cases for DHT key/subkey/value (#167)

Fix the remaining tests for py37 (#166)

Move Averager metadata serialization out of user scope (#168)

Handle edge cases in DecentralizedAverager (#171)

Fix a typo in quickstart.md (#174)

Serialize DHTID source with msgpack (#172)

Move CLI server launch script to hivemind/hivemind_cli (#173)

Source code(tar.gz)
Source code(zip)
0.9.0(Feb 28, 2021)
Implement DecentralizedAverager for averaging model parameters & statistics across DHT peers (#119 #123 #134 #140 #141)

Accelerate RemoteMixtureOfExperts beam search with new key structure (#97 #101 #103 #109)

Implement lossy compression algorithms for tensors (#102 #106 #112)

Detect anomalies in RemoteMixtureOfExperts (#132)

Configure gRPC channels for long-term stability (#129 #131)

Load expert checkpoints on server startup (#138)

Support attention mask in example TransformerEncoder layer (#126)

Add the contribution guide (#156)

Bugfixes:

Fix wrong getattr in hivemind.Server (#122)

Enhancements:

Suport python3.9 and torch1.7 (#142)

Blacklist nonresponsive peers with exponential backoff (#114)

Reuse grpc channels between calls (#120)

Verify DHT peer accessibility and local clock (#137)

Improve logging, remove duplicate log entries (#135)

Improve test coverage (#116)

Source code(tar.gz)
Source code(zip)
0.8.2(Aug 28, 2020)
Remove name property from all asyncio tasks (compatibility with python3.7)

Source code(tar.gz)
Source code(zip)
0.8.1(Aug 27, 2020)
Minor update:

you can now create minimalistic hivemind server via ./script/run_server.py @Vsevolod-pl

./script/run_server.py can now sample experts from a pre-defined grid, e.g. expert.[0:256].[0:256]

added quickstart tutorial @justheuristic

Source code(tar.gz)
Source code(zip)
v0.8.0(Aug 23, 2020)
Speed up tests, shutdown threads in server via threading.Event

Compile protobuf in setup.py

Update circleci pipelines

Update RTD pipeline

Refactor custom build_ext into install and develop

Source code(tar.gz)
Source code(zip)
v0.7.1(Aug 16, 2020)

Source code(tar.gz)
Source code(zip)

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Related tags

Overview

Hivemind: decentralized deep learning in PyTorch

Key Features

Installation

With pip

From source

Documentation

Contributing

Citation

Comments

What is currently done:

What are topics to discuss:

What is yet to be done:

Current benchmarks results:

What can be done after merging this

benchmark_dht.py

Releases(1.1.4)

1.1.4(Dec 2, 2022)

What's Changed

1.1.3(Nov 29, 2022)

What's Changed

New Contributors

1.1.2(Oct 19, 2022)

What's Changed

1.1.1(Sep 13, 2022)

What's Changed

New Contributors

1.1.0(Jun 20, 2022)

Release highlights

Deprecations

What's Changed

New Contributors

1.0.1(Feb 7, 2022)

What's Changed

1.0.0(Dec 20, 2021)

What's Changed

New Contributors

0.10.0(Aug 26, 2021)

0.9.10(Jul 16, 2021)

0.9.9(Jun 22, 2021)

0.9.8(Jun 7, 2021)

0.9.7(Apr 27, 2021)

0.9.6(Apr 2, 2021)

0.9.5(Mar 5, 2021)

0.9.0(Feb 28, 2021)

0.8.2(Aug 28, 2020)

0.8.1(Aug 27, 2020)

v0.8.0(Aug 23, 2020)

v0.7.1(Aug 16, 2020)

Owner

Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...

Class-imbalanced / Long-tailed ensemble learning in Python. Modular, flexible, and extensible

Distributed Evolutionary Algorithms in Python

Machine learning template for projects based on sklearn library.

100 Days of Machine and Deep Learning Code

Bottleneck a collection of fast, NaN-aware NumPy array functions written in C.

Magenta: Music and Art Generation with Machine Intelligence

This is the material used in my free Persian course: Machine Learning with Python

TensorFlow implementation of an arbitrary order Factorization Machine

scikit-learn: machine learning in Python

Scikit-Learn useful pre-defined Pipelines Hub

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Customers Segmentation with RFM Scores and K-means

Code Repository for Machine Learning with PyTorch and Scikit-Learn

#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

Quantum Machine Learning

List of Data Science Cheatsheets to rule the world

MLR - Machine Learning Research

SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series