FedTorch is an open-source Python package for distributed and federated training of machine learning models using PyTorch distributed API

Overview

FedTorch Logo

FedTorch is an open-source Python package for distributed and federated training of machine learning models using PyTorch distributed API. Various algorithms for federated learning and local SGD are implemented for benchmarking and research, including our own proposed methods:

And other common algorithms such as:

We are actively trying to expand the library to include more training algorithms as well.

NEWS

Recent updates to the package:

Installation

First you need to clone the repo into your computer:

git clone https://github.com/MLOPTPSU/FedTorch.git

The PyPi package will be added soon.

This package is built based on PyTorch Distributed API. Hence, it could be run with any supported distributed backend of GLOO, MPI, and NCCL. Among these three, MPI backend since it can be used for both CPU and CUDA runnings, is the main backend we use for developement. Unfortunately installing the built version of PyTorch does not support MPI backend for distributed training and needed to be built from source with a version of MPI installed that supports CUDA as well. However, do not worry since we got you covered. We provide a docker file that can create an image with all dependencies needed for FedTorch. The Dockerfile can be found here, where you can edit based on your need to create your customized image. In addition, since building this docker image might take a lot of time, we provide different versions that we built before along with this repository in the packages section.

For instance, you can pull one of the images that is built with CUDA 10.2 and OpenMPI 4.0.1 with CUDA support and PyTorch 1.6.0, using the following command:

docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi

The docker images can be used for cloud services such as Azure Machine Learning API for running FedTorch. The instructions for running on cloud services will be added in near future.

Get Started

Running different trainings with different settings is easy in FedTorch. The only thing you need to take care of is to set the correct parameters for the experiment. The list of all parameters used in this package is in parameters.py file. For different algorithms we will provide examples, so the relevant parameters can be set correctly. YAML support will be added in future for each distinct training. The parameters can be parsed from the input of the commandline using the following method:

from fedtorch.parameters import get_args

args = get_args()

When the parameters are set, we need to setup the nodes using those parameters and start the training. First, we need to setup the distributed backend. For instance, we can use MPI backend as:

import torch.distributed as dist

dist.init_process_group('mpi')

This will initialize the backend for distributed training. Next, we need to setup each node based on the parameters and create the graph of nodes. Then we need to initialize nodes and load their data. For this, we can use the node object in the FedTorch:

from fedtorch.nodes import Client

client = Client(args, dist.get_rank())
# Initialize the node
client.initialize()
# Initialize the dataset if not downloaded
client.initialize_dataset()
# Load the dataset
client.load_local_dataset()
# Generate auxiliary models and params for training
client.gen_aux_models()

Then, we need to call the appropriate training method and run the training. For instance, if the parameters are set for a FedAvg training, we can run:

from fedtorch.comms.trainings.federated import train_and_validate_federated

train_and_validate_federated(client)

Different distributed and federated algorithms in this package can be run using this procedure, and for simplicity, we provide main.py file, where can be used for running those algorithms following the same procedure. To run this file, we should run it using mpi and define number of clients (processes) that will run the same file for training using:

mpirun -np {NUM_CLIENTS} python main.py {args:values}

where {args:values} should be filled with appropriate parameters needed for the training. Next we provide a file to automatically build this command for mpi running for various situations.

Running Examples

To make the process easier, we have provided a run_mpi.py file that covers most of parameters needed for running different training algrithms. We first get into the details of different parameters and then provide some examples for running.

Dataset Parameters

For setting up the dataset there are some parameters involved. The main parameters are:

  • --data : Defines the dataset name for training.
  • --batch_size : Defines the size of the batch in the training.
  • --partition_data : Defines whether the data should be partitioned or each client access to the whole dataset.
  • --reshuffle_per_epoch : This can be set True for distributed training to have iid data accross clients and faster convergence. This is not inline with Federated Learning settings.
  • --iid_data : If set True, the data is randomly distributed accross clients. If is set to False, either the dataset itself is non-iid by default (like the EMNIST dataset) or it can be manullay distributed to be non-iid (like the MNIST dataset using parameter num_class_per_client). The default is True in the package, but in the run_mpi.py file the default is False.
  • --num_class_per_client: If the parameter iid is set to False, we can distribute the data heterogeneously by attributing certain number of classes to each client. For instance if setting --num_class_per_client 2, then each client will only has access to two randomly selected classes' data in the entire training process.

Federated Learning Parameters

To run the training using federated learning setups some main parameters are:

  • --federated : If set to True the training will be in one of federated learning setups. If not, it will be in a distributed mode using local SGD and with periodic averaging (that could be set using --local_step) and possibly reshuffling after each epoch. The default is False.
  • --federated_type : defines the type of fderated learning algorithm we want to use. The default is fedavg.
  • --federated_sync_type : It could be either epoch or local_step and it will be used to determine when to synchronize the models. If set to epoch, then the parameter --num_epochs_per_comm should be set as well. If set to local_step, then the parameter --local_steps should be set. The default is epoch.
  • --num_comms : Defines the number of communication rounds needed for trainings. This is only for federated learning, while in normal distirbuted mode the number of total iterations should be set either by --num_epochs or --num_iterations, and hence the --stop_criteria should be either epoch or iteration.
  • --online_client_rate : Defines the ratio of clients that are online and active during each round of communication. This is only for federated learning. The default value is 1.0, which means all clients will be active.

Learning Rate Schedule

Different learning rate schedules can be set using their corresponding parameters. The main parameter is --lr_schedule_scheme, which defines the scheme for learning rate scheduling. For more information about different learning rate schedulers, please see learning.py file.

Examples

Now we provide some simple examples for running some of the training algorithms on a single node with multiple processes using mpi. To do so, we first need to run the docker container with installed dependencies.

docker run --rm -it --mount type=bind,source="{path/to/FedTorch}",target=/FedTorch docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi
cd /FedTorch

This will run the container and will mount the FedTorch repo to it. The {path/to/FedTorch} should be replaced with your local path to the FedTorch repo directory. Now we can run the training on it.

FedAvg and FedGATE

Now, we can run the FedAvg algorithm for training an MLP model using MNIST data by the following command.

python run_mpi.py -f -ft fedavg -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2

This will run the training on 10 nodes with initial learning rate of 0.1, the batch size of 50, for 20 communication rounds each with 10 local steps of SGD. The dataset is distributed hetergeneously with each client has access to only 2 classes of data from the MNIST dataset.

Changing -ft fedavg to -ft fedgate will run the same training using the FedGATE algorithm. To run the FedCOMGATE algorithm we need to add -q to the parameter to enable quantization as well. Hence the command will be:

python run_mpi.py -f -ft fedgate -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -q

APFL

To run APFL algorithm a simple command will be:

python run_mpi.py -f -ft apfl -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -pa 0.5 -fp

where -pa 0.5 sets the alpha parameter of the APFL algorithm. The last parameter -fp will turn on the fed_personal parameter, which evaluate the personalized or the localized model using a local validation dataset. This will be mostly used for personalization algorithms such as APFL.

DRFA

To run a DRFA training we can use the following command:

python run_mpi.py -f -fd -ft fedavg -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -dg 0.1 

where -dg 0.1 sets the gamma parameter in the DRFA algorithm. Note that DRFA is a framework that can be run using any federated learning aggergator such as FedAvg or FedGATE. Hence the parameter -fd will enable DRFA training and -ft will define the federated type to be used for aggregation.

References

For this repository there are several different references used for each training procedure. If you use this repository in your research, please cite the following paper:

@article{haddadpour2020federated,
  title={Federated learning with compression: Unified analysis and sharp guarantees},
  author={Haddadpour, Farzin and Kamani, Mohammad Mahdi and Mokhtari, Aryan and Mahdavi, Mehrdad},
  journal={arXiv preprint arXiv:2007.01154},
  year={2020}
}

Our other papers developed using this repository should be cited using the following bibitems:

@inproceedings{haddadpour2019local,
  title={Local sgd with periodic averaging: Tighter analysis and adaptive synchronization},
  author={Haddadpour, Farzin and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad and Cadambe, Viveck},
  booktitle={Advances in Neural Information Processing Systems},
  pages={11082--11094},
  year={2019}
}
@article{deng2020distributionally,
  title={Distributionally Robust Federated Averaging},
  author={Deng, Yuyang and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}
@article{deng2020adaptive,
  title={Adaptive Personalized Federated Learning},
  author={Deng, Yuyang and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad},
  journal={arXiv preprint arXiv:2003.13461},
  year={2020}
}

Acknowledgement and Disclaimer

This repository is developed, mainly by MM. Kamani, based on our group's research on distributed and federated learning algorithms. We also developed the works of other groups' proposed methods using FedTorch for a better comparison. However, this repo is not the official code for those methods other than our group's. Some parts of the initial stages of this repository were based on a forked repo of Local SGD code from Tao Lin, which is not public now.

Comments
  • Errors when running the code using Docker

    Errors when running the code using Docker

    Hi, I followed your README instructions to run your algorithm but I encountered multiple errors along the way:

    • I tried to pull the image you provided on This issue #3 and run the following command: python run_mpi.py -f -ft apfl -n 10 -d cifar10 -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -pa 0.5 -fp -oc I keep getting warnings about my RTX 2080ti card which I think is not been recognized.
    warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
    Traceback (most recent call last):
      File "main.py", line 49, in <module>
        main(args)
      File "main.py", line 21, in main
        client.initialize()
      File "/workspace/fedtorch/fedtorch/nodes/nodes.py", line 44, in initialize
        init_config(self.args)
      File "/workspace/fedtorch/fedtorch/utils/init_config.py", line 36, in init_config
        torch.cuda.set_device(args.graph.device)
      File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 281, in set_device
        torch._C._cuda_setDevice(device)
    RuntimeError: cuda runtime error (101) : invalid device ordinal at /workspace/pytorch/torch/csrc/cuda/Module.cpp:59
    THCudaCheck FAIL file=/workspace/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
    /usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:125: UserWarning:
    GeForce RTX 2080 Ti with CUDA capability sm_75 is not compatible with the current PyTorch installation.
    The current PyTorch install supports CUDA capabilities sm_35 sm_37 sm_52 sm_60 sm_61 sm_70 compute_70.
    If you want to use the GeForce RTX 2080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
    
      warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
    Traceback (most recent call last):
      File "main.py", line 49, in <module>
        main(args)
      File "main.py", line 21, in main
        client.initialize()
      File "/workspace/fedtorch/fedtorch/nodes/nodes.py", line 44, in initialize
        init_config(self.args)
      File "/workspace/fedtorch/fedtorch/utils/init_config.py", line 36, in init_config
        torch.cuda.set_device(args.graph.device)
      File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 281, in set_device
        torch._C._cuda_setDevice(device)
    RuntimeError: cuda runtime error (101) : invalid device ordinal at /workspace/pytorch/torch/csrc/cuda/Module.cpp:59
    

    Regarding these warnings the code does not collapse but not even one log regarding the training procedure is printed.

    image

    In addition when I run nvidia-smi I see the following gpu utilization:

    image The utilization percentage across all 3 GPUs remains zero.

    • As second option I tried to build the image given the Dockerfile located in the docker folder. Yet again I encountered the following run error:
    Traceback (most recent call last):
      File "main.py", line 4, in <module>
        import torch.distributed as dist
      File "/usr/local/lib/python3.8/dist-packages/torch/__init__.py", line 189, in <module>
        from torch._C import *
    RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
    
    --------------------------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun detected that one or more processes exited with non-zero status, thus causing
    the job to be terminated. The first process to do so was:
    
      Process name: [[51835,1],1]
      Exit code:    1
    --------------------------------------------------------------------------
    

    It looks like there is a problem with PyTorch installation so I tried something simple:

    image

    Upgrading Numpy's version didn't help me either. Can you please help and explain how to run your code?

    Thank you!

    opened by AvivSham 5
  • A question about APFL algorithm

    A question about APFL algorithm

    Hi, I read your paper and am interested in the algorithm of APFL. I have a question about the code. It seems that the code of APFL is a little different from the algorithm written on the paper. In the paper, each client maintains 3 models and first update the local version of global model, then update the local model and finally mix them to get the new personalized model. However, in this code, it just maintains 2 models and first update the local version of global model, then get the grad of personalized model by mixing the output of local version of global model and personalized model.

    The code: `

    inference and get current performance.

                        client.optimizer.zero_grad()
                        loss, performance = inference(client.model, client.criterion, client.metrics, _input, _target)
    
                        # compute gradient and do local SGD step.
                        loss.backward()
                        client.optimizer.step(
                            apply_lr=True,
                            apply_in_momentum=client.args.in_momentum, apply_out_momentum=False
                        )
                        
                        client.optimizer.zero_grad()
                        client.optimizer_personal.zero_grad()
                        loss_personal, performance_personal = inference_personal(client.model_personal, client.model, 
                                                                                 client.args.fed_personal_alpha, client.criterion, 
                                                                                 client.metrics, _input, _target)
    
                        # compute gradient and do local SGD step.
                        loss_personal.backward()
                        client.optimizer_personal.step(
                            apply_lr=True,
                            apply_in_momentum=client.args.in_momentum, apply_out_momentum=False
                        )
    

    ` Are they the same? Looking forward to your reply. Thank you!

    opened by BrightHaozi 3
  • a question about fedprox

    a question about fedprox

    in this page "https://github.com/MLOPTPSU/FedTorch/blob/main/fedtorch/comms/trainings/federated/main.py". line 123 -> 129, code is below.

    elif client.args.federated_type == 'fedprox':
        # Adding proximal gradients and loss for fedprox
        for client_param, server_param in zip(client.model.parameters(), client.model_server.parameters()):
            if client.args.graph.rank == 0:
                print("distance norm for prox is:{}".format(torch.norm(client_param.data - server_param.data )))
            loss += client.args.fedprox_mu /2 * torch.norm(client_param.data - server_param.data)
            client_param.grad.data += client.args.fedprox_mu * (client_param.data - server_param.data)
    

    client.args.fedprox_mu /2 * torch.norm(client_param.data - server_param.data) may mean $\frac{\mu}{2} \left\|w-w_t\right\|$. But, $\frac{\mu}{2} \left\|w-w_t\right\|^2$ is used in fedprox.

    I'm not sure what I said above is true. Thank you very much for your kind consideration.

    opened by bird-two 2
  • Could you release the code in

    Could you release the code in "Federated Learning with Compression: Unified Analysis and Sharp Guarantees"?

    Hi,

    I just read the paper of "Federated Learning with Compression: Unified Analysis and Sharp Guarantees", it looks very interesting While searching for the code I was directed to this repo, however, I didn't found the implementation of FedCOM, so I would like to ask if you would release it?

    Thanks!

    opened by Dzhange 1
  • Does BrokenPipeError Matters?

    Does BrokenPipeError Matters?

    Hi, I am very interested in your proposed method. It is really nice of you to share such a helpful and clear repo. I follow your repo and use your docker images to run the code, but I keep getting some errors as follows after some random starting round of the traning.

    Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed send_bytes(obj) File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

    After all, I still can get a result, but I wonder whether the brokenpipeerror matters, am I still getting the correct results? Hope to hear your response soon.

    opened by RongDai430 1
  • No basic auth credentials

    No basic auth credentials

    I read your paper and am interested in your proposed method. It is really nice of you to share such a helpful and clear repo. I pulled one of the images built with CUDA 10.2 and OpenMPI 4.0.1 with CUDA support and PyTorch 1.6.0, using the following command: $ docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi

    However it reported an error "Error response from daemon: Head https://docker.pkg.github.com/v2/mloptpsu/fedtorch/fedtorch/manifests/cuda10.2-mpi: no basic auth credentials".
    How can I solve this error?

    opened by Sybil-Huang 1
  • Adding EMNIST full dataset

    Adding EMNIST full dataset

    Adding EMNIST full dataset in addition to its digits_only version, which was implemented before. Now, we have emnist and emnist_full datasets. The emnist dataset has 3383 clients with 10 classes, but the emnist_full has 3400 clients with 62 classes. Resolve the Batch Norm issue for training with 1 sample in the batch.

    opened by mmkamani7 1
  • setup problem in docker

    setup problem in docker

    Hello,when I ran docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi in my docker,there's a problem err do you have any ideas to solve it? Thanks.

    opened by b856741 0
  • A question about drawing the plots

    A question about drawing the plots

    Hi, Recently I've read your paper "Distributionally Robust Federated Averaging" carefully and I'm trying to reproduce the experiments in it. However, I don't know how to get the plots from the codes, I think modifying main.py and run_mpi.py would help but it seems really complicated. Could you please give me some instructions? Thanks!

    opened by Zhg9300 2
  • Models not saved for Federated Centered

    Models not saved for Federated Centered

    Hi While running main_centered.py the model checkpoints not saved. It is not Implemented?

    Is there anything else missing for main_centered.py?

    Regards, Ahmed

    enhancement 
    opened by aldahdooh 1
Releases(v0.1.1)
Owner
Machine Learning and Optimization Lab @PennState
This is the GitHub repository of the Machine Learning and Optimization lab at Penn State University.
Machine Learning and Optimization Lab @PennState
ESTDepth: Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks (CVPR 2021)

ESTDepth: Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks (CVPR 2021) Project Page | Video | Paper | Data We present a novel metho

65 Nov 28, 2022
A PyTorch-based library for semi-supervised learning

News If you want to join TorchSSL team, please e-mail Yidong Wang ([email protected]<

1k Jan 06, 2023
An open-source Deep Learning Engine for Healthcare that aims to treat & prevent major diseases

AlphaCare Background AlphaCare is a work-in-progress, open-source Deep Learning Engine for Healthcare that aims to treat and prevent major diseases. T

Siraj Raval 44 Nov 05, 2022
A no-BS, dead-simple training visualizer for tf-keras

A no-BS, dead-simple training visualizer for tf-keras TrainingDashboard Plot inter-epoch and intra-epoch loss and metrics within a jupyter notebook wi

Vibhu Agrawal 3 May 28, 2021
A PyTorch implementation of "Graph Classification Using Structural Attention" (KDD 2018).

GAM ⠀⠀ A PyTorch implementation of Graph Classification Using Structural Attention (KDD 2018). Abstract Graph classification is a problem with practic

Benedek Rozemberczki 259 Dec 05, 2022
Utility tools for the "Divide and Remaster" dataset, introduced as part of the Cocktail Fork problem paper

Divide and Remaster Utility Tools Utility tools for the "Divide and Remaster" dataset, introduced as part of the Cocktail Fork problem paper The DnR d

Darius Petermann 46 Dec 11, 2022
Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR) This is the official implementation of our paper Personalized Tran

Yongchun Zhu 81 Dec 29, 2022
Contains code for Deep Kernelized Dense Geometric Matching

DKM - Deep Kernelized Dense Geometric Matching Contains code for Deep Kernelized Dense Geometric Matching We provide pretrained models and code for ev

Johan Edstedt 83 Dec 23, 2022
2021-AIAC-QQ-Browser-Hyperparameter-Optimization-Rank6

2021-AIAC-QQ-Browser-Hyperparameter-Optimization-Rank6

Aigege 8 Mar 31, 2022
retweet 4 satoshi ⚡️

rt4sat retweet 4 satoshi This bot is the codebase for https://twitter.com/rt4sat please feel free to create an issue if you saw any bugs basically thi

6 Sep 30, 2022
This repo tries to recognize faces in the dataset you created

YÜZ TANIMA SİSTEMİ Bu repo oluşturacağınız yüz verisetlerini tanımaya çalışan ma

Mehdi KOŞACA 2 Dec 30, 2021
Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

MusCaps: Generating Captions for Music Audio Ilaria Manco1 2, Emmanouil Benetos1, Elio Quinton2, Gyorgy Fazekas1 1 Queen Mary University of London, 2

Ilaria Manco 57 Dec 07, 2022
Resources complimenting the Machine Learning Course led in the Faculty of mathematics and informatics part of Sofia University.

Machine Learning and Data Mining, Summer 2021-2022 How to learn data science and machine learning? Programming. Learn Python. Basic Statistics. Take a

Simeon Hristov 8 Oct 04, 2022
This is a clean and robust Pytorch implementation of DQN and Double DQN.

DQN/DDQN-Pytorch This is a clean and robust Pytorch implementation of DQN and Double DQN. Here is the training curve: All the experiments are trained

XinJingHao 15 Dec 27, 2022
Semantic code search implementation using Tensorflow framework and the source code data from the CodeSearchNet project

Semantic Code Search Semantic code search implementation using Tensorflow framework and the source code data from the CodeSearchNet project. The model

Chen Wu 24 Nov 29, 2022
PyTorch implementation(s) of various ResNet models from Twitch streams.

pytorch-resnet-twitch PyTorch implementation(s) of various ResNet models from Twitch streams. Status: ResNet50 currently not working. Will update in n

Daniel Bourke 3 Jan 11, 2022
EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation

EFENet EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation Code is a bit messy now. I woud clean up soon. For training the EF

Yaping Zhao 19 Nov 05, 2022
Quantized models with python

quantized-network download .pth files to qmodels/: googlenet : https://download.

adreamxcj 2 Dec 28, 2021
The 1st Place Solution of the Facebook AI Image Similarity Challenge (ISC21) : Descriptor Track.

ISC21-Descriptor-Track-1st The 1st Place Solution of the Facebook AI Image Similarity Challenge (ISC21) : Descriptor Track. You can check our solution

lyakaap 75 Jan 08, 2023
Code for KDD'20 "An Efficient Neighborhood-based Interaction Model for Recommendation on Heterogeneous Graph"

Heterogeneous INteract and aggreGatE (GraphHINGE) This is a pytorch implementation of GraphHINGE model. This is the experiment code in the following w

Jinjiarui 69 Nov 24, 2022