A fast MoE impl for PyTorch

Related tags

Deep Learningfastmoe
Overview

Release note | 中文文档 | Slack workspace

Introduction

An easy-to-use and efficient system to support the Mixture of Experts (MoE) model for PyTorch.

Installation

Prerequisites

PyTorch with CUDA is required. The repository is currently tested with PyTorch v1.8.0 and CUDA 10, with designed compatibility to older versions.

If the distributed expert feature is enabled, NCCL with P2P communication support, typically versions >=2.7.5, is needed.

Installing

FastMoE contains a set of PyTorch customized opearators, including both C and Python components. Use python setup.py install to easily install and enjoy using FastMoE for training.

The distributed expert feature is disabled by default. If you want to enable it, pass environment variable USE_NCCL=1 to the setup script.

Note that an extra NCCL developer package is needed, which has to be consistant with your PyTorch's NCCL version, which can be inspected by running torch.cuda.nccl.version(). The official PyTorch docker image is recommended, as the environment is well-setup there. Otherwise, you can access the download link of all NCCL versions to download the NCCL package that is suitable for you.

Usage

FMoEfy a Transformer model

Transformer is currently one of the most popular models to be extended by MoE. Using FastMoE, a Transformer-based model can be extended as MoE by an one-key plugin shown as follow.

For example, when using Megatron-LM, using the following lines can help you easily scale up the MLP layers to multiple experts.

model = ...

from fmoe.megatron import fmoefy
model = fmoefy(model, num_experts=<number of experts per worker>)

train(model, ...)

A detailed tutorial to moefy Megatron-LM can be found here.

Using FastMoE as a PyTorch module

An example MoE transformer model can be seen in the Transformer-XL example. The easist way is to replace the MLP layer by the FMoE layers.

Using FastMoE in Parallel

FastMoE supports both data parallel and model parallel.

Data Parallel

In FastMoE's data parallel mode, both the gate and the experts are replicated on each worker. The following figure shows the forward pass of a 3-expert MoE with 2-way data parallel.

For data parallel, no extra coding is needed. FastMoE works seamlessly with PyTorch's DataParallel or DistributedDataParallel. The only drawback of data parallel is that the number of experts is constrained by each worker's memory.

Model Parallel

In FastMoE's model parallel mode, the gate network is still replicated on each worker but experts are placed separately across workers. Thus, by introducing additional communication cost, FastMoE enjoys a large expert pool whose size is proportional to the number of workers.

The following figure shows the forward pass of a 6-expert MoE with 2-way model parallel. Note that experts 1-3 are located in worker 1 while experts 4-6 are located in worker 2.

FastMoE's model parallel requires sophiscated parallel strategies that neither PyTorch nor Megatron-LM provides. The fmoe.DistributedGroupedDataParallel module is introduced to replace PyTorch's DDP module.

Troubleshootings / Discussion

If you have any problem using FastMoE, or you are interested in getting involved in developing FastMoE, feel free to join the our slack channel.

Comments
  • The program hang at the forward function when use  model parallel in Megatron-LM

    The program hang at the forward function when use model parallel in Megatron-LM

    thanks for your work!!! I love it very much!! I met a problem, hope you can help me. Thx a lot !

    • I use pretrain_bert_distributed_with_mp.sh in Megatron-LM to train a model
    • however, when I use fmoefy, the program hang at the forward function when use model parallel
    • Before use fmoefy, the program can run well in Megatron-LM's model parallel mode
    • I update the Megatron-LM(V.2.2) code according to the instruction of https://github.com/laekov/fastmoe/blob/master/examples/megatron/fmoefy-v2.2.patch

    Platform

    • v100 , single node ,8gpu
    • pytorch:1.8.0
    • cuda11.1
    • cudnn8

    update

    • if set pipeline-model-parallel-size=1, the program can run well (tensor-model-parallel-size>1)
    • if set pipeline-model-parallel-size > 1, the program will hang
    opened by seanM29 14
  • data parallel with fmoe

    data parallel with fmoe

    测试场景: (1) 1GPU, num_experts=8, batch_size=8, expert_dp_num="none", dp_rank=0, world_size=1 (2) 4GPU, num_experts=2, batch_size=2, expert_dp_num="none", dp_rank=[0,1,2,3], world_size=4

    以上两个场景,相同的lr,每一步的loss值应该是差不多,但是实际测试出来第二种loss下降的明显要比第一种情况慢,请问这可能是什么原因?

    opened by hclearner 12
  • Bias improvement #15

    Bias improvement #15

    Bias term is now being accepted in MOELinear.apply as requested in #15

    The solution is 2 dimensional and each block will be handling 32 columns. Lines don't matter: only one block will be called vertically, and this block will sum all the following ones. Thanks @laekov for the suggestion

    I added different tensor types for the numerical stability tests. I had to set a different precision for the half tensors.

    Let me know if anything isn't clear

    opened by TiagoMAntunes 11
  • nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed

    nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed

    Describe the bug 'nccl.h' file is not found or ncclUnhandledCudaError: Call to CUDA function failed

    To Reproduce Steps to reproduce the behavior:

    1. USE_NCCL=1 python setup.py install

    Logs

    running install
    running bdist_egg
    running egg_info
    writing fastmoe.egg-info/PKG-INFO
    writing dependency_links to fastmoe.egg-info/dependency_links.txt
    writing top-level names to fastmoe.egg-info/top_level.txt
    reading manifest file 'fastmoe.egg-info/SOURCES.txt'
    adding license file 'LICENSE'
    writing manifest file 'fastmoe.egg-info/SOURCES.txt'
    installing library code to build/bdist.linux-x86_64/egg
    running install_lib
    running build_py
    running build_ext
    building 'fmoe_cuda' extension
    Emitting ninja build file /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja...
    Compiling objects...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    [1/7] c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
    FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o 
    c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    In file included from /home/xinglinpan/fastmoe-master/cuda/global_exchange.h:1:0,
                     from /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp:1:
    /home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
    compilation terminated.
    [2/7] /usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
    FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o 
    /usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
    In file included from /home/xinglinpan/fastmoe-master/cuda/balancing.cuh:1:0,
                     from /home/xinglinpan/fastmoe-master/cuda/balancing.cu:2:
    /home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
    compilation terminated.
    

    Try to fix

    1. Download nccl_2.7.8-1+cuda10.2_x86_64
    2. Set environment variables as mentioned
    3. USE_NCCL=1 python setup.py install
    Installed /home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg
    Processing dependencies for fastmoe==1.0.0
    Finished processing dependencies for fastmoe==1.0.0
    
    1. cd test && pytest test_ddp.py
    Traceback (most recent call last):
      File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
        locals()[sys.argv[1]](**args)
      File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
        torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
      File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
        work = default_pg.allgather([tensor_list], [tensor])
    RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
    ncclUnhandledCudaError: Call to CUDA function failed.
    Traceback (most recent call last):
      File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
        locals()[sys.argv[1]](**args)
      File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
        torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
      File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
        work = default_pg.allgather([tensor_list], [tensor])
    RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
    ncclUnhandledCudaError: Call to CUDA function failed.
    

    Platform

    • Device: GeForce RTX 2080Ti
    • OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    • CUDA version: 10.2
    • NCCL version: 2.7.8-1
    • PyTorch version: 1.9.1
    • Python Version: 3.8

    Additional context

    >>> torch.cuda.nccl.version()
    2708
    

    May some necessary environment variables be lost during the process of subprocess.Popen?

    https://github.com/laekov/fastmoe/blob/670e1407eb1f674a47c45c78567d9217e062caab/tests/test_ddp.py#L44

    opened by Fragile-azalea 9
  • Can't find ProcessGroupNCCL.hpp

    Can't find ProcessGroupNCCL.hpp

    Hi all. I've installed fmoe without USE_NCCL option successfully. However, when I turned on this option, I got the following error:

    cuda/moe.cpp:112:37: fatal error: c10d/ProcessGroupNCCL.hpp: No such file or directory.

    Environment: PyTorch 1.3,Cuda 10.0, Linux

    Looking foward to your advice.

    opened by zjujh1995 9
  • How to use Convolution operator as the expert?

    How to use Convolution operator as the expert?

    Hi, I am trying to train an convolution-backbone network with MoE. There are two difficulties encountered. The first difficulty is that current API seems unable to directly use. The parameter of class FMoE requires the hidden dimension, but the convolution layer actually does not define the hidden dimension explicity.

    Then, I find the FMoE class cannot accept tensor with dimension greater than 2. Therefore, I guess I cannot directly pass the image (with shape N, C, H, W) into the layer? My code snippet is

    from fmoe.layers import FMoE
    import torch
    from fmoe.gates import NaiveGate,SwitchGate
    N=3
    num_expert=2
    
    hidden_size=5
    out_feature=4
    layer=torch.nn.Linear(in_features=hidden_size,out_features=out_feature).to("cuda")
    layer.weight=torch.nn.Parameter(torch.ones_like(layer.weight))
    my_moe=FMoE(num_expert=num_expert,d_model=hidden_size,top_k=1,expert=layer,gate=SwitchGate).to("cuda")
    inputs=torch.rand((N,1,hidden_size)).to("cuda")
    print(my_moe(inputs))
    
    

    Here I use the linear layer as the expert just to test the input dimension. The error information is

    Traceback (most recent call last):
      File "/home/zyli/fastmoe/try.py", line 15, in <module>
        print(my_moe(inputs))
      File "/home/zyli/anaconda3/envs/QMoE/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/zyli/fastmoe/fmoe/layers.py", line 241, in forward
        experts=self.experts
      File "/home/zyli/fastmoe/fmoe/layers.py", line 78, in _fmoe_general_global_forward
        outp = tree.map_structure(gather_func, x)
      File "/home/zyli/anaconda3/envs/QMoE/lib/python3.7/site-packages/tree/__init__.py", line 430, in map_structure
        [func(*args) for args in zip(*map(flatten, structures))])
      File "/home/zyli/anaconda3/envs/QMoE/lib/python3.7/site-packages/tree/__init__.py", line 430, in <listcomp>
        [func(*args) for args in zip(*map(flatten, structures))])
      File "/home/zyli/fastmoe/fmoe/layers.py", line 75, in gather_func
        world_size,
      File "/home/zyli/fastmoe/fmoe/functions.py", line 171, in forward
        maybe_overlap=False)
      File "/home/zyli/fastmoe/fmoe/functions.py", line 89, in _local_gather
        inp_buf.index_copy_(0, pos, inp)
    IndexError: index_copy_(): When source and destination are not scalars, their dimensionality must match. Source dimensionality (3), destination dimensionality (2)
    

    One possible solution I think is to first apply img2col to the input so that the convolution is transformed to matrix multiplication, but this incurs oblivious overhead. Or I need to modify the implementation of the class FMoE. Both of them are not elegant, so is there any idea to do this?

    opened by hobbitlzy 8
  • python setup.py install error with [

    python setup.py install error with ["ninja", "-v"]

    Describe the bug Have error with ["ninja", "-v"] in the last step [7/7] of installation.

    To Reproduce Steps to reproduce the behavior: [USE_NCCL=1] python setup.py install

    Expected behavior compiled successfully

    Logs [7/7] /usr/local/cuda-10.2/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/louislau/adaptChildSpeech/fastmoe/build/temp.linux-x86_64-3.8/cuda/parallel_linear.o.d -I/home/louislau/anaconda3/envs/espnet/lib/python3.8/site-packages/torch/include -I/home/louislau/anaconda3/envs/espnet/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/louislau/anaconda3/envs/espnet/lib/python3.8/site-packages/torch/include/TH -I/home/louislau/anaconda3/envs/espnet/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda-10.2/include -I/home/louislau/anaconda3/envs/espnet/include/python3.8 -c -c /home/louislau/adaptChildSpeech/fastmoe/cuda/parallel_linear.cu -o /home/louislau/adaptChildSpeech/fastmoe/build/temp.linux-x86_64-3.8/cuda/parallel_linear.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/louislau/anaconda3/envs/espnet/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1667, in _run_ninja_build subprocess.run( File "/home/louislau/anaconda3/envs/espnet/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:

    Platform

    • Device: NVIDIA GTX 1080 Ti
    • OS: Ubuntu 16.04.7 LTS
    • CUDA version: 10.2
    • NCCL version: 2.7.8-1
    • PyTorch version: 1.8.0

    Additional context Whether with USE_NCCL=1 or not, all have this error. Whether with pytorch==1.10.1 or pytorch==1.8.0, all have this error.

    Any idea on this compilation problem? Thanks in advance.

    opened by louislau1129 8
  • Installation error

    Installation error

    Describe the bug Failed to build fastmoe in the docker images that megatron provides. https://ngc.nvidia.com/catalog/containers/nvidia:pytorch

    To Reproduce Steps to reproduce the behavior: USE_NCCL=1 python setup.py install Expected behavior Installed successfully. Logs FAILED: /root/paddlejob/toyer_switch/fastmoe/build/temp.linux-x86_64-3.8/cuda/global_exchange.o error: no matching function for call to ‘HackNCCLGroup::broadcastUniqueNCCLID(ncclUniqueId*)’
    91 | broadcastUniqueNCCLID(&ncclID); Platform

    • Device: NVIDIA V100
    • OS:Ubuntu
    • CUDA version: 11.1
    • NCCL version: 2.8.3
    bug 
    opened by youth123 7
  • fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o: No such file or directory

    fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o: No such file or directory

    I tried to install the fastmoe under the environment of cuda 11.1+pytorch 1.8 +nccl 2.8.3 (as the recommended environment of megatron-2.2). However, I come up with the following error when running the setup script

    [email protected]:~/data/fastmoe-master# python setup.py install running install running bdist_egg running egg_info writing fastmoe.egg-info/PKG-INFO writing dependency_links to fastmoe.egg-info/dependency_links.txt writing top-level names to fastmoe.egg-info/top_level.txt reading manifest file 'fastmoe.egg-info/SOURCES.txt' writing manifest file 'fastmoe.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py running build_ext building 'fmoe_cuda' extension Emitting ninja build file /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 1.10.1 g++ -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/stream_manager.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/local_exchange.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/parallel_linear.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fmoe_cuda.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fastermoe/smart_schedule.o -L/opt/conda/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lnccl -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-3.8/fmoe_cuda.cpython-38-x86_64-linux-gnu.so g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o: No such file or directory error: command 'g++' failed with exit status 1

    There is no global_exchange.o under the directory of fastmoe-master/build/temp.linux-x86_64-3.8/cuda/. Do you know how to fix this?

    opened by Irenehere 6
  • 询问DistributedGroupedDataParallel的使用方式

    询问DistributedGroupedDataParallel的使用方式

    作者们您好,我希望使用FastMoE尝试改进ViT在CIFAR-10任务上的结果。 我有2个GPU,我希望为每个FFN提供4个专家,其中2个专家使用GPU0,另外两个使用GPU1,其他部分使用数据并行。 但是我发现修改之后发现效果下降了,于是我使用1个GPU但扩大两倍BatchSize希望模拟2卡的情况,发现效果提升了。 我想询问我对num_expert,world_size,DistributedGroupedDataParallel的理解正确吗,这两个实验应该产生相似的结果吗?谢谢 Describe the bug 使用GPU=2, Batch_Size=256得到ViT-CIFAR10上的Baseline 使用GPU=2, Batch_Size=256,FMoE的参数:num_expert=2, world_size=2, gate=GShardGate上得到的结果弱于Baseline(紫色) 使用GPU=1, Batch_Size=512,FMoE的参数:num_expert=4, world_size=1, gate=GShardGate上得到的结果强于Baseline(粉色) 在其他参数不修改的情况下,两者是否应该得到相似的结果? image

    相关代码

    class _ExpertFF(FMoE):
        def __init__(self,
                     num_expert=32,
                     d_model=1024,
                     world_size=1,
                     top_k=2,
                     gate=GShardGate,
                     expert=None):
            super().__init__(num_expert, d_model, world_size,
                             top_k=top_k, gate=gate, expert=expert)
            self.mark_parallel_comm()
    
        def forward(self, inp: Tensor):
            b, p, h = inp.shape
            inp = inp.view((-1, h))
            oup = super().forward(inp)
            oup = oup.view((b, p, -1))
            return oup
    
    def expert_fn(dim):
          return FeedForward(dim, mlp_dim, dropout=dropout)
    
    _ExpertFF(4, dim, 1, expert=expert_fn) # when #GPU=1
    _ExpertFF(2, dim, 2, expert=expert_fn) # when #GPU=2
    

    分布式的初始化代码

    from fmoe.distributed import DistributedGroupedDataParallel as DDP
    model = ViT(image_size=32,
                    patch_size=4,
                    num_classes=10,
                    dim=512,
                    depth=6,
                    heads=8,
                    mlp_dim=512,
                    dropout=0.1,
                    emb_dropout=0.1).to(rank)
    model = DDP(model) # 没有传入任何的group
    
    opened by Fragile-azalea 6
  • data parallel and model parallel at the same time

    data parallel and model parallel at the same time

    Describe the bug I am working on a multi-GPU and multi-node application. I want to do model-parallel for each node and data-parallel across the nodes. I followed the suggestion in https://github.com/laekov/fastmoe/issues/105. However, I cannot make it work and I am not sure the value of moe_group on each worker. Could you please help me by looking at my mini-reproducing script? It works when I only do model parallel (group_world_size=4), but it fails when I mix model-parallel and data-parallel (group_world_size=2).

    To Reproduce cmd

    python -m torch.distributed.launch --nproc_per_node=4 tools/test_moe_grouped_dist/mini_reproduce_group_report.py --group_world_size 4 # works
    
    python -m torch.distributed.launch --nproc_per_node=4 tools/test_moe_grouped_dist/mini_reproduce_group_report.py --group_world_size 2 # not work
    

    code

    import argparse
    import torch
    from torch.distributed import Backend
    
    import fmoe
    from fmoe import FMoETransformerMLP
    
    
    def create_model(num_expert, moe_world_size, moe_group):
        # create model architecture
        model = FMoETransformerMLP(num_expert, d_model=16, d_hidden=16, world_size=moe_world_size, moe_group=moe_group)
    
        return model
    
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument("--local_rank", type=int)
        parser.add_argument("--group_world_size", type=int)
        args = parser.parse_args()
    
        # if args.local_rank != 0:
        #     def print_pass(*args):
        #         pass
        #     builtins.print = print_pass
    
        print("distributing")
        local_rank = args.local_rank
        torch.cuda.set_device(args.local_rank)
        torch.distributed.init_process_group(backend=Backend.NCCL,
                                             init_method="env://")
    
        group_world_size = args.group_world_size
        rank = torch.distributed.get_rank()
        group_rank = rank // group_world_size
        inner_group_rank = rank % group_world_size
        group_size = torch.distributed.get_world_size() // group_world_size
        print("group_size is {}".format(group_size))
    
        moe_comm_group_list = [i + group_world_size * group_rank for i in range(group_world_size)]
        moe_comm_group = torch.distributed.new_group(moe_comm_group_list)
        print("rank {}, moe_comm_group list is {}".format(rank, moe_comm_group_list))
    
        # moe_comm_group = None
        model = create_model(num_expert=4 // group_world_size, moe_world_size=group_world_size, moe_group=moe_comm_group)
        device = torch.device("cuda:{}".format(args.local_rank))
        model.to(device)
    
        x = torch.rand([4, 16, 5, 5]).cuda()
    
        # set model_moe
        # moe_sync_group = None
        moe_sync_group_list = [inner_group_rank + group_size * i for i in range(group_size)]
        print("rank {}, moe_sync_group list is {}".format(rank, moe_sync_group_list))
        moe_sync_group = torch.distributed.new_group(moe_sync_group_list)
        model = fmoe.DistributedGroupedDataParallel(model, device_ids=[local_rank], output_device=local_rank,
                                                    moe_sync_group=moe_sync_group)
        model._sync_params()
    
        y = model(x)
    
        y.sum().backward()
        model.allreduce_params()
    
        # print("x is {}".format(x))
        print("y is {}".format(y))
        # print("model.experts.htoh4.weight.grad is {}".format(model.module.experts.htoh4.weight.grad))
    

    log (group_world_size=2)

    distributing                                                                                                                                            [630/1826]
    distributing
    distributing
    distributing
    group_size is 2
    group_size is 2
    group_size is 2
    group_size is 2
    rank 1, moe_comm_group list is [0, 1]
    rank 0, moe_comm_group list is [0, 1]
    rank 3, moe_comm_group list is [2, 3]
    rank 2, moe_comm_group list is [2, 3]
    rank 2, moe_sync_group list is [0, 2]
    rank 1, moe_sync_group list is [1, 3]
    rank 0, moe_sync_group list is [0, 2]
    rank 3, moe_sync_group list is [1, 3]
    NCCL Error at /home/t-xiaochen/envs/fastmoe/cuda/global_exchange.cpp:121 value 2
    Killing subprocess 80153
    Killing subprocess 80154
    Killing subprocess 80155
    Killing subprocess 80156
    Traceback (most recent call last):
      File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in <module>
        main()
      File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
        sigkill_handler(signal.SIGTERM, None)  # not coming back
      File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
        raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
    subprocess.CalledProcessError: Command '['/home/t-xiaochen/.conda/envs/moe/bin/python', '-u', 'tools/test_moe_grouped_dist/mini_reproduce_group_report.py', '--loc
    al_rank=3', '--group_world_size', '2']' returned non-zero exit status 255.
    
    opened by geekJZY 6
  • Adding Expert Prototyping to FastMoE

    Adding Expert Prototyping to FastMoE

    Hi, thanks for your provding end-to-end training framework in Pytorch for MoE models. We have recently implemented MoE in tensorflow and found out that categorizing experts to different groups can bring improvements in model quality. More details can be referred to our paper https://arxiv.org/abs/2105.15082. I wonder if it is possible to add this feature as FastMoE really facilitates research in sparse expert models.

    Generally, this strategy categorizes experts to different groups, each of which has its own gating function for routing. It is compatible with the conventional routing method like Switch or top-2 routing as you can set the group number to 1. We find that increasing the value of k in top-k can increase model performance and k top-1 can achieve similar effect. Also, it is possible to try out more complex strategies, say k top-k' or so.

    We have a code snippet in the appendix, which may be helpful.

    enhancement 
    opened by JustinLin610 1
  • Adaptation guidelines for Megatron v2.4

    Adaptation guidelines for Megatron v2.4

    Hi developers,

    It seems that current patch for v2.2 no longer works directly for v2.4. I tried to migrate the code line by line, but here's the error log during runtime:

    Traceback (most recent call last):
      File "/root/Megatron/pretrain_gpt.py", line 189, in <module>
        args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
      File "/root/Megatron/megatron/training.py", line 124, in pretrain
        model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      File "/root/Megatron/megatron/training.py", line 323, in setup_model_and_optimizer
        model = get_model(model_provider_func)
      File "/root/Megatron/megatron/training.py", line 269, in get_model
        for model_module in model]
      File "/root/Megatron/megatron/training.py", line 269, in <listcomp>
        for model_module in model]
    TypeError: __init__() takes 2 positional arguments but 4 were given
    

    Is there any guideline for me to fmoefy megatron-v2.4? Thanks.

    good first issue 
    opened by ymjiang 5
Releases(v1.0.0)
  • v1.0.0(Apr 2, 2022)

    FasterMoE

    • The new performance boosting features in the PPoPP'22 paper FasterMoE, detailed in the document.
      • Expert Shadowing.
      • Smart Scheduling.
      • Topology-aware gate.

    Bug fixes

    • Transformer-XL examples.
    • Compatibility to PyTorch versions.
    • Megatron-LM documents.
    • GShardGate.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Nov 8, 2021)

    FMoE core

    • Previous mp_group is renamed to slice_group, indicating that all workers in the group receive the same input batch, and process a slice of the input. mp_group will be deprecated in our next release.
    • ROCm supported.
    • FMoELinear is moved to a stand-alone file.

    Groupped data parallel

    • Support any group name by their relative tag name.

    Load balancing

    • A brand new balancing strategy - SWIPE. Contributed by authors of a (currently unpublished) paper.
    • A property has_loss is added to each gate, in order to identify whether balance loss should be collected.

    Megatron-LM support

    • Experts are partitioned by tensor model parallelism in mp_group, instead of expert parallelism.
    • Support arbitrary customized gate in MegatronMLP.
    • Move the patches to a stand-alone file.

    Tests

    • Move util functions into test_ddp.py.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Aug 23, 2021)

    Load balancing

    • Fix gradient for balance loss.

    Misc

    • Typos.
    • Update benchmark interface.
    • Remove some redundant code for performance improvement.
    • Enable USE_NCCL by default.
    • Compatibility for PyTorch <1.8.0 and >=1.8.0.

    Megatron adaption

    • Patch for numerical correctness of gradient clipping.
    • Support to pipeline parallelism.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(May 31, 2021)

    Load balancing

    • A brand new gate module with capacity-related utilities.
    • GShard's and Switch Transformer's balance strategies are implemented as integrated gates.
    • Balance loss is enabled.
    • Balance monitor is provided.

    Checkpointing

    • MoE models can be loaded and saved by fmoe's checkpointing module.

    Performance

    • FP16 training performance is improved.

    Misc

    • CUDA code directory is reconstructed.
    • More tests are added.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Mar 13, 2021)

    Compilation

    • Remove dependency on the CUDA examples repository.

    Distributed

    • Fix a bug related to PyTorch v1.8.0. FastMoE can now operate on multiple GPUs on multiple nodes with PyTorch v1.8.0.

    Misc

    • Fix tons of typos.
    • Format the code.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Mar 1, 2021)

Owner
Rick Ho
Rick Ho
Tool cek opsi checkpoint facebook!

tool apa ini? cek_opsi_facebook adalah sebuah tool yang mengecek opsi checkpoint akun facebook yang terkena checkpoint! tujuan dibuatnya tool ini? too

Muhammad Latif Harkat 2 Jul 17, 2022
This is an official implementation of the CVPR2022 paper "Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots".

Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots Blind2Unblind Citing Blind2Unblind @inproceedings{wang2022blind2unblind, tit

demonsjin 58 Dec 06, 2022
Unofficial PyTorch implementation of SimCLR by Google Brain

Unofficial PyTorch implementation of SimCLR by Google Brain

Rishabh Anand 2 Oct 13, 2021
FAVD: Featherweight Assisted Vulnerability Discovery

FAVD: Featherweight Assisted Vulnerability Discovery This repository contains the replication package for the paper "Featherweight Assisted Vulnerabil

secureIT 4 Sep 16, 2022
This dlib-based facial login system

Facial-Login-System This dlib-based facial login system is a technology capable of matching a human face from a digital webcam frame capture against a

Mushahid Ali 3 Apr 23, 2022
Starter Code for VALUE benchmark

StarterCode for VALUE Benchmark This is the starter code for VALUE Benchmark [website], [paper]. This repository currently supports all baseline model

VALUE Benchmark 73 Dec 09, 2022
Experiments with the Robust Binary Interval Search (RBIS) algorithm, a Query-Based prediction algorithm for the Online Search problem.

OnlineSearchRBIS Online Search with Best-Price and Query-Based Predictions This is the implementation of the Robust Binary Interval Search (RBIS) algo

S. K. 1 Apr 16, 2022
Driller: augmenting AFL with symbolic execution!

Driller Driller is an implementation of the driller paper. This implementation was built on top of AFL with angr being used as a symbolic tracer. Dril

Shellphish 791 Jan 06, 2023
Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets This is the official implementation of "Towards Good Pract

Sanja Fidler's Lab 52 Nov 22, 2022
CKD - Collaborative Knowledge Distillation for Heterogeneous Information Network Embedding

Collaborative Knowledge Distillation for Heterogeneous Information Network Embed

zhousheng 9 Dec 05, 2022
Code for the paper "Multi-task problems are not multi-objective"

Multi-Task problems are not multi-objective This is the code for the paper "Multi-Task problems are not multi-objective" in which we show that the com

Michael Ruchte 5 Aug 19, 2022
A Research-oriented Federated Learning Library and Benchmark Platform for Graph Neural Networks. Accepted to ICLR'2021 - DPML and MLSys'21 - GNNSys workshops.

FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks A Research-oriented Federated Learning Library and Benchmark Platform

FedML-AI 175 Dec 01, 2022
PyTorch implementation of Super SloMo by Jiang et al.

Super-SloMo PyTorch implementation of "Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation" by Jiang H., Sun

Avinash Paliwal 2.9k Jan 03, 2023
This is a simple backtesting framework to help you test your crypto currency trading. It includes a way to download and store historical crypto data and to execute a trading strategy.

You can use this simple crypto backtesting script to ensure your trading strategy is successful Minimal setup required and works well with static TP a

Andrei 154 Sep 12, 2022
Where2Act: From Pixels to Actions for Articulated 3D Objects

Where2Act: From Pixels to Actions for Articulated 3D Objects The Proposed Where2Act Task. Given as input an articulated 3D object, we learn to propose

Kaichun Mo 69 Nov 28, 2022
最新版本yolov5+deepsort目标检测和追踪,支持5.0版本可训练自己数据集

使用YOLOv5+Deepsort实现车辆行人追踪和计数,代码封装成一个Detector类,更容易嵌入到自己的项目中。

422 Dec 30, 2022
Codebase for testing whether hidden states of neural networks encode discrete structures.

structural-probes Codebase for testing whether hidden states of neural networks encode discrete structures. Based on the paper A Structural Probe for

John Hewitt 349 Dec 17, 2022
PyTorch implementation of paper: AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer, ICCV 2021.

AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer [Paper] [PyTorch Implementation] [Paddle Implementation] Overview This reposit

148 Dec 30, 2022
Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks

Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks This is the master thesi

Giacomo Arcieri 1 Mar 21, 2022
Implementation of SSMF: Shifting Seasonal Matrix Factorization

SSMF Implementation of SSMF: Shifting Seasonal Matrix Factorization, Koki Kawabata, Siddharth Bhatia, Rui Liu, Mohit Wadhwa, Bryan Hooi. NeurIPS, 2021

Koki Kawabata 9 Jun 10, 2022