Tutel MoE: An Optimized Mixture-of-Experts Implementation

Overview

Project Tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation.

  • Supported Framework: Pytorch
  • Supported GPUs: CUDA(fp32 + fp16), ROCm(fp32)

How to setup Tutel MoE for Pytorch:

* Install Online:

        $ python3 -m pip install --user --upgrade git+https://github.com/microsoft/[email protected]

* Build from Source:

        $ git clone https://github.com/microsoft/tutel
        $ python3 ./tutel/setup.py install --user

How to use Tutel-optimized MoE in Pytorch:

* Tutel MoE Example:

        moe_layer = MOELayer('Top2Gate', model_dim, experts={
            'count_per_node': 2,
            'type': 'ffn', 'hidden_size_per_expert': 1024, 'activation_fn': lambda x: F.relu(x), ..
        })
        y = moe_layer(x)

* Usage of MOELayer Args:

        gate             : the string type of MOE gate, e.g: Top1Gate, Top2Gate, Top3Gate, Top4Gate
        model_dim        : the number of channels for MOE's input tensor
        experts          : a dict-type config for builtin expert network, or a torch.nn.Module-type custom expert network
        fp32_gate        : option of enabling mixed precision for gate network
        scan_expert_func : allow users to specify a lambda function to iterate each experts param, e.g. `scan_expert_func = lambda name, param: setattr(param, 'expert', True)`
        result_func      : allow users to specify a lambda function to format the MoE output and aux_loss, e.g. `result_func = lambda output: (output, output.l_aux)`
        group            : specify the explicit communication group of all_to_all
        seeds            : a tuple containing a pair of int to specify manual seed of (shared params, local params)

* Usage of dict-type Experts Config:

        count_per_node   : the number of local experts per device (by default, the value is 1 if not specified)
        type             : available built-in experts implementation, e.g: ffn
        hidden_size_per_expert : the hidden size between two linear layers for each expert (used for type == 'ffn' only)
        activation_fn    : the custom-defined activation function between two linear layers (used for type == 'ffn' only)

* Running MoE Hello World Model by torch.distributed.all_reduce:

        $ python3 -m torch.distributed.launch --nproc_per_node=1 ./examples/helloworld.py

* Running MoE Hello World Model by torch.nn.parallel.DistributedDataParallel (requires torch >= 1.8.0):

        $ python3 -m torch.distributed.launch --nproc_per_node=1 ./examples/helloworld_ddp.py

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments
  • question about how to set data Parallelism

    question about how to set data Parallelism

    Thanks for your contributions. Tutel is definitely a great work! But I have difficulty applying this Tutel to a new framework. If I have 8GPUs, And I want to set the number of experts to 4 while there are 2 local experts on a device (GPU). Thus the GPUs will be divided into 4 groups, which each has 2 GPUs and contain 4 experts. I think the GPU0, GPU2, GPU4, and GPU6 have the experts with the same parameters. How can I implement this setting?

    opened by Lechatelia 13
  • 100x slower when using 4nodes than 1node to run the helloworld_ddp example

    100x slower when using 4nodes than 1node to run the helloworld_ddp example

    Hello, I meet a problem that it is 100x slower when using 2node than 1node to run the helloworld_ddp example. I compile tutel with cuda11.3, pytorch 1.11 and nccl 2.9.9 on a nvidia-a100 GPU cluster with 100G IB. When I run tutel.examples.helloworld_ddp on a single node with 8 gpus and batch size 16, the speed meets the results in your table(step_time = 0.012315). But when I test with 4nodes, the step time becomes about 1 second, which is about 100x slower. Other multi-node tasks can normally run on my cluster, so I think maybe something is wrong with the environment when I build the project. It will be very nice if you can share the detailed environment information, such as the pytorch version, cuda version, g++ version, etc. Thanks.

    libnccl issue 
    opened by a157801 12
  • The output of nccl_all_to_all_scatter_async may be incomplete when num_local_experts>1.

    The output of nccl_all_to_all_scatter_async may be incomplete when num_local_experts>1.

    Describe the bug The output of nccl_all_to_all_scatter_async may be incomplete.

    To Reproduce Steps to reproduce the behavior:

    on host0(master): SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=1 on host1: SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=1

    Log The value of https://github.com/microsoft/tutel/blob/2c0cad3a742ecf4c0b0a989d6db629fcc2022bc0/tutel/impls/moe_layer.py#L244 tensor([[[ 1.5410, -0.2934], [-1.0845, -1.3986]], [[ 1.5410, -0.2934], [ 0.4033, 0.8380]], [[-2.1788, 0.5684], [-1.0845, -1.3986]], [[ 0.4033, 0.8380], [-2.1788, 0.5684]]], device='cuda:0')

    The value of https://github.com/microsoft/tutel/blob/2c0cad3a742ecf4c0b0a989d6db629fcc2022bc0/tutel/impls/moe_layer.py#L253 tensor([[[ 1.5410, -0.2934], [-1.0845, -1.3986]], [[ 1.5410, -0.2934], [ 0.4033, 0.8380]], [[-2.1788, 0.5684], [-1.0845, -1.3986]], [[ 0.4033, 0.8380], [-2.1788, 0.5684]]], device='cuda:0')

    This is the result I expect. However, when on host0(master): SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=2 on host1: SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=2

    The value of https://github.com/microsoft/tutel/blob/2c0cad3a742ecf4c0b0a989d6db629fcc2022bc0/tutel/impls/moe_layer.py#L244 tensor([[[ 1.5410, -0.2934], [-1.0845, -1.3986]], [[ 1.5410, -0.2934], [ 0.4033, 0.8380]], [[-2.1788, 0.5684], [-1.0845, -1.3986]], [[ 0.4033, 0.8380], [-2.1788, 0.5684]]], device='cuda:0')

    The value of https://github.com/microsoft/tutel/blob/2c0cad3a742ecf4c0b0a989d6db629fcc2022bc0/tutel/impls/moe_layer.py#L249 tensor([[[ 0.0000, 0.0000], [ 0.0000, 0.0000]], [[ 1.5410, -0.2934], [ 0.4033, 0.8380]], [[ 0.0000, 0.0000], [ 0.0000, 0.0000]], [[ 0.4033, 0.8380], [-2.1788, 0.5684]]], device='cuda:0')

    It seems incomplete.

    The possible code is: https://github.com/microsoft/tutel/blob/2c0cad3a742ecf4c0b0a989d6db629fcc2022bc0/tutel/custom/custom_kernel.cpp#L472-L489

    It looks like the NCCL group keeps only the last send-recv pair in each peer. There is no same problem when num_local_experts=1.

    wontfix 
    opened by Fragile-azalea 11
  • Cannot Import JIT optimized kernels?

    Cannot Import JIT optimized kernels?

    I used TUTEL for a while and it works greatly fine. But today I updated my environment and reinstall tutel, I found it crashed during importing module. Do you have any idea on why this happen? Thanks!

    >>> from tutel import moe as tutel_moe
    Traceback (most recent call last):
      File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 8, in <module>
        import tutel_custom_kernel
    ImportError: /mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/moe.py", line 6, in <module>
        from .jit_kernels.gating import fast_cumsum_sub_one
      File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/jit_kernels/gating.py", line 7, in <module>
        from ..impls.jit_compiler import tutel_custom_kernel
      File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 10, in <module>
        raise Exception("Cannot import JIT optimized kernels. Did you forget to install Custom Kernel Extension?")
    Exception: Cannot import JIT optimized kernels. Did you forget to install Custom Kernel Extension?
    
    opened by Luodian 11
  • Cannot compile tutel kernels and got runtime error

    Cannot compile tutel kernels and got runtime error

    I have installed tutel on my machine and have set up the related environment variables, such as the $CUDA_HOME and $CFLAGS. However, when I try to run examples/hello_world.py, I got the following error:

    [E custom_kernel.cpp:124] default_program(1): catastrophic error: cannot open source file "cuda_runtime.h"

    1 catastrophic error detected in the compilation of "default_program". Compilation terminated. Failed to use NVRTC for JIT compilation in this Pytorch version, try another approach using CUDA compiler.. (To always disable NVRTC, please: export USE_NVRTC=0)

    File "/private/home/hyhuang/.local/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 26, in func tutel_custom_kernel.invoke(inputs, ctx) RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/tmp/pip-req-build-pcbbciia/tutel/custom/custom_kernel.cpp":40, please report a bug to PyTorch. CHECK_EQ fails.

    I am using PyTorch 1.10.1 + CUDA 11.3. Is there any other parameter I should fix to use tutel?

    opened by hyhuang00 10
  • module 'tutel_custom_kernel' has no attribute 'inject_source'

    module 'tutel_custom_kernel' has no attribute 'inject_source'

    My cuda version is 11.4, python version is 3.6.5 Following the requirement, my torch and torchvision versions are torch==1.10.0+cu113 and torchvision==0.11.1+cu113. Then I run git clone https://github.com/microsoft/tutel --branch v0.1.x python ./tutel/setup.py install --user then run the tutorial: python ./tutel/examples/helloworld.py --batch_size=16 but meet the following error:

    Traceback (most recent call last):
      File "./tutel/examples/helloworld.py", line 118, in <module>
        output = model(x)
      File "/home/fanj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "./tutel/examples/helloworld.py", line 85, in forward
        result = self._moe_layer(input)
      File "/home/fanj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/fanj/tutel/tutel/impls/moe_layer.py", line 424, in forward
        result_output, l_aux = self.gates[gate_index].apply_on_expert_fn(reshaped_input, self)
      File "/home/fanj/tutel/tutel/impls/moe_layer.py", line 73, in apply_on_expert_fn
        critical_data, l_loss = extract_critical(gates, self.top_k, self.capacity_factor, self.fp32_gate, self.batch_prioritized_routing)
      File "/home/fanj/tutel/tutel/impls/fast_dispatch.py", line 163, in extract_critical
        locations1 = compute_location(masks_se[0])
      File "/home/fanj/tutel/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
        return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
      File "/home/fanj/tutel/tutel/jit_kernels/gating.py", line 68, in get_cumsum_kernel
        ''')
      File "/home/fanj/tutel/tutel/impls/jit_compiler.py", line 31, in generate_kernel
        return JitCompiler.create_raw(template)
      File "/home/fanj/tutel/tutel/impls/jit_compiler.py", line 21, in create_raw
        __ctx__ = tutel_custom_kernel.inject_source(source)
    AttributeError: module 'tutel_custom_kernel' has no attribute 'inject_source'
    

    Do you know how to solve this problem? Thank you very much!

    opened by LisaWang0306 10
  • New Tutel checkpoint loading is incompatible with old models

    New Tutel checkpoint loading is incompatible with old models

    Hi, I have been using Swin-MoE pre-trained models with Tutel. However, after the recent update in Tutel library in model loading format, the pre-trained model has different dict structure than the current required expert model resulting in loading error. Can you please create compatible versions of these released pre-trained models? or release any script to do so? Any help would be highly appreciated.

    opened by jinga-lala 7
  • fast_cumsum_sub_one fails when the module is wrapped by ORTModule

    fast_cumsum_sub_one fails when the module is wrapped by ORTModule

    As title, I got the following errors when my module is wrapped with ORTModule

    [E custom_kernel.cpp:123] default_program(14): error: identifier "tensor" is undefined
    
    1 error detected in the compilation of "default_program".
     Failed to use NVRTC for JIT compilation in this Pytorch version, try another approach using CUDA compiler.. (To always disable NVRTC, please: export USE_NVRTC=0)
    /tmp/torch-tutel-o0geuH.cu(14): error: identifier "tensor" is undefined
    
    1 error detected in the compilation of "/tmp/torch-tutel-o0geuH.cu".
    /opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_training_manager.py:224: UserWarning: Fast path enabled - skipping checks. Rebuild graph: True, Execution agent: True, Device check: True
      f" Device check: {self._skip_check.is_set(_SkipCheck.SKIP_CHECK_DEVICE)}", UserWarning)
    RuntimeError: There was an error while exporting the PyTorch model to ONNX:
    
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
        raise exception
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 389, in _get_exported_model
        **self._export_extra_kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/__init__.py", line 280, in export
        custom_opsets, enable_onnx_checker, use_external_data_format)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 94, in export
        use_external_data_format=use_external_data_format)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 695, in _export
        dynamic_axes=dynamic_axes)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 459, in _model_to_graph
        _retain_param_name)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 422, in _create_jit_graph
        graph, torch_out = _trace_and_get_graph_from_model(model, args)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 373, in _trace_and_get_graph_from_model
        torch.jit._get_trace_graph(model, args, strict=False, _force_outplace=False, _return_inputs_states=True)
      File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 1160, in _get_trace_graph
        outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 132, in forward
        self._force_outplace,
      File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 118, in wrapper
        outs.append(self.inner(*trace_inputs))
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1039, in _slow_forward
        result = self.forward(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_io.py", line 430, in forward
        return _transform_output_to_flat_tuple(self._original_module(*new_args, **new_kwargs))
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1039, in _slow_forward
        result = self.forward(*input, **kwargs)
      File "test_ort.py", line 17, in forward
        x_cumsum = fast_cumsum_sub_one(x, dim=0)
      File "/opt/conda/lib/python3.7/site-packages/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
        return get_cumsum_kernel(data.size(0), data.size(1))(data)
      File "/opt/conda/lib/python3.7/site-packages/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
        base_kernel(mask1.to(torch.int32).contiguous(), locations1)
      File "/opt/conda/lib/python3.7/site-packages/tutel/impls/jit_compiler.py", line 26, in func
        tutel_custom_kernel.invoke(inputs, __ctx__)
    RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/tmp/pip-req-build-qjgbz25n/tutel/custom/custom_kernel.cpp":39, please report a bug to PyTorch. CHECK_EQ fails.
    
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "test_ort.py", line 24, in <module>
        output = cumsum_module(input)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_utils.py", line 309, in _forward
        return ortmodule._torch_module.forward(*inputs, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_utils.py", line 289, in _forward
        torch_module_ort.is_training()).forward(*inputs, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 292, in forward
        log_level=self._debug_options.logging.log_level)
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 151, in handle_exception
        raise exception
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 231, in forward
        build_gradient_graph = self._export_model(*inputs, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 322, in _export_model
        schema, *inputs, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 392, in _get_exported_model
        RuntimeError(f'There was an error while exporting the PyTorch model to ONNX: '
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_fallback_exceptions.py", line 72, in wrap_exception
        raise new_exception(raised_exception) from raised_exception
    onnxruntime.training.ortmodule._fallback_exceptions.ORTModuleONNXModelException: There was an error while exporting the PyTorch model to ONNX:
    
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
        raise exception
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 389, in _get_exported_model
        **self._export_extra_kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/__init__.py", line 280, in export
        custom_opsets, enable_onnx_checker, use_external_data_format)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 94, in export
        use_external_data_format=use_external_data_format)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 695, in _export
        dynamic_axes=dynamic_axes)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 459, in _model_to_graph
        _retain_param_name)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 422, in _create_jit_graph
        graph, torch_out = _trace_and_get_graph_from_model(model, args)
      File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 373, in _trace_and_get_graph_from_model
        torch.jit._get_trace_graph(model, args, strict=False, _force_outplace=False, _return_inputs_states=True)
      File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 1160, in _get_trace_graph
        outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 132, in forward
        self._force_outplace,
      File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 118, in wrapper
        outs.append(self.inner(*trace_inputs))
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1039, in _slow_forward
        result = self.forward(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_io.py", line 430, in forward
        return _transform_output_to_flat_tuple(self._original_module(*new_args, **new_kwargs))
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1039, in _slow_forward
        result = self.forward(*input, **kwargs)
      File "test_ort.py", line 17, in forward
        x_cumsum = fast_cumsum_sub_one(x, dim=0)
      File "/opt/conda/lib/python3.7/site-packages/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
        return get_cumsum_kernel(data.size(0), data.size(1))(data)
      File "/opt/conda/lib/python3.7/site-packages/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
        base_kernel(mask1.to(torch.int32).contiguous(), locations1)
      File "/opt/conda/lib/python3.7/site-packages/tutel/impls/jit_compiler.py", line 26, in func
        tutel_custom_kernel.invoke(inputs, __ctx__)
    RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/tmp/pip-req-build-qjgbz25n/tutel/custom/custom_kernel.cpp":39, please report a bug to PyTorch. CHECK_EQ fails.
    

    To reproduce the problem, please try the following code, thanks.

    from torch_ort import ORTModule
    from onnxruntime.training import ortmodule
    ortmodule.ONNX_OPSET_VERSION=12
    from onnxruntime.training.ortmodule._custom_autograd_function import enable_custom_autograd_support
    enable_custom_autograd_support()
    
    from tutel.jit_kernels.gating import fast_cumsum_sub_one
    
    import torch
    
    class CumsumModule(torch.nn.Module):
        def __init__(self):
            super(CumsumModule, self).__init__()
            self.param = torch.nn.Parameter(torch.ones(5, 5))
        def forward(self, x):
            x = x + self.param
            x_cumsum = fast_cumsum_sub_one(x, dim=0)
            return x_cumsum
    
    
    input = torch.randint(0, 5, (5, 5), device='cuda:0')
    cumsum_module = CumsumModule().to(device='cuda:0')
    cumsum_module = ORTModule(cumsum_module)
    output = cumsum_module(input)
    
    opened by foreveronehundred 7
  • Is simple_all_reduce also required for capacity_factor > 0 cases?

    Is simple_all_reduce also required for capacity_factor > 0 cases?

    My code seems to hang when unbalanced workloads exist in two different GPUs(i.e. scores.size(0) is unequal in different GPUs such as, at the end of a dataset). It further leads to inequality in the capacity of Line 178 in different GPUs. Is simple_all_reduce also required for capacity_factor > 0 cases?

    https://github.com/microsoft/tutel/blob/ceba363909a673203a356a71f0b1a6a9113a6845/tutel/impls/fast_dispatch.py#L177-L183

    bug 
    opened by Fragile-azalea 6
  • Error met when using multi nodes

    Error met when using multi nodes

    Dear contributors, I meet an error with tutel's moe layer. The error occurred when I run the tutel/examples/helloworld_ddp.py in the torch distributed mode with more than one GPU node (i.e.: 16 GPUs on 2 machines). However, It is fine when I run this script with 8 GPUs or less.

    The error log is following:

    
    [Benchmark] world_size = 16, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 65536, num_local_experts = 2, topK = 1, device = `cuda:0`
    Traceback (most recent call last):
      File "tutel/examples/helloworld_ddp.py", line 154, in <module>
        output = model(x)
      File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
        output = self.module(*inputs[0], **kwargs[0])
      File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "tutel/examples/helloworld_ddp.py", line 119, in forward
        result = self._moe_layer(input)
      File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 387, in forward
        result_output, l_aux = self.gate.apply_on_expert_fn(reshaped_input, self.expert_fn, self.group)
      File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 103, in apply_on_expert_fn
        locations1 = self.compute_location(masks_se[0])
      File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 81, in fast_cumsum_sub_one
        return get_cumsum_kernel(data.size(0), data.size(1))(data)
      File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
        base_kernel(mask1.to(torch.int32).contiguous(), locations1)
      File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/jit_compiler.py", line 39, in func
        tutel_custom_kernel.invoke_with_source(inputs, __ctx__, no_nvrtc, source)
    RuntimeError: (0) == (cuModuleGetFunction(&gm.hFunc, gm.hMod, std::string(pos, tail - pos).c_str())) INTERNAL ASSERT FAILED at "/mnt/lustre/zhujinguo/codes/tutel/tutel/custom/custom_kernel.cpp":208, please report a bug to PyTorch. CHECK_EQ fails.
    
    

    Also, I use tutel moe layer in another project, where the same thing happened.

    opened by Lechatelia 5
  • What is the purpose of the

    What is the purpose of the "use_2dh" option?

    Hi Tutel authors, thank you for this great framework.

    I have a question about commit 901a65cb54c386c8f75bd15e87cf221ff2463d99. What is the purpose of the use_2dh option? And what problem does PrimAllToAll2D intend to solve? It would be great if you can provide more context. Thanks.

    question 
    opened by ymjiang 4
  • How the experts' gradients are handled under data parallelism?

    How the experts' gradients are handled under data parallelism?

    When count_per_node is set to negative, one expert should be paralleled on multiple GPUs like ZERO, with each GPU holding a slice of the expert's parameters. There are also all_gathers performed within the ffn_zero_group in the forward of the expert.

    My question is how these gradients and parameter update of the expert is handled in Tutel under DP. Examples seem to indicate that there are no extra efforts for users to manually handle it. However, I cannot find the corresponding implementation in either moe_layer or TutelDistributedOptimizer.

    Any help will be appreciated!

    opened by yzs981130 0
  • RuntimeError: No such operator tutel_ops::cumsum

    RuntimeError: No such operator tutel_ops::cumsum

    Hello, thanks for providing such a great work. However, I cannot use tutel successfully. I have followed the library installation steps:

    * Install Pytorch for NVIDIA CUDA >= 11.3:
            $ python3 -m pip install --user torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
           
    
    * Install Tutel Online:
    
            $ python3 -m pip uninstall tutel -y
            $ python3 -m pip install --user --upgrade git+https://github.com/microsoft/[email protected]
            $ python3 ./tutel/setup.py install --user
    

    But when I try the followed test:

    * Quick Test on Single-GPU:
    
            $ python3 -m tutel.examples.helloworld --batch_size=16               # Test Tutel-optimized MoE + manual distribution
    

    The followed error is reported:

    Traceback (most recent call last):
      File "/root/miniconda3/envs/widenet/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/root/miniconda3/envs/widenet/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/root/tutel-main/tutel/examples/helloworld.py", line 120, in <module>
        output = model(x)
      File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/root/tutel-main/tutel/examples/helloworld.py", line 85, in forward
        result = self._moe_layer(input)
      File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/root/tutel-main/tutel/impls/moe_layer.py", line 267, in forward
        logits_dtype, (crit, l_aux) = routing()
      File "/root/tutel-main/tutel/impls/moe_layer.py", line 261, in routing
        inequivalent_tokens = inequivalent_tokens,
      File "/root/tutel-main/tutel/impls/fast_dispatch.py", line 158, in extract_critical
        locations1 = compute_location(masks_se[0])
      File "/root/tutel-main/tutel/jit_kernels/gating.py", line 22, in fast_cumsum_sub_one
        return torch.ops.tutel_ops.cumsum(data)
      File "/root/.local/lib/python3.7/site-packages/torch/_ops.py", line 63, in __getattr__
        op = torch._C._jit_get_operation(qualified_op_name)
    RuntimeError: No such operator tutel_ops::cumsum
    
    opened by sharkdrop 4
  • [installation errors] fatal error: nccl.h: No such file or directory

    [installation errors] fatal error: nccl.h: No such file or directory

    Hello, thanks for providing such a great work. However, I cannot install tutel successfully during the compilation. I have exported the lib path of nccl_2.7.8-1-cuda10.1/include/nccl.h into LD_LIBRARY_PATH. But the error logs seem that it still cannot find the NCCL path. Do you have any idea on how to solve this error? Thanks!

    running install_lib running build_py running build_ext building 'tutel_custom_kernel' extension Emitting ninja build file /mnt/lustre/chengguangliang/zhouqianyu/segdgformer/cvpr_2023/tutel-main/build/temp.linux-x86_64-3.7/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/1] /mnt/cache/share/gcc/gcc-7.3.0/bin/g++ -MMD -MF /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-3.7/ tutel/custom/custom_kernel.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/mnt/lustre/chengguangliang/miniconda3/lib/python3 .7/site-packages/torch/include -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/che ngguangliang/miniconda3/lib/python3.7/site-packages/torch/include/TH -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/site-packages/torch/include/TH C -I/mnt/lustre/share/cuda-10.1/include -I/mnt/lustre/chengguangliang/miniconda3/include/python3.7m -c -c /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/tutel/custom/custom_kernel.cpp -o /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64- 3.7/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_N CCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENS ION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 FAILED: /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-3.7/tutel/custom/custom_kernel.o /mnt/cache/share/gcc/gcc-7.3.0/bin/g++ -MMD -MF /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-3.7/tutel/ custom/custom_kernel.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/sit e-packages/torch/include -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/chengguan gliang/miniconda3/lib/python3.7/site-packages/torch/include/TH -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/site-packages/torch/include/THC -I/m nt/lustre/share/cuda-10.1/include -I/mnt/lustre/chengguangliang/miniconda3/include/python3.7m -c -c /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/tutel/custom/custom_kernel.cpp -o /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-3.7/tu tel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_NCCL -D TORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NA ME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/tutel/custom/custom_kernel.cpp:19:10: fatal error: nccl.h: No such file or directory #include <nccl.h> ^~~~~~~~ compilation terminated. ninja: build stopped: subcommand failed. Try installing without NCCL extension..

    My machine details are as follows:

    nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Apr_24_19:10:27_PDT_2019 Cuda compilation tools, release 10.1, V10.1.168

    I used pytorch1.8.1 with cuda 10.1. I wonder whether tutal can be installed with cuda10.1?

    I used the following commands for the installation:

    export PATH=/mnt/lustre/share/gcc/gcc-5.4/bin/:$PATH export LD_LIBRARY_PATH=/mnt/lustre/share/polaris/dep/nccl_2.7.8-1-cuda10.1/include/nccl.h:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/mnt/lustre/share/gcc/gmp-4.3.2/lib:/mnt/lustre/share/gcc/mpfr-2.4.2/lib:/mnt/lustre/share/gcc/mpc-0.8.1/lib:$LD_LIBRARY_PATH export TORCH_CUDA_ARCH_LIST='3.5;5.0+PTX;6.0;7.0'

    python setup.py install --user

    opened by qianyuzqy 1
  • Multi-nodes training is much more slower than single node

    Multi-nodes training is much more slower than single node

    hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch. Any debugging suggestions with this issue? Thanks!!!

    opened by YingqingHe 1
  • NCCL Asynchronous update timeout crash with Tutel MoE

    NCCL Asynchronous update timeout crash with Tutel MoE

    Hi, I am using Tutel library with MMAction framework to replicate Swin-v2 MoE performance described in the paper. However, I am facing this error when I try to train MoE in DDP setting. Can someone please help me in resolving this error? Alternatively, can you release the object detection code that was used in the Tutel paper.

    E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.
    
    [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
    
    terminate called after throwing an instance of 'std::runtime_error'
    
      what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.
    
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6056 closing signal SIGTERM
    
    opened by jinga-lala 5
  • My code seems to hang when skip_remainder_batch=False.

    My code seems to hang when skip_remainder_batch=False.

    Describe the bug Hi, Authors. My code seems to hang when skip_remainder_batch=False.

    To Reproduce Steps to reproduce the behavior:

    git clone https://github.com/microsoft/tutel --branch main
    python3 -m pip uninstall tutel -y
    python3 ./tutel/setup.py
    
    cd ./tutel/tutel/examples/fairseq_moe
    git clone https://github.com/facebookresearch/fairseq --branch main
    cd fairseq/ && git checkout b5e7b250913120409b872a940fbafec4d43c7b13
    # This patch is an example to train Fairseq MoE transformers.
    # Note that the current patch only works for `legacy_ddp` backend, and `--checkpoint-activations` must be disabled.
    git apply ../fairseq_patch.diff
    python3 -m pip install omegaconf==2.0.5 hydra-core==1.0.7
    python3 -m pip install --no-deps --editable .
    
    #fix bug in https://github.com/facebookresearch/fairseq/blob/main/fairseq/tasks/translation.py#L441-L442
    #get dataset followed by https://github.com/facebookresearch/fairseq/tree/main/examples/translation
    
    
     CUDA_VISIBLE_DEVICES=0,1  MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train  fairseq/data-bin/iwslt14.tokenized.de-en     --arch transformer_iwslt_de_en --share-decoder-input-output-embed     --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0     --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000     --dropout 0.3 --weight-decay 0.0001     --criterion label_smoothed_cross_entropy --label-smoothing 0.1     --max-tokens 4096 --eval-bleu     --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'     --eval-bleu-detok moses     --eval-bleu-remove-bpe  --eval-bleu-print-samples     --best-checkpoint-metric bleu --maximize-best-checkpoint-metric  --ddp-backend legacy_ddp --max-update 100000
    

    Logs

    2022-08-09 10:51:01 | INFO | fairseq.utils | rank   0: capabilities =  7.5  ; total memory = 10.761 GB ;[0/1773]
    NVIDIA GeForce RTX 2080 Ti
    2022-08-09 10:51:01 | INFO | fairseq.utils | rank   1: capabilities =  7.5  ; total memory = 10.761 GB ; name =
    NVIDIA GeForce RTX 2080 Ti
    2022-08-09 10:51:01 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers**********
    *************
    2022-08-09 10:51:01 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
    2022-08-09 10:51:01 | INFO | fairseq_cli.train | max tokens per device = 4096 and max sentences per device = Non
    e
    2022-08-09 10:51:01 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt
    2022-08-09 10:51:01 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt
    2022-08-09 10:51:01 | INFO | fairseq.trainer | loading train data for epoch 1
    2022-08-09 10:51:01 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: fairseq/data-bin/iwslt14.to
    kenized.de-en/train.de-en.de
    2022-08-09 10:51:01 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: fairseq/data-bin/iwslt14.to
    kenized.de-en/train.de-en.en
    2022-08-09 10:51:01 | INFO | fairseq.tasks.translation | fairseq/data-bin/iwslt14.tokenized.de-en train de-en 16
    0239 examples
    2022-08-09 10:51:01 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --am
    p
    2022-08-09 10:51:01 | INFO | fairseq.data.iterators | grouped total_num_itrs = 551
    epoch 001:   0%|                                                          | 0/551 [00:00<?, ?it/s]2022-08-09 10:
    51:01 | INFO | fairseq.trainer | begin training epoch 1
    2022-08-09 10:51:01 | INFO | fairseq_cli.train | Start iterating over samples
    /home/xinglinpan/tutel/tutel/examples/fairseq_moe/fairseq/fairseq/utils.py:374: UserWarning: amp_C fused kernels
     unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
      warnings.warn(
    /home/xinglinpan/tutel/tutel/examples/fairseq_moe/fairseq/fairseq/utils.py:374: UserWarning: amp_C fused kernels
     unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
      warnings.warn(
    epoch 001: 100%|▉| 550/551 [02:12<00:00,  4.54it/s, loss=9.244, nll_loss=8.59, ppl=385.3, wps=31462022-08-09 10:
    53:14 | INFO | fairseq_cli.train | begin validation on "valid" subset
                                                                                                     2022-08-09 10:5
    3:19 | INFO | fairseq.tasks.translation | example hypothesis: they don't don't don't don't don't don't don't't't
    't
    2022-08-09 10:53:19 | INFO | fairseq.tasks.translation | example reference: they're just not moving.
    

    The possible problem is that not all devices are provided with data in the last iteration on the valid, so alltoall is always pending other processes. If SKIP_MOE=1, there is no this phenomenon.

    application patch 
    opened by Fragile-azalea 7
Releases(v0.2.0)
  • v0.2.0(Aug 11, 2022)

    What's New in v0.2.0:

    1. Support Windows Python3 + Torch Installation;
    2. Add examples to enable Tutel MoE in Fairseq;
    3. Refactor MoE Layer implementation, letting all features (e.g. top-X, overlap, parallel_type, capacity, ..) be able to change at different forward interations;
    4. New features: load_importance_loss, cosine router, inequivalent_tokens;
    5. Extend capacity_factor value that includes zero value and negative values for smarter capacity estimation;
    6. Add tutel.checkpoint conversion tools to reformat checkpoint files, making it able to use existing checkpoints to train/infer with a different world size.
    How to Setup:
    python3 -m pip install --user https://github.com/microsoft/tutel/archive/refs/tags/v0.2.0.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Feb 26, 2022)

    What's New in v0.1.5:

    1. Add 2D hierarchical a2a algorithm used for extremely-large scaling;
    2. Support different parallel_type for MoE computation: data, model, auto;
    3. Combine different expert granularities (e.g. normal, sharded experts, megatron dense ffn) into same programming interface & style;
    4. New features: is_postscore to specify whether gating scores are weighed during encoding or decoding;
    5. Enhance existing features: JIT compiler, a2a overlap with 2D.
    How to Setup:
    python3 -m pip install --user https://github.com/microsoft/tutel/archive/refs/tags/v0.1.5.tar.gz
    

    Contributors: @abuccts, @yzygitzh, @ghostplant, @EricWangCN

    Source code(tar.gz)
    Source code(zip)
  • v0.1.4(Feb 9, 2022)

    What's New in v0.1.4:

    1. Enhance communication features: a2a overlap with computation, support different granularity of group creation, etc.
    2. Add single-thread CPU implementation for correctness check & reference;
    3. Refine JIT compiler interface for flexible usability: jit::inject_source && jit::jit_execute;
    4. Enhance examples: fp64 support, cuda amp, checkpointing, etc.
    5. Support execution inside torch.distributed.pipeline.
    How to Setup:
    python3 -m pip install --user https://github.com/microsoft/tutel/archive/refs/tags/v0.1.4.tar.gz
    

    Contributors: @yzygitzh, @ghostplant, @EricWangCN

    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Dec 29, 2021)

    What's New in v0.1.3:

    1. Add Tutel Launcher Support based on Open MPI;
    2. Support Establishing Data Model Parallel in Initialization;
    3. Support Single Expert Evenly Sharded on Multiple GPUs;
    4. Support List of Gates and Forwarding MoE Layer with Specified Gating Index;
    5. Fix NVRTC Compatibility when Enabling USE_NVRTC=1;
    6. Other Implementation Enhancements & Correctness Checking;
    How to Setup:
    python3 -m pip install --user https://github.com/microsoft/tutel/archive/refs/tags/v0.1.3.tar.gz
    

    Contributors: @ghostplant, @EricWangCN, @guoshzhao.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Nov 16, 2021)

    What's New in v0.1.2:

    1. General-purpose top-k gating with {'type': 'top', 'k': 2};
    2. Add Megatron-ML Tensor Parallel as gating type;
    3. Add deepspeed-based & megatron-based helloworld example for fair comparison;
    4. Add torch.bfloat16 datatype support for single-GPU;
    How to Setup:
    python3 -m pip install --user https://github.com/microsoft/tutel/archive/refs/tags/v0.1.2.tar.gz
    

    Contributors: @ghostplant, @EricWangCN, @foreveronehundred.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Oct 10, 2021)

    What's New in v0.1.1:

    1. Enable fp16 support for AMDGPU.
    2. Using NVRTC for JIT compilation if available.
    3. Add new system_init interface for initializing NUMA settings in distributed GPUs.
    4. Extend more gating types: Top3Gate & Top4Gate.
    5. Allow high level to change capacity value in Tutel fast dispatcher.
    6. Add custom AllToAll extension for old Pytorch version without builtin AllToAll operator support.
    How to Setup:
    python3 -m pip install --user https://github.com/microsoft/tutel/archive/refs/tags/v0.1.1.tar.gz
    

    Contributors: @jspark1105 , @ngoyal2707 , @guoshzhao, @ghostplant .

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Sep 14, 2021)

    The first version of Tutel for efficient MoE implementation.

    How to setup:
    python3 -m pip install --user https://github.com/microsoft/tutel/archive/refs/tags/v0.1.0.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python

MNE-Python MNE-Python software is an open-source Python package for exploring, visualizing, and analyzing human neurophysiological data such as MEG, E

MNE tools for MEG and EEG data analysis 2.1k Dec 28, 2022
AI drive app that can help user become beautiful.

爱美丽 Beauty 简体中文 Features Beauty is an AI drive app that can help user become beautiful. it contain those functions: face score cheek face beauty repor

Starved Midnight 1 Jan 30, 2022
Subgraph Based Learning of Contextual Embedding

SLiCE Self-Supervised Learning of Contextual Embeddings for Link Prediction in Heterogeneous Networks Dataset details: We use four public benchmark da

Pacific Northwest National Laboratory 27 Dec 01, 2022
This is the implementation of the paper "Self-supervised Outdoor Scene Relighting"

Self-supervised Outdoor Scene Relighting This is the implementation of the paper "Self-supervised Outdoor Scene Relighting". The model is implemented

Ye Yu 24 Dec 17, 2022
The deployment framework aims to provide a simple, lightweight, fast integrated, pipelined deployment framework that ensures reliability, high concurrency and scalability of services.

savior是一个能够进行快速集成算法模块并支持高性能部署的轻量开发框架。能够帮助将团队进行快速想法验证(PoC),避免重复的去github上找模型然后复现模型;能够帮助团队将功能进行流程拆解,很方便的提高分布式执行效率;能够有效减少代码冗余,减少不必要负担。

Tao Luo 125 Dec 22, 2022
Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Realistic Full-Body Anonymization with Surface-Guided GANs This is the official

Håkon Hukkelås 30 Nov 18, 2022
A Peer-to-peer Platform for Secure, Privacy-preserving, Decentralized Data Science

PyGrid is a peer-to-peer network of data owners and data scientists who can collectively train AI models using PySyft. PyGrid is also the central serv

OpenMined 615 Jan 03, 2023
Single-Stage Instance Shadow Detection with Bidirectional Relation Learning (CVPR 2021 Oral)

Single-Stage Instance Shadow Detection with Bidirectional Relation Learning (CVPR 2021 Oral) Tianyu Wang*, Xiaowei Hu*, Chi-Wing Fu, and Pheng-Ann Hen

Steve Wong 51 Oct 20, 2022
基于Paddle框架的fcanet复现

fcanet-Paddle 基于Paddle框架的fcanet复现 fcanet 本项目基于paddlepaddle框架复现fcanet,并参加百度第三届论文复现赛,将在2021年5月15日比赛完后提供AIStudio链接~敬请期待 参考项目: frazerlin-fcanet 数据准备 本项目已挂

QuanHao Guo 7 Mar 07, 2022
disentanglement_lib is an open-source library for research on learning disentangled representations.

disentanglement_lib disentanglement_lib is an open-source library for research on learning disentangled representation. It supports a variety of diffe

Google Research 1.3k Dec 28, 2022
CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation

CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation [arxiv] This is the official repository for CDTrans: Cross-domain Transformer for

238 Dec 22, 2022
Code for the Image similarity challenge.

ISC 2021 This repository contains code for the Image Similarity Challenge 2021. Getting started The docs subdirectory has step-by-step instructions on

Facebook Research 173 Dec 12, 2022
Markov Attention Models

Introduction This repo contains code for reproducing the results in the paper Graphical Models with Attention for Context-Specific Independence and an

Vicarious 0 Dec 09, 2021
Official implementation for "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation" (CVPR 2022)

QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation (CVPR2022) https://arxiv.org/abs/2203.08483 Unpaired image-to-image (I2I

Xueqi Hu 50 Dec 16, 2022
PyTorch implementation of ENet

PyTorch-ENet PyTorch (v1.1.0) implementation of ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation, ported from the lua-torc

David Silva 333 Dec 29, 2022
Pytorch implementation of MLP-Mixer with loading pre-trained models.

MLP-Mixer-Pytorch PyTorch implementation of MLP-Mixer: An all-MLP Architecture for Vision with the function of loading official ImageNet pre-trained p

Qiushi Yang 2 Sep 29, 2022
Learning Efficient Online 3D Bin Packing on Packing Configuration Trees

Learning Efficient Online 3D Bin Packing on Packing Configuration Trees This repository is being continuously updated, please stay tuned! Any code con

86 Dec 28, 2022
A code repository associated with the paper A Benchmark for Rough Sketch Cleanup by Chuan Yan, David Vanderhaeghe, and Yotam Gingold from SIGGRAPH Asia 2020.

A Benchmark for Rough Sketch Cleanup This is the code repository associated with the paper A Benchmark for Rough Sketch Cleanup by Chuan Yan, David Va

33 Dec 18, 2022
Medical-Image-Triage-and-Classification-System-Based-on-COVID-19-CT-and-X-ray-Scan-Dataset

Medical-Image-Triage-and-Classification-System-Based-on-COVID-19-CT-and-X-ray-Sc

2 Dec 26, 2021
A "gym" style toolkit for building lightweight Neural Architecture Search systems

A "gym" style toolkit for building lightweight Neural Architecture Search systems

Jack Turner 12 Nov 05, 2022