A distributed deep learning framework that supports flexible parallelization strategies.

Related tags

Deep LearningFlexFlow
Overview

FlexFlow

FlexFlow is a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies. FlexFlow provides a drop-in replacement for TensorFlow Keras and PyTorch. Running existing Keras and PyTorch programs in FlexFlow only requires a few lines of changes to the program.

Install FlexFlow

To install FlexFlow from source code, please read the instructions. If you would like to quickly try FlexFlow, we also provide prebuilt docker images with all dependencies pre-installed. You can also use conda to install the FlexFlow Python package (coming soon).

TensorFlow Keras Support

Users can use FlexFlow to accelerate the training procedure of existing TensorFlow Keras models by just changing the following import header lines.

from flexflow.keras.models import Model, Sequential
from flexflow.keras.layers import Input, Dense, Conv2D, ...
from flexflow.keras.callbacks import Callback, ...

FlexFlow uses a Python function called top_level_task() as the entry point of a program and automatically parallelize DNN training across all GPUs on all compute nodes. For example, the following code snippet shows parallelizing AlexNet training on the CIFAR10 dataset in FlexFlow.

def top_level_task():
  model = Sequential()
  model.add(Conv2D(filters=64, input_shape=(3,229,229), kernel_size=(11,11), strides=(4,4), padding=(2,2), activation="relu"))
  model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding="valid"))
  model.add(Conv2D(filters=192, kernel_size=(5,5), strides=(1,1), padding=(2,2), activation="relu"))
  ## More lines for model construction
  model.add(Activation("softmax"))
  ## Model compilation
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  ## Model training
  (x_train, y_train) = cifar10.load_data()
  model.fit(x_train, y_train, epochs=30)

if __name__ == "__main__":
  top_level_task()

During model compilation (i.e., model.compile in Keras), FlexFlow can autotune the parallelization performance by searching for efficient strategies on the given parallel machine. Next, model.fit performs DNN training on all available GPUs (potentially across multiple nodes) using the best discovered strategy. As a result, users don't need to manually design and optimize the device assignments.

More FlexFlow Keras examples: see the keras examples folder.

PyTorch Support

Users can also use FlexFlow to optimize the parallelization performance of existing PyTorch models in two steps. First, a PyTorch model can be exported to the FlexFlow model format using flexflow.torch.fx.torch_to_flexflow.

import torch
import flexflow.torch.fx as fx

model = MyPyTorchModule()
fx.torch_to_flexflow(model, "mymodel.ff")

Second, a FlexFlow program can directly import a previously saved PyTorch model and autotune the parallelization performance for a given parallel machine.

from flexflow.pytorch.model import PyTorchModel

def top_level_task():
  torch_model = PyTorchModel("mymodel.ff")
  output_tensor = torch_model.apply(ffmodel, input_tensor)
  ## Model compilation
  ffmodel.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  ## Model training
  (x_train, y_train) = cifar10.load_data()
  ffmodel.fit(x_train, y_train, epochs=30)

More FlexFlow PyTorch examples: see the pytorch examples folder.

ONNX Support

Similar to the PyTorch front-end, FlexFlow also supports training existing ONNX models by loading the models using flexflow.onnx.model.ONNXModel.

More FlexFlow ONNX examples: see the ONNX examples folder.

C++ Interface

For users that prefer to program in C/C++. FlexFlow supports a C++ program inference that is equivalent to its Python APIs.

More FlexFlow C++ examples: see the C++ examples folder.

Command-Line Flags

In addition to setting runtime configurations in a FlexFlow Python/C++ program, the FlexFlow runtime also accepts command-line arguments for various runtime parameters:

FlexFlow training flags:

  • -e or --epochs: number of total epochs to run (default: 1)
  • -b or --batch-size: global batch size in each iteration (default: 64)
  • -p or --print-freq: print frequency (default: 10)
  • -d or --dataset: path to the training dataset. If not set, synthetic data is used to conduct training.

Legion runtime flags:

  • -ll:gpu: number of GPU processors to use on each node (default: 0)
  • -ll:fsize: size of device memory on each GPU (in MB)
  • -ll:zsize: size of zero-copy memory (pinned DRAM with direct GPU access) on each node (in MB). This is used for prefecthing training images from disk.
  • -ll:cpu: number of data loading workers (default: 4)
  • -ll:util: number of utility threads to create per process (default: 1)
  • -ll:bgwork: number of background worker threads to create per process (default: 1)

Performance auto-tuning flags:

  • --search-budget or --budget: the number of iterations for the MCMC search (default: 0)
  • --search-alpha or --alpha: a hyper-parameter for the search procedure (default: 0.05)
  • --export-strategy or --export: path to export the best discovered strategy (default: None)
  • --import-strategy or --import: path to import a previous saved strategy (default: None)
  • --enable-parameter-parallel: allow FlexFlow to explore parameter parallelism for performance auto-tuning. (By default FlexFlow only considers data and model parallelism.)
  • --enable-attribute-parallel: allow FlexFlow to explore attribute parallelism for performance auto-tuning. (By default FlexFlow only considers data and model parallelism.) For performance tuning related flags: see performance autotuning.

Contributing

Please let us know if you encounter any bugs or have any suggestions by submitting an issue.

We welcome all contributions to FlexFlow from bug fixes to new features and extensions.

Please subscribe to the FlexFlow users mailing list for

Citations

The Team

FlexFlow is developed and maintained by teams at CMU, Facebook, Los Alamos National Lab, MIT, and Stanford (alphabetically).

License

FlexFlow uses Apache License 2.0.

Comments
  • LEGION WARNING

    LEGION WARNING "failed to memoize the trace" / low GPU throughput

    I followed the compilation instructions of #231 plus I added the flag --python-data-loader-type 2as a FlexFlow warning hinted me to do.

    Testing on a machine with 8xV100 cards, it runs, but I get the below warning about every ~1000 data points (images) when training AlexNet and the throughput times gets down to about 60% of expected numbers.

    [Metrics] accuracy: 14.315257% (1246 / 8704) sparse_categorical_crossentropy: 2.292534    │
    [Metrics] accuracy: 14.330358% (1284 / 8960) sparse_categorical_crossentropy: 2.292337    │
    [Metrics] accuracy: 14.257812% (1314 / 9216) sparse_categorical_crossentropy: 2.292219    │
    [Metrics] accuracy: 14.210304% (1346 / 9472) sparse_categorical_crossentropy: 2.292063    │
    [Metrics] accuracy: 14.113898% (1373 / 9728) sparse_categorical_crossentropy: 2.291939    │
    [Metrics] accuracy: 14.002404% (1398 / 9984) sparse_categorical_crossentropy: 2.291959    │
    [0 - 7f33c0063700]   20.357225 {4}{runtime}: [warning 1097] LEGION WARNING: WARNING: The r│
    untime has failed to memoize the trace more than 5 times, due to the absence of a replayab│
    le template. It is highly likely that trace 201 will not be memoized for the rest of execu│
    tion. The most recent template was not replayable for the following reason: precondition n│
    ot subsumed by postcondition. Please change the mapper to stop making memoization requests│
    . (from file /workdisk/FlexFlowMaster/FlexFlow/deps/legion/runtime/legion/legion_trace.cc:│
    2025)                                                                                     │
    For more information see:                                                                 │
    http://legion.stanford.edu/messages/warning_code.html#warning_code_1097                   │
                                                                                              │
    epochs 2, ELAPSED TIME = 6.5928s, THROUGHPUT = 3033.59 samples/s
    

    This is the AlexNet strategy file that I tested with: MCMC_strategy.txt. Batch size was set to 256.

    opened by roman955b 27
  • error occur when executing with pytorch interface

    error occur when executing with pytorch interface

    I've successfully compile the entire project from source code and everything goes well with the scripts in python/native/ folder. However, when running with PyTorch as the front end like this:

    python/flexflow_python examples/python/pytorch/resnet.py -ll:py 1 -ll:gpu 1 -ll:fsize 24000 -ll:zsize 15000 --epochs 10

    The error occurs: [0 - 7ff6a4e9da40] 2.538880 {5}{gpu}: ERROR: The binary was compiled for the wrong GPU architecture. Update the 'GPU_ARCH' flag at the top of runtime/runtime.mk to match/include your current GPU architecture (70).

    I tried to export GPU_ARC=70, but it doesn't work. How can I fix the bug? Is there any instruction?

    opened by MiZhangWhuer 25
  • How to generate a strategy for a DNN?

    How to generate a strategy for a DNN?

    It seems that I haven't figured out how to use FlexFlow. How should a strategy be used in training and how to generate one strategy? Can you provide a tutorial? (use the simulator in the scripts?)

    In the strategy folder, the strategies can not work and an error occurs (as follow): ''' strategies.size() = 12 workSpaceSize (1024 MB) Floating point exception (core dumped) ''' However, With the default strategy(data parallelism) , it can run successfully. ''' ...... forwardAlgo(7) time(0.67) bwdFilterAlgo(5) time(0.77) bwdDataAlgo(5) time(0.66) init pool (input): n(64) c(256) h(13) w(13) init pool (output): n(64) c(256) h(6) w(6) init linear (input): in_dim(9216) out_dim(4096) batch_size(64) init linear (input): in_dim(4096) out_dim(4096) batch_size(64) init linear (input): in_dim(4096) out_dim(1000) batch_size(64) ELAPSED TIME = 7.2131s, THROUGHPUT = 1135.71 samples/s

    ''' Thanks.

    enhancement 
    opened by Orion-wyc 15
  • ERROR: The binary was compiled for the wrong GPU architecture. Update the 'GPU_ARCH' flag at the top of runtime/runtime.mk to match/include your current GPU architecture (80).

    ERROR: The binary was compiled for the wrong GPU architecture. Update the 'GPU_ARCH' flag at the top of runtime/runtime.mk to match/include your current GPU architecture (80).

    I installed FlexFlow from the source code. When testing FlexFlow with pytorch interface, errors occurred. Our environment is PyTorch 1.5, cuda 11.0, and cudnn 8.0. After installing FlexFlow, we came into the following error when running ./flexflow_python $FF_HOME/examples/python/pytorch/mnist_mlp_torch.py -ll:py 1 -ll:gpu 4 -ll:fsize 2048 -ll:zsize 12192.

    [0 - 7f7911e7bfc0] 4098.849694 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4098.849696 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4098.849698 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4098.849699 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4098.849701 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4098.849703 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4098.849704 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4098.849706 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4100.077353 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4100.077362 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4100.077366 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4100.077369 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4100.077372 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4100.077374 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4100.077377 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4100.077379 {4}{gpu}: duplicate registration of function _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv [0 - 7f7911e7bfc0] 4103.231768 {5}{gpu}: ERROR: The binary was compiled for the wrong GPU architecture. Update the 'GPU_ARCH' flag at the top of runtime/runtime.mk to match/include your current GPU architecture (80).

    I tried to specify the FF_CUDA_ARCH=80 in the config/config.linux file. I also modified the runtime.mk in the deps/legion/runtime folder and set GPU_ARCH = ampere. However, none of these methods worked. I still run into the above error.

    Is there any method to solve the above errors? Thanks ahead.

    opened by TonyTangYu 14
  • Runtime Hang after finishing all tasks

    Runtime Hang after finishing all tasks

    To reproduce

    /home/ubuntu/FlexFlow//python/flexflow_python /home/ubuntu/FlexFlow//examples/python/keras/seq_mnist_mlp.py -ll:py 1 -ll:gpu 1 -ll:fsize 14048 -ll:zsize 12192 -b 64 --only-data-parallel
    

    The execution hangs after real end top-level task is printed. The issue seems to be related to some of the recent changes to the Python part.

    bug 
    opened by jiazhihao 12
  • Build for hip gpu backends

    Build for hip gpu backends

    Issue: https://github.com/flexflow/FlexFlow/issues/345

    Testing the build

    FF_GPU_BACKEND=hip_rocm ./docker/build.sh
    

    Current status

    hip_rocm builds e2e with a few changes to legion and flexflow source

    Small source modifications for build

    misc small changes to the source to get the build working that should be ok to merge.

    Move tools to top level directory

    We glob for files under src to get the source files for the flexflow target. Moving tools to the top level directory prevents the tools sourcefiles from accidentally being added to the flexflow target source files.

    change substitution_to_dot cuda_add_executable to add_executable. When building with hip_rocm, we don't have cuda available and shouldn't need to build with it for substitution_to_dot as the target does

    Remaining:

    • [x] hip_rocm backend builds
    • [x] cuda backend builds
    • [x] Do changes to legion source have to be merged in
    • [x] Is switching from miopen.h to miopen/miopen.h header acceptabe
    • [x] Fix out of date hip kernels to match kernel headers
    • [x] Feedback on hip kernel changes
    • [x] update Dockerfile to conditionally install hip dependencies
    • [x] Document Docker build changes

    Misc

    Additional note on the legion change, I also don't know if the const_cast to remove the volatile qualifier is sound in that context. I mainly added it to get legion compiling with the changed build config

    opened by williamberman 12
  • fail to allocate future buffer for task

    fail to allocate future buffer for task

    I am trying to run the following script from the test.sh file: ./flexflow_python $FF_HOME/examples/python/native/alexnet.py -ll:py 1 -ll:gpu 4 -ll:fsize 8000 -ll:cpu 28 -ll:zsize 12192 --epochs 40

    and get the following error: [0 - 7fd3f4340700] 16.466406 {5}{runtime}: [error 576] LEGION ERROR: Failed to allocate eager future buffer for task flexflow_top_level_task (UID 1) because Visible to all processors on a node memory 1e00000000000000 is full. This is an eager allocation so you must adjust the percentage of this mememory dedicated for eager allocations with '-lg:eager_alloc_percentage' flag on the command line. (from file /usr/FlexFlow/deps/legion/runtime/legion/runtime.cc:9971) For more information see: http://legion.stanford.edu/messages/error_code.html#error_code_576

    I tried playing with the -lg flag

    opened by DavidPeleg6 12
  • Configuring incomplete, errors occurred!

    Configuring incomplete, errors occurred!

    When I installed flexflow using this linked tutorial https://github.com/flexflow/FlexFlow/blob/master/INSTALL.md

    CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: CUDA_CUDA_LIBRARY (ADVANCED) linked by target "RealmRuntime" in directory /data/FlexFlow/deps/legion/runtime

    -- Configuring incomplete, errors occurred! I have already export the LG_RT_DIT variables so the What is the problem and what should I do,ths

    opened by Tron-x 12
  • cant ffcompile examples/cpp/Resnet

    cant ffcompile examples/cpp/Resnet

    Get the following error when I try to compile resnet:

    ./ffcompile.sh examples/cpp/ResNet/ Use the FlexFlow protoc make: *** No rule to make target '../../src/runtime/model.cc', needed by '../../src/runtime/model.cc.o'. Stop.

    opened by bhetherman 12
  • FlexFlow Performance of Alexnet drastically drops in Multi-GPU

    FlexFlow Performance of Alexnet drastically drops in Multi-GPU

    Environment CUDA V100-PCIE:

    FlexFlow-AlexNet (1GPU):

    • data-parallel: 1172.39 samples/s
    • optimized: 2005.06 samples/s

    FlexFlow-AlexNet (2GPU):

    • data-parallel: 840.98 samples/s
    • optimized: 2000.77 samples/s

    FlexFlow-AlexNet (4GPU):

    • data-parallel: 331.25 samples/s
    • optimized: 1444.97 samples/s

    Tensorflow-AlexNet (1GPU):

    • data-parallel: 3079.8 samples/s

    Tensorflow-AlexNet (2GPU):

    • data-parallel: 3414.8 samples/s

    Tensorflow-AlexNet (4GPU):

    • data-parallel: 3210.6 samples/s

    How to fix it?

    opened by ghostplant 11
  • The strategy returned by auto-tuning is slower than the default data parallel strategy

    The strategy returned by auto-tuning is slower than the default data parallel strategy

    Testing Environment

    AWS p3.8xlarge, Deep Learning AMI (Ubuntu 18.04) Version 43.0, 4 x V100 GPUs. FlexFlow : 6b06996 (master branch, Jun 28, 2021)

    Model definition

    The model is a MLP with 8 dense layers. The file is put in $FF_HOME/examples/python/native/mlp_2304.py https://github.com/merrymercy/FlexFlow/blob/d55288e983418b2c7eaf3987d987782970640ca5/examples/python/native/mlp_2304.py

    Auto-tuning

    ./flexflow_python $FF_HOME/examples/python/native/mlp_2304.py -ll:py 1 -ll:gpu 4 -ll:fsize 14000 -ll:zsize 8192 --batch-size 16384 --export mlp_stra.txt --search-budget 500000 --enable-parameter-parallel -enable-attribute-parallel --search-alpha 0.05
    ...
    iteration(0) current_strategy(318.7493) best_strategy(318.7493)
    ...
    iteration(8000) current_strategy(316.6185) best_strategy(316.6185)
    ...
    iteration(500000) current_strategy(316.6185) best_strategy(316.6185)
    =========== Best Discovered Strategy ==========
    [Dense_100] num_dims(2) dims[4,1] device_ids[0,1,2,3]
    [Dense_101] num_dims(2) dims[4,1] device_ids[0,1,2,3]
    [Dense_102] num_dims(2) dims[4,1] device_ids[0,1,2,3]
    [Dense_103] num_dims(2) dims[4,1] device_ids[0,1,2,3]
    [Dense_104] num_dims(2) dims[4,1] device_ids[0,1,2,3]
    [Dense_105] num_dims(2) dims[4,1] device_ids[0,1,2,3]
    [Dense_106] num_dims(2) dims[1,4] device_ids[0,1,2,3]
    [Dense_107] num_dims(2) dims[1,4] device_ids[0,1,2,3]
    ============= MCMC Search Finished ============
    ...
    

    Benchmark

    1. Use the tuned strategy
    ./flexflow_python $FF_HOME/examples/python/native/mlp_2304.py -ll:py 1 -ll:gpu 4 -ll:fsize 14000 -ll:zsize 8192 --batch-size 16384 --import mlp_stra.txt
    ...
    Time: 0.484 second
    
    1. Use the default data-parallel
    ./flexflow_python $FF_HOME/examples/python/native/mlp_2304.py -ll:py 1 -ll:gpu 4 -ll:fsize 14000 -ll:zsize 8192 --batch-size 16384
    ...
    Time: 0.333 second
    

    Questions

    I have the following observations.

    • O1 The tuned strategy is slower than default data parallel strategy.
    • O2 The execution time estimated by the simulator is not very accurate, at least for the returned strategy.
    • O3 The MCMC search does not improve the execution time a lot even in the simulator. It only reduces the estimated execution time from 318.7493 to 316.7213. In addition, the estimated execution time stops to decrease after 8000 iterations. Adding more search iterations does not improve the result.

    Are O1 and O2 due to the wrong configuration of the simulator or inefficient implementation of the runtime? For example, the NVlink-based GPU-to-GPU connection on p3.8xlarge is not symmetric. How to configure the simulator to fit the real topology better? Is O3 the expected behavior of the MCMC search?

    opened by merrymercy 10
  • Implement ReduceSum

    Implement ReduceSum

    Description of changes:

    Related Issues:

    Linked Issues:

    • Issue #

    Issues closed by this PR:

    • Closes #479

    Before merging:

    • [ ] Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?
    opened by jiazhihao 0
  • [MOE] - Update code for Unity compatibility

    [MOE] - Update code for Unity compatibility

    Description of changes:

    This PR aims to remove deadcode from the MoE-related files and make the example/operators work again after the Unity merge. A good amount of the code is extracted from the inference branch.

    TODO:

    • [ ] Fix dataloader bug (seg fault)
    • [ ] Fix replica dimension hack
    • [ ] Implement measure_operator_cost for aggregate/aggregate_spec, group_by, and topk operators (we currently have placeholder functions in aggregate.cc, aggregate_spec.cc, group_by.cc, topk.cc)

    Related Issues:

    Linked Issues:

    • Issue #

    Issues closed by this PR:

    • Closes #

    Before merging:

    • [ ] Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?
    opened by gabrieleoliaro 0
  • [Python] - Fix issue related to path of `flexflow_native_python` library

    [Python] - Fix issue related to path of `flexflow_native_python` library

    Description of changes:

    In this PR, we fix some issues related to the path of the flexflow_native_python.so library, to make it always possible to run FlexFlow using the native python interpreter. We also add checks to CI to ensure that both the flexflow_python and native python interpreter work properly in all setups. In particular:

    • if you build/install FlexFlow with pip, you should be able to import the flexflow.core module and use the python/flexflow_python interpreter without exporting any special flag, as well as the native python interpreter with just the FF_USE_NATIVE_PYTHON=1 flag (no PYTHONPATH needed)
    • if you build FlexFlow with Cmake WITHOUT installing, you should be able to import the flexflow.core module by setting PYTHONPATH="${FF_HOME}/python"; use the python/flexflow_python interpreter without any flag, and use the native python interpreter with the PYTHONPATH="${FF_HOME}/python:${FF_HOME}/build/python" and FF_USE_NATIVE_PYTHON=1 flags
    • if you build FlexFlow with Cmake AND install it, you should be able to import the flexflow.core module and use the python/flexflow_python interpreter without exporting any special flag, as well as the native python interpreter with just the FF_USE_NATIVE_PYTHON=1 flag (no PYTHONPATH needed)

    Related Issues:

    Linked Issues:

    • Issue #

    Issues closed by this PR:

    • Closes #

    Before merging:

    • [ ] Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?
    opened by gabrieleoliaro 0
  • [BatchMatmul] Refactor kernel functions

    [BatchMatmul] Refactor kernel functions

    Description of changes:

    This PR removes kernel functions from batch_matmul.h and batch_matmul.cc into separate batch_matmul_kernels.h, batch_matmul_kernels.cu, and batch_matmul_kernels.cpp.

    Related Issues:

    Linked Issues:

    • Issue #303

    Issues closed by this PR:

    • Closes #438

    Before merging:

    • [x] Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules? n/a
    opened by virena 0
  • Refactor config.linux into python and fix conda cuda build issues

    Refactor config.linux into python and fix conda cuda build issues

    Description of changes:

    Translate config.linux and config.inc into Python and add a fix for building with conda-supplied cuda (essentially, the nvidia cuda-toolkit places libcuda.so under lib/stubs instead of lib64/stubs).

    Related Issues:

    Linked Issues:

    • Issue #

    Issues closed by this PR:

    • Closes #

    Before merging:

    • [ ] Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?
    opened by lockshaw 3
Releases(r22.07)
  • r22.07(Aug 1, 2022)

    This is the last stable release of FlexFlow before the Unity merge. Unity enables joint optimization of algebraic transformations and parallelization and generally achieves better performance and scalability compared to the original FlexFlow without Unity's optimizations. The Unity merge introduces the following major changes to FlexFlow.

    • With Unity, we now use parallel computation graphs (PCGs) to represent a DNN model. PCG is a unified representation of distributed DNN training that simultaneously expresses computation, parallelism, and data movement. A detailed description of PCG is available here.

    • We add support for Unity's additional forms of parallelism, including reduction parallelism and other operator-specific parallelization strategies.

    • We replace FlexFlow's MCMC search with a three-layer hierarchical search algorithm, which discovers joint optimization of algebraic transformations and parallelization and achieves better performance and scalability compared to FlexFlow's MCMC search.

    Starting from this release, Unity's changes will be available in the master branch of the FlexFlow repository.

    Source code(tar.gz)
    Source code(zip)
  • r22.05(Jun 8, 2022)

    This is a stable release of FlexFlow in preparation for the Unity merge.

    Frontend support:

    • FlexFlow now supports training HuggingFace models using the PyTorch fx interface. An example of training HuggingFace MT5 in FlexFlow is available at https://github.com/flexflow/FlexFlow/tree/master/examples/python/pytorch/mt5

    PyTorch Alignment:

    • Added unit tests for aligning FlexFlow's operators with PyTorch's. For each operator, the unit test checks if FlexFlow and PyTorch return identical activations/gradients when given the same inputs. More details of the PyTorch alignment is available at https://github.com/flexflow/FlexFlow/tree/master/align

    Documentation:

    • Initial documentation support added: https://github.com/flexflow/FlexFlow/tree/master/docs

    Operators:

    • Multiple bug fixes for FlexFlow operators

    Broadcast:

    • FlexFlow now supports broadcasting for a subset of operators, include elementwise unary and elementwise binary operators. The broadcasting semantic is identical to that of Numpy's
    Source code(tar.gz)
    Source code(zip)
  • r21.09(Oct 6, 2021)

    Frontend Supports

    • Changing PyBind11 as the default Python frontend in FlexFlow.

    Control Replication

    Distributed training

    • FlexFlow now uses NCCL AllReduce for gradients synchronization by default. To switch to distributed parameter server, set FF_USE_NCCL=OFF in cmake.

    Distributed inference

    • Passing comp_node = comp_node = CompMode::INFERENCE as an additional argument to model.compile will run a DNN model in the inference model
    • Various bug fixes and performance improvements for distributed inference in FlexFlow.

    Operators

    • Additional operators include AggregateSpec, Multi-Head Attention

    Machine Model

    • FlexFlow now support a new machine model for more precisely modeling network topology and simulating traffics at the granularity of individual packages
    Source code(tar.gz)
    Source code(zip)
  • r21.03(Apr 2, 2021)

    • Build
      • FlexFlow now uses camke build by default, the Makefiles will be deprecated soon.
    • Frontend Supports
      • In addition to CFFI, FlexFlow now also supports Python interface via PyBind11. To use ByBind11, please set FF_USE_PYBIND = ON in cmake.
    • Distributed inference
      • FlexFlow supports automated performance tuning for both distributed training and inference. For optimizing and performing distributed inference, simply pass comp_node = CompMode::INFERENCE as an additional argument to model.compile. An example can be found at https://github.com/flexflow/FlexFlow/blob/master/examples/python/native/bert_proxy_native.py.
    • Runtime
      • FlexFlow now supports gradients update via either Parameter Server or NCCL Allreduce. To enable NCCL, please set FF_USE_NCCL = ON in cmake.
    • Operators
      • New operators including Aggregate, Multi-head Attention, Scalar Multiply, Scalar Add, Scalar Sub, Scalar Divide and Top-K.
      • Conv2D now supports group convolutions.
    • Examples
      • Unit tests of all operators have been added to the tests/ops folder.
    Source code(tar.gz)
    Source code(zip)
  • r20.12(Jan 4, 2021)

    • Build
      • FlexFlow now supports both Makefile and CMake build. More details are available in this instruction.
    • Frontend Supports
      • PyTorch. FlexFlow now supports training existing PyTorch models with minimal changes to the source code. To run PyTorch models in FlexFlow, users can first export a model to the ONNX format using torch.onnx and then load an ONNX model in FlexFlow for distributed training. More examples: https://github.com/flexflow/FlexFlow/tree/master/examples/python/pytorch
      • ONNX. FlexFlow supports training existing ONNX models through flexflow.onnx.model. More examples: https://github.com/flexflow/FlexFlow/tree/master/examples/python/onnx
      • TensorFlow Keras. Similar to the PyTorch support. flexflow.keras enables distributed training of existing TensorFlow Keras models. See this bootcamp talk for more details.
    • Parallelization Optimizer
      • Integrated the parallelization optimizer into the FlexFlow runtime. Users can now use the --search-budget and --search-alpha to control the FlexFlow parallelization optimizer for searching for optimized strategies. See this post for the usage of the optimizer.
    • Examples
      • More PyTorch, ONNX, TensorFlow Keras examples have been added to the /examples/python folder.
      • Updated the cpp examples to use the new runtime interface.
    • Mapper
      • Implemented a new mapper with improved runtime performance.
    • Legion
      • Updated the Legion version with improved runtime performance
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Feb 14, 2019)

    This is v1.1.1 pre-release for SysML19 Artifact Evaluation. Follow the instructions to build FlexFlow and use the script run_experiments.sh to run all experiments.

    Source code(tar.gz)
    Source code(zip)
  • v1.1(Feb 11, 2019)

    This is v1.1 pre-release for SysML19 Artifact Evaluation. Follow the instructions to build FlexFlow and use the script run_experiments.sh to run all experiments.

    Source code(tar.gz)
    Source code(zip)
  • v1.0(Jan 26, 2019)

    This is a pre-release for SysML19 Artifact Evaluation. Follow the instructions to build FlexFlow and use the script run_experiments.sh to run all experiments.

    Source code(tar.gz)
    Source code(zip)
Official Codes for Graph Modularity:Towards Understanding the Cross-Layer Transition of Feature Representations in Deep Neural Networks.

Dynamic-Graphs-Construction Official Codes for Graph Modularity:Towards Understanding the Cross-Layer Transition of Feature Representations in Deep Ne

11 Dec 14, 2022
Deep Learning Algorithms for Hedging with Frictions

Deep Learning Algorithms for Hedging with Frictions This repository contains the Forward-Backward Stochastic Differential Equation (FBSDE) solver and

Xiaofei Shi 3 Dec 22, 2022
Metadata-Extractor - Metadata Extractor Script can be used to read in exif metadata

Metadata Extractor The exifextract script can be used to read in exif metadata f

1 Feb 16, 2022
A PyTorch implementation of SlowFast based on ICCV 2019 paper "SlowFast Networks for Video Recognition"

SlowFast A PyTorch implementation of SlowFast based on ICCV 2019 paper SlowFast Networks for Video Recognition. Requirements Anaconda PyTorch conda in

Hao Ren 8 Dec 23, 2022
YolactEdge: Real-time Instance Segmentation on the Edge

YolactEdge, the first competitive instance segmentation approach that runs on small edge devices at real-time speeds. Specifically, YolactEdge runs at up to 30.8 FPS on a Jetson AGX Xavier (and 172.7

Haotian Liu 1.1k Jan 06, 2023
Python wrapper of LSODA (solving ODEs) which can be called from within numba functions.

numbalsoda numbalsoda is a python wrapper to the LSODA method in ODEPACK, which is for solving ordinary differential equation initial value problems.

Nick Wogan 52 Jan 09, 2023
Free-duolingo-plus - Duolingo account creator that uses your invite code to get you free duolingo plus

free-duolingo-plus duolingo account creator that uses your invite code to get yo

1 Jan 06, 2022
Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

Moustafa Meshry 16 Oct 05, 2022
Supervised 3D Pre-training on Large-scale 2D Natural Image Datasets for 3D Medical Image Analysis

Introduction This is an implementation of our paper Supervised 3D Pre-training on Large-scale 2D Natural Image Datasets for 3D Medical Image Analysis.

24 Dec 06, 2022
A model to classify a piece of news as REAL or FAKE

Fake_news_classification A model to classify a piece of news as REAL or FAKE. This python project of detecting fake news deals with fake and real news

Gokul Stark 1 Jan 29, 2022
The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

STAR-FC This code is the implementation for the CVPR 2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes" 🌟 🌟 . 🎓 Re

Shuai Shen 87 Dec 28, 2022
Code To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment.

COLIEE 2021 - task 2: Legal Case Entailment This repository contains the code to reproduce NeuralMind's submissions to COLIEE 2021 presented in the pa

NeuralMind 13 Dec 16, 2022
BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

BossNAS This repository contains PyTorch evaluation code, retraining code and pretrained models of our paper: BossNAS: Exploring Hybrid CNN-transforme

Changlin Li 127 Dec 26, 2022
[ICCV 2021] Deep Hough Voting for Robust Global Registration

Deep Hough Voting for Robust Global Registration, ICCV, 2021 Project Page | Paper | Video Deep Hough Voting for Robust Global Registration Junha Lee1,

Junha Lee 10 Dec 02, 2022
CrossMLP - The repository offers the official implementation of our BMVC 2021 paper (oral) in PyTorch.

CrossMLP Cascaded Cross MLP-Mixer GANs for Cross-View Image Translation Bin Ren1, Hao Tang2, Nicu Sebe1. 1University of Trento, Italy, 2ETH, Switzerla

Bingoren 16 Jul 27, 2022
tf2-keras implement yolov5

YOLOv5 in tesnorflow2.x-keras yolov5数据增强jupyter示例 Bilibili视频讲解地址: 《yolov5 解读,训练,复现》 Bilibili视频讲解PPT文件: yolov5_bilibili_talk_ppt.pdf Bilibili视频讲解PPT文件:

yangcheng 254 Jan 08, 2023
Codes for [NeurIPS'21] You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership.

You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership Codes for [NeurIPS'21] You are caught stealing my winni

VITA 8 Nov 01, 2022
这是一个unet-pytorch的源码,可以训练自己的模型

Unet:U-Net: Convolutional Networks for Biomedical Image Segmentation目标检测模型在Pytorch当中的实现 目录 性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Downl

Bubbliiiing 567 Jan 05, 2023
Human motion synthesis using Unity3D

Human motion synthesis using Unity3D Prerequisite: Software: amc2bvh.exe, Unity 2017, Blender. Unity: RockVR (Video Capture), scenes, character models

Hao Xu 9 Jun 01, 2022
MLPs for Vision and Langauge Modeling (Coming Soon)

MLP Architectures for Vision-and-Language Modeling: An Empirical Study MLP Architectures for Vision-and-Language Modeling: An Empirical Study (Code wi

Yixin Nie 27 May 09, 2022