Large scale embeddings on a single machine.

Related tags

Deep Learningmarius
Overview

Marius

Marius is a system under active development for training embeddings for large-scale graphs on a single machine.

Training on large scale graphs requires a large amount of data movement to get embedding parameters from storage to the computational device. Marius is designed to mitigate/reduce data movement overheads using:

  • Pipelined training and IO
  • Partition caching and buffer-aware data orderings

Details on how Marius works can be found in our OSDI '21 Paper, where experiment scripts and configurations can be found in the osdi2021 branch.

Requirements

(Other versions may work, but are untested)

  • Ubuntu 18.04 or MacOS 10.15
  • CUDA 10.1 or 10.2 (If using GPU training)
  • CuDNN 7 (If using GPU training)
  • pytorch >= 1.7
  • python >= 3.6
  • pip >= 21
  • GCC >= 9 (On Linux) or Clang 12.0 (On MacOS)
  • cmake >= 3.12
  • make >= 3.8

Installation from source with Pip

  1. Install latest version of PyTorch for your CUDA version: https://pytorch.org/get-started/locally/

  2. Clone the repository git clone https://github.com/marius-team/marius.git

  3. Build and install Marius cd marius; python3 -m pip install .

Full script (without torch install)

git clone https://github.com/marius-team/marius.git
cd marius
python3 -m pip install .

Training a graph

Training embeddings on a graph requires three steps.

  1. Define a configuration file. This example will use the config already defined in examples/training/configs/fb15k_gpu.ini

    See docs/configuration.rst for full details on the configuration options.

  2. Preprocess the dataset marius_preprocess output_dir/ --dataset fb15k

    This command will download the freebase15k dataset and preprocess it for training, storing files in output_dir/. If a different output directory is used, the configuration file's path options will need to be updated accordingly.

  3. Run the training executable with the config file marius_train examples/training/configs/fb15k_gpu.ini.

The output of the first epoch should be similar to the following.

[info] [03/18/21 01:33:18.778] Metadata initialized
[info] [03/18/21 01:33:18.778] Training set initialized
[info] [03/18/21 01:33:18.779] Evaluation set initialized
[info] [03/18/21 01:33:18.779] Preprocessing Complete: 2.605s
[info] [03/18/21 01:33:18.791] ################ Starting training epoch 1 ################
[info] [03/18/21 01:33:18.836] Total Edges Processed: 40000, Percent Complete: 0.082
[info] [03/18/21 01:33:18.862] Total Edges Processed: 80000, Percent Complete: 0.163
[info] [03/18/21 01:33:18.892] Total Edges Processed: 120000, Percent Complete: 0.245
[info] [03/18/21 01:33:18.918] Total Edges Processed: 160000, Percent Complete: 0.327
[info] [03/18/21 01:33:18.944] Total Edges Processed: 200000, Percent Complete: 0.408
[info] [03/18/21 01:33:18.970] Total Edges Processed: 240000, Percent Complete: 0.490
[info] [03/18/21 01:33:18.996] Total Edges Processed: 280000, Percent Complete: 0.571
[info] [03/18/21 01:33:19.021] Total Edges Processed: 320000, Percent Complete: 0.653
[info] [03/18/21 01:33:19.046] Total Edges Processed: 360000, Percent Complete: 0.735
[info] [03/18/21 01:33:19.071] Total Edges Processed: 400000, Percent Complete: 0.816
[info] [03/18/21 01:33:19.096] Total Edges Processed: 440000, Percent Complete: 0.898
[info] [03/18/21 01:33:19.122] Total Edges Processed: 480000, Percent Complete: 0.980
[info] [03/18/21 01:33:19.130] ################ Finished training epoch 1 ################
[info] [03/18/21 01:33:19.130] Epoch Runtime (Before shuffle/sync): 339ms
[info] [03/18/21 01:33:19.130] Edges per Second (Before shuffle/sync): 1425197.8
[info] [03/18/21 01:33:19.130] Edges Shuffled
[info] [03/18/21 01:33:19.130] Epoch Runtime (Including shuffle/sync): 339ms
[info] [03/18/21 01:33:19.130] Edges per Second (Including shuffle/sync): 1425197.8
[info] [03/18/21 01:33:19.148] Starting evaluating
[info] [03/18/21 01:33:19.254] Pipeline flush complete
[info] [03/18/21 01:33:19.271] Num Eval Edges: 50000
[info] [03/18/21 01:33:19.271] Num Eval Batches: 50
[info] [03/18/21 01:33:19.271] Auc: 0.973, Avg Ranks: 24.477, MRR: 0.491, [email protected]: 0.357, [email protected]: 0.651, [email protected]: 0.733, [email protected]: 0.806, [email protected]: 0.895, [email protected]: 0.943

To train using CPUs only, use the examples/training/configs/fb15k_cpu.ini configuration file instead.

Using the Python API

Sample Code

Below is a sample python script which trains a single epoch of embeddings on fb15k.

import marius as m
from marius.tools import preprocess

def fb15k_example():

    preprocess.fb15k(output_dir="output_dir/")
    
    config_path = "examples/training/configs/fb15k_cpu.ini"
    config = m.parseConfig(config_path)

    train_set, eval_set = m.initializeDatasets(config)

    model = m.initializeModel(config.model.encoder_model, config.model.decoder_model)

    trainer = m.SynchronousTrainer(train_set, model)
    evaluator = m.SynchronousEvaluator(eval_set, model)

    trainer.train(1)
    evaluator.evaluate(True)


if __name__ == "__main__":
    fb15k_example()

Marius in Docker

Marius can be deployed within a docker container. Here is a sample ubuntu dockerfile (located at examples/docker/dockerfile) which contains the necessary dependencies preinstalled for GPU training.

Building and running the container

Build an image with the name marius and the tag example:
docker build -t marius:example -f examples/docker/dockerfile examples/docker

Create and start a new container instance named gaius with:
docker run --name gaius -itd marius:example

Run docker ps to verify the container is running

Start a bash session inside the container:
docker exec -it gaius bash

Sample Dockerfile

See examples/docker/dockerfile

FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
RUN apt update

RUN apt install -y g++ \ 
         make \
         wget \
         unzip \
         vim \
         git \
         python3-pip

# install gcc-9
RUN apt install -y software-properties-common
RUN add-apt-repository -y ppa:ubuntu-toolchain-r/test
RUN apt update
RUN apt install -y gcc-9 g++-9
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 9
RUN update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 9

# install cmake 3.20
RUN wget https://github.com/Kitware/CMake/releases/download/v3.20.0/cmake-3.20.0-linux-x86_64.sh
RUN mkdir /opt/cmake
RUN sh cmake-3.20.0-linux-x86_64.sh --skip-license --prefix=/opt/cmake/
RUN ln -s /opt/cmake/bin/cmake /usr/local/bin/cmake

# install pytorch
RUN python3 -m pip install torch==1.7.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Citing Marius

Arxiv Version:

@misc{mohoney2021marius,
      title={Marius: Learning Massive Graph Embeddings on a Single Machine}, 
      author={Jason Mohoney and Roger Waleffe and Yiheng Xu and Theodoros Rekatsinas and Shivaram Venkataraman},
      year={2021},
      eprint={2101.08358},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

OSDI Version:

@inproceedings {273733,
                author = {Jason Mohoney and Roger Waleffe and Henry Xu and Theodoros Rekatsinas and Shivaram Venkataraman},
                title = {Marius: Learning Massive Graph Embeddings on a Single Machine},
                booktitle = {15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21)},
                year = {2021},
                isbn = {978-1-939133-22-9},
                pages = {533--549},
                url = {https://www.usenix.org/conference/osdi21/presentation/mohoney},
                publisher = {{USENIX} Association},
                month = jul,
}
Comments
  • ERROR during the pip installation process

    ERROR during the pip installation process

    I was trying to install pip and

    ERROR: Could not find a version that satisfies the requirement torch (from marius==0.0.2) (from versions: none)
    ERROR: No matching distribution found for torch (from marius==0.0.2)
    

    keeps popping up.

    Tried to pip install torchvision==0.1.8 in command line and it showed Successfully installed torch-1.11.0 torchvision-0.1.8. Then, when I tried to pip3 install . again, the same error appears. I am wondering how to solve this to proceed. Thank you.

    question 
    opened by lwwlwwl 17
  • 'test_edges.bin' and 'validation_edges.bin' are not created when preprocess ogbn_products.

    'test_edges.bin' and 'validation_edges.bin' are not created when preprocess ogbn_products.

    Hi, I want to run marius with ogbn-products dataset.

    I executed the following command: marius_preprocess --dataset ogbn_products --output_dir datasets/ogbn_products

    There was no problem running it, but only 'train_edges.bin' was created in ogbn_products/edges. There is no 'test_edges.bin' and 'validation_edges.bin'. How could I get them??

    Thanks a lot.

    question 
    opened by qhtjrmin 13
  • Training Wikidata embedding

    Training Wikidata embedding

    I'm trying to create embeddings for Wikidata, using this conf file [general] device=CPU num_train=611058458 num_nodes=91580024 num_valid=612283 num_test=612283 experiment_name=wikidata num_relations=1390 ...

    However, I am getting the error:

    ValueError: cannot create std::vector larger than max_size()

    Looking for any workaround, thanks

    question 
    opened by dlauc 13
  • Very high memory allocation during preprocessing

    Very high memory allocation during preprocessing

    Problem description

    When preprocessing my data set in the .tsv format to prepare it for training and splitting into four partitions, I receive an out-of-memory error.

    I used the command: marius_preprocess --output_dir /mount/marius_preprocessed/works/ --edges /mount/data/works.tsv --dataset_split 0.8 0.1 0.1 --columns 0 1 2 --num_partitions 4

    However, during preprocessing I encounter the error:

    unique_nodes = np.unique(np.concatenate([unique_src.astype(str), unique_dst.astype(str)])) numpy.core._exceptions.MemoryError: Unable to allocate 612GiB for an array with shape (167416627, ).

    The input file is 46GB in size and contains about 1 billion lines (triples). And the instance I'm using has 500GB of memory. It seems the array has the 'length' equal to the number of unique entities in the input file.

    The error occurs after the remapping of edges step has started. Changing the number of partitions did not help. I am running the tool in a Docker container, the behavior without container was similar though.

    I understand Marius was built for efficient embeddings generation on lower-capacity machines when training. Is there any way to reduce the resource needs during preproccesing as well? Perhaps any modifications in the .tsv file from my side that could support the preprocessing?

    Expected behavior

    Preprocessing of raw input (.nt or .tsv files) into ready-to-train partitions, with comparable resource requirements as during the embeddings training.

    Environment

    [email protected]:/# marius_env_info
    cmake:
      version: 3.20.0
    cpu_info:
      num_cpus: 32
      total_memory: 503GB
    cuda:
      version: '11.1'
    gpu_info: []
    marius:
      bindings_installed: true
      install_path: /usr/local/lib/python3.6/dist-packages/marius
      version: 0.0.2
    openmp:
      version: '201511'
    operating_system:
      platform: Linux-4.19.0-13-amd64-x86_64-with-Ubuntu-18.04-bionic
    pybind:
      PYBIND11_BUILD_ABI: _cxxabi1011
      PYBIND11_COMPILER_TYPE: _gcc
      PYBIND11_STDLIB: _libstdcpp
    python:
      compiler: GCC 8.4.0
      deps:
        numpy_version: 1.19.5
        omegaconf_version: 2.2.3
        pandas_version: 1.1.5
        pip_version: 9.0.1
        pyspark_version: 3.2.2
        pytest_version: 7.0.1
        torch_version: 1.9.1+cu111
        tox_version: 3.26.0
      version: 3.6.9
    pytorch:
      install_path: /usr/local/lib/python3.6/dist-packages/torch
      version: 1.9.1+cu111
    

    Thank you!

    bug 
    opened by johankit 12
  • marius_preprocess triggers program aborted

    marius_preprocess triggers program aborted

    Describe the bug run marius_preprocess or import preprocess would trigger the following error.

    free(): invalid pointer
    Aborted
    

    To Reproduce Steps to reproduce the behavior:

    1. Run the given example 'marius_preprocess output_dir/ --dataset fb15k' OR
    2. 'from marius.tools import preprocess' in Python

    Environment gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) Python 3.8.5

    bug 
    opened by VeritasYin 7
  • CUDA error: device-side assert triggered when trying to execute example scripts

    CUDA error: device-side assert triggered when trying to execute example scripts

    Describe the bug I successfully installed the program and it passed test/cpp/end_to_end, then when I tried to execute examples/training/scripts/fb15k_gpu.sh (and also some other configs with GPU enabled), it triggered a nll_loss_backward_reduce_cuda_kernel_2d assertion failure.

    To Reproduce Steps to reproduce the behavior:

    1. I execute bash examples/training/scripts/fb15k_gpu.sh
    2. marius_preprocess step is able to be executed without any problems
    3. When marius_train proceeds to backward for the first batch of the first epoch, the following error occurs:
    [email protected]:~/marius$ bash examples/training/scripts/fb15k_gpu.sh 
    fb15k
    Downloading fb15k.tgz to output_dir/fb15k.tgz
    Extracting
    Extraction completed
    Detected delimiter: ~   ~
    Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
    Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
    Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
    Number of instance per file:[483142, 50000, 59071]
    Number of nodes: 14951
    Number of edges: 592213
    Number of relations: 1345
    Delimiter: ~    ~
    ['/home/nfp/.local/bin/marius_train', 'examples/training/configs/fb15k_gpu.ini']
    [info] [10/28/21 22:12:59.865] Start preprocessing
    [debug] [10/28/21 22:12:59.866] Initializing Model
    [debug] [10/28/21 22:12:59.866] Empty Encoder
    [debug] [10/28/21 22:12:59.866] DistMult Decoder
    [debug] [10/28/21 22:12:59.867] data/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/embeddings/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/relations/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/edges/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/edges/train/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/edges/evaluation/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/edges/test/ directory already exists
    [debug] [10/28/21 22:12:59.880] Edges: DeviceMemory storage initialized
    [debug] [10/28/21 22:12:59.894] Edges shuffled
    [debug] [10/28/21 22:12:59.894] Edge storage initialized. Train: 483142, Valid: 50000, Test: 59071
    [debug] [10/28/21 22:13:00.004] Node embeddings: DeviceMemory storage initialized
    [debug] [10/28/21 22:13:00.004] Node embeddings state: DeviceMemory storage initialized
    [debug] [10/28/21 22:13:00.004] Node embeddings initialized: 14951
    [debug] [10/28/21 22:13:00.014] Relation embeddings: DeviceMemory storage initialized
    [debug] [10/28/21 22:13:00.014] Relation embeddings state: DeviceMemory storage initialized
    [debug] [10/28/21 22:13:00.014] Relation embeddings initialized: 1345
    [debug] [10/28/21 22:13:00.014] Getting batches from edge list
    [info] [10/28/21 22:13:00.014] Training set initialized
    [debug] [10/28/21 22:13:00.014] Getting batches from edge list
    [debug] [10/28/21 22:13:00.014] Batches initialized
    [info] [10/28/21 22:13:00.015] Evaluation set initialized
    [info] [10/28/21 22:13:00.015] Preprocessing Complete: 0.149s
    [debug] [10/28/21 22:13:00.032] Loaded training set
    [info] [10/28/21 22:13:00.032] ################ Starting training epoch 1 ################
    [trace] [10/28/21 22:13:00.032] Starting Batch. ID 0, Starting Index 0, Batch Size 10000 
    [trace] [10/28/21 22:13:00.034] Batch: 0 Accumulated 11109 unique embeddings
    [trace] [10/28/21 22:13:00.034] Batch: 0 Accumulated 640 unique relations
    [trace] [10/28/21 22:13:00.034] Batch: 0 Indices sent to device
    [trace] [10/28/21 22:13:00.034] Batch: 0 Node Embeddings read
    [trace] [10/28/21 22:13:00.034] Batch: 0 Node State read
    [trace] [10/28/21 22:13:00.034] Batch: 0 Relation Embeddings read
    [trace] [10/28/21 22:13:00.034] Batch: 0 Relation State read
    [trace] [10/28/21 22:13:00.035] Batch: 0 prepared for compute
    [debug] [10/28/21 22:13:00.040] Loss: 124804.266, Regularization loss: 0.012812799
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
    Traceback (most recent call last):
      File "/home/nfp/.local/bin/marius_train", line 8, in <module>
        sys.exit(main())
      File "/home/nfp/.local/lib/python3.6/site-packages/marius/console_scripts/marius_train.py", line 8, in main
        m.marius_train(len(sys.argv), sys.argv)
    RuntimeError: CUDA error: device-side assert triggered
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Exception raised from launch_unrolled_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:132 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f95645bcd62 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
    frame #1: void at::native::gpu_kernel_impl<at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > >(at::TensorIteratorBase&, at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > const&) + 0xb37 (0x7f95665b2f27 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #2: void at::native::gpu_kernel<at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > >(at::TensorIteratorBase&, at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > const&) + 0x113 (0x7f95665bf333 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #3: void at::native::opmath_gpu_kernel_with_scalars<float, float, float, at::native::AddFunctor<float> >(at::TensorIteratorBase&, at::native::AddFunctor<float> const&) + 0xa9 (0x7f95665bf4c9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #4: <unknown function> + 0xe5d953 (0x7f9566592953 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #5: at::native::add_kernel_cuda(at::TensorIteratorBase&, c10::Scalar const&) + 0x15 (0x7f95665930a5 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #6: <unknown function> + 0xe5e0cf (0x7f95665930cf in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #7: at::native::structured_sub_out::impl(at::Tensor const&, at::Tensor const&, c10::Scalar const&, at::Tensor const&) + 0x40 (0x7f95a9f1ef00 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #8: <unknown function> + 0x25e52ab (0x7f9567d1a2ab in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #9: <unknown function> + 0x25e5372 (0x7f9567d1a372 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #10: at::_ops::sub_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0xb9 (0x7f95aa55d3f9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #11: <unknown function> + 0x34be046 (0x7f95ac03c046 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #12: <unknown function> + 0x34be655 (0x7f95ac03c655 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #13: at::_ops::sub_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x13f (0x7f95aa5b5b2f in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #14: <unknown function> + 0x3f299b0 (0x7f95acaa79b0 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #15: torch::autograd::generated::LogsumexpBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1dc (0x7f95abd1447c in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #16: <unknown function> + 0x3896817 (0x7f95ac414817 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #17: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x145b (0x7f95ac40fa7b in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #18: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x57a (0x7f95ac4107aa in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #19: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f95ac4081c9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #20: <unknown function> + 0xc71f (0x7f962b3ad71f in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
    frame #21: <unknown function> + 0x76db (0x7f962d01f6db in /lib/x86_64-linux-gnu/libpthread.so.0)
    frame #22: clone + 0x3f (0x7f962d35871f in /lib/x86_64-linux-gnu/libc.so.6)
    

    Expected behavior The program works well for CPU configs:

    [email protected]:~/marius$ bash examples/training/scripts/fb15k_cpu.sh 
    fb15k
    Downloading fb15k.tgz to output_dir/fb15k.tgz
    Extracting
    Extraction completed
    Detected delimiter: ~   ~
    Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
    Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
    Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
    Number of instance per file:[483142, 50000, 59071]
    Number of nodes: 14951
    Number of edges: 592213
    Number of relations: 1345
    Delimiter: ~    ~
    ['/home/nfp/.local/bin/marius_train', 'examples/training/configs/fb15k_cpu.ini']
    [info] [10/28/21 22:19:07.259] Start preprocessing
    [info] [10/28/21 22:19:08.397] Training set initialized
    [info] [10/28/21 22:19:08.397] Evaluation set initialized
    [info] [10/28/21 22:19:08.397] Preprocessing Complete: 1.137s
    [info] [10/28/21 22:19:08.410] ################ Starting training epoch 1 ################
    [info] [10/28/21 22:19:08.904] Total Edges Processed: 50000, Percent Complete: 0.099
    [info] [10/28/21 22:19:09.252] Total Edges Processed: 95000, Percent Complete: 0.198
    [info] [10/28/21 22:19:09.700] Total Edges Processed: 152000, Percent Complete: 0.298
    [info] [10/28/21 22:19:09.998] Total Edges Processed: 190000, Percent Complete: 0.397
    [info] [10/28/21 22:19:10.418] Total Edges Processed: 237000, Percent Complete: 0.496
    [info] [10/28/21 22:19:10.809] Total Edges Processed: 286000, Percent Complete: 0.595
    [info] [10/28/21 22:19:11.211] Total Edges Processed: 336000, Percent Complete: 0.694
    [info] [10/28/21 22:19:11.567] Total Edges Processed: 383000, Percent Complete: 0.793
    [info] [10/28/21 22:19:11.958] Total Edges Processed: 432000, Percent Complete: 0.893
    [info] [10/28/21 22:19:12.320] Total Edges Processed: 478000, Percent Complete: 0.992
    [info] [10/28/21 22:19:12.357] ################ Finished training epoch 1 ################
    [info] [10/28/21 22:19:12.357] Epoch Runtime (Before shuffle/sync): 3946ms
    [info] [10/28/21 22:19:12.357] Edges per Second (Before shuffle/sync): 122438.414
    [info] [10/28/21 22:19:12.358] Pipeline flush complete
    [info] [10/28/21 22:19:12.374] Edges Shuffled
    [info] [10/28/21 22:19:12.374] Epoch Runtime (Including shuffle/sync): 3963ms
    [info] [10/28/21 22:19:12.374] Edges per Second (Including shuffle/sync): 121913.195
    [info] [10/28/21 22:19:12.389] Starting evaluating
    [info] [10/28/21 22:19:12.709] Pipeline flush complete
    [info] [10/28/21 22:19:15.909] Num Eval Edges: 50000
    [info] [10/28/21 22:19:15.909] Num Eval Batches: 50
    [info] [10/28/21 22:19:15.909] Auc: 0.941, Avg Ranks: 40.139, MRR: 0.336, [email protected]: 0.212, [email protected]: 0.476, [email protected]: 0.600, [email protected]: 0.707, [email protected]: 0.827, [email protected]: 0.895
    [info] [10/28/21 22:19:15.920] Evaluation complete: 3531ms
    [info] [10/28/21 22:19:15.931] ################ Starting training epoch 2 ################
    [info] [10/28/21 22:19:16.361] Total Edges Processed: 46000, Percent Complete: 0.099
    [info] [10/28/21 22:19:16.900] Total Edges Processed: 97000, Percent Complete: 0.198
    [info] [10/28/21 22:19:17.424] Total Edges Processed: 156000, Percent Complete: 0.298
    [info] [10/28/21 22:19:17.697] Total Edges Processed: 189000, Percent Complete: 0.397
    [info] [10/28/21 22:19:18.078] Total Edges Processed: 238000, Percent Complete: 0.496
    [info] [10/28/21 22:19:18.466] Total Edges Processed: 288000, Percent Complete: 0.595
    [info] [10/28/21 22:19:18.825] Total Edges Processed: 336000, Percent Complete: 0.694
    [info] [10/28/21 22:19:19.160] Total Edges Processed: 381000, Percent Complete: 0.793
    [info] [10/28/21 22:19:19.584] Total Edges Processed: 436000, Percent Complete: 0.893
    [info] [10/28/21 22:19:19.909] Total Edges Processed: 481000, Percent Complete: 0.992
    [info] [10/28/21 22:19:19.928] ################ Finished training epoch 2 ################
    [info] [10/28/21 22:19:19.928] Epoch Runtime (Before shuffle/sync): 3997ms
    [info] [10/28/21 22:19:19.928] Edges per Second (Before shuffle/sync): 120876.16
    [info] [10/28/21 22:19:19.929] Pipeline flush complete
    [info] [10/28/21 22:19:19.947] Edges Shuffled
    [info] [10/28/21 22:19:19.948] Epoch Runtime (Including shuffle/sync): 4016ms
    [info] [10/28/21 22:19:19.948] Edges per Second (Including shuffle/sync): 120304.29
    [info] [10/28/21 22:19:19.961] Starting evaluating
    [info] [10/28/21 22:19:20.246] Pipeline flush complete
    [info] [10/28/21 22:19:20.255] Num Eval Edges: 50000
    [info] [10/28/21 22:19:20.255] Num Eval Batches: 50
    [info] [10/28/21 22:19:20.255] Auc: 0.972, Avg Ranks: 21.458, MRR: 0.431, [email protected]: 0.294, [email protected]: 0.595, [email protected]: 0.719, [email protected]: 0.812, [email protected]: 0.906, [email protected]: 0.949
    [info] [10/28/21 22:19:20.271] Evaluation complete: 309ms
    [info] [10/28/21 22:19:20.282] ################ Starting training epoch 3 ################
    [info] [10/28/21 22:19:20.694] Total Edges Processed: 47000, Percent Complete: 0.099
    [info] [10/28/21 22:19:21.042] Total Edges Processed: 95000, Percent Complete: 0.198
    [info] [10/28/21 22:19:21.425] Total Edges Processed: 143000, Percent Complete: 0.298
    [info] [10/28/21 22:19:21.872] Total Edges Processed: 203000, Percent Complete: 0.397
    ^C[info] [10/28/21 22:19:22.195] Total Edges Processed: 244000, Percent Complete: 0.496
    [info] [10/28/21 22:19:22.561] Total Edges Processed: 288000, Percent Complete: 0.595
    [info] [10/28/21 22:19:22.971] Total Edges Processed: 342000, Percent Complete: 0.694
    [info] [10/28/21 22:19:23.266] Total Edges Processed: 380000, Percent Complete: 0.793
    [info] [10/28/21 22:19:23.747] Total Edges Processed: 438000, Percent Complete: 0.893
    [info] [10/28/21 22:19:24.101] Total Edges Processed: 479142, Percent Complete: 0.992
    ...
    

    Environment I tried on 2 machines and got the same error. Platform: linux (Ubuntu 18.04 LTS) Python version: 3.6.9 Pytorch version: 1.10.0+cu102; 1.10.0+cu113

    bug 
    opened by IronySuzumiya 5
  • README example not working

    README example not working

    Describe the bug

    Traceback (most recent call last):
      File "/Users/cthoyt/dev/marius/test.py", line 20, in <module>
        fb15k_example()
      File "/Users/cthoyt/dev/marius/test.py", line 8, in fb15k_example
        train_set, eval_set = m.initializeDatasets(config)
    RuntimeError: filesystem error: in copy_file: No such file or directory [training_data/marius/edges/train/edges.bin] [output_dir/train_edges.pt]
    

    To Reproduce

    I took the example from the README verbatim besides fixing the config path

    import marius as m
    
    def fb15k_example():
        config_path = "/Users/cthoyt/dev/marius/examples/training/configs/kinships_cpu.ini"
        config = m.parseConfig(config_path)
    
        train_set, eval_set = m.initializeDatasets(config)
    
        model = m.initializeModel(config.model.encoder_model, config.model.decoder_model)
    
        trainer = m.SynchronousTrainer(train_set, model)
        evaluator = m.SynchronousEvaluator(eval_set, model)
    
        trainer.train(1)
        evaluator.evaluate(True)
    
    
    if __name__ == "__main__":
        fb15k_example()
    
    

    Expected behavior A clear and concise description of what you expected to happen.

    Environment Mac os 11.2.3 big sur, python 3.9.2, pip installed from latest code on marius

    bug 
    opened by cthoyt 5
  • [Question] Large GPU Memory Usage & Early Exit of MariusGNN-Eurosys23

    [Question] Large GPU Memory Usage & Early Exit of MariusGNN-Eurosys23

    Hi, thank you for this excellent work!

    I am trying to reproduce some of the results with a 2080Ti (11GB) but seem to encounter the GPU memory usage problem. Specifically, when I ran python3 experiment_manager/run_experiment.py --experiment papers100m with the default config of papers100M, the training seems exit abnormally fast while with no error:

    ==== ogbn_papers100m already preprocessed =====
    =========================================
    Running: marius 
    Configuration: experiment_manager/system_comparisons/configs/ogbn_papers100m/marius_gs.yaml
    Saving results to: results/ogbn_papers100m/marius_gs
    [2022-12-22 16:26:25.906] [info] [marius.cpp:29] Start initialization
    [2022-12-22 16:31:01.955] [info] [marius.cpp:66] Initialization Complete: 276.048s
    [2022-12-22 16:32:21.671] [info] [trainer.cpp:41] ################ Starting training epoch 1 ################
    Complete. Total runtime: 366.0947s
    

    But after I modified the config with small hidden dimension (16 instead of 256) and small train batchsize (600 instead of 1000), the system run normally:

    ==== ogbn_papers100m already preprocessed =====
    =========================================
    Running: marius 
    Configuration: experiment_manager/system_comparisons/configs/ogbn_papers100m/marius_gs.yaml
    Saving results to: results/ogbn_papers100m/marius_gs
    Overwriting previous experiment.
    [2022-12-22 16:22:29.642] [info] [marius.cpp:29] Start initialization
    [2022-12-22 16:27:13.260] [info] [marius.cpp:66] Initialization Complete: 283.617s
    [2022-12-22 16:28:12.311] [info] [trainer.cpp:41] ################ Starting training epoch 1 ################
    [2022-12-22 16:28:23.558] [info] [reporting.cpp:167] Nodes processed: [121200/1207179], 10.039936%
    [2022-12-22 16:28:34.565] [info] [reporting.cpp:167] Nodes processed: [242400/1207179], 20.079872%
    [2022-12-22 16:28:43.379] [info] [reporting.cpp:167] Nodes processed: [363600/1207179], 30.119808%
    [2022-12-22 16:28:51.657] [info] [reporting.cpp:167] Nodes processed: [484800/1207179], 40.159744%
    [2022-12-22 16:28:58.793] [info] [reporting.cpp:167] Nodes processed: [606000/1207179], 50.199680%
    ....
    

    So I suspect that the abnormal early Complete message actually implies GPU OOM here?

    Then MariusGNN seems to use significantly larger GPU memory than DGL? Since I can easily scale batch size to over 8000 under the same fanouts & hidden & GPU. Does this observation comply with the MariusGNN's internal design?

    I am very grateful if you could help to explain, thank you!

    question 
    opened by CSLabor 4
  • Change argument variable output_directory to data_directory

    Change argument variable output_directory to data_directory

    What is the documentation lacking? Please describe. The variable output_directory in the input arguments is overloaded. The directory is used to include both the input and output data for a data set. Imprecise naming leads to wrong use of the system.

    Describe the improvement you'd like Rename the variable to data_directory instead of output_directory

    bug documentation enhancement 
    opened by thodrek 4
  • Could I run C++ code for Marius?

    Could I run C++ code for Marius?

    Hi, I built execution files (marius_train and marius_eval), using CMakeLists.txt. However, when I run this execution file to execute as in the example of github, error occurs.

    $ ./marius_train examples/configuration/fb15k_237.yaml

    Result: Aborted (core dumped)

    Is the execution files created through CMake not working at the moment? Or is the input that should be entered differently from when running the marius python??

    Thanks

    question 
    opened by qhtjrmin 3
  • Marius++ code/example request

    Marius++ code/example request

    What is the documentation lacking? Please describe. A code example accompanying the Marius++ paper

    Describe the improvement you'd like A code example accompanying the Marius++ paper

    Additional context Thank you for releasing this amazing repo! Have you released the code/examples to accompany the Marius++ paper - it'd be great to be run Marius++ code to better understand the system. Thank you

    documentation 
    opened by 99snowleopards 3
  • Error during parquet embedding export

    Error during parquet embedding export

    Describe the bug When trying to export my generated embeddings using the command marius_postprocess, I receive an error that kills the export process.

    The exact command I am using is:

    marius_postprocess --model_dir /mount_ws/02_distmult --format parquet --output_dir /mount_ws/parquet_export/02_distmult_parquet

    Which gives the following error after a while:

    Traceback (most recent call last):
      File "/usr/local/bin/marius_postprocess", line 11, in <module>
        load_entry_point('marius==0.0.2', 'console_scripts', 'marius_postprocess')()
      File "/usr/local/lib/python3.6/dist-packages/marius/tools/marius_postprocess.py", line 61, in main
        exporter.export(output_dir)
      File "/usr/local/lib/python3.6/dist-packages/marius/tools/postprocess/in_memory_exporter.py", line 176, in export
        self.export_node_embeddings(output_dir)
      File "/usr/local/lib/python3.6/dist-packages/marius/tools/postprocess/in_memory_exporter.py", line 83, in export_node_embeddings
        self.overwrite,
      File "/usr/local/lib/python3.6/dist-packages/marius/tools/postprocess/in_memory_exporter.py", line 37, in save_df
        output_df.to_parquet(output_path)
      File "/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py", line 199, in wrapper
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 2372, in to_parquet
        **kwargs,
      File "/usr/local/lib/python3.6/dist-packages/pandas/io/parquet.py", line 276, in to_parquet
        **kwargs,
      File "/usr/local/lib/python3.6/dist-packages/pandas/io/parquet.py", line 199, in write
        **kwargs,
      File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 951, in write
        partition_cols=partition_on)
      File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 750, in make_metadata
        object_encoding=oencoding, times=times)
      File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 116, in find_type
        object_encoding = infer_object_encoding(data)
      File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 322, in infer_object_encoding
        for i in head if i):
      File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 322, in <genexpr>
        for i in head if i):
    ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
    

    How can I interpret this error? Running with .csv file format works, but seems to produce very large files. But I presume .parquet is more efficient..

    The packages installed are:

    antlr4-python3-runtime (4.9.3)
    asn1crypto (0.24.0)
    cramjam (2.3.2)
    cryptography (2.1.4)
    dataclasses (0.8)
    fastparquet (0.7.2)
    fsspec (2022.1.0)
    GPUtil (1.4.0)
    idna (2.6)
    importlib-metadata (4.8.3)
    keyring (10.6.0)
    keyrings.alt (3.0)
    marius (0.0.2)
    numpy (1.19.5)
    omegaconf (2.2.3)
    pandas (1.1.5)
    pip (9.0.1)
    psutil (5.9.2)
    py4j (0.10.9.5)
    pycrypto (2.6.1)
    pygobject (3.26.1)
    pyspark (3.2.2)
    python-apt (1.6.5+ubuntu0.7)
    python-dateutil (2.8.2)
    pytz (2022.4)
    pyxdg (0.25)
    PyYAML (6.0)
    SecretStorage (2.3.1)
    setuptools (39.0.1)
    six (1.11.0)
    thrift (0.16.0)
    torch (1.9.1+cu111)
    typing-extensions (4.1.1)
    unattended-upgrades (0.1)
    wheel (0.30.0)
    zipp (3.6.0)
    

    Environment marius_env_info output:

    cmake:
      version: 3.20.0
    cpu_info:
      num_cpus: 96
      total_memory: 377GB
    cuda:
      version: '11.1'
    gpu_info:
      - memory: 40GB
        name: NVIDIA A100-PCIE-40GB
    marius:
      bindings_installed: true
      install_path: /usr/local/lib/python3.6/dist-packages/marius
      version: 0.0.2
    openmp:
      version: '201511'
    operating_system:
      platform: Linux-4.18.0-305.65.1.el8_4.x86_64-x86_64-with-Ubuntu-18.04-bionic
    pybind:
      PYBIND11_BUILD_ABI: _cxxabi1011
      PYBIND11_COMPILER_TYPE: _gcc
      PYBIND11_STDLIB: _libstdcpp
    python:
      compiler: GCC 8.4.0
      deps:
        numpy_version: 1.19.5
        omegaconf_version: 2.2.3
        pandas_version: 1.1.5
        pip_version: 9.0.1
        pyspark_version: 3.2.2
        pytest_version: 7.0.1
        torch_version: 1.9.1+cu111
        tox_version: 3.28.0
      version: 3.6.9
    pytorch:
      install_path: /usr/local/lib/python3.6/dist-packages/torch
      version: 1.9.1+cu111
    

    I'd be glad for any help - Thank you!

    bug 
    opened by johankit 1
  • Marius Script Compiler

    Marius Script Compiler

    • Add marius_mpic tool to compile marius script
    • Add test cases to handle error scenarios
    • Generate code in mpic_gen directory
    • Add github workflow to test

    See examples/mpic for example usage

    opened by pao214 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Pyspark preprocessor outputs to s3

    Pyspark preprocessor outputs to s3

    the preprocessor now writes processed edge and node data to s3, but the data is split into many files. need to combine them.

    the following errors when the files are small,

    s3_obj.merge(output_filename, files_list)
    

    throws the error EntityTooSmall.

    once we have a single file, we can look into converting that to binary. Alternatively, we can define a custom writer that outputs in binary format without the intermediate csv files.

    opened by basavaraj29 0
  • Spark preprocessor optimization

    Spark preprocessor optimization

    • removing id assignment for edges
    • using zipwithindex instead of repartition(1) and windowing
    • parititonBy([src_bucket, dst_bucket])

    todo:

    • custom binary writer to eliminate intermediate csv
    opened by basavaraj29 3
  • About mini-batch training and edge bucket

    About mini-batch training and edge bucket

    Whether each bucket will perform mini-batch training? On your paper, whether each bucket performs 4(bound) mini-batch training? Is my understanding correct? Thanks a lot!

    question 
    opened by YijianLiu 1
Releases(v0.0.1)
  • v0.0.1(Sep 26, 2022)

    This release contains the initial artifact for the paper MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks to be published at EuroSys 2023. The artifact contains the necessary code to reproduce experiments reported in the paper.

    Source code(tar.gz)
    Source code(zip)
Owner
Marius
Graph Learning at Scale
Marius
CS_Final_Metal_surface_detection - This is a final project for CoderSchool Machine Learning bootcamp on 29/12/2021.

CS_Final_Metal_surface_detection This is a final project for CoderSchool Machine Learning bootcamp on 29/12/2021. The project is based on the dataset

Cuong Vo 1 Dec 29, 2021
Prototype-based Incremental Few-Shot Semantic Segmentation

Prototype-based Incremental Few-Shot Semantic Segmentation Fabio Cermelli, Massimiliano Mancini, Yongqin Xian, Zeynep Akata, Barbara Caputo -- BMVC 20

Fabio Cermelli 21 Dec 29, 2022
DeiT: Data-efficient Image Transformers

DeiT: Data-efficient Image Transformers This repository contains PyTorch evaluation code, training code and pretrained models for DeiT (Data-Efficient

Facebook Research 3.2k Jan 06, 2023
Applying PVT to Semantic Segmentation

Applying PVT to Semantic Segmentation Here, we take MMSegmentation v0.13.0 as an example, applying PVTv2 to SemanticFPN. For details see Pyramid Visio

35 Nov 30, 2022
Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Troyanskaya Laboratory 323 Jan 01, 2023
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
TensorFlow implementation of ENet

TensorFlow-ENet TensorFlow implementation of ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. This model was tested on th

Kwotsin 255 Oct 17, 2022
The King is Naked: on the Notion of Robustness for Natural Language Processing

the-king-is-naked: on the notion of robustness for natural language processing AAAI2022 DISCLAIMER:This repo will be updated soon with instructions on

Iperboreo_ 1 Nov 24, 2022
An open source Python package for plasma science that is under development

PlasmaPy PlasmaPy is an open source, community-developed Python 3.7+ package for plasma science. PlasmaPy intends to be for plasma science what Astrop

PlasmaPy 444 Jan 07, 2023
A certifiable defense against adversarial examples by training neural networks to be provably robust

DiffAI v3 DiffAI is a system for training neural networks to be provably robust and for proving that they are robust. The system was developed for the

SRI Lab, ETH Zurich 202 Dec 13, 2022
Technical experimentations to beat the stock market using deep learning :chart_with_upwards_trend:

DeepStock Technical experimentations to beat the stock market using deep learning. Experimentations Deep Learning Stock Prediction with Daily News Hea

Keon 449 Dec 29, 2022
A vision library for performing sliced inference on large images/small objects

SAHI: Slicing Aided Hyper Inference A vision library for performing sliced inference on large images/small objects Overview Object detection and insta

Open Business Software Solutions 2.3k Jan 04, 2023
Mengzi Pretrained Models

中文 | English Mengzi 尽管预训练语言模型在 NLP 的各个领域里得到了广泛的应用,但是其高昂的时间和算力成本依然是一个亟需解决的问题。这要求我们在一定的算力约束下,研发出各项指标更优的模型。 我们的目标不是追求更大的模型规模,而是轻量级但更强大,同时对部署和工业落地更友好的模型。

Langboat 424 Jan 04, 2023
PaSST: Efficient Training of Audio Transformers with Patchout

PaSST: Efficient Training of Audio Transformers with Patchout This is the implementation for Efficient Training of Audio Transformers with Patchout Pa

165 Dec 26, 2022
Dynamic Multi-scale Filters for Semantic Segmentation (DMNet ICCV'2019)

Dynamic Multi-scale Filters for Semantic Segmentation (DMNet ICCV'2019) Introduction Official implementation of Dynamic Multi-scale Filters for Semant

23 Oct 21, 2022
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks [Paper] [Project Website] This repository holds the source code, pretra

Humam Alwassel 83 Dec 21, 2022
OSLO: Open Source framework for Large-scale transformer Optimization

O S L O Open Source framework for Large-scale transformer Optimization What's New: December 21, 2021 Released OSLO 1.0. What is OSLO about? OSLO is a

TUNiB 280 Nov 24, 2022
Generative Adversarial Networks for High Energy Physics extended to a multi-layer calorimeter simulation

CaloGAN Simulating 3D High Energy Particle Showers in Multi-Layer Electromagnetic Calorimeters with Generative Adversarial Networks. This repository c

Deep Learning for HEP 101 Nov 13, 2022
Anonymize BLM Protest Images

Anonymize BLM Protest Images This repository automates @BLMPrivacyBot, a Twitter bot that shows the anonymized images to help keep protesters safe. Us

Stanford Machine Learning Group 40 Oct 13, 2022
Multi-angle c(q)uestion answering

Macaw Introduction Macaw (Multi-angle c(q)uestion answering) is a ready-to-use model capable of general question answering, showing robustness outside

AI2 430 Jan 04, 2023