End-to-End Referring Video Object Segmentation with Multimodal Transformers

Related tags

Deep LearningMTTR
Overview

End-to-End Referring Video Object Segmentation with Multimodal Transformers

License Framework

This repo contains the official implementation of the paper:


End-to-End Referring Video Object Segmentation with Multimodal Transformers

MTTR_preview.mp4

How to Run the Code

First, clone this repo to your local machine using:

git clone https://github.com/mttr2021/MTTR.git

Dataset Requirements

A2D-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files) 

###JHMDB-Sentences Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── jhmdb_sentences/ 
    ├── Rename_Images/  (frame images)
    │   └── */ (action dirs)
    ├── puppet_mask/  (mask annotations)
    │   └── */ (action dirs)
    └── jhmdb_annotation.txt  (text annotations)

Refer-YouTube-VOS

Download the dataset from the competition's website here.

Note that you may be required to sign up to the competition in order to get access to the dataset. This registration process is free and short.

Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    │           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

Environment Installation

The code was tested on a Conda environment installed on Ubuntu 18.04. Install Conda and then create an environment as follows:

conda create -n mttr python=3.9.7 pip -y

conda activate mttr

  • Pytorch 1.10:

conda install pytorch==1.10.0 torchvision==0.11.1 -c pytorch -c conda-forge

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

  • Hugging Face transformers 4.11.3:

pip install transformers==4.11.3

  • COCO API (for mAP calculations):

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

  • Additional required packages:

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Running Configuration

The following table lists the parameters which can be configured directly from the command line.

The rest of the running/model parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Note that in order to run the code the path of the relevant .yaml config file needs to be supplied using the -c parameter.

Command Description
-c path to dataset configuration file
-rm running mode (train/eval)
-ws window size
-bs training batch size per GPU
-ebs eval batch size per GPU (if not provided, training batch size is used)
-ng number of GPUs to run on

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files.

The commands were tested on RTX 3090 24GB GPUs, but it may be possible to run some of them using GPUs with less memory by decreasing the batch-size -bs parameter.

A2D-Sentences

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 46.1
8 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 44.7

JHMDB-Sentences

The following commands evaluate our A2D-Sentences-pretrained model on JHMDB-Sentences without additional training.

For this purpose, as explained in our paper, we uniformly sample three frames from each video. To ensure proper reproduction of our results on other machines we include the metadata of the sampled frames under datasets/jhmdb_sentences/jhmdb_sentences_samples_metadata.json. This file is automatically loaded during the evaluation process with the commands below.

To avoid using this file and force sampling different frames, change the seed and generate_new_samples_metadata parameters under MTTR/configs/jhmdb_sentences.yaml.

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 39.2
8 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 36.6

Refer-YouTube-VOS

The following command evaluates our model on the public validation subset of Refer-YouTube-VOS dataset. Since annotations are not publicly available for this subset, our code generates a zip file with the predicted masks under MTTR/runs/[RUN_DATE_TIME]/validation_outputs/submission_epoch_0.zip. This zip needs to be uploaded to the competition server for evaluation. For your convenience we supply this zip file here as well.

Window Size Command Checkpoint File Output Zip J&F Result
12 python main.py -rm eval -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ckpt CHECKPOINT_PATH -ng 8 Link Link 55.32

Training

First, download the Kinetics-400 pretrained weights of Video Swin Transformer from this link. Note that these weights were originally published in video swin's original repo here.

Place the downloaded file inside your cloned repo directory as MTTR/pretrained_swin_transformer/swin_tiny_patch244_window877_kinetics400_1k.pth.

Next, the following commands can be used to train MTTR as described in our paper.

Note that it may be possible to run some of these commands on GPUs with less memory than the ones mentioned below by decreasing the batch-size -bs or window-size -ws parameters. However, changing these parameters may also affect the final performance of the model.

A2D-Sentences

  • The command for the following configuration was tested on 2 A6000 48GB GPUs:
Window Size Command
10 python main.py -rm train -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ng 2
  • The command for the following configuration was tested on 3 RTX 3090 24GB GPUs:
Window Size Command
8 python main.py -rm train -c configs/a2d_sentences.yaml -ws 8 -bs 2 -ng 3

Refer-YouTube-VOS

  • The command for the following configuration was tested on 4 A6000 48GB GPUs:
Window Size Command
12 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ng 4
  • The command for the following configuration was tested on 8 RTX 3090 24GB GPUs.
Window Size Command
8 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 8

Note that this last configuration was not mentioned in our paper. However, it is more memory efficient than the original configuration (window size 12) while producing a model which is almost as good (J&F of 54.56 in our experiments).

JHMDB-Sentences

As explained in our paper JHMDB-Sentences is used exclusively for evaluation, so training is not supported at this time for this dataset.

Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion: A Machine Learning Library for Time Series Table of Contents Introduction Installation Documentation Getting Started Anomaly Detection Foreca

Salesforce 2.8k Dec 30, 2022
The repository for our EMNLP 2021 paper "Finnish Dialect Identification: The Effect of Audio and Text"

Finnish Dialect Identification The repository for our EMNLP 2021 paper "Finnish Dialect Identification: The Effect of Audio and Text". We present a te

Rootroo Ltd 2 Dec 25, 2021
Avalanche RL: an End-to-End Library for Continual Reinforcement Learning

Avalanche RL: an End-to-End Library for Continual Reinforcement Learning Avalanche Website | Getting Started | Examples | Tutorial | API Doc | Paper |

ContinualAI 43 Dec 24, 2022
MatchGAN: A Self-supervised Semi-supervised Conditional Generative Adversarial Network

MatchGAN: A Self-supervised Semi-supervised Conditional Generative Adversarial Network This repository is the official implementation of MatchGAN: A S

Justin Sun 12 Dec 27, 2022
Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.

EfficientZero (NeurIPS 2021) Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021. Thank you for you

Weirui Ye 671 Jan 03, 2023
A PyTorch implementation of a Factorization Machine module in cython.

fmpytorch A library for factorization machines in pytorch. A factorization machine is like a linear model, except multiplicative interaction terms bet

Jack Hessel 167 Jul 06, 2022
Infrastructure as Code (IaC) for a self-hosted version of Gnosis Safe on AWS

Welcome to Yearn Gnosis Safe! Setting up your local environment Infrastructure Deploying Gnosis Safe Prerequisites 1. Create infrastructure for secret

Numan 16 Jul 18, 2022
Cossim - Sharpened Cosine Distance implementation in PyTorch

Sharpened Cosine Distance PyTorch implementation of the Sharpened Cosine Distanc

Istvan Fehervari 10 Mar 22, 2022
[ACM MM 2021] Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Diverse Image Inpainting with Bidirectional and Autoregressive Transformers Installation pip install -r requirements.txt Dataset Preparation Given the

Yingchen Yu 25 Nov 09, 2022
LocUNet is a deep learning method to localize a UE based solely on the reported signal strengths from a set of BSs.

LocUNet LocUNet is a deep learning method to localize a UE based solely on the reported signal strengths from a set of BSs. The method utilizes accura

4 Oct 05, 2022
Code for Talk-to-Edit (ICCV2021). Paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog.

Talk-to-Edit (ICCV2021) This repository contains the implementation of the following paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog Yumin

Yuming Jiang 221 Jan 07, 2023
Image-popularity-score - A novel deep regression method for image scoring.

Image-popularity-score - A novel deep regression method for image scoring.

Shoaib ahmed 1 Dec 26, 2021
A TensorFlow implementation of SOFA, the Simulator for OFfline LeArning and evaluation.

SOFA This repository is the implementation of SOFA, the Simulator for OFfline leArning and evaluation. Keeping Dataset Biases out of the Simulation: A

22 Nov 23, 2022
Gender Classification Machine Learning Model using Sk-learn in Python with 97%+ accuracy and deployment

Gender-classification This is a ML model to classify Male and Females using some physical characterstics Data. Python Libraries like Pandas,Numpy and

Aryan raj 11 Oct 16, 2022
Conversion between units used in magnetism

convmag Conversion between various units used in magnetism The conversions between base units available are: T - G : 1e4

0 Jul 15, 2021
Learned Initializations for Optimizing Coordinate-Based Neural Representations

Learned Initializations for Optimizing Coordinate-Based Neural Representations Project Page | Paper Matthew Tancik*1, Ben Mildenhall*1, Terrance Wang1

Matthew Tancik 127 Jan 03, 2023
Codebase for the Summary Loop paper at ACL2020

Summary Loop This repository contains the code for ACL2020 paper: The Summary Loop: Learning to Write Abstractive Summaries Without Examples. Training

Canny Lab @ The University of California, Berkeley 44 Nov 04, 2022
Efficient Sparse Attacks on Videos using Reinforcement Learning

EARL This repository provides a simple implementation of the work "Efficient Sparse Attacks on Videos using Reinforcement Learning" Example: Demo: Her

12 Dec 05, 2021
A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

Zain 1 Feb 01, 2022
Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

Journey Towards Tiny Perceptual Super-Resolution Test code for our ECCV2020 paper: https://arxiv.org/abs/2007.04356 Our x4 upscaling pre-trained model

Royson 6 Mar 30, 2022