[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Overview

Grounded Situation Recognition with Transformers

Paper | Model Checkpoint

  • This is the official PyTorch implementation of Grounded Situation Recognition with Transformers (BMVC 2021).
  • GSRTR (Grounded Situation Recognition TRansformer) achieves state of the art in all evaluation metrics on the SWiG benchmark.
  • This repository contains instructions, code and model checkpoint.

Overview

Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization. Our model is the first Transformer architecture for GSR, and achieves the state of the art in every evaluation metric on the SWiG benchmark.

model

GSRTR mainly consists of two components: Transformer Encoder for verb prediction, and Transformer Decoder for grounded noun prediction. For details, please see Grounded Situation Recognition with Transformers by Junhyeong Cho, Youngseok Yoon, Hyeonjun Lee and Suha Kwak.

Environment Setup

We provide instructions for environment setup.

# Clone this repository and navigate into the repository
git clone https://github.com/jhcho99/gsrtr.git    
cd gsrtr                                          

# Create a conda environment, activate the environment and install PyTorch via conda
conda create --name gsrtr python=3.9              
conda activate gsrtr                             
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge 

# Install requirements via pip
pip install -r requirements.txt                   

SWiG Dataset

Annotations are given in JSON format, and annotation files are under "SWiG/SWiG_jsons/" directory. Images can be downloaded here. Please download the images and store them in "SWiG/images_512/" directory.

SWiG_Image In the SWiG dataset, each image is associated with Verb, Frame and Groundings.

A) Verb: each image is paired with a verb. In the annotation file, "verb" denotes the salient action for an image.

B) Frame: a frame denotes the set of semantic roles for a verb. For example, the frame for verb "Catching" denotes the set of semantic roles "Agent", "Caught Item", "Tool" and "Place". In the annotation file, "frames" show the set of semantic roles for a verb, and noun annotations for each role. There are three noun annotations for each role, which are given by three different annotators.

C) Groundings: each grounding is described in [x1, y1, x2, y2] format. In the annotation file, "bb" denotes groundings for roles. Note that nouns can be labeled without groundings, e.g., in the case of occluded objects. When there is no grounding for a role, [-1, -1, -1, -1] is given.

# an example of annotation for an image

"catching_175.jpg": {
    "verb": "catching",
    "height": 512, 
    "width": 910,
    "bb": {"tool": [-1, -1, -1, -1], 
           "caughtitem": [444, 169, 671, 317], 
           "place": [-1, -1, -1, -1], 
           "agent": [270, 112, 909, 389]},
    "frames": [{"tool": "n05282433", "caughtitem": "n02190166", "place": "n03991062", "agent": "n00017222"}, 
               {"tool": "n05302499", "caughtitem": "n02190166", "place": "n03990474", "agent": "n00017222"}, 
               {"tool": "n07655505", "caughtitem": "n13152742", "place": "n00017222", "agent": "n02190166"}]
    }

In imsitu_space.json file, there is additional information for verb and noun.

# an example of additional verb information

"catching": {
    "framenet": "Getting", 
    "abstract": "an AGENT catches a CAUGHTITEM with a TOOL at a PLACE", 
    "def": "capture a sought out item", 
    "order": ["agent", "caughtitem", "tool", "place"], 
    "roles": {"tool": {"framenet": "manner", "def": "The object used to do the catch action"}, 
              "caughtitem": {"framenet": "theme", "def": "The entity being caught"}, 
              "place": {"framenet": "place", "def": "The location where the catch event is happening"}, 
              "agent": {"framenet": "recipient", "def": "The entity doing the catch action"}}
    }
# an example of additional noun information

"n00017222": {
    "gloss": ["plant", "flora", "plant life"], 
    "def": "(botany) a living organism lacking the power of locomotion"
    }

Additional Details

  • All images should be under "SWiG/images_512/" directory.
  • train.json file is for train set.
  • dev.json file is for development set.
  • test.json file is for test set.

Training

To train GSRTR on a single node with 4 gpus for 40 epochs, run:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py \
           --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
           --num_workers 4 --enc_layers 6 --dec_layers 6 --dropout 0.15 --hidden_dim 512 \
           --output_dir gsrtr

To train GSRTR on a Slurm cluster with submitit using 4 TITAN Xp gpus for 40 epochs, run:

python run_with_submitit.py --ngpus 4 --nodes 1 --job_dir gsrtr \
        --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
        --num_workers 4 --enc_layers 6 --dec_layers 6 --dropout 0.15 --hidden_dim 512 \
        --partition titanxp
  • A single epoch takes about 30 minutes. 40 epoch training takes around 20 hours on a single machine with 4 TITAN Xp gpus.
  • We use AdamW optimizer with learning rate 10-4 (10-5 for backbone), weight decay 10-4 and β = (0.9, 0.999).
  • Random Color Jittering, Random Gray Scaling, Random Scaling and Random Horizontal Flipping are used for augmentation.

Inference

To run an inference on a custom image, run:

python inference.py --image_path inference/filename.jpg \
                    --saved_model gsrtr_checkpoint.pth \
                    --output_dir inference
  • Model checkpoint can be downloaded here.

Here is an example of inference result: inference_result

Acknowledgements

Our code is modified and adapted from these amazing repositories:

Contact

Junhyeong Cho ([email protected])

Citation

If you find our work useful for your research, please cite our paper:

@InProceedings{cho2021gsrtr,
    title={Grounded Situation Recognition with Transformers},
    author={Junhyeong Cho and Youngseok Yoon and Hyeonjun Lee and Suha Kwak},
    booktitle={British Machine Vision Conference (BMVC)},
    year={2021}
}

License

GSRTR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Owner
Junhyeong Cho
Student at POSTECH | Studied at Stanford, UIUC and UC Berkeley
Junhyeong Cho
Solving Zero-Shot Learning in Named Entity Recognition with Common Sense Knowledge

Zero-Shot Learning in Named Entity Recognition with Common Sense Knowledge Associated code for the paper Zero-Shot Learning in Named Entity Recognitio

Søren Hougaard Mulvad 13 Dec 25, 2022
Solving reinforcement learning tasks which require language and vision

Multimodal Reinforcement Learning JAX implementations of the following multimodal reinforcement learning approaches. Dual-coding Episodic Memory from

Henry Prior 31 Feb 26, 2022
StarGAN2 for practice

StarGAN2 for practice This version of StarGAN2 (coined as 'Post-modern Style Transfer') is intended mostly for fellow artists, who rarely look at scie

vadim epstein 87 Sep 24, 2022
FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning PyTorch implementation for the paper: FACIAL: Synthesizing Dynamic Talking

226 Jan 08, 2023
[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Fudan Zhang Vision Group 897 Jan 05, 2023
Unsupervised Learning of Video Representations using LSTMs

Unsupervised Learning of Video Representations using LSTMs Code for paper Unsupervised Learning of Video Representations using LSTMs by Nitish Srivast

Elman Mansimov 341 Dec 20, 2022
[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets

[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets Introduction This repo contains the source code accompanying the paper: Well-tuned Sim

52 Jan 04, 2023
Code for the CIKM 2019 paper "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting".

Dual Self-Attention Network for Multivariate Time Series Forecasting 20.10.26 Update: Due to the difficulty of installation and code maintenance cause

Kyon Huang 223 Dec 16, 2022
To Design and Implement Logistic Regression to Classify Between Benign and Malignant Cancer Types

To Design and Implement Logistic Regression to Classify Between Benign and Malignant Cancer Types, from a Database Taken From Dr. Wolberg reports his Clinic Cases.

Astitva Veer Garg 1 Jul 31, 2022
Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs

Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs In this work, we propose an algorithm DP-SCAFFOLD(-warm), whic

19 Nov 10, 2022
Some experiments with tennis player aging curves using Hilbert space GPs in PyMC. Only experimental for now.

NOTE: This is still being developed! Setup notes This document uses Jeff Sackmann's tennis data. You can obtain it as follows: git clone https://githu

Martin Ingram 1 Jan 20, 2022
Dynamic Bottleneck for Robust Self-Supervised Exploration

Dynamic Bottleneck Introduction This is a TensorFlow based implementation for our paper on "Dynamic Bottleneck for Robust Self-Supervised Exploration"

Bai Chenjia 4 Nov 14, 2022
Vrcwatch - Supply the local time to VRChat as Avatar Parameters through OSC

English: README-EN.md VRCWatch VRCWatch は、VRChat 内のアバター向けに現在時刻を送信するためのプログラムです。 使

Kosaki Mezumona 17 Nov 30, 2022
Official Code for ICML 2021 paper "Revisiting Point Cloud Shape Classification with a Simple and Effective Baseline"

Revisiting Point Cloud Shape Classification with a Simple and Effective Baseline Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, Jia Deng Internati

Princeton Vision & Learning Lab 115 Jan 04, 2023
Semi-Supervised Semantic Segmentation with Cross-Consistency Training (CCT)

Semi-Supervised Semantic Segmentation with Cross-Consistency Training (CCT) Paper, Project Page This repo contains the official implementation of CVPR

Yassine 344 Dec 29, 2022
Working demo of the Multi-class and Anomaly classification model using the CLIP feature space

👁️ Hindsight AI: Crime Classification With Clip About For Educational Purposes Only This is a recursive neural net trained to classify specific crime

Miles Tweed 2 Jun 05, 2022
Tensorflow 2 implementation of the paper: Learning and Evaluating Representations for Deep One-class Classification published at ICLR 2021

Deep Representation One-class Classification (DROC). This is not an officially supported Google product. Tensorflow 2 implementation of the paper: Lea

Google Research 137 Dec 23, 2022
SOTA model in CIFAR10

A PyTorch Implementation of CIFAR Tricks 调研了CIFAR10数据集上各种trick,数据增强,正则化方法,并进行了实现。目前项目告一段落,如果有更好的想法,或者希望一起维护这个项目可以提issue或者在我的主页找到我的联系方式。 0. Requirement

PJDong 58 Dec 21, 2022
a pytorch implementation of auto-punctuation learned character by character

Learning Auto-Punctuation by Reading Engadget Articles Link to Other of my work 🌟 Deep Learning Notes: A collection of my notes going from basic mult

Ge Yang 137 Nov 09, 2022
Code for Environment Dynamics Decomposition (ED2).

ED2 Code for Environment Dynamics Decomposition (ED2). Installation Follow the installation in MBPO and Dreamer. Usage First follow the SD2 method for

0 Aug 10, 2021