[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Overview

Grounded Situation Recognition with Transformers

Paper | Model Checkpoint

  • This is the official PyTorch implementation of Grounded Situation Recognition with Transformers (BMVC 2021).
  • GSRTR (Grounded Situation Recognition TRansformer) achieves state of the art in all evaluation metrics on the SWiG benchmark.
  • This repository contains instructions, code and model checkpoint.

Overview

Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization. Our model is the first Transformer architecture for GSR, and achieves the state of the art in every evaluation metric on the SWiG benchmark.

model

GSRTR mainly consists of two components: Transformer Encoder for verb prediction, and Transformer Decoder for grounded noun prediction. For details, please see Grounded Situation Recognition with Transformers by Junhyeong Cho, Youngseok Yoon, Hyeonjun Lee and Suha Kwak.

Environment Setup

We provide instructions for environment setup.

# Clone this repository and navigate into the repository
git clone https://github.com/jhcho99/gsrtr.git    
cd gsrtr                                          

# Create a conda environment, activate the environment and install PyTorch via conda
conda create --name gsrtr python=3.9              
conda activate gsrtr                             
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge 

# Install requirements via pip
pip install -r requirements.txt                   

SWiG Dataset

Annotations are given in JSON format, and annotation files are under "SWiG/SWiG_jsons/" directory. Images can be downloaded here. Please download the images and store them in "SWiG/images_512/" directory.

SWiG_Image In the SWiG dataset, each image is associated with Verb, Frame and Groundings.

A) Verb: each image is paired with a verb. In the annotation file, "verb" denotes the salient action for an image.

B) Frame: a frame denotes the set of semantic roles for a verb. For example, the frame for verb "Catching" denotes the set of semantic roles "Agent", "Caught Item", "Tool" and "Place". In the annotation file, "frames" show the set of semantic roles for a verb, and noun annotations for each role. There are three noun annotations for each role, which are given by three different annotators.

C) Groundings: each grounding is described in [x1, y1, x2, y2] format. In the annotation file, "bb" denotes groundings for roles. Note that nouns can be labeled without groundings, e.g., in the case of occluded objects. When there is no grounding for a role, [-1, -1, -1, -1] is given.

# an example of annotation for an image

"catching_175.jpg": {
    "verb": "catching",
    "height": 512, 
    "width": 910,
    "bb": {"tool": [-1, -1, -1, -1], 
           "caughtitem": [444, 169, 671, 317], 
           "place": [-1, -1, -1, -1], 
           "agent": [270, 112, 909, 389]},
    "frames": [{"tool": "n05282433", "caughtitem": "n02190166", "place": "n03991062", "agent": "n00017222"}, 
               {"tool": "n05302499", "caughtitem": "n02190166", "place": "n03990474", "agent": "n00017222"}, 
               {"tool": "n07655505", "caughtitem": "n13152742", "place": "n00017222", "agent": "n02190166"}]
    }

In imsitu_space.json file, there is additional information for verb and noun.

# an example of additional verb information

"catching": {
    "framenet": "Getting", 
    "abstract": "an AGENT catches a CAUGHTITEM with a TOOL at a PLACE", 
    "def": "capture a sought out item", 
    "order": ["agent", "caughtitem", "tool", "place"], 
    "roles": {"tool": {"framenet": "manner", "def": "The object used to do the catch action"}, 
              "caughtitem": {"framenet": "theme", "def": "The entity being caught"}, 
              "place": {"framenet": "place", "def": "The location where the catch event is happening"}, 
              "agent": {"framenet": "recipient", "def": "The entity doing the catch action"}}
    }
# an example of additional noun information

"n00017222": {
    "gloss": ["plant", "flora", "plant life"], 
    "def": "(botany) a living organism lacking the power of locomotion"
    }

Additional Details

  • All images should be under "SWiG/images_512/" directory.
  • train.json file is for train set.
  • dev.json file is for development set.
  • test.json file is for test set.

Training

To train GSRTR on a single node with 4 gpus for 40 epochs, run:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py \
           --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
           --num_workers 4 --enc_layers 6 --dec_layers 6 --dropout 0.15 --hidden_dim 512 \
           --output_dir gsrtr

To train GSRTR on a Slurm cluster with submitit using 4 TITAN Xp gpus for 40 epochs, run:

python run_with_submitit.py --ngpus 4 --nodes 1 --job_dir gsrtr \
        --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
        --num_workers 4 --enc_layers 6 --dec_layers 6 --dropout 0.15 --hidden_dim 512 \
        --partition titanxp
  • A single epoch takes about 30 minutes. 40 epoch training takes around 20 hours on a single machine with 4 TITAN Xp gpus.
  • We use AdamW optimizer with learning rate 10-4 (10-5 for backbone), weight decay 10-4 and β = (0.9, 0.999).
  • Random Color Jittering, Random Gray Scaling, Random Scaling and Random Horizontal Flipping are used for augmentation.

Inference

To run an inference on a custom image, run:

python inference.py --image_path inference/filename.jpg \
                    --saved_model gsrtr_checkpoint.pth \
                    --output_dir inference
  • Model checkpoint can be downloaded here.

Here is an example of inference result: inference_result

Acknowledgements

Our code is modified and adapted from these amazing repositories:

Contact

Junhyeong Cho ([email protected])

Citation

If you find our work useful for your research, please cite our paper:

@InProceedings{cho2021gsrtr,
    title={Grounded Situation Recognition with Transformers},
    author={Junhyeong Cho and Youngseok Yoon and Hyeonjun Lee and Suha Kwak},
    booktitle={British Machine Vision Conference (BMVC)},
    year={2021}
}

License

GSRTR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Owner
Junhyeong Cho
Student at POSTECH | Studied at Stanford, UIUC and UC Berkeley
Junhyeong Cho
A python program to block out your face

Readme This is a small program I threw together in about 6 hours to block out your face. It probably doesn't work very well, so be warned. By default,

1 Oct 17, 2021
Optical character recognition for Japanese text, with the main focus being Japanese manga

Manga OCR Optical character recognition for Japanese text, with the main focus being Japanese manga. It uses a custom end-to-end model built with Tran

Maciej Budyś 327 Jan 01, 2023
Controlling the computer volume with your hands // OpenCV

HandsControll-AI Controlling the computer volume with your hands // OpenCV Step 1 git clone https://github.com/Hayk-21/HandsControll-AI.git pip instal

Hayk 1 Nov 04, 2021
This is a Computer vision package that makes its easy to run Image processing and AI functions. At the core it uses OpenCV and Mediapipe libraries.

CVZone This is a Computer vision package that makes its easy to run Image processing and AI functions. At the core it uses OpenCV and Mediapipe librar

CVZone 648 Dec 30, 2022
Textboxes : Image Text Detection Model : python package (tensorflow)

shinTB Abstract A python package for use Textboxes : Image Text Detection Model implemented by tensorflow, cv2 Textboxes Paper Review in Korean (My Bl

Jayne Shin (신재인) 91 Dec 15, 2022
Detect and fix skew in images containing text

Alyn Skew detection and correction in images containing text Image with skew Image after deskew Install and use via pip! Recommended way(using virtual

Kakul 230 Dec 21, 2022
Deep LearningImage Captcha 2

滑动验证码深度学习识别 本项目使用深度学习 YOLOV3 模型来识别滑动验证码缺口,基于 https://github.com/eriklindernoren/PyTorch-YOLOv3 修改。 只需要几百张缺口标注图片即可训练出精度高的识别模型,识别效果样例: 克隆项目 运行命令: git cl

Python3WebSpider 117 Dec 28, 2022
The world's simplest facial recognition api for Python and the command line

Face Recognition You can also read a translated version of this file in Chinese 简体中文版 or in Korean 한국어 or in Japanese 日本語. Recognize and manipulate fa

Adam Geitgey 47k Jan 07, 2023
OpenGait is a flexible and extensible gait recognition project

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.

Shiqi Yu 335 Dec 22, 2022
零样本学习测评基准,中文版

ZeroCLUE 零样本学习测评基准,中文版 零样本学习是AI识别方法之一。 简单来说就是识别从未见过的数据类别,即训练的分类器不仅仅能够识别出训练集中已有的数据类别, 还可以对于来自未见过的类别的数据进行区分。 这是一个很有用的功能,使得计算机能够具有知识迁移的能力,并无需任何训练数据, 很符合现

CLUE benchmark 27 Dec 10, 2022
Vietnamese Language Detection and Recognition

Table of Content Introduction (Khôi viết) Dataset (đổi link thui thành 3k5 ảnh mình) Getting Started (An Viết) Requirements Usage Example Training & E

6 May 27, 2022
deployment of a hybrid model for automatic weapon detection/ anomaly detection for surveillance applications

Automatic Weapon Detection Deployment of a hybrid model for automatic weapon detection/ anomaly detection for surveillance applications. Loved the pro

Janhavi 4 Mar 04, 2022
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 03, 2023
Demo processor to illustrate OCR-D Python API

ocrd_vandalize/ Demo processor to illustrate the OCR-D/core Python API Description :TODO: write docs :) Installation From PyPI pip3 install ocrd_vanda

Konstantin Baierer 5 May 05, 2022
A fastai/PyTorch package for unpaired image-to-image translation.

Unpaired image-to-image translation A fastai/PyTorch package for unpaired image-to-image translation currently with CycleGAN implementation. This is a

Tanishq Abraham 120 Dec 02, 2022
Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Handwritten Text Recognition with TensorFlow Update 2021: more robust model, faster dataloader, word beam search decoder also available for Windows Up

Harald Scheidl 1.5k Jan 07, 2023
Camelot: PDF Table Extraction for Humans

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Dec 31, 2022
A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database.

A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database. The structure, shape and proportions of the faces are comp

Pavankumar Khot 4 Mar 19, 2022
A pkg stiching around view images(4-6cameras) to generate bird's eye view.

AVP-BEV-OPEN Please check our new work AVP_SLAM_SIM A pkg stiching around view images(4-6cameras) to generate bird's eye view! View Demo · Report Bug

Xinliang Zhong 37 Dec 01, 2022
Single Shot Text Detector with Regional Attention

Single Shot Text Detector with Regional Attention Introduction SSTD is initially described in our ICCV 2017 spotlight paper. A third-party implementat

Pan He 215 Dec 07, 2022