[CVPR 2021] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Overview

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Introduction

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% [email protected] improvement).

Please also check out the project website here.

For additional detail, please see the Scan2Cap paper:
"Scan2Cap: Context-aware Dense Captioning in RGB-D Scans"
by Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang
from Technical University of Munich and Simon Fraser University.

Data

ScanRefer

If you would like to access to the ScanRefer dataset, please fill out this form. Once your request is accepted, you will receive an email with the download link.

Note: In addition to language annotations in ScanRefer dataset, you also need to access the original ScanNet dataset. Please refer to the ScanNet Instructions for more details.

Download the dataset by simply executing the wget command:

wget <download_link>

Scan2CAD

As learning the relative object orientations in the relational graph requires CAD model alignment annotations in Scan2CAD, please refer to the Scan2CAD official release (you need ~8MB on your disk). Once the data is downloaded, extract the zip file under data/ and change the path to Scan2CAD annotations (CONF.PATH.SCAN2CAD) in lib/config.py . As Scan2CAD doesn't cover all instances in ScanRefer, please download the mapping file and place it under CONF.PATH.SCAN2CAD. Parsing the raw Scan2CAD annotations by the following command:

python scripts/Scan2CAD_to_ScanNet.py

Setup

Please execute the following command to install PyTorch 1.8:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

Install the necessary packages listed out in requirements.txt:

pip install -r requirements.txt

And don't forget to refer to Pytorch Geometric to install the graph support.

After all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:

cd lib/pointnet2
python setup.py install

Before moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.

Data preparation

  1. Download the ScanRefer dataset and unzip it under data/ - You might want to run python scripts/organize_scanrefer.py to organize the data a bit.
  2. Download the preprocessed GLoVE embeddings (~990MB) and put them under data/.
  3. Download the ScanNetV2 dataset and put (or link) scans/ under (or to) data/scannet/scans/ (Please follow the ScanNet Instructions for downloading the ScanNet dataset).

After this step, there should be folders containing the ScanNet scene data under the data/scannet/scans/ with names like scene0000_00

  1. Pre-process ScanNet data. A folder named scannet_data/ will be generated under data/scannet/ after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py

After this step, you can check if the processed scene data is valid by running:

python visualize.py --scene_id scene0000_00
  1. (Optional) Pre-process the multiview features from ENet.

    a. Download the ENet pretrained weights (1.4MB) and put it under data/

    b. Download and decompress the extracted ScanNet frames (~13GB).

    c. Change the data paths in config.py marked with TODO accordingly.

    d. Extract the ENet features:

    python scripts/compute_multiview_features.py

    e. Project ENet features from ScanNet frames to point clouds; you need ~36GB to store the generated HDF5 database:

    python scripts/project_multiview_features.py --maxpool

    You can check if the projections make sense by projecting the semantic labels from image to the target point cloud by:

    python scripts/project_multiview_labels.py --scene_id scene0000_00 --maxpool

Usage

End-to-End training for 3D dense captioning

Run the following script to start the end-to-end training of Scan2Cap model using the multiview features and normals. For more training options, please run scripts/train.py -h:

python scripts/train.py --use_multiview --use_normal --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 --batch_size 12 --epoch 50

The trained model as well as the intermediate results will be dumped into outputs/ . For evaluating the model (@0.5IoU), please run the following script and change the accordingly, and note that arguments must match the ones for training:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_caption --min_iou 0.5

Evaluating the detection performance:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_detection

You can even evaluate the pretraiend object detection backbone:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_detection --eval_pretrained

If you want to visualize the results, please run this script to generate bounding boxes and descriptions for scene to outputs/ :

python scripts/visualize.py --folder <output_folder> --scene_id <scene_id> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10

Note that you need to run python scripts/export_scannet_axis_aligned_mesh.py first to generate axis-aligned ScanNet mesh files.

3D dense captioning with ground truth bounding boxes

For experimenting the captioning performance with ground truth bounding boxes, you need to extract the box features with a pre-trained extractor. The pretrained ones are already in pretrained, but if you want to train a new one from scratch, run the following script:

python scripts/train_maskvotenet.py --batch_size 8 --epoch 200 --lr 1e-3 --wd 0 --use_multiview --use_normal

The pretrained model will be stored under outputs/ . Before we proceed, you need to move the to pretrained/ and change the name of the folder to XYZ_MULTIVIEW_NORMAL_MASKS_VOTENET, which must reflect the features while training, e.g. MULTIVIEW -> --use_multiview.

After that, let's run the following script to extract the features for the ground truth bounding boxes. Note that the feature options must match the ones in the previous steps:

python scripts/extract_gt_features.py --batch_size 16 --epoch 100 --use_multiview --use_normal --train --val

The extracted features will be stored as a HDF5 database under /gt_ _features . You need ~610MB space on your disk.

Now the box features are ready - we're good to go! Next step: run the following command to start training the dense captioning pipeline with the extraced ground truth box features:

python scripts/train_pretrained.py --mode gt --batch_size 32 --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10

For evaluating the model, run the following command:

python scripts/eval_pretrained.py --folder <ouptut_folder> --mode gt --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 

3D dense captioning with pre-trained VoteNet bounding boxes

If you would like to play around with the pre-trained VoteNet bounding boxes, you can directly use the pre-trained VoteNet in pretrained. After picking the model you like, run the following command to extract the bounding boxes and associated box features:

python scripts/extract_votenet_features.py --batch_size 16 --epoch 100 --use_multiview --use_normal --train --val

Now the box features are ready. Next step: run the following command to start training the dense captioning pipeline with the extraced VoteNet boxes:

python scripts/train_pretrained.py --mode votenet --batch_size 32 --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10

For evaluating the model, run the following command:

python scripts/eval_pretrained.py --folder <ouptut_folder> --mode votenet --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 

Experiments on ReferIt3D

Yes, of course you can use the ReferIt3D dataset for training and evaluation. Simply download ReferIt3D dataset and unzip it under data, then run the following command to convert it to ScanRefer format:

python scripts/organize_referit3d.py

Then you can simply specify the dataset you would like to use by --dataset ReferIt3D in the aforementioned steps. Have fun!

2D Experiments

Please refer to Scan2Cad-2D for more information.

Citation

If you found our work helpful, please kindly cite our paper via:

@inproceedings{chen2021scan2cap,
  title={Scan2Cap: Context-aware Dense Captioning in RGB-D Scans},
  author={Chen, Zhenyu and Gholami, Ali and Nie{\ss}ner, Matthias and Chang, Angel X},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3193--3203},
  year={2021}
}

License

Scan2Cap is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Copyright (c) 2021 Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

Owner
Dave Z. Chen
PhD candidate at TUM
Dave Z. Chen
Code for the Lovász-Softmax loss (CVPR 2018)

The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks Maxim Berman, Amal Ranne

Maxim Berman 1.3k Jan 04, 2023
Scalable and Elastic Deep Reinforcement Learning Using PyTorch. Please star. 🔥

ElegantRL “小雅”: Scalable and Elastic Deep Reinforcement Learning ElegantRL is developed for researchers and practitioners with the following advantage

AI4Finance Foundation 2.5k Jan 05, 2023
How to use TensorLayer

How to use TensorLayer While research in Deep Learning continues to improve the world, we use a bunch of tricks to implement algorithms with TensorLay

zhangrui 349 Dec 07, 2022
PECOS - Prediction for Enormous and Correlated Spaces

PECOS - Predictions for Enormous and Correlated Output Spaces PECOS is a versatile and modular machine learning (ML) framework for fast learning and i

Amazon 387 Jan 04, 2023
RSC-Net: 3D Human Pose, Shape and Texture from Low-Resolution Images and Videos

RSC-Net: 3D Human Pose, Shape and Texture from Low-Resolution Images and Videos Implementation for "3D Human Pose, Shape and Texture from Low-Resoluti

XiangyuXu 42 Nov 10, 2022
[ICCV 2021] Code release for "Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks"

Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks By Yikai Wang, Yi Yang, Fuchun Sun, Anbang Yao. This is the pytorc

Yikai Wang 26 Nov 20, 2022
PyTorch implementation for 3D human pose estimation

Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach This repository is the PyTorch implementation for the network presented in:

Xingyi Zhou 579 Dec 22, 2022
Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

Nervana 3.9k Dec 20, 2022
A simple image/video to Desmos graph converter run locally

Desmos Bezier Renderer A simple image/video to Desmos graph converter run locally Sample Result Setup Install dependencies apt update apt install git

Kevin JY Cui 339 Dec 23, 2022
An efficient and easy-to-use deep learning model compression framework

TinyNeuralNetwork 简体中文 TinyNeuralNetwork is an efficient and easy-to-use deep learning model compression framework, which contains features like neura

Alibaba 441 Dec 25, 2022
PyTorch implementation of ICLR 2022 paper PiCO: Contrastive Label Disambiguation for Partial Label Learning

PiCO: Contrastive Label Disambiguation for Partial Label Learning This is a PyTorch implementation of ICLR 2022 paper PiCO: Contrastive Label Disambig

王皓波 147 Jan 07, 2023
A library for Deep Learning Implementations and utils

deeply A Deep Learning library Table of Contents Features Quick Start Usage License Features Python 2.7+ and Python 3.4+ compatible. Quick Start $ pip

Achilles Rasquinha 1 Dec 12, 2022
An Ensemble of CNN (Python 3.5.1 Tensorflow 1.3 numpy 1.13)

An Ensemble of CNN (Python 3.5.1 Tensorflow 1.3 numpy 1.13)

0 May 06, 2022
DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction (3DV 2021)

DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction (3DV 2021) This repo is the implementation of DPC. Tested environment Pyth

Dvir Ginzburg 30 Nov 30, 2022
Collaborative forensic timeline analysis

Timesketch Table of Contents About Timesketch Getting started Community Contributing About Timesketch Timesketch is an open-source tool for collaborat

Google 2.1k Dec 28, 2022
⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.

Optimized Einsum Optimized Einsum: A tensor contraction order optimizer Optimized einsum can significantly reduce the overall execution time of einsum

Daniel Smith 653 Dec 30, 2022
Code accompanying the paper on "An Empirical Investigation of Domain Generalization with Empirical Risk Minimizers" published at NeurIPS, 2021

Code for "An Empirical Investigation of Domian Generalization with Empirical Risk Minimizers" (NeurIPS 2021) Motivation and Introduction Domain Genera

Meta Research 15 Dec 27, 2022
A lightweight python AUTOmatic-arRAY library.

A lightweight python AUTOmatic-arRAY library. Write numeric code that works for: numpy cupy dask autograd jax mars tensorflow pytorch ... and indeed a

Johnnie Gray 62 Dec 27, 2022
A 3D Dense mapping backend library of SLAM based on taichi-Lang designed for the aerial swarm.

TaichiSLAM This project is a 3D Dense mapping backend library of SLAM based Taichi-Lang, designed for the aerial swarm. Intro Taichi is an efficient d

XuHao 230 Dec 19, 2022
Demo for Real-time RGBD-based Extended Body Pose Estimation paper

Real-time RGBD-based Extended Body Pose Estimation This repository is a real-time demo for our paper that was published at WACV 2021 conference The ou

Renat Bashirov 118 Dec 26, 2022