TadTR: End-to-end Temporal Action Detection with Transformer

By Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, Xiang Bai.

This repo holds the code for TadTR, described in the paper End-to-end temporal action detection with Transformer published in IEEE Transactions on Image Processing (TIP) 2022.

We have also explored fully end-to-end training from RGB images with TadTR. See our CVPR 2022 work E2E-TAD.

Introduction

TadTR is an end-to-end Temporal Action Detection TRansformer. It has the following advantages over previous methods:

Simple. It adopts a set-prediction pipeline and achieves TAD with a single network. It does not require a separate proposal generation stage.
Flexible. It removes hand-crafted design such as anchor setting and NMS.
Sparse. It produces very sparse detections (e.g. 10 on ActivityNet), thus requiring lower computation cost.
Strong. As a self-contained temporal action detector, TadTR achieves state-of-the-art performance on HACS and THUMOS14. It is also much stronger than concurrent Transformer-based methods such as RTD-Net and AGT.

Updates

[2023.2.19] Fix a bug a loss caculation (issue #21). Thank @zachpvin for raising this issue!

[2022.8.7] Add support for training/testing on THUMOS14!

[2022.7.4] Glad to share that this paper will appear in IEEE Transactions on Image Processing (TIP). Although I am still busy with my thesis, I will try to make the code accessible soon. Thanks for your patience.

[2022.6] Update the technical report of this work on arxiv (now v3).

[2022.3] Our new work E2E-TAD based on TadTR is accepted to CVPR 2022. It supports fully end-to-end training from RGB images.

[2021.9.15] Update the performance on THUMOS14.

[2021.9.1] Add demo code.

[2021.7] Our revised paper was submitted to IEEE Transactions on Image Processing.

[2021.6] Our revised paper was uploaded to arxiv.

[2021.1.21] Our paper was submitted to IJCAI 2021.

TODOs

add model code
add inference code
add training code
support training/inference with video input. See E2E-TAD

Main Results

HACS Segments

Method	Feature	mAP@0.5	mAP@0.75	mAP@0.95	Avg. mAP
TadTR	I3D RGB	47.14	32.11	10.94	32.09

THUMOS14

Method	Feature	mAP@0.3	mAP@0.4	mAP@0.5	mAP@0.6	mAP@0.7	Avg. mAP
TadTR	I3D 2stream	74.8	69.1	60.1	46.6	32.8	56.7

ActivityNet-1.3

Method	Feature	mAP@0.5	mAP@0.75	mAP@0.95	Avg. mAP
TadTR	TSN 2stream	51.29	34.99	9.49	34.64
TadTR	TSP	53.62	37.52	10.56	36.75

Install

Requirements

Linux or Windows
Python>=3.7
(Optional) CUDA>=9.2, GCC>=5.4
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)
Other requirements
```
pip install -r requirements.txt
```

Compiling CUDA extensions (Optional)

The RoIAlign operator is implemented with CUDA extension. If your machine does have a NVIDIA GPU with CUDA support, you can run this step. Otherwise, please set disable_cuda=True in opts.py.

cd model/ops;

# If you have multiple installations of CUDA Toolkits, you'd better add a prefix
# CUDA_HOME=<your_cuda_toolkit_path> to specify the correct version. 
python setup.py build_ext --inplace

Run a quick test

python demo.py

1.Data Preparation

Currently we only support thumos14.

THUMOS14

Download all data from [BaiduDrive(code: adTR)] or [OneDrive].

Features: Download the I3D features I3D_2stream_Pth.tar. It was originally provided by the authors of P-GCN. I have concatenated the RGB and Flow features (drop the tail of the longer one if the lengths are inconsistent) and converted the data to float32 precision to save space.
Annotations: The annotations of action instances and the meta information of feature files. Both are in JSON format (th14_annotations_with_fps_duration.json and th14_i3d2s_ft_info.json).
Pre-trained Reference Models: Our pretrained model that use I3D features thumos14_i3d2s_tadtr_reference.pth. This model corresponds to the config file configs/thumos14_i3d2s_tadtr.yml.

After downloading is finished, extract the archived feature files inplace by cd data;tar -xf I3D_2stream_Pth.tar. Then put the features, annotations, the model under the data/thumos14 directory. We expect the following structure in root folder.

- data
  - thumos14
    - I3D_2stream_Pth
     - xxxxx
     - xxxxx
    - th14_annotations_with_fps_duration.json
    - th14_i3d2s_ft_info.json
    - thumos14_tadtr_reference.pth

2.Testing Pre-trained Models

Run

python main.py --cfg CFG_PATH --eval --resume CKPT_PATH

CFG_PATH is the path to the YAML-format config file that defines the experimental setting. For example, configs/thumos14_i3d2s_tadtr.yml. CKPT_PATH is the path of the pre-trained model. Alternatively, you can execute the Shell script bash scripts/test_reference_models.sh thumos14 for simplity.

3.Training by Yourself

Run the following command

python main.py --cfg CFG_PATH

This codebase supports running on both CPU and GPU.

To run on CPU: please add --device cpu to the above command. Also, you need to set disable_cuda=True in opts.py. The CPU mode does not support actionness regression and the detection performance is lower.
To run on GPU: since the model is very lightweight, just one GPU is enough. You may specify the GPU device ID (e.g., 0) to use by the adding the prefix CUDA_VISIBLE_DEVICES=ID before the above command. To run on multiple GPUs, please refer to scripts/run_parallel.sh.

During training, our code will automatically perform testing every N epochs (N is the test_interval in opts.py). Training takes 6~10 minutes on THUMOS14 if you use a modern GPU (e.g. TITAN Xp). You can also monitor the training process with Tensorboard (need to set cfg.tensorboard=True in opts.py). The tensorboard record and the checkpoint will be saved at output_dir (can be modified in config file).

After training is done, you can also test your trained model by running

python main.py --cfg CFG_PATH --eval

It will automatically use the best model checkpoint. If you want to manually specify the model checkpoint, run

python main.py --cfg CFG_PATH --eval --resume CKPT_PATH

Note that the performance of the model trained by your own may be different from the reference model, even though all seeds are fixed. The reason is that TadTR uses the grid_sample operator, whoses gradient computation involves the non-deterministic AtomicAdd operator. Please refer to ref1 ref2 ref3(Chinese) for details.

Acknowledgement

The code is based on the DETR and Deformable DETR. We also borrow the implementation of the RoIAlign1D from G-TAD. Thanks for their great works.

Citing

@article{liu2022end,
  title={End-to-end Temporal Action Detection with Transformer},
  author={Liu, Xiaolong and Wang, Qimeng and Hu, Yao and Tang, Xu and Zhang, Shiwei and Bai, Song and Bai, Xiang},
  journal={IEEE Transactions on Image Processing (TIP)},
  year={2022}
}

Contact

For questions and suggestions, please contact Xiaolong Liu by email ("liuxl at hust dot edu dot cn").

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Evaluation		Evaluation
configs		configs
data_intro		data_intro
datasets		datasets
docs		docs
models		models
scripts		scripts
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
engine.py		engine.py
main.py		main.py
opts.py		opts.py
requirements.txt		requirements.txt

License

xlliu7/TadTR

Folders and files

Latest commit

History

Repository files navigation