TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Last update: Dec 21, 2022

Related tags

Deep Learning TSP

Overview

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

[Paper] [Project Website]

This repository holds the source code, pretrained models, and pre-extracted features for the TSP method.

Please cite this work if you find TSP useful for your research.

@article{alwassel2020tsp,
  title={TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks},
  author={Alwassel, Humam and Giancola, Silvio and Ghanem, Bernard},
  journal={arXiv preprint arXiv:2011.11479},
  year={2020}
}

Pre-extracted TSP Features

We provide pre-extracted features for ActivityNet v1.3 and THUMOS14 videos. The feature files are saved in H5 format, where we map each video-name to a features tensor of size N x 512, where N is the number of features and 512 is the feature size. Use h5py python package to read the feature files. Not familiar with H5 files or h5py? here is a quick start guide.

For ActivityNet v1.3 dataset

Download: [train subset] [valid subset] [test subset]

Details: The features are extracted from the R(2+1)D-34 encoder pretrained with TSP on ActivityNet (released model) using clips of 16 frames at a frame rate of 15 fps and a stride of 16 frames (i.e., non-overlapping clips). This gives one feature vector per 16/15 ~= 1.067 seconds.

For THUMOS14 dataset

Download: [valid subset] [test subset]

Details: The features are extracted from the R(2+1)D-34 encoder pretrained with TSP on THUMOS14 (released model) using clips of 16 frames at a frame rate of 15 fps and a stride of 1 frame (i.e., dense overlapping clips). This gives one feature vector per 1/15 ~= 0.067 seconds.

Setup

Clone this repository and create the conda environment.

git clone https://github.com/HumamAlwassel/TSP.git
cd TSP
conda env create -f environment.yml
conda activate tsp

Data Preprocessing

Follow the instructions here to download and preprocess the input data.

Training

We provide training scripts for the TSP models and the TAC baselines here.

Feature Extraction

You can extract features from released pretrained models or from local checkpoints using the scripts here.

Acknowledgment: Our source code borrows implementation ideas from pytorch/vision and facebookresearch/VMZ repositories.

Comments

LOSS does not decrease during training

My data set is small, 1500 videos, all under 10 seconds in length. The current training results of this model are as follows:

The experimental Settings adopted are: Batch_size=32,FACTOR=2. Is such a situation normal? If it is abnormal, what should be done?

opened by ZChengLong578 5
H5 files generated about GVF features

Hi, @HumamAlwassel Thanks for your excellent work and for sharing the code. When I was training my dataset, I read your explanation on GVF feature generation. Do I need to combine .pkl files generated by the training set and valid set into .h5 files when I go to step 3?

opened by ZChengLong578 5
The LOSS value is too large and does not decrease

Hi, @HumamAlwassel, I'm sorry to bother you again. I did it without or very little background (no action). Now I have added more background (no Action), but the LOSS value is very large and does not decrease. The specific situation is shown in the following figure: Here are the files for the training set and validation set: What can I do to solve this problem?

opened by ZChengLong578 3
Use the pretraining model to train other datasets

Hi, @HumamAlwassel After downloading the pre-training model as you said, I overwrote the value of epoch to 0. The following changes were then made in the code: I would like you to take a look, is the change I made in the code correct? Or should I replace the initial tac-on-kinetics Pretrained weights with this instead of using it in the resume?

opened by ZChengLong578 2
Inference unseen video using pretrained model

Hi @HumamAlwassel, Thanks for your excellent work. I really appreciated it. I've trained your work on my own dataset. However, I am thinking about how to use trained model to inference unseen videos. Could you give me some examples that export result of a video such as action label and its start or end time.

Best regards,

opened by t2kien 2
Data sampling problems

Hi, @HumamAlwassel I'm sorry to trouble you again. The duration of my dataset action was short and many partitions were removed, as shown below: However, after observation, I find that it does not seem to be the problem with the length of the video. Actions with a length of 0-1.5 seconds are in the video, but actions with a length of 1.5-3 seconds are not in the video. Why is this?

opened by ZChengLong578 2
$RuntimeError(f'<UntrimmedVideoDataset>: got clip of length {vframes.shape[0]} != {self.clip_length}.'$

RuntimeError(f': got clip of length {vframes.shape[0]} != {self.clip_length}.'

Traceback (most recent call last): File "train.py", line 290, in <module> main(args) File "train.py", line 260, in main train_one_epoch(model=model, criterion=criterion, optimizer=optimizer, lr_scheduler=lr_scheduler, File "train.py", line 63, in train_one_epoch for sample in metric_logger.log_every(data_loader, print_freq, header, device=device): File "/media/bruce/2T/projects/TSP/train/../common/utils.py", line 137, in log_every for obj in iterable: File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 345, in __next__ data = self._next_data() File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise() File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/media/bruce/2T/projects/TSP/train/untrimmed_video_dataset.py", line 86, in __getitem__ raise RuntimeError(f'<UntrimmedVideoDataset>: got clip of length {vframes.shape[0]} != {self.clip_length}.' RuntimeError: <UntrimmedVideoDataset>: got clip of length 15 != 16.filename=/mnt/nas/bruce14t/THUMOS14/valid/video_validation_0000420.mp4, clip_t_start=526.7160991305855, clip_t_end=527.7827657972522, fps=30.0, t_start=498.2, t_end=546.9

I am very impressed by your wonderful work. When I try to reproduce the bash train_tsp_on_thumos14.sh for the THUMOS14 dataset, I got the above data loading issue. The calculation of the start and end of input clips seems not to work well for all the clips (code Line 74-78 of train/untrimmed_video_dataset.py). Could you provide some help with it? Thank you very much in advance.

opened by bruceyo 2
How do I calculate mean and std for a new dataset?

Thanks for your inspiring code with detailed explanations! I have learnt a lot from that and now I'm trying to do some experiments in another dataset. But some implementation details confuse me.

I notice that in the dataset transform part, there is a normalizing step. normalize = T.Normalize(mean=[0.43216, 0.394666, 0.37645], std=[0.22803, 0.22145, 0.216989])

So how do I calculate the mean and std for a new dataset? Should I extract frames from videos first, then calculate mean & std inside all the frames in all videos for each RGB channel?

opened by xjtupanda 1
$Similar to issue #11 getting RuntimeError(f'<UntrimmedVideoDataset>: got clip of length {vframes.shape[0]} != {self.clip_length}.'$
Similar to issue #11 getting RuntimeError(f': got clip of length {vframes.shape[0]} != {self.clip_length}.'
I am working with ActivityNet-v1.3 data converted to grayscale.

I followed the preprocessing step highlighted here.

However, I am still facing this issue similar to #11 , wanted to check if I am missing something or if there are any known fixes.

Example from the log:

RuntimeError: <UntrimmedVideoDataset>: got clip of length 15 != 16.filename=~/ActivityNet/grayscale_split/train/v_bNuRrXSjJl0.mp4, clip_t_start=227.63093165194988, clip_t_end=228.69759831861654, fps=30.0, t_start=219.1265882558503, t_end=228.7

RuntimeError: <UntrimmedVideoDataset>: got clip of length 13 != 16.filename=~/ActivityNet/grayscale_split/train/v_nTNkGOtp7aQ.mp4, clip_t_start=33.341372258903775, clip_t_end=34.408038925570445, fps=30.0, t_start=25.58139772698908, t_end=34.53333333333333

RuntimeError: <UntrimmedVideoDataset>: got clip of length 1 != 16.filename=~/ActivityNet/grayscale_split/train/v_7Iy7Cjv2SAE.mp4, clip_t_start=190.79558490339477, clip_t_end=191.86225157006143, fps=30.0, t_start=131.42849249141963, t_end=195.0

Also, is there a recommended way to skip these files instead of raising the issue while training. The above issues came for different runs and at different epochs.
opened by vc-30 1
Accuracy don't increase

Thank you for your reply! I used the above code to train my data set and found that the accuracy rate has not changed much and has remained around 3. Here is the output of the training: Do you know what caused it?

opened by ZChengLong578 1
question about pretrain-model

Hi, thank you for your excellent work. I have a problem with your model. It is extracted TSP Features in ActivityNet. When the objects present in my video are not in ActivityNet, the model fails to recognize. As an example, ActivityNet's animals are only dogs and horses, but when my video is a cat, I run into trouble. I'm guessing because the model hasn't seen cats, one of my solution is to use ImageNet-22k pretrained weights and then do extracted TSP Features in ActivityNet. I don't know if my thinking is right. If it is correct, could you please update your code about using ImageNet-22k pretrained weights? Thank you very much for your excellent work.

opened by qt2139 1

Releases(thumos14_features)

thumos14_features(Apr 14, 2021)

Pre-extracted TSP features for THUMOS14
Source code(tar.gz)
Source code(zip)
r2plus1d_34-tsp_on_thumos14-test_features.h5(1212.17 MB)
r2plus1d_34-tsp_on_thumos14-valid_features.h5(1098.67 MB)
model_weights(Apr 14, 2021)

This a release of TSP pretrained models as well as baseline models.
Source code(tar.gz)
Source code(zip)
r2plus1d_18-tac_on_activitynet-backbone_lr_0.0001-fc_lr_0.004-epoch_5-9f56941a.pth(239.74 MB)
r2plus1d_18-tac_on_kinetics-76ce975c.pth(120.31 MB)
r2plus1d_18-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_6-22835b73.pth(239.76 MB)
r2plus1d_34-tac_on_activitynet-backbone_lr_0.0001-fc_lr_0.002-epoch_5-98ccac94.pth(485.49 MB)
r2plus1d_34-tac_on_kinetics-0547130e.pth(243.26 MB)
r2plus1d_34-tac_on_thumos14-backbone_lr_0.00001-fc_lr_0.002-epoch_3-54b5c8aa.pth(484.79 MB)
r2plus1d_34-tsp_on_activitynet-avg_gvf-backbone_lr_0.0001-fc_lr_0.004-epoch_5-8b74eaa2.pth(485.51 MB)
r2plus1d_34-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_5-0d2cf854.pth(485.51 MB)
r2plus1d_34-tsp_on_activitynet-no_gvf-backbone_lr_0.0001-fc_lr_0.004-epoch_5-fb38fdd2.pth(485.50 MB)
r2plus1d_34-tsp_on_thumos14-max_gvf-backbone_lr_0.0001-fc_lr_0.004-epoch_4-e6a30b2f.pth(484.81 MB)
r3d_18-tac_on_activitynet-backbone_lr_0.001-fc_lr_0.01-epoch_5-31fd6e95.pth(253.89 MB)
r3d_18-tac_on_kinetics-dcd952c6.pth(127.36 MB)
r3d_18-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_6-85584422.pth(253.91 MB)
activitynet_features(Apr 14, 2021)

Pre-extracted TSP features for Activitynet v1.3
Source code(tar.gz)
Source code(zip)
r2plus1d_34-tsp_on_activitynet-test_features.h5(962.38 MB)
r2plus1d_34-tsp_on_activitynet-train_features.h5(1969.45 MB)
r2plus1d_34-tsp_on_activitynet-valid_features.h5(975.15 MB)

Owner

Humam Alwassel

PhD Student, Computer Vision Researcher, and Deep Learning "Hacker".

GitHub Repository http://humamalwassel.com/publication/tsp/

Implementation of the paper: "SinGAN: Learning a Generative Model from a Single Natural Image"

SinGAN This is an unofficial implementation of SinGAN from someone who's been sitting right next to SinGAN's creator for almost five years. Please ref

35 Nov 10, 2022

This project is based on RIFE and aims to make RIFE more practical for users by adding various features and design new models

CPM 项目描述 CPM（Chinese Pretrained Models）模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型，参数量分别为109M、334M、2.6B，用户需申请与通过审核，方可下载。由于原项目需要考虑大模型的训练和使用，需要安装较为复杂

190 Jan 08, 2023

ByteTrack with ReID module following the paradigm of FairMOT, tracking strategy is borrowed from FairMOT/JDE.

ByteTrack_ReID ByteTrack is the SOTA tracker in MOT benchmarks with strong detector YOLOX and a simple association strategy only based on motion infor

46 Dec 29, 2022

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

112 Nov 07, 2022

Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space"

MotionCLIP Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space". Please visit our webpage for mor

173 Dec 26, 2022

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors In this paper, we propose a novel local descriptor-based fra

80 Dec 15, 2022

meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)

meProp The codes were used for the paper meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting (ICML 2017) [pdf]

107 Nov 18, 2022

Predicting the duration of arrival delays for commercial flights.

Flight Delay Prediction Our objective is to predict arrival delays of commercial flights. According to the US Department of Transportation, about 21%

1 Jan 11, 2022

Implementation of BI-RADS-BERT & The Advantages of Section Tokenization.

BI-RADS BERT Implementation of BI-RADS-BERT & The Advantages of Section Tokenization. This implementation could be used on other radiology in house co

1 May 17, 2022

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020)

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020) Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng [PDF] [Supplementary M

6 Sep 27, 2022

MicroNet: Improving Image Recognition with Extremely Low FLOPs (ICCV 2021)

MicroNet: Improving Image Recognition with Extremely Low FLOPs (ICCV 2021) A pytorch implementation of MicroNet. If you use this code in your research

293 Dec 28, 2022

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

18 Nov 06, 2022

Implementation of Monocular Direct Sparse Localization in a Prior 3D Surfel Map (DSL)

DSL Project page: https://sites.google.com/view/dsl-ram-lab/ Monocular Direct Sparse Localization in a Prior 3D Surfel Map Authors: Haoyang Ye, Huaiya

93 Nov 30, 2022

Attention-driven Robot Manipulation (ARM) which includes Q-attention

Attention-driven Robotic Manipulation (ARM) This codebase is home to: Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation I

84 Dec 29, 2022

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region (Paper and DataSet). [New] Note that all the emails about the download permission o

71 Dec 22, 2022

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

37 Dec 30, 2022

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

45 Dec 29, 2022

Jupyter notebooks for using & learning Keras

deep-learning-with-keras-notebooks 這個github的repository主要是個人在學習Keras的一些記錄及練習。希望在學習過程中發現到一些好的資訊與範例也可以對想要學習使用 Keras來解決問題的同好，或是對深度學習有興趣的在學學生可以有一些方便理解與上手範例

2.1k Dec 27, 2022

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

5 Dec 07, 2022

Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion

Feature-Style Encoder for Style-Based GAN Inversion Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion. Code will

63 Jan 03, 2023

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Related tags

Overview

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Pre-extracted TSP Features

For ActivityNet v1.3 dataset

For THUMOS14 dataset

Setup

Data Preprocessing

Training

Feature Extraction

Comments

Releases(thumos14_features)

thumos14_features(Apr 14, 2021)

model_weights(Apr 14, 2021)

activitynet_features(Apr 14, 2021)

Owner

Humam Alwassel

Implementation of the paper: "SinGAN: Learning a Generative Model from a Single Natural Image"

This project is based on RIFE and aims to make RIFE more practical for users by adding various features and design new models

ByteTrack with ReID module following the paradigm of FairMOT, tracking strategy is borrowed from FairMOT/JDE.

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space"

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)

Predicting the duration of arrival delays for commercial flights.

Implementation of BI-RADS-BERT & The Advantages of Section Tokenization.

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020)

MicroNet: Improving Image Recognition with Extremely Low FLOPs (ICCV 2021)

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

Implementation of Monocular Direct Sparse Localization in a Prior 3D Surfel Map (DSL)

Attention-driven Robot Manipulation (ARM) which includes Q-attention

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Jupyter notebooks for using & learning Keras

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion