Compressed Video Action Recognition 
Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl.
 In CVPR, 2018. [Project Page]
Overview
This is a reimplementation of CoViAR in PyTorch (the original paper uses MXNet). This code currently supports UCF-101 and HMDB-51; Charades coming soon. (This is a work in progress. Any suggestions are appreciated.)
Results
This code produces comparable or better results than the original paper:
 HMDB-51: 52% (I-frame), 40% (motion vector), 43% (residuals), 59.2% (CoViAR).
 UCF-101: 87% (I-frame), 70% (motion vector), 80% (residuals), 90.5% (CoViAR).
 (average of 3 splits; without optical flow. )
Data loader
We provide a python data loader that directly takes a compressed video and returns the compressed representation (I-frames, motion vectors, and residuals) as a numpy array . We can thus train the model without extracting and storing all representations as image files.
In our experiments, it's fast enough so that it doesn't delay GPU training. Please see GETTING_STARTED.md for details and instructions.
Using CoViAR
Please see GETTING_STARTED.md for instructions for training and inference.
Citation
If you find this model useful for your resesarch, please use the following BibTeX entry.
@inproceedings{wu2018coviar,
  title={Compressed Video Action Recognition},
  author={Wu, Chao-Yuan and Zaheer, Manzil and Hu, Hexiang and Manmatha, R and Smola, Alexander J and Kr{\"a}henb{\"u}hl, Philipp},
  booktitle={CVPR},
  year={2018}
}
Acknowledgment
This implementation largely borrows from tsn-pytorch by yjxiong. Part of the dataloader implementation is modified from this tutorial and FFmpeg extract_mv example.
