Welcome to starts
- 2021.12.12: Recent papers (from 2021)
- welcome to add if any information misses. ๐
Introduction
Referring video object segmentation aims at segmenting an object in video with language expressions.
Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, to identify and segment an object referred by the given language expressions in a video. A detailed explanation of the new task can be found in the following paper.
Seonguk Seo, Joon-Young Lee, Bohyung Han, โURVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmarkโ, European Conference on Computer Vision (ECCV), 2020:https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123600205.pdf
Impressive Works Related to Referring Video Object Segmentation (RVOS)
-
ReferFormer:https://arxiv.org/pdf/2201.00487.pdf
-
PMINet:https://youtube-vos.org/assets/challenge/2021/reports/RVOS_2_Ding.pdf
-
CMPC-V [PAMI 2021]:https://github.com/spyflying/CMPC-Refseg
Cross-modal progressive comprehension for referring segmentation:https://arxiv.org/abs/2105.07175
- URVOS [ECCV 2020]:https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123600205.pdf
Benchmark
The 3rd Large-scale Video Object Segmentation - Track 3: Referring Video Object Segmentation
Datasets
- YouTube-VOS:
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_YTVOS_w_refer.py
python down_YTVOS_w_refer.py
Folder structure:
${current_path}/
โโโ refer_youtube_vos/
โโโ train/
โ โโโ JPEGImages/
โ โ โโโ */ (video folders)
โ โ โโโ *.jpg (frame image files)
โ โโโ Annotations/
โ โโโ */ (video folders)
โ โโโ *.png (mask annotation files)
โโโ valid/
โ โโโ JPEGImages/
โ โโโ */ (video folders)
โ โโโ *.jpg (frame image files)
โโโ meta_expressions/
โโโ train/
โ โโโ meta_expressions.json (text annotations)
โโโ valid/
โโโ meta_expressions.json (text annotations)
- A2D-Sentences:
REPO:https://web.eecs.umich.edu/~jjcorso/r/a2d/
paper:https://arxiv.org/abs/1803.07485
Citation:
@misc{gavrilyuk2018actor,
title={Actor and Action Video Segmentation from a Sentence},
author={Kirill Gavrilyuk and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek},
year={2018},
eprint={1803.07485},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
License: The dataset may not be republished in any form without the written consent of the authors.
README Dataset and Annotation (version 1.0, 1.9GB, tar.bz) Evaluation Toolkit (version 1.0, tar.bz)
mkdir a2d_sentences
cd a2d_sentences
wget https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz
tar jxvf A2D_main_1_0.tar.bz
mkdir text_annotations
cd text_annotations
wget https://kgavrilyuk.github.io/actor_action/a2d_annotation.txt
wget https://kgavrilyuk.github.io/actor_action/a2d_missed_videos.txt
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_a2d_annotation_with_instances.py
python down_a2d_annotation_with_instances.py
unzip a2d_annotation_with_instances.zip
#rm a2d_annotation_with_instances.zip
cd ..
cd ..
Folder structure:
${current_path}/
โโโ a2d_sentences/
โโโ Release/
โ โโโ videoset.csv (videos metadata file)
โ โโโ CLIPS320/
โ โโโ *.mp4 (video files)
โโโ text_annotations/
โโโ a2d_annotation.txt (actual text annotations)
โโโ a2d_missed_videos.txt
โโโ a2d_annotation_with_instances/
โโโ */ (video folders)
โโโ *.h5 (annotations files)
Citation:
@inproceedings{YaXuCaCVPR2017,
author = {Yan, Y. and Xu, C. and Cai, D. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Weakly Supervised Actor-Action Segmentation via Robust Multi-Task Ranking},
year = {2017}
}
@inproceedings{XuCoCVPR2016,
author = {Xu, C. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Actor-Action Semantic Segmentation with Grouping-Process Models},
year = {2016}
}
@inproceedings{XuHsXiCVPR2015,
author = {Xu, C. and Hsieh, S.-H. and Xiong, C. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
poster = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D_poster.pdf},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Can Humans Fly? {Action} Understanding with Multiple Classes of Actors},
url = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D.pdf},
year = {2015}
}
- J-HMDB:http://jhmdb.is.tue.mpg.de/
downloading_script
mkdir jhmdb_sentences
cd jhmdb_sentences
wget http://files.is.tue.mpg.de/jhmdb/Rename_Images.tar.gz
wget https://kgavrilyuk.github.io/actor_action/jhmdb_annotation.txt
wget http://files.is.tue.mpg.de/jhmdb/puppet_mask.zip
tar -xzvf Rename_Images.tar.gz
unzip puppet_mask.zip
cd ..
Folder structure:
${current_path}/
โโโ jhmdb_sentences/
โโโ Rename_Images/ (frame images)
โ โโโ */ (action dirs)
โโโ puppet_mask/ (mask annotations)
โ โโโ */ (action dirs)
โโโ jhmdb_annotation.txt (text annotations)
Citation:
@inproceedings{Jhuang:ICCV:2013,
title = {Towards understanding action recognition},
author = {H. Jhuang and J. Gall and S. Zuffi and C. Schmid and M. J. Black},
booktitle = {International Conf. on Computer Vision (ICCV)},
month = Dec,
pages = {3192-3199},
year = {2013}
}
- refer-DAVIS16/17:[https://arxiv.org/pdf/1803.08006.pdf]