BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

Overview

BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting

Updated on December 10, 2021 (Release all dataset(2021 videos))

Updated on June 06, 2021 (Added evaluation metric)

Released on May 26, 2021

Description

YouTube Demo | Homepage | Downloads(Google Drive) Downloads(Baidu Drive)(password:go10) | Paper

We create a new large-scale benchmark dataset named Bilingual, Open World Video Text(BOVText), the first large-scale and multilingual benchmark for video text spotting in a variety of scenarios. All data are collected from KuaiShou and YouTube

There are mainly three features for BOVText:

  • Large-Scale: we provide 2,000+ videos with more than 1,750,000 frame images, four times larger than the existing largest dataset for text in videos.
  • Open Scenario:BOVText covers 30+ open categories with a wide selection of various scenarios, e.g., life vlog, sports news, automatic drive, cartoon, etc. Besides, caption text and scene text are separately tagged for the two different representational meanings in the video. The former represents more theme information, and the latter is the scene information.
  • Bilingual:BOVText provides Bilingual text annotation to promote multiple cultures live and communication.

Tasks and Metrics

The proposed BOVText support four task(text detection, recognition, tracking, spotting), but mainly includes two tasks:

  • Video Frames Detection.
  • Video Frames Recognition.
  • Video Text Tracking.
  • End to End Text Spotting in Videos.

MOTP (Multiple Object Tracking Precision)[1], MOTA (Multiple Object Tracking Accuracy) and IDF1[3,4] as the three important metrics are used to evaluate task1 (text tracking) for MMVText. In particular, we make use of the publicly available py-motmetrics library (https://github.com/cheind/py-motmetrics) for the establishment of the evaluation metric.

Word recognition evaluation is case-insensitive, and accent-insensitive. The transcription '###' or "#1" is special, as it is used to define text areas that are unreadable. During the evaluation, such areas will not be taken into account: a method will not be penalised if it does not detect these words, while a method that detects them will not get any better score.

Task 3 for Text Tracking Evaluation

The objective of this task is to obtain the location of words in the video in terms of their affine bounding boxes. The task requires that words are both localised correctly in every frame and tracked correctly over the video sequence. Please output the json file as following:

Output
.
├-Cls10_Program_Cls10_Program_video11.json
│-Cls10_Program_Cls10_Program_video12.json
│-Cls10_Program_Cls10_Program_video13.json
├-Cls10_Program_Cls10_Program_video14.json
│-Cls10_Program_Cls10_Program_video15.json
│-Cls10_Program_Cls10_Program_video16.json
│-Cls11_Movie_Cls11_Movie_video17.json
│-Cls11_Movie_Cls11_Movie_video18.json
│-Cls11_Movie_Cls11_Movie_video19.json
│-Cls11_Movie_Cls11_Movie_video20.json
│-Cls11_Movie_Cls11_Movie_video21.json
│-...


And then cd Evaluation_Protocol/Task1_VideoTextTracking, run following script:

python evaluation.py --groundtruths ./Test/Annotation --tests ./output

Task 4 for Text Spotting Evaluation

Please output the json file like task 3.

cd Evaluation_Protocol/Task2_VideoTextSpotting, run following script:

python evaluation.py --groundtruths ./Test/Annotation --tests ./output

Ground Truth (GT) Format and Downloads

We create a single JSON file for each video in the dataset to store the ground truth in a structured format, following the naming convention: gt_[frame_id], where frame_id refers to the index of the video frame in the video

In a JSON file, each gt_[frame_id] corresponds to a list, where each line in the list correspond to one word in the image and gives its bounding box coordinates, transcription, text type(caption or scene text) and tracking ID, in the following format:

{

“frame_1”:  
            [
			{
				"points": [x1, y1, x2, y2, x3, y3, x4, y4],
				“tracking ID”: "1" ,
				“transcription”: "###",
				“category”: title/caption/scene text,
				“language”: Chinese/English,
				“ID_transcription“:  complete words for the whole trajectory
			},

               …

            {
				"points": [x1, y1, x2, y2, x3, y3, x4, y4],
				“tracking ID”: "#" ,
				“transcription”: "###",
				“category”: title/caption/scene text,
				“language”: Chinese/English,
				“ID_transcription“:  complete words for the whole trajectory
			}
			],

“frame_2”:  
            [
			{
				"points": [x1, y1, x2, y2, x3, y3, x4, y4],
				“tracking ID”: "1" ,
				“transcription”: "###",
				“category”: title/caption/scene text,
				“language”: Chinese/English,
				“ID_transcription“:  complete words for the whole trajectory
			},

               …

            {
				"points": [x1, y1, x2, y2, x3, y3, x4, y4],
				“tracking ID”: "#" ,
				“transcription”: "###",
				“category”: title/caption/scene text,
				“language”: Chinese/English,
				“ID_transcription“:  complete words for the whole trajectory
			}
			],

……

}

Downloads

Training data and the test set can be found from Downloads(Google Drive) Downloads(Baidu Drive)(password:go10).

Table Ranking

Important Announcements: we expand the data size from 1,850 videos to 2,021 videos, causing the performance difference between arxiv paper and the NeurIPS version. Therefore, please refer to the latest arXiv paper, while existing ambiguity.

</tbody>
Method Text Tracking Performance/% End to End Video Text Spotting/% Published at
MOTA MOTP IDP IDR IDF1 MOTA MOTP IDP IDR IDF1
EAST+CRNN -21.6 75.8 29.9 26.5 28.1 -79.3 76.3 6.8 6.9 6.8 -
TransVTSpotter 68.2 82.1 71.0 59.7 64.7 -1.4 82.0 43.6 38.4 40.8 -

Maintenance Plan and Goal

The author will plays an active participant in the video text field and maintaining the dataset at least before 2023 years. And the maintenance plan as the following:

  • Merging and releasing the whole dataset after further review. (Around before November, 2021)
  • Updating evaluation guidance and script code for four tasks(detection, tracking, recognition, and spotting). (Around before November, 2021)
  • Hosting a competition concerning our work for promotional and publicity. (Around before March,2022)

More video-and-language tasks will be supported in our dataset:

  • Text-based Video Retrieval[5] (Around before March,2022)
  • Text-based Video Caption[6] (Around before September,2022)
  • Text-based VQA[7][8] (TED)

TodoList

  • update evaluation metric
  • update data and annotation link
  • update evaluation guidance
  • update Baseline(TransVTSpotter)
  • ...

Citation

@article{wu2021opentext,
  title={A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer},
  author={Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou},
  journal={35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
  year={2021}
}

Organization

Affiliations: Zhejiang University, MMU of Kuaishou Technology

Authors: Weijia Wu(Zhejiang University), Debing Zhang(Kuaishou Technology)

Feedback

Suggestions and opinions of this dataset (both positive and negative) are greatly welcome. Please contact the authors by sending email to [email protected].

License and Copyright

The project is open source under CC-by 4.0 license (see the LICENSE file).

Only for research purpose usage, it is not allowed for commercial purpose usage.

The videos were partially downloaded from YouTube and some may be subject to copyright. We don't own the copyright of those videos and only provide them for non-commercial research purposes only. For each video from YouTube, while we tried to identify video that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each video and you should verify the license for each image yourself.

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 4.0 License.

References

[1] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., & Leal-Taixe, L. (2019). CVPR19 Tracking and Detection Challenge: How crowded can it get?. arXiv preprint arXiv:1906.04567.

[2] Bernardin, K. & Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. Image and Video Processing, 2008(1):1-10, 2008.

[3] Ristani, E., Solera, F., Zou, R., Cucchiara, R. & Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In ECCV workshop on Benchmarking Multi-Target Tracking, 2016.

[4] Li, Y., Huang, C. & Nevatia, R. Learning to associate: HybridBoosted multi-target tracker for crowded scene. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.

[5] Anand Mishra, Karteek Alahari, and CV Jawahar. Image retrieval using textual cues. In Proceedings of the IEEE International Conference on Computer Vision, pages 3040–3047, 2013.

[6] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In European Conference on Computer Vision, pages 742–758. Springer, 2020.

[7] Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar, "DocVQA: A Dataset for VQA on Document Images", arXiv:2007.00398 [cs.CV], WACV 2021

[8] Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C.V. Jawahar, "Document Visual Question Answering Challenge 2020", arXiv:2008.08899 [cs.CV], DAS 2020

Owner
weijiawu
computer version, OCR I am looking for a research intern or visiting chance.
weijiawu
This is the repository of our article published on MDPI Entropy "Feature Selection for Recommender Systems with Quantum Computing".

Collaborative-driven Quantum Feature Selection This repository was developed by Riccardo Nembrini, PhD student at Politecnico di Milano. See the websi

Quantum Computing Lab @ Politecnico di Milano 10 Apr 21, 2022
We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction

We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction. This repository aims to give easy access to state-of-the-art pre-train

GMUM 90 Jan 08, 2023
Add-on for importing and auto setup of character creator 3 character exports.

CC3 Blender Tools An add-on for importing and automatically setting up materials for Character Creator 3 character exports. Using Blender in the Chara

260 Jan 05, 2023
Find-Lane-Line - Use openCV library and Python to detect the road-lane-line

Find-Lane-Line This project is to use openCV library and Python to detect the road-lane-line. Data Pipeline Step one : Color Selection Step two : Cann

Kenny Cheng 3 Aug 17, 2022
[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

OW-DETR: Open-world Detection Transformer (CVPR 2022) [Paper] Akshita Gupta*, Sanath Narayan*, K J Joseph, Salman Khan, Fahad Shahbaz Khan, Mubarak Sh

Akshita Gupta 127 Dec 27, 2022
Code for KDD'20 "Generative Pre-Training of Graph Neural Networks"

GPT-GNN: Generative Pre-Training of Graph Neural Networks GPT-GNN is a pre-training framework to initialize GNNs by generative pre-training. It can be

Ziniu Hu 346 Dec 19, 2022
Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

disclaimer: this code is modified from pytorch-tutorial Image classification with synthetic gradient in Pytorch I implement the Decoupled Neural Inter

Andrew 114 Dec 22, 2022
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

419 Jan 03, 2023
Animal Sound Classification (Cats Vrs Dogs Audio Sentiment Classification)

this is a simple artificial neural network model using deep learning and torch-audio to classify cats and dog sounds.

crispengari 3 Dec 05, 2022
A embed able annotation tool for end to end cross document co-reference

CoRefi CoRefi is an emebedable web component and stand alone suite for exaughstive Within Document and Cross Document Coreference Anntoation. For a de

PythicCoder 39 Dec 12, 2022
Keras community contributions

keras-contrib : Keras community contributions Keras-contrib is deprecated. Use TensorFlow Addons. The future of Keras-contrib: We're migrating to tens

Keras 1.6k Dec 21, 2022
Group Fisher Pruning for Practical Network Compression(ICML2021)

Group Fisher Pruning for Practical Network Compression (ICML2021) By Liyang Liu*, Shilong Zhang*, Zhanghui Kuang, Jing-Hao Xue, Aojun Zhou, Xinjiang W

Shilong Zhang 129 Dec 13, 2022
To prepare an image processing model to classify the type of disaster based on the image dataset

Disaster Classificiation using CNNs bunnysaini/Disaster-Classificiation Goal To prepare an image processing model to classify the type of disaster bas

Bunny Saini 1 Jan 24, 2022
Segment axon and myelin from microscopy data using deep learning

Segment axon and myelin from microscopy data using deep learning. Written in Python. Using the TensorFlow framework. Based on a convolutional neural network architecture. Pixels are classified as eit

NeuroPoly 103 Nov 29, 2022
Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality".

personalized-breath Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality". Guideline To ex

Manh-Ha Bui 2 Nov 15, 2021
code for our ECCV-2020 paper: Self-supervised Video Representation Learning by Pace Prediction

Video_Pace This repository contains the code for the following paper: Jiangliu Wang, Jianbo Jiao and Yunhui Liu, "Self-Supervised Video Representation

Jiangliu Wang 95 Dec 14, 2022
PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+

PaddlePaddle Vision Transformers State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 🤖 PaddlePaddle Visual Transformers (PaddleViT or

1k Dec 28, 2022
Self-Correcting Quantum Many-Body Control using Reinforcement Learning with Tensor Networks

Self-Correcting Quantum Many-Body Control using Reinforcement Learning with Tensor Networks This repository contains the code and data for the corresp

Friederike Metz 7 Apr 23, 2022
Plaything for Autistic Children (demo for PaddlePaddle/Wechaty/Mixlab project)

星星的孩子 - 一款为孤独症孩子设计的聊天机器人游戏 孤独症儿童是目前常常被忽视的一类群体。他们有着类似性格内向的特征,实际却受着广泛性发育障碍的折磨。 项目背景 这类儿童在与人交往时存在着沟通障碍,其特点表现在: 社交交流差,互动障碍明显 认知能力有限,被动认知 兴趣狭窄,重复刻板,缺乏变化和想象

Tianyi Pan 35 Nov 24, 2022
[NeurIPS 2021] Code for Unsupervised Learning of Compositional Energy Concepts

Unsupervised Learning of Compositional Energy Concepts This is the pytorch code for the paper Unsupervised Learning of Compositional Energy Concepts.

45 Nov 30, 2022