TableBank: A Benchmark Dataset for Table Detection and Recognition

Overview

TableBank

TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.

News

  • We release an official split for the train/val/test datasets and re-train both of the Table Detection and Table Structure Recognition models using Detectron2 and OpenNMT tools. The benchmark results, the MODEL ZOO, and the download link of TableBank have been updated.
  • A new benchmark dataset DocBank (Paper, Repo) is now available for document layout analysis
  • Our data can only be used for research purpose
  • Our paper has been accepted in LREC 2020

Introduction

To address the need for a standard open domain table benchmark dataset, we propose a novel weak supervision approach to automatically create the TableBank, which is orders of magnitude larger than existing human labeled datasets for table analysis. Distinct from traditional weakly supervised training set, our approach can obtain not only large scale but also high quality training data.

Nowadays, there are a great number of electronic documents on the web such as Microsoft Word (.docx) and Latex (.tex) files. These online documents contain mark-up tags for tables in their source code by nature. Intuitively, we can manipulate these source code by adding bounding box using the mark-up language within each document. For Word documents, the internal Office XML code can be modified where the borderline of each table is identified. For Latex documents, the tex code can be also modified where bounding boxes of tables are recognized. In this way, high-quality labeled data is created for a variety of domains such as business documents, official fillings, research papers etc, which is tremendously beneficial for large-scale table analysis tasks.

The TableBank dataset totally consists of 417,234 high quality labeled tables as well as their original documents in a variety of domains.

Statistics of TableBank

Based on the number of tables

Task Word Latex Word+Latex
Table detection 163,417 253,817 417,234
Table structure recognition 56,866 88,597 145,463

Based on the number of images

Task Word Latex Word+Latex
Table detection 78,399 200,183 278,582
Table structure recognition 56,866 88,597 145,463

Statistics on Train/Val/Test sets of Table Detection

Source Train Val Test
Latex 187199 7265 5719
Word 73383 2735 2281
Total 260582 10000 8000

Statistics on Train/Val/Test sets of Table Structure Recognition

Source Train Val Test
Latex 79486 6075 3036
Word 50977 3925 1964
Total 130463 10000 5000

License

TableBank is released under the Attribution-NonCommercial-NoDerivs License. You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. If you remix, transform, or build upon the material, you may not distribute the modified material.

Task Definition

Table Detection

Table detection aims to locate tables using bounding boxes in a document. Given a document page in the image format, generating several bounding box that represents the location of tables in this page.

Table Structure Recognition

Table structure recognition aims to identify the row and column layout structure for the tables especially in non-digital document formats such as scanned images. Given a table in the image format, generating an HTML tag sequence that represents the arrangement of rows and columns as well as the type of table cells.

Baselines

To verify the effectiveness of Table-Bank, we build several strong baselines using the state-of-the-art models with end-to-end deep neural networks. The table detection model is based on the Faster R-CNN [Ren et al., 2015] architecture with different settings. The table structure recognition model is based on the encoder-decoder framework for image-to-text.

Data and Metrics

To evaluate table detection, we sample 18,000 document images from Word and Latex documents, where 10,000 images for validation and 8,000 images for testing. Each sampled image contains at least one table. Meanwhile, we also evaluate our model on the ICDAR 2013 dataset to verify the effectiveness of TableBank. To evaluate table structure recognition, we sample 15,000 table images from Word and Latex documents, where 10,000 images for validation and 5,000 images for testing. For table detection, we calculate the precision, recall and F1 in the way described in our paper, where the metrics for all documents are computed by summing up the area of overlap, prediction and ground truth. For table structure recognition, we use the 4-gram BLEU score as the evaluation metric with a single reference.

Table Detection

We use the open-source framework Detectron2 [Wu et al., 2019] to train models on the TableBank. Detectron2 is a high-quality and high-performance codebase for object detection research, which supports many state-of-the-art algorithms. In this task, we use the Faster R-CNN algorithm with the ResNeXt [Xie et al., 2016] as the backbone network architecture, where the parameters are pre-trained on the ImageNet dataset. All baselines are trained using 4 V100 NVIDIA GPUs using data-parallel sync SGD with a minibatch size of 20 images. For other parameters, we use the default values in Detectron2. During testing, the confidence threshold of generating bounding boxes is set to 90%.

Models Word Latex Word+Latex
Precision Recall F1 Precision Recall F1 Precision Recall F1
X101(Word) 0.9352 0.9398 0.9375 0.9905 0.5851 0.7356 0.9579 0.7474 0.8397
X152(Word) 0.9418 0.9415 0.9416 0.9912 0.6882 0.8124 0.9641 0.8041 0.8769
X101(Latex) 0.8453 0.9335 0.8872 0.9819 0.9799 0.9809 0.9159 0.9587 0.9368
X152(Latex) 0.8476 0.9264 0.8853 0.9816 0.9814 0.9815 0.9173 0.9562 0.9364
X101(Word+Latex) 0.9178 0.9363 0.9270 0.9827 0.9784 0.9806 0.9526 0.9592 0.9559
X152(Word+Latex) 0.9229 0.9266 0.9247 0.9837 0.9752 0.9795 0.9557 0.9530 0.9543

Table Structure Recognition

For table structure recognition, we use the open-source framework OpenNMT [Klein et al., 2017] to train the image-to-text model. OpenNMT is mainly designed for neural machine translation, which supports many encoder-decoder frameworks. In this task, we train our model using the image-to-text method in OpenNMT. The model is also trained using 4 V100 NVIDIA GPUs with the learning rate of 1 and batch size of 24. For other parameters, we use the default values in OpenNMT.

Models Word Latex Word+Latex
Image-to-Text (Word) 59.18 69.76 65.75
Image-to-Text (Latex) 51.45 71.63 63.08
Image-to-Text (Word+Latex) 69.93 77.94 74.54

Model Zoo

The trained models are available for download in the TableBank Model Zoo.

Get Data and Leaderboard

**Please DO NOT re-distribute our data.**

If you use the corpus in published work, please cite it referring to the "Paper and Citation" Section.

The annotations and original document pictures of the TableBank dataset can be download from the TableBank dataset homepage.

Paper and Citation

https://arxiv.org/abs/1903.01949

@misc{li2019tablebank,
    title={TableBank: A Benchmark Dataset for Table Detection and Recognition},
    author={Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou and Zhoujun Li},
    year={2019},
    eprint={1903.01949},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

References

  • [Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
  • [Gilani et al., 2017] A. Gilani, S. R. Qasim, I. Malik, and F. Shafait. Table detection using deep learning. In Proc. of ICDAR 2017, volume 01, pages 771–776, Nov 2017.
  • [Wu et al., 2019] Y Wu, A Kirillov, F Massa, WY Lo, R Girshick. Detectron2[J]. 2019.
  • [Xie et al., 2016] Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. CoRR, abs/1611.05431, 2016.
  • [Klein et al., 2017] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. Open-NMT: Open-source toolkit for neural machine translation. In Proc. of ACL, 2017.]
Fully-automated scripts for collecting AI-related papers

AI-Paper-Collector Web demo: https://ai-paper-collector.vercel.app/ (recommended) Colab notebook: here Motivation Fully-automated scripts for collecti

772 Dec 30, 2022
Source code of RRPN ---- Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Paper source Arbitrary-Oriented Scene Text Detection via Rotation Proposals https://arxiv.org/abs/1703.01086 News We update RRPN in pytorch 1.0! View

428 Nov 22, 2022
Rotational region detection based on Faster-RCNN.

R2CNN_Faster_RCNN_Tensorflow Abstract This is a tensorflow re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detecti

UCAS-Det 581 Nov 22, 2022
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 671 Dec 27, 2022
2 telegram-bots: for image recognition and for text generation

💻 📱 Telegram_Bots 🔎 & 📖 2 telegram-bots: for image recognition and for text generation. About Image recognition bot: User sends a photo and bot de

Marina Polukoshko 1 Jan 27, 2022
Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

Juanma Coria 185 Jan 01, 2023
This repo contains a script that allows us to find range of colors in images using openCV, and then convert them into geo vectors.

Vectorizing color range This repo contains a script that allows us to find range of colors in images using openCV, and then convert them into geo vect

Development Seed 9 Jul 27, 2022
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Quick Info this library tries to solve language detection of very short words and phrases, even shorter than tweets makes use of both statistical and

Peter M. Stahl 532 Dec 28, 2022
Developed an AI-based system to control the mouse cursor using Python and OpenCV with the real-time camera.

Developed an AI-based system to control the mouse cursor using Python and OpenCV with the real-time camera. Fingertip location is mapped to RGB images to control the mouse cursor.

Ravi Sharma 71 Dec 20, 2022
QuanTaichi: A Compiler for Quantized Simulations (SIGGRAPH 2021)

QuanTaichi: A Compiler for Quantized Simulations (SIGGRAPH 2021) Yuanming Hu, Jiafeng Liu, Xuanda Yang, Mingkuan Xu, Ye Kuang, Weiwei Xu, Qiang Dai, W

Taichi Developers 119 Dec 02, 2022
Hand Detection and Finger Detection on Live Feed

Hand-Detection-On-Live-Feed Hand Detection and Finger Detection on Live Feed Getting Started Install the dependencies $ git clone https://github.com/c

Chauhan Mahaveer 2 Jan 02, 2022
This project is basically to draw lines with your hand, using python, opencv, mediapipe.

Paint Opencv 📷 This project is basically to draw lines with your hand, using python, opencv, mediapipe. Screenshoots 📱 Tools ⚙️ Python Opencv Mediap

Williams Ismael Bobadilla Torres 3 Nov 17, 2021
Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Deskew by Marek Mauder https://galfar.vevb.net/deskew https://github.com/galfar/deskew v1.30 2019-06-07 Overview Deskew is a command line tool for des

Marek Mauder 127 Dec 03, 2022
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Jacobo José Guijarro Villalba 75 Oct 21, 2022
Autonomous Driving project for Euro Truck Simulator 2

hope-autonomous-driving Autonomous Driving project for Euro Truck Simulator 2 Video: How is it working ? In this video, the program processes the imag

Umut Görkem Kocabaş 36 Nov 06, 2022
"Very simple but works well" Computer Vision based ID verification solution provided by LibraX.

ID Verification by LibraX.ai This is the first free Identity verification in the market. LibraX.ai is an identity verification platform for developers

LibraX.ai 46 Dec 06, 2022
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

OCRopus 285 Dec 08, 2022
Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Version 2 is now available and under development in the master branch, read a story about v2: Why I refactor tesseract.js v2? Check the support/1.x br

Project Naptha 29.2k Jan 05, 2023
This tool will help you convert your text to handwriting xD

So your teacher asked you to upload written assignments? Hate writing assigments? This tool will help you convert your text to handwriting xD

Saurabh Daware 4.2k Jan 07, 2023
A selectional auto-encoder approach for document image binarization

The code of this repository was used for the following publication. If you find this code useful please cite our paper: @article{Gallego2019, title =

Javier Gallego 89 Nov 18, 2022