This repository contains datasets and baselines for benchmarking Chinese text recognition.

Overview

Benchmarking-Chinese-Text-Recognition

This repository contains datasets and baselines for benchmarking Chinese text recognition. Please see the corresponding paper for more details regarding the datasets, baselines, the empirical study, etc.

Highlights

๐ŸŒŸ All datasets are transformed to lmdb format for convenient usage.

๐ŸŒŸ The experimental results of all baselines are available at link with format (index [pred] [gt]).

๐ŸŒŸ The code and trained weights of TransOCR (one of the baselines) are available at link for direct use.

Updates

Jan 3, 2022: This repo is made publicly available. The corresponding paper is available at arXiv.

Nov 26, 2021: We upload the lmdb datasets publicly to Google Drive and BaiduCloud.

Download

  • The lmdb scene, web and document datasets are available in BaiduCloud (psw:v2rm) and GoogleDrive.

  • For the handwriting setting, please first download it at SCUT-HCCDoc and divide it into training, validation, and testing sets following link.

  • We also collected HWDB2.0-2.2 and ICDAR2013 handwriting datasets from CASIA and ICDAR2013 competition for futher research. Datasets are available at BaiduCloud (psw:lfaq) and GoogleDrive.

Datasets

Alt text The image demonstrates the four datasets used in our benchmark including Scene, Web, Document, and Handwriting datasets, each of which is introduced next.

Scene Dataset

We first collect the publicly available scene datasets including RCTW, ReCTS, LSVT, ArT, CTW resulting in 636,455 samples, which are randomly shuffled and then divided at a ratio of 8:1:1 to construct the training, validation, and testing datasets. Details of each scene datasets are introduced as follows:

  • RCTW [1] provides 12,263 annotated Chinese text images from natural scenes. We derive 44,420 text lines from the training set and use them in our benchmark. The testing set of RCTW is not used as the text labels are not available.
  • ReCTS [2] provides 25,000 annotated street-view Chinese text images, mainly derived from natural signboards. We only adopt the training set and crop 107,657 text samples in total for our benchmark.
  • LSVT [3] is a large scale Chinese and English scene text dataset, providing 50,000 full-labeled (polygon boxes and text labels) and 400,000 partial-labeled (only one text instance each image) samples. We only utilize the full-labeled training set and crop 243,063 text line images for our benchmark.
  • ArT [4] contains text samples captured in natural scenes with various text layouts (e.g., rotated text and curved texts). Here we obtain 49,951 cropped text images from the training set, and use them in our benchmark.
  • CTW [5] contains annotated 30,000 street view images with rich diversity including planar, raised, and poorly-illuminated text images. Also, it provides not only character boxes and labels, but also character attributes like background complexity, appearance, etc. Here we crop 191,364 text lines from both the training and testing sets.

We combine all the subdatasets, resulting in 636,455 text samples. We randomly shuffle these samples and split them at a ratio of 8:1:1, leading to 509,164 samples for training, 63,645 samples for validation, and 63,646 samples for testing.

Web Dataset

To collect the web dataset, we utilize MTWI [6] that contains 20,000 Chinese and English web text images from 17 different categories on the Taobao website. The text samples are appeared in various scenes, typography and designs. We derive 140,589 text images from the training set, and manually divide them at a ratio of 8:1:1, resulting in 112,471 samples for training, 14,059 samples for validation, and 14,059 samples for testing.

Document Dataset

We use the public repository Text Render [7] to generate some document-style synthetic text images. More specifically, we uniformly sample the length of text varying from 1 to 15. The corpus comes from wiki, films, amazon, and baike. The dataset contains 500,000 in total and is randomly divided into training, validation, and testing sets with a proportion of 8:1:1 (400,000 v.s. 50,000 v.s. 50,000).

Handwriting Dataset

We collect the handwriting dataset based on SCUT-HCCDoc [8], which captures the Chinese handwritten image with cameras in unconstrained environments. Following the official settings, we derive 93,254 text lines for training and 23,389 for testing, respectively. To pursue more rigorous research, we manually split the original training set into two sets at a ratio of 4:1, resulting in 74,603 samples for training and 18,651 samples for validation. For convenience, we continue to use the original 23,389 samples for testing.

Overall, the amount of text samples for each dataset is shown as follows:

  Setting     Dataset     Sample Size     Setting     Dataset     Sample Size  
Scene Training 509,164 Web Training 112,471
Validation 63,645 Validation 14,059
Testing 63,646 Testing 14,059
Document Training 400,000 Handwriting Training 74,603
Validation 50,000 Validation 18,651
Testing 50,000 Testing 23,389

Baselines

We manually select seven representative methods as baselines, which will be introduced as follows.

  • CRNN [9] is a typical CTC-based method and it is widely used in academia and industry. It first sends the text image to a CNN to extract the image features, then adopts a two-layer LSTM to encode the sequential features. Finally, the output of LSTM is fed to a CTC (Connectionist Temperal Classification) decoder to maximize the probability of all the paths towards the ground truth.

  • ASTER [10] is a typical rectification-based method aiming at tackling irregular text images. It introduces a Spatial Transformer Network (STN) to rectify the given text image into a more recognizable appearance. Then the rectified text image is sent to a CNN and a two-layer LSTM to extract the features. In particular, ASTER takes advantage of the attention mechanism to predict the final text sequence.

  • MORAN [11] is a representative rectification-based method. It first adopts a multi-object rectification network (MORN) to predict rectified pixel offsets in a weak supervision way (distinct from ASTER that utilizes STN). The output pixel offsets are further used for generating the rectified image, which is further sent to the attention-based decoder (ASRN) for text recognition.

  • SAR [12] is a representative method that takes advantage of 2-D feature maps for more robust decoding. In particular, it is mainly proposed to tackle irregular texts. On one hand, SAR adopts more powerful residual blocks in the CNN encoder for learning stronger image representation. On the other hand, different from CRNN, ASTER, and MORAN compressing the given image into a 1-D feature map, SAR adopts 2-D attention on the spatial dimension of the feature maps for decoding, resulting in a stronger performance in curved and oblique texts.

  • SRN [13] is a representative semantics-based method that utilizes self-attention modules to correct the errors of predictions. It proposes a parallel visual attention module followed by a self-attention network to capture the global semantic features through multi-way parallel transmission, resulting in significant performance improvement towards the recognition of irregular texts.

  • SEED [14] is a representative semantics-based method. It introduces a semantics module to extract global semantics embedding and utilize it to initialize the first hidden state of the decoder. Specifically, while inheriting the structure of ASTER, the decoder of SEED intakes the semantic embedding to provide prior for the recognition process, thus showing superiority in recognizing low-quality text images.

  • TransOCR [15] is one of the representative Transformer-based methods. It is originally designed to provide text priors for the super-resolution task. It employs ResNet-34 as the encoder and self-attention modules as the decoder. Distinct from the RNN-based decoders, the self-attention modules are more efficient to capture semantic features of the given text images.

Here are the results of the baselines on four datasets. ACC / NED follow the percentage format and decimal format, respectively. Please click the hyperlinks to see the detailed experimental results, following the format of (index [pred] [gt]).

  Baseline     Year   Dataset
      Scene              Web          Document    Handwriting 
CRNN [9] 2016 53.4 / 0.734 54.5 / 0.736 97.5 / 0.994 46.4 / 0.840
ASTER [10] 2018 54.5 / 0.695 52.3 / 0.689 93.1 / 0.989 38.9 / 0.720
MORAN [11] 2019 51.8 / 0.686 49.9 / 0.682 95.8 / 0.991 39.7 / 0.761
SAR [12] 2019 62.5 / 0.785 54.3 / 0.725 93.8 / 0.987 31.4 / 0.655
SRN [13] 2020 60.1 / 0.778 52.3 / 0.706 96.7 / 0.995 18.0 / 0.512
SEED [14] 2020 49.6 / 0.661 46.3 / 0.637 93.7 / 0.990 32.1 / 0.674
TransOCR [15] 2021 63.3 / 0.802 62.3 / 0.787 96.9 / 0.994 53.4 / 0.849

References

Datasets

[1] Shi B, Yao C, Liao M, et al. ICDAR2017 competition on reading chinese text in the wild (RCTW-17). ICDAR, 2017.

[2] Zhang R, Zhou Y, Jiang Q, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. ICDAR, 2019.

[3] Sun Y, Ni Z, Chng C K, et al. ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT. ICDAR, 2019.

[4] Chng C K, Liu Y, Sun Y, et al. ICDAR2019 robust reading challenge on arbitrary-shaped text-RRC-ArT. ICDAR, 2019.

[5] Yuan T L, Zhu Z, Xu K, et al. A large chinese text dataset in the wild. Journal of Computer Science and Technology, 2019.

[6] He M, Liu Y, Yang Z, et al. ICPR2018 contest on robust reading for multi-type web images. ICPR, 2018.

[7] text_render: https://github.com/Sanster/text_renderer

[8] Zhang H, Liang L, Jin L. SCUT-HCCDoc: A new benchmark dataset of handwritten Chinese text in unconstrained camera-captured documents. Pattern Recognition, 2020.

Methods

[9] Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI, 2016.

[10] Shi B, Yang M, Wang X, et al. Aster: An attentional scene text recognizer with flexible rectification. TPAMI, 2018.

[11] Luo C, Jin L, Sun Z. Moran: A multi-object rectified attention network for scene text recognition. PR, 2019.

[12] Li H, Wang P, Shen C, et al. Show, attend and read: A simple and strong baseline for irregular text recognition. AAAI, 2019.

[13] Yu D, Li X, Zhang C, et al. Towards accurate scene text recognition with semantic reasoning networks. CVPR, 2020.

[14] Qiao Z, Zhou Y, Yang D, et al. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. CVPR, 2020.

[15] Chen J, Li B, Xue X. Scene Text Telescope: Text-Focused Scene Image Super-Resolution. CVPR, 2021.

Citation

Please consider citing this paper if you find it useful in your research. The bibtex-format citations of all relevant datasets and baselines are at link.

to be filled

Acknowledgements

We sincerely thank those researchers who collect the subdatasets for Chinese text recognition. Besides, we would like to thank Teng Fu, Nanxing Meng, Ke Niu and Yingjie Geng for their feedbacks on this benchmark.

Copyright

The team includes Jingye Chen, Haiyang Yu, Jianqi Ma, Mengnan Guan, Xixi Xu, Xiaocong Wang, and Shaobo Qu, advised by Prof. Bin Li and Prof. Xiangyang Xue.

Copyright ยฉ 2021 Fudan-FudanVI. All Rights Reserved.

Alt text

Owner
FudanVI Lab
Visual Intelligence Lab at Fudan University
FudanVI Lab
Catalyst.Detection

Accelerated DL R&D PyTorch framework for Deep Learning research and development. It was developed with a focus on reproducibility, fast experimentatio

Catalyst-Team 12 Oct 25, 2021
Neural Cellular Automata + CLIP

๐Ÿง  Text-2-Cellular Automata Using Neural Cellular Automata + OpenAI CLIP (Work in progress) Examples Text Prompt: Cthulu is watching cthulu_is_watchin

Mainak Deb 21 Dec 19, 2022
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 36 Oct 31, 2022
TransNet V2: Shot Boundary Detection Neural Network

TransNet V2: Shot Boundary Detection Neural Network This repository contains code for TransNet V2: An effective deep network architecture for fast sho

Tomรกลก Souฤek 212 Dec 27, 2022
Nested Graph Neural Network (NGNN) is a general framework to improve a base GNN's expressive power and performance

Nested Graph Neural Networks About Nested Graph Neural Network (NGNN) is a general framework to improve a base GNN's expressive power and performance.

Muhan Zhang 38 Jan 05, 2023
Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network This repository is the official implementation of Speech Separati

Kai Li (ๆŽๅ‡ฏ) 116 Nov 09, 2022
RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues

RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues FGBG (foreground-background) pytorch package for defining and training model

Klaas Kelchtermans 1 Jun 02, 2022
A hybrid SOTA solution of LiDAR panoptic segmentation with C++ implementations of point cloud clustering algorithms. ICCV21, Workshop on Traditional Computer Vision in the Age of Deep Learning

ICCVW21-TradiCV-Survey-of-LiDAR-Cluster Motivation In contrast to popular end-to-end deep learning LiDAR panoptic segmentation solutions, we propose a

YimingZhao 103 Nov 22, 2022
AI grand challenge 2020 Repo (Speech Recognition Track)

KorBERT๋ฅผ ํ™œ์šฉํ•œ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์œ„ํ˜‘ ์ƒํ™ฉ์ธ์ง€(2020 ์ธ๊ณต์ง€๋Šฅ ๊ทธ๋žœ๋“œ ์ฑŒ๋ฆฐ์ง€) ๋ณธ ํ”„๋กœ์ ํŠธ๋Š” ETRI์—์„œ ์ œ๊ณต๋œ ํ•œ๊ตญ์–ด korBERT ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ํญ๋ ฅ ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๋ถ„๋ฅ˜ ๋ชจ๋ธ๋“ค์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๊ฐœ๋ฐœ์ž๋“ค์ด ์ฐธ์—ฌํ•œ 2020 ์ธ๊ณต์ง€

Young-Seok Choi 23 Jan 25, 2022
code and models for "Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation"

Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation This repository contains code and models for the method described in: Golnaz

55 Jun 18, 2022
I-SECRET: Importance-guided fundus image enhancement via semi-supervised contrastive constraining

I-SECRET This is the implementation of the MICCAI 2021 Paper "I-SECRET: Importance-guided fundus image enhancement via semi-supervised contrastive con

13 Dec 02, 2022
Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

TensorFlow White Paper Notes Features Notes broken down section by section, as well as subsection by subsection Relevant links to documentation, resou

Sam Abrahams 437 Oct 09, 2022
Learn about Spice.ai with in-depth samples

Samples Learn about Spice.ai with in-depth samples ServerOps - Learn when to run server maintainance during periods of low load Gardener - Intelligent

Spice.ai 16 Mar 23, 2022
Learning to Identify Top Elo Ratings with A Dueling Bandits Approach

Learning to Identify Top Elo Ratings We propose two algorithms MaxIn-Elo and MaxIn-mElo to solve the top players identification on the transitive and

2 Jan 14, 2022
A PyTorch-centric hybrid classical-quantum machine learning framework

torchquantum A PyTorch-centric hybrid classical-quantum dynamic neural networks framework. News Add a simple example script using quantum gates to do

MIT HAN Lab 400 Jan 02, 2023
RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

RIFE - Real Time Video Interpolation arXiv | YouTube | Colab | Tutorial | Demo Table of Contents Introduction Collection Usage Evaluation Training and

hzwer 3k Jan 04, 2023
Python Implementation of algorithms in Graph Mining, e.g., Recommendation, Collaborative Filtering, Community Detection, Spectral Clustering, Modularity Maximization, co-authorship networks.

Graph Mining Author: Jiayi Chen Time: April 2021 Implemented Algorithms: Network: Scrabing Data, Network Construbtion and Network Measurement (e.g., P

Jiayi Chen 3 Mar 03, 2022
Using knowledge-informed machine learning on the PRONOSTIA (FEMTO) and IMS bearing data sets. Predict remaining-useful-life (RUL).

Knowledge Informed Machine Learning using a Weibull-based Loss Function Exploring the concept of knowledge-informed machine learning with the use of a

Tim 43 Dec 14, 2022
Server files for UltimateLabeling

UltimateLabeling server files Server files for UltimateLabeling. git clone https://github.com/alexandre01/UltimateLabeling_server.git cd UltimateLabel

Alexandre Carlier 4 Oct 10, 2022
Matplotlib Image labeller for classifying images

mpl-image-labeller Use Matplotlib to label images for classification. Works anywhere Matplotlib does - from the notebook to a standalone gui! For more

Ian Hunt-Isaak 5 Sep 24, 2022