Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Overview

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Yeh

Paper: https://arxiv.org/abs/2111.13327

Scene text recognition (STR) has been widely studied in academia and industry. Training a text recognition model often requires a large amount of labeled data, but data labeling can be difficult, expensive, or time-consuming, especially for Traditional Chinese text recognition. To the best of our knowledge, public datasets for Traditional Chinese text recognition are lacking.

We generated over 20 million synthetic data and collected over 7,000 manually labeled data TC-STR 7k-word as the benchmark. Experimental results show that a text recognition model can achieve much better accuracy either by training from scratch with our generated synthetic data or by further fine-tuning with TC-STR 7k-word.

Synthetic Dataset: TCSynth

Inspired by MJSynth, SynthText and Belval/TextRecognitionDataGenerator, we propose a framework for generating scene text images for Traditional Chinese. To produce synthetic text images similar to real-world ones, we use different kinds of mechanisms for rendering, including word sampling, character spacing, font types/sizes, text coloring, text stroking, text skewing/distorting, background rendering, text Location and noise.

synth_text_pipeline

TCSynth dataset includes 21,535,590 synthetic text images.

TCSynth-VAL dataset includes 6,000 synthetic text images for validation.

LMDB Format

After untaring,

TCSynth/
├── data.mdb
└── lock.mdb

Our data structure of LMDB follows the repo. clovaai/deep-text-recognition-benchmark. The value queried by key 'num-samples'.encode() gets total number of text images. The indexes of text images starts from 1. Given the index, we can query binary of the image and its label by key 'image-%09d'.encode() % index and 'label-%09d'.encode() % index. The implement details are shown in the class LmdbConnector in lmdb_tools/lmdb_connector.py.

We also provide several tools to manipulate the LMDB shown in lmdb_tools. Before using those tools, we should install some dependencies. (tested with python 3.6)

pip install -r lmdb_tools/requirements.txt
  • Insert images into LMDB
python lmdb_tools/prepare_lmdb.py \
  --input_dir IMG_FOLDER \
  --gt_file GT \
  --output_dir LMDB_FOLDER
  • Insert images into LMDB (asynchronous version)
python lmdb_tools/prepare_lmdb_async.py \
  --input_dir IMG_FOLDER \
  --gt_file GT \
  --output_dir LMDB_FOLDER \
  --workers WORKERS
  • Extract images from LMDB (asynchronous version) (convert LMDB Format to Raw Format)
python lmdb_tools/extract_to_files.py \
  --input_lmdb LMDB_FOLDER \
  --output_dir IMG_FOLDER \
  --workers WORKERS

Raw Format

After untaring,

TCSynth_raw/
├── labels.txt
├── 0000/
│   ├── 00000001.jpg
│   ├── 00000002.jpg
│   ├── 00000003.jpg
│   └── ...
├── 0001/
├── 0002/
└── ...

format of labels.txt: {imagepath}\t{label}\n, for example:

0000/00000001.jpg 㒓
...

Labeled Data: TC-STR 7k-word

Our TC-STR 7k-word dataset collects about 1,554 images from Google image search to produce 7,543 cropped text images. To increase the diversity in our collected scene text images, we search for images under different scenarios and query keywords. Since the collected scene text images are to be used in evaluating text recognition performance, we manually crop text from the collected images and assign a label to each cropped text box.

TC-STR_demo

TC-STR 7k-word dataset includes a training set of 3,837 text images and a testing set of 3,706 images.

After untaring,

TC-STR/
├── train_labels.txt
├── test_labels.txt
└── images/
    ├── xxx_1.jpg
    ├── xxx_2.jpg
    ├── xxx_3.jpg
    └── ...

format of xxx_labels.txt: {imagepath}\t{label}\n, for example:

images/billboard_00000_010_雜貨鋪.jpg 雜貨鋪
images/sign_02616_999_民生路.png 民生路
...

Citation

Please consider citing this work in your publications if it helps your research.

@article{chen2021traditional,
  title={Traditional Chinese Synthetic Datasets Verified with Labeled Data for Scene Text Recognition},
  author={Yi-Chang Chen and Yu-Chuan Chang and Yen-Cheng Chang and Yi-Ren Yeh},
  journal={arXiv preprint arXiv:2111.13327},
  year={2021}
}
Owner
Yi-Chang Chen
大家好!我是YC,是一名資料科學家,熟悉機器學習和深度學習的各類技術,以及大數據分散式系統; 同時,我也是一名街頭藝人和部落客。我總是嘗試各種生命的可能性,因為我深信:人生的意義在於體驗一切身為人的經驗。
Yi-Chang Chen
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

capbot-siic Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021. Problem Inspiration A plethora

Aryan Kargwal 19 Feb 17, 2022
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Yeoun Yi 3 May 23, 2022
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 06, 2023
Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

Ivan 20 Aug 22, 2022
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 03, 2023
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023
Graphical user interface for Argos Translate

Argos Translate GUI Website | GitHub | PyPI Graphical user interface for Argos Translate. Install pip3 install argostranslategui

Argos Open Tech 16 Dec 07, 2022
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

1 Nov 08, 2021
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022