Generate text line images for training deep learning OCR model (e.g. CRNN)

Overview

Text Renderer

Generate text line images for training deep learning OCR model (e.g. CRNN). example

  • Modular design. You can easily add different components: Corpus, Effect, Layout.
  • Integrate with imgaug, see imgaug_example for usage.
  • Support render multi corpus on image with different effects. Layout is responsible for the layout between multiple corpora
  • Support apply effects on different stages of rendering process corpus_effects, layout_effects, render_effects.
  • Generate vertical text.
  • Support generate lmdb dataset which compatible with PaddleOCR, see Dataset
  • A web font viewer.
  • Corpus sampler: helpful to perform character balance

Documentation

Run Example

Run following command to generate images using example data:

git clone https://github.com/oh-my-ocr/text_renderer
cd text_renderer
python3 setup.py develop
pip3 install -r docker/requirements.txt
python3 main.py \
    --config example_data/example.py \
    --dataset img \
    --num_processes 2 \
    --log_period 10

The data is generated in the example_data/output directory. A labels.json file contains all annotations in follow format:

{
  "labels": {
    "000000000": "test",
    "000000001": "text2"
  },
  "sizes": {
    "000000000": [
      120,
      32 
    ],
    "000000001": [
      128,
      32 
    ]
  },
  "num-samples": 2
}

You can also use --dataset lmdb to store image in lmdb file, lmdb file contains follow keys:

  • num-samples
  • image-000000000
  • label-000000000
  • size-000000000

You can check config file example_data/example.py to learn how to use text_renderer, or follow the Quick Start to learn how to setup configuration

Quick Start

Prepare file resources

  • Font files: .ttf.otf.ttc
  • Background images of any size, either from your business scenario or from publicly available datasets (COCO, VOC)
  • Corpus: text_renderer offers a wide variety of text sampling methods, to use these methods, you need to consider the preparation of the corpus from two perspectives:
  1. The corpus must be in the target language for which you want to perform OCR recognition
  2. The corpus should meets your actual business needs, such as education field, medical field, etc.
  • Charset file [Optional but recommend]: OCR models in real-world scenarios (e.g. CRNN) usually support only a limited character set, so it's better to filter out characters outside the character set during data generation. You can do this by setting the chars_file parameter

You can download pre-prepared file resources for this Quick Start from here:

Save these resource files in the same directory:

workspace
├── bg
│ └── background.png
├── corpus
│ └── eng_text.txt
└── font
    └── simsun.ttf

Create config file

Create a config.py file in workspace directory. One configuration file must have a configs variable, it's a list of GeneratorCfg.

The complete configuration file is as follows:

import os
from pathlib import Path

from text_renderer.effect import *
from text_renderer.corpus import *
from text_renderer.config import (
    RenderCfg,
    NormPerspectiveTransformCfg,
    GeneratorCfg,
    SimpleTextColorCfg,
)

CURRENT_DIR = Path(os.path.abspath(os.path.dirname(__file__)))


def story_data():
    return GeneratorCfg(
        num_image=10,
        save_dir=CURRENT_DIR / "output",
        render_cfg=RenderCfg(
            bg_dir=CURRENT_DIR / "bg",
            height=32,
            perspective_transform=NormPerspectiveTransformCfg(20, 20, 1.5),
            corpus=WordCorpus(
                WordCorpusCfg(
                    text_paths=[CURRENT_DIR / "corpus" / "eng_text.txt"],
                    font_dir=CURRENT_DIR / "font",
                    font_size=(20, 30),
                    num_word=(2, 3),
                ),
            ),
            corpus_effects=Effects(Line(0.9, thickness=(2, 5))),
            gray=False,
            text_color_cfg=SimpleTextColorCfg(),
        ),
    )


configs = [story_data()]

In the above configuration we have done the following things:

  1. Specify the location of the resource file
  2. Specified text sampling method: 2 or 3 words are randomly selected from the corpus
  3. Configured some effects for generation
  4. Specifies font-related parameters: font_size, font_dir

Run

Run main.py, it only has 4 arguments:

  • config:Python config file path
  • dataset: Dataset format img or lmdb
  • num_processes: Number of processes used
  • log_period: Period of log printing. (0, 100)

All Effect/Layout Examples

Find all effect/layout config example at link

  • bg_and_text_mask: Three images of the same width are merged together horizontally, it can be used to train GAN model like EraseNet
Name Example
0 bg_and_text_mask bg_and_text_mask.jpg
1 char_spacing_compact char_spacing_compact.jpg
2 char_spacing_large char_spacing_large.jpg
3 color_image color_image.jpg
4 curve curve.jpg
5 dropout_horizontal dropout_horizontal.jpg
6 dropout_rand dropout_rand.jpg
7 dropout_vertical dropout_vertical.jpg
8 emboss emboss.jpg
9 extra_text_line_layout extra_text_line_layout.jpg
10 line_bottom line_bottom.jpg
11 line_bottom_left line_bottom_left.jpg
12 line_bottom_right line_bottom_right.jpg
13 line_horizontal_middle line_horizontal_middle.jpg
14 line_left line_left.jpg
15 line_right line_right.jpg
16 line_top line_top.jpg
17 line_top_left line_top_left.jpg
18 line_top_right line_top_right.jpg
19 line_vertical_middle line_vertical_middle.jpg
20 padding padding.jpg
21 perspective_transform perspective_transform.jpg
22 same_line_layout_different_font_size same_line_layout_different_font_size.jpg
23 vertical_text vertical_text.jpg

Contribution

  • Corpus: Feel free to contribute more corpus generators to the project, It does not necessarily need to be a generic corpus generator, but can also be a business-specific generator, such as generating ID numbers

Run in Docker

Build image

docker build -f docker/Dockerfile -t text_renderer .

Config file is provided by CONFIG environment. In example.py file, data is generated in example_data/output directory, so we map this directory to the host.

docker run --rm \
-v `pwd`/example_data/docker_output/:/app/example_data/output \
--env CONFIG=/app/example_data/example.py \
--env DATASET=img \
--env NUM_PROCESSES=2 \
--env LOG_PERIOD=10 \
text_renderer

Font Viewer

Start font viewer

streamlit run tools/font_viewer.py -- web /path/to/fonts_dir

image

Build docs

cd docs
make html
open _build/html/index.html

Citing text_renderer

If you use text_renderer in your research, please consider use the following BibTeX entry.

@misc{text_renderer,
  author =       {oh-my-ocr},
  title =        {text_renderer},
  howpublished = {\url{https://github.com/oh-my-ocr/text_renderer}},
  year =         {2021}
}
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

Build a Discord AI Chatbot that Speaks like Your Favorite Character! This is a Discord AI Chatbot that uses the Microsoft DialoGPT conversational mode

Lynn Zheng 231 Dec 30, 2022
A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject, 在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版,支持解析消息链 一般情况下请当作简易的消息链解析器/命令解析器 文档 暂时的文档 Example from arclet.alcon

19 Jan 03, 2023
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

padmalcom 6 Aug 28, 2022
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
The Easy-to-use Dialogue Response Selection Toolkit for Researchers

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

GMFTBY 32 Nov 13, 2022
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。

【关于 NLP】那些你不知道的事 作者:杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金 介绍 本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。 目录架构 一、【

1.4k Dec 30, 2022
Code for the paper "Language Models are Unsupervised Multitask Learners"

Status: Archive (code is provided as-is, no updates expected) gpt-2 Code and models from the paper "Language Models are Unsupervised Multitask Learner

OpenAI 16.1k Jan 08, 2023
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 06, 2023
Scene Text Retrieval via Joint Text Detection and Similarity Learning

This is the code of "Scene Text Retrieval via Joint Text Detection and Similarity Learning". For more details, please refer to our CVPR2021 paper.

79 Nov 29, 2022
Prithivida 690 Jan 04, 2023
Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

Malik Makkes 2 Mar 26, 2022
AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

凌逆战 75 Dec 05, 2022
The RWKV Language Model

RWKV-LM We propose the RWKV language model, with alternating time-mix and channel-mix layers: The R, K, V are generated by linear transforms of input,

PENG Bo 877 Jan 05, 2023