Generate text line images for training deep learning OCR model (e.g. CRNN)

Overview

Text Renderer

Generate text line images for training deep learning OCR model (e.g. CRNN). example

  • Modular design. You can easily add different components: Corpus, Effect, Layout.
  • Integrate with imgaug, see imgaug_example for usage.
  • Support render multi corpus on image with different effects. Layout is responsible for the layout between multiple corpora
  • Support apply effects on different stages of rendering process corpus_effects, layout_effects, render_effects.
  • Generate vertical text.
  • Support generate lmdb dataset which compatible with PaddleOCR, see Dataset
  • A web font viewer.
  • Corpus sampler: helpful to perform character balance

Documentation

Run Example

Run following command to generate images using example data:

git clone https://github.com/oh-my-ocr/text_renderer
cd text_renderer
python3 setup.py develop
pip3 install -r docker/requirements.txt
python3 main.py \
    --config example_data/example.py \
    --dataset img \
    --num_processes 2 \
    --log_period 10

The data is generated in the example_data/output directory. A labels.json file contains all annotations in follow format:

{
  "labels": {
    "000000000": "test",
    "000000001": "text2"
  },
  "sizes": {
    "000000000": [
      120,
      32 
    ],
    "000000001": [
      128,
      32 
    ]
  },
  "num-samples": 2
}

You can also use --dataset lmdb to store image in lmdb file, lmdb file contains follow keys:

  • num-samples
  • image-000000000
  • label-000000000
  • size-000000000

You can check config file example_data/example.py to learn how to use text_renderer, or follow the Quick Start to learn how to setup configuration

Quick Start

Prepare file resources

  • Font files: .ttf.otf.ttc
  • Background images of any size, either from your business scenario or from publicly available datasets (COCO, VOC)
  • Corpus: text_renderer offers a wide variety of text sampling methods, to use these methods, you need to consider the preparation of the corpus from two perspectives:
  1. The corpus must be in the target language for which you want to perform OCR recognition
  2. The corpus should meets your actual business needs, such as education field, medical field, etc.
  • Charset file [Optional but recommend]: OCR models in real-world scenarios (e.g. CRNN) usually support only a limited character set, so it's better to filter out characters outside the character set during data generation. You can do this by setting the chars_file parameter

You can download pre-prepared file resources for this Quick Start from here:

Save these resource files in the same directory:

workspace
├── bg
│ └── background.png
├── corpus
│ └── eng_text.txt
└── font
    └── simsun.ttf

Create config file

Create a config.py file in workspace directory. One configuration file must have a configs variable, it's a list of GeneratorCfg.

The complete configuration file is as follows:

import os
from pathlib import Path

from text_renderer.effect import *
from text_renderer.corpus import *
from text_renderer.config import (
    RenderCfg,
    NormPerspectiveTransformCfg,
    GeneratorCfg,
    SimpleTextColorCfg,
)

CURRENT_DIR = Path(os.path.abspath(os.path.dirname(__file__)))


def story_data():
    return GeneratorCfg(
        num_image=10,
        save_dir=CURRENT_DIR / "output",
        render_cfg=RenderCfg(
            bg_dir=CURRENT_DIR / "bg",
            height=32,
            perspective_transform=NormPerspectiveTransformCfg(20, 20, 1.5),
            corpus=WordCorpus(
                WordCorpusCfg(
                    text_paths=[CURRENT_DIR / "corpus" / "eng_text.txt"],
                    font_dir=CURRENT_DIR / "font",
                    font_size=(20, 30),
                    num_word=(2, 3),
                ),
            ),
            corpus_effects=Effects(Line(0.9, thickness=(2, 5))),
            gray=False,
            text_color_cfg=SimpleTextColorCfg(),
        ),
    )


configs = [story_data()]

In the above configuration we have done the following things:

  1. Specify the location of the resource file
  2. Specified text sampling method: 2 or 3 words are randomly selected from the corpus
  3. Configured some effects for generation
  4. Specifies font-related parameters: font_size, font_dir

Run

Run main.py, it only has 4 arguments:

  • config:Python config file path
  • dataset: Dataset format img or lmdb
  • num_processes: Number of processes used
  • log_period: Period of log printing. (0, 100)

All Effect/Layout Examples

Find all effect/layout config example at link

  • bg_and_text_mask: Three images of the same width are merged together horizontally, it can be used to train GAN model like EraseNet
Name Example
0 bg_and_text_mask bg_and_text_mask.jpg
1 char_spacing_compact char_spacing_compact.jpg
2 char_spacing_large char_spacing_large.jpg
3 color_image color_image.jpg
4 curve curve.jpg
5 dropout_horizontal dropout_horizontal.jpg
6 dropout_rand dropout_rand.jpg
7 dropout_vertical dropout_vertical.jpg
8 emboss emboss.jpg
9 extra_text_line_layout extra_text_line_layout.jpg
10 line_bottom line_bottom.jpg
11 line_bottom_left line_bottom_left.jpg
12 line_bottom_right line_bottom_right.jpg
13 line_horizontal_middle line_horizontal_middle.jpg
14 line_left line_left.jpg
15 line_right line_right.jpg
16 line_top line_top.jpg
17 line_top_left line_top_left.jpg
18 line_top_right line_top_right.jpg
19 line_vertical_middle line_vertical_middle.jpg
20 padding padding.jpg
21 perspective_transform perspective_transform.jpg
22 same_line_layout_different_font_size same_line_layout_different_font_size.jpg
23 vertical_text vertical_text.jpg

Contribution

  • Corpus: Feel free to contribute more corpus generators to the project, It does not necessarily need to be a generic corpus generator, but can also be a business-specific generator, such as generating ID numbers

Run in Docker

Build image

docker build -f docker/Dockerfile -t text_renderer .

Config file is provided by CONFIG environment. In example.py file, data is generated in example_data/output directory, so we map this directory to the host.

docker run --rm \
-v `pwd`/example_data/docker_output/:/app/example_data/output \
--env CONFIG=/app/example_data/example.py \
--env DATASET=img \
--env NUM_PROCESSES=2 \
--env LOG_PERIOD=10 \
text_renderer

Font Viewer

Start font viewer

streamlit run tools/font_viewer.py -- web /path/to/fonts_dir

image

Build docs

cd docs
make html
open _build/html/index.html

Citing text_renderer

If you use text_renderer in your research, please consider use the following BibTeX entry.

@misc{text_renderer,
  author =       {oh-my-ocr},
  title =        {text_renderer},
  howpublished = {\url{https://github.com/oh-my-ocr/text_renderer}},
  year =         {2021}
}
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

Tomoki Hayashi 1.2k Dec 23, 2022
Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

Dibya Chakravorty 570 Dec 31, 2022
Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

Victor Zhong 33 Dec 27, 2022
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 05, 2023
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
This repository contains helper functions which can help you generate additional data points depending on your NLP task.

NLP Albumentations For Data Augmentation This repository contains helper functions which can help you generate additional data points depending on you

Aflah 6 May 22, 2022
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Ceaser-Cipher The Caesar Cipher technique is one of the earliest and simplest me

Lateefah Ajadi 2 May 12, 2022
189 Jan 02, 2023
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 01, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Kenneth Enevoldsen 124 Dec 29, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022