Generate text line images for training deep learning OCR model (e.g. CRNN)

Overview

Text Renderer

Generate text line images for training deep learning OCR model (e.g. CRNN). example

  • Modular design. You can easily add different components: Corpus, Effect, Layout.
  • Integrate with imgaug, see imgaug_example for usage.
  • Support render multi corpus on image with different effects. Layout is responsible for the layout between multiple corpora
  • Support apply effects on different stages of rendering process corpus_effects, layout_effects, render_effects.
  • Generate vertical text.
  • Support generate lmdb dataset which compatible with PaddleOCR, see Dataset
  • A web font viewer.
  • Corpus sampler: helpful to perform character balance

Documentation

Run Example

Run following command to generate images using example data:

git clone https://github.com/oh-my-ocr/text_renderer
cd text_renderer
python3 setup.py develop
pip3 install -r docker/requirements.txt
python3 main.py \
    --config example_data/example.py \
    --dataset img \
    --num_processes 2 \
    --log_period 10

The data is generated in the example_data/output directory. A labels.json file contains all annotations in follow format:

{
  "labels": {
    "000000000": "test",
    "000000001": "text2"
  },
  "sizes": {
    "000000000": [
      120,
      32 
    ],
    "000000001": [
      128,
      32 
    ]
  },
  "num-samples": 2
}

You can also use --dataset lmdb to store image in lmdb file, lmdb file contains follow keys:

  • num-samples
  • image-000000000
  • label-000000000
  • size-000000000

You can check config file example_data/example.py to learn how to use text_renderer, or follow the Quick Start to learn how to setup configuration

Quick Start

Prepare file resources

  • Font files: .ttf.otf.ttc
  • Background images of any size, either from your business scenario or from publicly available datasets (COCO, VOC)
  • Corpus: text_renderer offers a wide variety of text sampling methods, to use these methods, you need to consider the preparation of the corpus from two perspectives:
  1. The corpus must be in the target language for which you want to perform OCR recognition
  2. The corpus should meets your actual business needs, such as education field, medical field, etc.
  • Charset file [Optional but recommend]: OCR models in real-world scenarios (e.g. CRNN) usually support only a limited character set, so it's better to filter out characters outside the character set during data generation. You can do this by setting the chars_file parameter

You can download pre-prepared file resources for this Quick Start from here:

Save these resource files in the same directory:

workspace
├── bg
│ └── background.png
├── corpus
│ └── eng_text.txt
└── font
    └── simsun.ttf

Create config file

Create a config.py file in workspace directory. One configuration file must have a configs variable, it's a list of GeneratorCfg.

The complete configuration file is as follows:

import os
from pathlib import Path

from text_renderer.effect import *
from text_renderer.corpus import *
from text_renderer.config import (
    RenderCfg,
    NormPerspectiveTransformCfg,
    GeneratorCfg,
    SimpleTextColorCfg,
)

CURRENT_DIR = Path(os.path.abspath(os.path.dirname(__file__)))


def story_data():
    return GeneratorCfg(
        num_image=10,
        save_dir=CURRENT_DIR / "output",
        render_cfg=RenderCfg(
            bg_dir=CURRENT_DIR / "bg",
            height=32,
            perspective_transform=NormPerspectiveTransformCfg(20, 20, 1.5),
            corpus=WordCorpus(
                WordCorpusCfg(
                    text_paths=[CURRENT_DIR / "corpus" / "eng_text.txt"],
                    font_dir=CURRENT_DIR / "font",
                    font_size=(20, 30),
                    num_word=(2, 3),
                ),
            ),
            corpus_effects=Effects(Line(0.9, thickness=(2, 5))),
            gray=False,
            text_color_cfg=SimpleTextColorCfg(),
        ),
    )


configs = [story_data()]

In the above configuration we have done the following things:

  1. Specify the location of the resource file
  2. Specified text sampling method: 2 or 3 words are randomly selected from the corpus
  3. Configured some effects for generation
  4. Specifies font-related parameters: font_size, font_dir

Run

Run main.py, it only has 4 arguments:

  • config:Python config file path
  • dataset: Dataset format img or lmdb
  • num_processes: Number of processes used
  • log_period: Period of log printing. (0, 100)

All Effect/Layout Examples

Find all effect/layout config example at link

  • bg_and_text_mask: Three images of the same width are merged together horizontally, it can be used to train GAN model like EraseNet
Name Example
0 bg_and_text_mask bg_and_text_mask.jpg
1 char_spacing_compact char_spacing_compact.jpg
2 char_spacing_large char_spacing_large.jpg
3 color_image color_image.jpg
4 curve curve.jpg
5 dropout_horizontal dropout_horizontal.jpg
6 dropout_rand dropout_rand.jpg
7 dropout_vertical dropout_vertical.jpg
8 emboss emboss.jpg
9 extra_text_line_layout extra_text_line_layout.jpg
10 line_bottom line_bottom.jpg
11 line_bottom_left line_bottom_left.jpg
12 line_bottom_right line_bottom_right.jpg
13 line_horizontal_middle line_horizontal_middle.jpg
14 line_left line_left.jpg
15 line_right line_right.jpg
16 line_top line_top.jpg
17 line_top_left line_top_left.jpg
18 line_top_right line_top_right.jpg
19 line_vertical_middle line_vertical_middle.jpg
20 padding padding.jpg
21 perspective_transform perspective_transform.jpg
22 same_line_layout_different_font_size same_line_layout_different_font_size.jpg
23 vertical_text vertical_text.jpg

Contribution

  • Corpus: Feel free to contribute more corpus generators to the project, It does not necessarily need to be a generic corpus generator, but can also be a business-specific generator, such as generating ID numbers

Run in Docker

Build image

docker build -f docker/Dockerfile -t text_renderer .

Config file is provided by CONFIG environment. In example.py file, data is generated in example_data/output directory, so we map this directory to the host.

docker run --rm \
-v `pwd`/example_data/docker_output/:/app/example_data/output \
--env CONFIG=/app/example_data/example.py \
--env DATASET=img \
--env NUM_PROCESSES=2 \
--env LOG_PERIOD=10 \
text_renderer

Font Viewer

Start font viewer

streamlit run tools/font_viewer.py -- web /path/to/fonts_dir

image

Build docs

cd docs
make html
open _build/html/index.html

Citing text_renderer

If you use text_renderer in your research, please consider use the following BibTeX entry.

@misc{text_renderer,
  author =       {oh-my-ocr},
  title =        {text_renderer},
  howpublished = {\url{https://github.com/oh-my-ocr/text_renderer}},
  year =         {2021}
}
Conditional probing: measuring usable information beyond a baseline

Conditional probing: measuring usable information beyond a baseline

John Hewitt 20 Dec 15, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Pocket Galaxy 아주 간단한 개인용, 혹은 내부용 툴을 만들어야하는데 이왕이면 웹이 편하죠? 그럴때를 위해 만들어둔 django와 vue(vuetify)로 이뤄진 boilerplate 입니다. 각 폴더에 있는 설명서대로 실행을 시키면 일단 당장 뭔가가 돌아갑니

Jamie J. Seol 16 Dec 03, 2021
Search Git commits in natural language

NaLCoS - NAtural Language COmmit Search Search commit messages in your repository in natural language. NaLCoS (NAtural Language COmmit Search) is a co

Pushkar Patel 50 Mar 22, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

Ashish Patel 577 Jan 07, 2023
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
Host your own GPT-3 Discord bot

GPT3 Discord Bot Host your own GPT-3 Discord bot i'd host and make the bot invitable myself, however GPT3 terms of service prohibit public use of GPT3

[something hillarious here] 8 Jan 07, 2023
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 02, 2023
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP(Natural Language Processing)是指自然语言处理,他的目的是让计算机可以听懂人话。 下面是我将2万条豆瓣影评训练之后,随意输入一段新影评交给神经网络,最终AI推断出的结果。 "很好,演技不错

3 Apr 15, 2022
Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Zhenhailong Wang 2 Jul 15, 2022