Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

Overview

STN-OCR: A single Neural Network for Text Detection and Text Recognition

This repository contains the code for the paper: STN-OCR: A single Neural Network for Text Detection and Text Recognition

Please note that we refined our approach and released new source code. You can find the code here

Please use the new code, if you want to experiment with FSNS like data and our approach. It should also be easy to redo the text recognition experiments with the new code, although we did not release any code for that.

Structure of the repository

The folder datasets contains code related to datasets used in the paper. datasets/svhn contains several scripts that can be used to create svhn based ground truth files as used in our experiments reported in section 4.2., please see the readme in this folder on how to use the scripts. datasets/fsns contains scripts that can be used to first download the fsns dataset, second extract the images from the downloaded files and third restructure the contained gt files.

The folder mxnet contains all code used for training our networks.

Installation

In order to use the code you will need the following software environment:

  1. Install python3 (the code might work with python2, too, but this is untested)
  2. it might be a good idea to use a virtualenv
  3. install all requirements with pip install -r requirements.txt
  4. clone and install warp-ctc from here
  5. go into the folder mxnet/metrics/ctc and run python setup.py build_ext --inplace
  6. clone the mxnet repository
  7. checkout the tag v0.9.3
  8. add the warpctc plugin to the project by enabling it in the file config.mk
  9. compile mxnet
  10. install the python bindings of mxnet
  11. You should be ready to go!

Training

You can use this code to train models for three different tasks.

SVHN House Number Recognition

The file train_svhn.py is the entry point for training a network using our purpose build svhn datasets. The file as such is ready to train a network capable of finding a single house number placed randomly on an image.

Example: centered_image

In order to do this, you need to follow these steps:

  1. Download the datasets

  2. Locate the folder generated/centered

  3. open train.csv and adapt the paths of all images to the path on your machine (do the same with valid.csv)

  4. make sure to prepare your environment as described in installation

  5. start the training by issuing the following command:

    python train_svhn.py <path to train.csv> <path to valid.csv> --gpus <gpu id you want to use> --log-dir <where to save the logs> -b <batch size you want ot use> --lr 1e-5 --zoom 0.5 --char-map datasets/svhn/svhn_char_map.json

  6. Wait and enjoy.

If you want to do experiments on more challenging images you might need to update some parts of the code in train_svhn.py. The parts you might want to update are located around line 40 in this file. Here you can change the max. number of house numbers in the image (num_timesteps), the maximum number of characters per house number (labels_per_timestep), the number of rnn layers to use for predicting the localization num_rnn_layers and whether to use a blstm for predicting the localization or not use_blstm.

A quite more challenging dataset is contained in the folder medium_two_digits, or medium in the datasets folder. Example: 2_digits_more_challenge

If you want to follow our experiments with svhn numbers placed in a regular grid you'll need to do the following:

  1. Download the datasets
  2. Locate the folder generated/easy
  3. open train.csv and adapt the paths of all images to the path on your machine (do the same with valid.csv)
  4. set num_timesteps and labels_per_timestep to 4 in train_svhn.py
  5. start the training using the following command: python train_svhn.py <path to train.csv> <path to valid.csv> --gpus <gpu id you want to use> --log-dir <where to save the logs> -b <batch size you want ot use> --lr 1e-5
  6. If you are lucky it will work ;)

Text Recognition

Following our text recognition experiments might be a little difficult, because we can not offer the entire dataset used by us. But it is possible to perform the experiments based on the Synth-90k dataset provided by Jaderberg et al. here. After downloading and extracting this file you'll need to adapt the groundtruth file provided with this dataset to fit to the format used by our code. Our format is quite easy. You need to create a csv file with tabular separated values. The first column is the absolute path to the image and the rest of the line are the labels corresponding to this image.

To train the network you can use the train_text_recognition.py script. You can start this script in a similar manner to the train_svhn.py script.

FSNS

In order to redo our experiments on the FSNS dataset you need to perform the following steps:

  1. Download the fsns dataset using the download_fsns.py script located in datasets/fsns

  2. Extract the individual images using the tfrecord_to_image.py script located in datasets/fsns/tfrecord_utils (you will need to install tensorflow for doing that)

  3. Use the transform_gt.py script to transform the original fsns groundtruth, which is based on a single line to a groundtruth containing labels for each word individually. A possible usage of the transform_gt.py script could look like this:

    python transform_gt.py <path to original gt> datasets/fsns/fsns_char_map.json <path to gt that shall be generated>

  4. Because MXNet expects the blank label to be 0 for the training with CTC Loss, you have to use the swap_classes.py script in datasets/fsns and swap the class for space and blank in the gt, by issuing:

    python swap_classes.py <original gt> <swapped gt> 0 133

  5. After performing these steps you should be able to run the training by issuing:

    python train_fsns.py <path to generated train gt> <path to generated validation gt> --char-map datases/fsns/fsns_char_map.json --blank-label 0

Observing the Training Progress

We've added a nice script that makes it possible to see how well the network performs at every step of the training. This progress is normally plotted to disk for each iteration and can later on be used to create animations of the train progress (you can use the create_gif.py and create_video.py scripts located in mxnet/utils for this purpose). Besides this normal plotting to disk it is also possible to directly see this progress while the training is running. In order to see this you have to do the following:

  1. start the show_progress.py script in mxnet/utils

  2. start the training with the following additional command line params:

    --send-bboxes --ip <localhost, or remote ip if you are working on a remote machine> --port <the port the show_progress.py script is running on (default is 1337)

  3. enjoy!

This tool is especially helpful in determining whether the network is learning anything or not. We recommend that you always use this tool while training.

Evaluation

If you want to evaluate already trained models you can use the evaluation scripts provided in the mxnet folder. For evaluating a model you need to do the following:

  1. train or download a model

  2. choose the correct evaluation script an adapt it, if necessary (take care in case you are fiddling around with the amount of timesteps and number of RNN layers)

  3. Get the dataset you want to evaluate the model on and adapt the groundtruth file to fit the format expected by our software. The format expected by our software is defined as a csv (tab separated) file that looks like that: <absolute path to image> \t <numerical labels each label separated from the other by \t>

  4. run the chosen evaluation script like so

    python eval_<type>_model.py <path to model dir>/<prefix of model file> <number of epoch to test> <path to evaluation gt> <path to char map>

You can use eval_svhn_model.py for evaluating a model trained with CTC on the original svhn dataset, the eval_text_recognition_model.py script for evaluating a model trained for text recognition, and the eval_fsns_model.py for evaluating a model trained on the FSNS dataset.

License

This Code is licensed under the GPLv3 license. Please see further details in LICENSE.md.

Citation

If you are using this Code please cite the following publication:

@article{bartz2017stn,
  title={STN-OCR: A single Neural Network for Text Detection and Text Recognition},
  author={Bartz, Christian and Yang, Haojin and Meinel, Christoph},
  journal={arXiv preprint arXiv:1707.08831},
  year={2017}
}

A short note on code quality

The code contains a huge amount of workarounds around MXNet, as we were not able to find any easier way to do what we wanted to do. If you know a better way, pease let us know, as we would like to have code that is better understandable, as now.

Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

Tencent YouTu Research 146 Dec 24, 2022
Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

OCRopus 3.2k Dec 31, 2022
Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

Daniel Soares Saldanha 2 Oct 11, 2021
Text-to-Image generation

Generate vivid Images for Any (Chinese) text CogView is a pretrained (4B-param) transformer for text-to-image generation in general domain. Read our p

THUDM 1.3k Jan 05, 2023

Installations for running keras-theano on GPU Upgrade pip and install opencv2 cd ~ pip install --upgrade pip pip install opencv-python Upgrade keras

Berat Kurar Barakat 14 Sep 30, 2022
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Jacobo José Guijarro Villalba 75 Oct 21, 2022
An unofficial package help developers to implement ZATCA (Fatoora) QR code easily which required for e-invoicing

ZATCA (Fatoora) QR-Code Implementation An unofficial package help developers to implement ZATCA (Fatoora) QR code easily which required for e-invoicin

TheAwiteb 28 Nov 03, 2022
This is the code for our paper DAAIN: Detection of Anomalous and AdversarialInput using Normalizing Flows

Merantix-Labs: DAAIN This is the code for our paper DAAIN: Detection of Anomalous and Adversarial Input using Normalizing Flows which can be found at

Merantix 14 Oct 12, 2022
Generate a list of papers with publicly available source code in the daily arxiv

2021-06-08 paper code optimal network slicing for service-oriented networks with flexible routing and guaranteed e2e latency networkslicing multi-moda

79 Jan 03, 2023
Introduction to Augmented Reality (AR) with Python 3 and OpenCV 4.2.

Introduction to Augmented Reality (AR) with Python 3 and OpenCV 4.2.

fernanda rodríguez 85 Jan 02, 2023
Generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv

basic-dataset-generator-from-image-of-numbers generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv inpu

1 Jan 01, 2022
A novel region proposal network for more general object detection ( including scene text detection ).

DeRPN: Taking a further step toward more general object detection DeRPN is a novel region proposal network which concentrates on improving the adaptiv

Deep Learning and Vision Computing Lab, SCUT 151 Dec 12, 2022
GDB python tool to pretty print and debug c++ xtensor containers

gdb_xt2np GDB python tool to pretty print, examine, and debug c++ Xtensor containers. Xtensor is a c++ library for scientific computing using multidim

Christopher Burke 4 Oct 29, 2021
Driver Drowsiness Detection with OpenCV & Dlib

In this project, we have built a driver drowsiness detection system that will detect if the eyes of the driver are close for too long and infer if the driver is sleepy or inactive.

Mansi Mishra 4 Oct 26, 2022
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 671 Dec 27, 2022
An OCR evaluation tool

dinglehopper dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR resu

QURATOR-SPK 40 Dec 20, 2022
A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

coura 444 Dec 30, 2022
Characterizing possible failure modes in physics-informed neural networks.

Characterizing possible failure modes in physics-informed neural networks This repository contains the PyTorch source code for the experiments in the

Aditi Krishnapriyan 55 Jan 02, 2023
SemTorch

SemTorch This repository contains different deep learning architectures definitions that can be applied to image segmentation. All the architectures a

David Lacalle Castillo 154 Dec 07, 2022
scene-linear test images

Scene-Referred Image Collection A collection of OpenEXR Scene-Referred images, encoded as max 2048px width, DWAA 80 compression. All exrs are encoded

Gralk Klorggson 7 Aug 25, 2022