Links to awesome OCR projects

Overview

Awesome OCR

Awesome

This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).

Contributions are welcome, as is feedback.

Software

OCR engines

  • tesseract - The definitive Open Source OCR engine Apache 2.0
  • EasyOCR - OCR engine built on PyTorch by JaidedAI, Apache 2.0
  • ocropus - OCR engine based on LSTM, Apache 2.0
  • ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
  • kraken - Ocropus fork with sane defaults
  • gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
  • Ocrad - The GNU OCR. GPL
  • ocular - Machine-learning OCR for historic documents
  • SwiftOCR - fast and simple OCR library written in Swift
  • attention-ocr - OCR engine using visual attention mechanisms
  • RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
  • simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
  • Calamari - OCR Engine based on OCRopy and Kraken

Older and possibly abandoned OCR engines

  • Clara OCR - Open source OCR in C GPL
  • Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
  • Eye - an experimental Java OCR (image-to-text) application
  • kognition - An omnifont OCR software for KDE
  • OCRchie - Modular Optical Character Recognition Software
  • ocre - o.c.r. easy
  • xplab - A GTK 2 tool for pattern matching
  • hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article) GPL

OCR file formats

hOCR

  • hocr-tools - Tools for doing various useful things with hOCR files, Apache 2.0
  • hocr-spec - hOCR 1.2 specification
  • ocr-transform - CLI tool to convert between hOCR and ALTO, MIT
  • hocr-parser - hOCR Specification Python Parser
  • hOCRTools - hOCR to ALTO conversion XSLT

ALTO XML

TEI

  • TEI-OCR - TEI customization for OCR generated layout and content information
  • TEI SIG on Libraries - Best Practices for TEI in Libraries
  • GDZ - METS/TEI-based GDZ document format

PAGE XML

  • PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
  • omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
  • py-pagexml - Python library for handling PAGE XML and OPF files.

OCR CLI

  • OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
  • Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
  • Ocrocis - Project manager interface for Ocropy, see also external project homepage
  • tesseract-recognize - Tesseract-based tool that outputs result in Page XML format (docker image).

OCR GUI

  • moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
  • qt-box-editor - QT4 editor of tesseract-ocr box files.
  • ocr-gt-tools - Client-Server application for editing OCR ground truth.
  • Paperwork - Using scanners and OCR to grep paper documents the easy way.
  • Paperless - Scan, index, and archive all of your paper documents.
  • gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
  • VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
  • PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
  • OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
  • PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
  • LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
  • archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.
  • nw-page-editor - Simple app for visual editing of Page XML files. Provides desktop and server docker-based versions.

OCR Preprocessing

OCR as a Service

OCR evaluation

OCR libraries by programming language

Go

  • gosseract - Golang OCR library, wrapping Tesseract-ocr.

Java

  • Tess4J - Java Native Access bindings to Tesseract.
  • tess-two - Tools for compiling Tesseract on Android and Java API.

.Net

Object Pascal

PHP

Python

  • pytesseract - A Python wrapper for Google Tesseract.
  • pyocr - A Python wrapper for Tesseract and Cuneiform.
  • ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
  • tesserocr - A Python wrapper for the tesseract-ocr API

Javascript

  • ocracy - pure javascript lstm rnn implementation based on ocropus
  • gocr.js - Javascript port (emscripten) of gocr
  • ocrad.js - Javascript port (emscripten) of ocrad
  • tesseract.js - Javascript port (emscripten) of Tesseract
  • node-tesseract-ocr - A simple wrapper for the Tesseract OCR package.
  • node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.

Ruby

  • rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
  • ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
  • ocr_space - API wrapper for free ocr service ocr.space. Includes CLI

Rust

  • tesseract.rs - Rust bindings for tesseract OCR.
  • leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.

R

Swift

  • Tesseract OCR iOS - Swift and Objective-C wrapper for Tesseract OCR.
  • SwiftOCR - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes.

OCR training tools

  • glyph-miner - A system for extracting glyphs from early typeset prints
  • ocrodeg - Document image degradation for OCR data augmentation

Datasets

Ground Truth

  • Rescribe - Transcriptions of Caroline Minuscule Manuscripts PDM 1.0

Literature

OCR-related publication and link lists

Blog Posts and Tutorials

OCR Showcases

  • abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
  • cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
  • MathOCR - A printed scientific document recognition system, pre-alpha

Academic articles

2011 and before

2012

2013

2014

2015

2016

2017

2018

Owner
Konstantin Baierer
Ⓐ ಥ_ಥ (╯°□°)╯︵ ┻━┻ ★。・:*¯\_(ツ)_/¯*:・゚★
Konstantin Baierer
This can be use to convert text in a file to handwritten text.

TextToHandwriting This can be used to convert text to handwriting. Clone this project or download the code. Run TextToImage.py give the filename of th

Ashutosh Mahapatra 2 Feb 06, 2022
PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

PRImA Research Lab 46 Nov 14, 2022
Distilling Knowledge via Knowledge Review, CVPR 2021

ReviewKD Distilling Knowledge via Knowledge Review Pengguang Chen, Shu Liu, Hengshuang Zhao, Jiaya Jia This project provides an implementation for the

DV Lab 194 Dec 28, 2022
Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

DewarpNet This repository contains the codes for DewarpNet training. Recent Updates [May, 2020] Added evaluation images and an important note about Ma

<a href=[email protected]"> 354 Jan 01, 2023
Camelot: PDF Table Extraction for Humans

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Dec 31, 2022
Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

QURATOR-SPK 70 Jun 30, 2022
Semantic-based Patch Detection for Binary Programs

PMatch Semantic-based Patch Detection for Binary Programs Requirement tensorflow-gpu 1.13.1 numpy 1.16.2 scikit-learn 0.20.3 ssdeep 3.4 Usage tar -xvz

Mr.Curiosity 3 Sep 02, 2022
Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Dataset and Code for RealVSR Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme Xi Yang, Wangmeng Xiang,

Xi Yang 91 Nov 22, 2022
ScanTailor Advanced is the version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and fixes.

ScanTailor Advanced The ScanTailor version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and f

952 Dec 31, 2022
Basic functions manipulating images using the OpenCV library

OpenCV Basic functions manipulating images using the OpenCV library. Reading Ima

Shatha Siala 3 Feb 17, 2022
Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

SMCG Code for the paper "Controllable Video Captioning with an Exemplar Sentence" Introduction We investigate a novel and challenging task, namely con

10 Dec 04, 2022
Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

SA-AutoAug Scale-aware Automatic Augmentation for Object Detection Yukang Chen, Yanwei Li, Tao Kong, Lu Qi, Ruihang Chu, Lei Li, Jiaya Jia [Paper] [Bi

Jia Research Lab 182 Dec 29, 2022
Repository for Scene Text Detection with Supervised Pyramid Context Network with tensorflow.

Scene-Text-Detection-with-SPCNET Unofficial repository for [Scene Text Detection with Supervised Pyramid Context Network][https://arxiv.org/abs/1811.0

121 Oct 15, 2021
A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

OCR Resources This repository contains a collection of resources (including the papers and datasets) of OCR (Optical Character Recognition). Contents

Zuming Huang 363 Jan 03, 2023
Awesome Spectral Indices in Python.

Awesome Spectral Indices in Python: Numpy | Pandas | GeoPandas | Xarray | Earth Engine | Planetary Computer | Dask GitHub: https://github.com/davemlz/

David Montero Loaiza 98 Jan 02, 2023
Binarize document images

Binarization Binarization for document images Examples Introduction This tool performs document image binarization (i.e. transform colour/grayscale to

QURATOR-SPK 48 Jan 02, 2023
"Very simple but works well" Computer Vision based ID verification solution provided by LibraX.

ID Verification by LibraX.ai This is the first free Identity verification in the market. LibraX.ai is an identity verification platform for developers

LibraX.ai 46 Dec 06, 2022
BoxToolBox is a simple python application built around the openCV library

BoxToolBox is a simple python application built around the openCV library. It is not a full featured application to guide you through the w

František Horínek 1 Nov 12, 2021
Developed an AI-based system to control the mouse cursor using Python and OpenCV with the real-time camera.

Developed an AI-based system to control the mouse cursor using Python and OpenCV with the real-time camera. Fingertip location is mapped to RGB images to control the mouse cursor.

Ravi Sharma 71 Dec 20, 2022
question‘s area recognition using image processing and regular expression

======================================== Paper-Question-recognition ======================================== question‘s area recognition using image p

Yuta Mizuki 7 Dec 27, 2021