Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

Overview

doc2text


Signup for Announcements doc2text Example

doc2text extracts higher quality text by fixing common scan errors

Developing text corpora can be a massive pain in the butt. Much of the text data we are interested in as scientists are locked away in pdfs that are poorly scanned. These scans can be off kilter, poor resolution, have a hand in them... and if you OCR these scans without fixing these errors, the OCR doesn't turn out so well. doc2text was created to help researchers fix these errors and extract the highest quality text from their pdfs as possible.

doc2text is super duper alpha atm

doc2text is developed and tested on Ubuntu 16.04 LTS Xenial Xerus. We do not pretend to serve all operating systems at the moment because that would be irresponsible. Please use this software with a huge grain of salt. We are currently working on:

  • Increasing the responsiveness of the text block identifier.
  • Optimizing the binarization for tesseract detection.
  • Identifying text in multiple columns (right now, treats as one big column).
  • Handling tables.
  • Many other optimizations.

Support and Contributions

If you have feedback or would like to contribute, please, please submit a pull request or contact me at joseph dot sutherland at columbia dot edu.

Installation

To install the doc2text package, simply:

pip install doc2text

doc2text relies on the OpenCV, tesseract, and PythonMagick libraries. To execute the quick-install script, which installs OpenCV, tesseract, and PythonMagick:

curl https://raw.githubusercontent.com/jlsutherland/doc2text/master/install_deps.sh | bash

Manual installation

To install OpenCV manually:

sudo apt-get install -y build-essential
sudo apt-get install -y cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install -y python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev
git clone https://github.com/opencv/opencv.git opencv
git clone https://github.com/opencv/opencv_contrib.git opencv_contrib
cd opencv
git checkout 3.1.0
cd ../opencv_contrib
git checkout 3.1.0
cd ../opencv
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D INSTALL_C_EXAMPLES=OFF -D INSTALL_PYTHON_EXAMPLES=ON -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules -D BUILD_EXAMPLES=ON ..
make -j4
sudo make install
sudo ldconfig

To install tesseract manually:

sudo apt-get install tesseract-ocr

To install PythonMagick manually:

sudo apt-get install python-pythonmagick

Example usage

import doc2text

# Initialize the class.
doc = doc2text.Document()

# You can pass the lang (as 3 letters code) to the class to improve accuracy
# On ubuntu it requires the package tesseract-ocr-$lang$
# On other OS, see https://github.com/tesseract-ocr/langdata
doc = doc2text.Document(lang="eng")

# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('./path/to/my/file')

# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()

# Extract text from the pages.
doc.extract_text()
text = doc.get_text()

Big thanks

doc2text would be nothing without the open-source contributions of:

Owner
Joe Sutherland
Head of Data Science at Search Discovery. @Columbia, @WUSTL grad. @WhiteHouse, @OFA alum.
Joe Sutherland
OpenCVを用いたカメラキャリブレーションのサンプルです。2021/06/21時点でPython実装のある3種類(通常カメラ向け、魚眼レンズ向け(fisheyeモジュール)、全方位カメラ向け(omnidirモジュール))について用意しています。

OpenCV-CameraCalibration-Example FishEyeCameraCalibration.mp4 OpenCVを用いたカメラキャリブレーションのサンプルです 2021/06/21時点でPython実装のある以下3種類について用意しています。 通常カメラ向け 魚眼レンズ向け(

KazuhitoTakahashi 34 Nov 17, 2022
Text page dewarping using a "cubic sheet" model

page_dewarp Page dewarping and thresholding using a "cubic sheet" model - see full writeup at https://mzucker.github.io/2016/08/15/page-dewarping.html

Matt Zucker 1.2k Dec 29, 2022
Pre-Recognize Library - library with algorithms for improving OCR quality.

PRLib - Pre-Recognition Library. The main aim of the library - prepare image for recogntion. Image processing can really help to improve recognition q

Alex 80 Dec 30, 2022
Here use convulation with sobel filter from scratch in opencv python .

Here use convulation with sobel filter from scratch in opencv python .

Tamzid hasan 2 Nov 11, 2021
This is a pytorch re-implementation of EAST: An Efficient and Accurate Scene Text Detector.

EAST: An Efficient and Accurate Scene Text Detector Description: This version will be updated soon, please pay attention to this work. The motivation

Dejia Song 544 Dec 20, 2022
CNN+LSTM+CTC based OCR implemented using tensorflow.

CNN_LSTM_CTC_Tensorflow CNN+LSTM+CTC based OCR(Optical Character Recognition) implemented using tensorflow. Note: there is No restriction on the numbe

Watson Yang 356 Dec 08, 2022
Memory tests solver with using OpenCV

Human Benchmark project This project is OpenCV based programs which are puzzle solvers for 7 different games for https://humanbenchmark.com/. made as

Bahadır Araz 24 Dec 27, 2022
Controlling the computer volume with your hands // OpenCV

HandsControll-AI Controlling the computer volume with your hands // OpenCV Step 1 git clone https://github.com/Hayk-21/HandsControll-AI.git pip instal

Hayk 1 Nov 04, 2021
📷 Face Recognition using Haar-Cascade Classifier, OpenCV, and Python

Face-Recognition-System Face Recognition using Haar-Cascade Classifier, OpenCV and Python. This project is based on face detection and face recognitio

1 Jan 10, 2022
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

213 Nov 12, 2022
Motion Detection Squid Game with OpenCV Python

*Motion Detection Squid Game with OpenCV Python i am newbie in python. In this project I made a simple game to follow the trend about the red light gr

Nayan 17 Nov 22, 2022
Fusion 360 Add-in that creates a pair of toothed curves that can be used to split a body and create two pieces that slide and lock together.

Fusion-360-Add-In-PuzzleSpline Fusion 360 Add-in that creates a pair of toothed curves that can be used to split a body and create two pieces that sli

Michiel van Wessem 1 Nov 15, 2021
Primary QPDF source code and documentation

QPDF QPDF is a command-line tool and C++ library that performs content-preserving transformations on PDF files. It supports linearization, encryption,

QPDF 2.2k Jan 04, 2023
Driver Drowsiness Detection with OpenCV & Dlib

In this project, we have built a driver drowsiness detection system that will detect if the eyes of the driver are close for too long and infer if the driver is sleepy or inactive.

Mansi Mishra 4 Oct 26, 2022
Repositório para registro de estudo da biblioteca opencv (Python)

OpenCV (Python) Objetivo do Repositório: Registrar avanços no estudo da biblioteca opencv. O repositório estará aberto a qualquer pessoa e há tambem u

1 Jun 14, 2022
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Jacobo José Guijarro Villalba 75 Oct 21, 2022
list all open dataset about ocr.

ocr-open-dataset list all open dataset about ocr. printed dataset year Born-Digital Images (Web and Email) 2011-2015 COCO-Text 2017 Text Extraction fr

hongbomin 95 Nov 24, 2022
The Open Source Framework for Machine Vision

SimpleCV Quick Links: About Installation [Docker] (#docker) Ubuntu Virtual Environment Arch Linux Fedora MacOS Windows Raspberry Pi SimpleCV Shell Vid

Sight Machine 2.6k Dec 31, 2022
document image degradation

ocrodeg The ocrodeg package is a small Python library implementing document image degradation for data augmentation for handwriting recognition and OC

NVIDIA Research Projects 134 Nov 18, 2022
Let's explore how we can extract text from forms

Form Segmentation Let's explore how we can extract text from any forms / scanned pages. Objectives The goal is to find an algorithm that can extract t

Philip Doxakis 42 Jun 05, 2022