~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Last update: Dec 06, 2022

Related tags

Overview

cosc428-structor

I had an open-ended Computer Vision assignment to complete, and an out-of-copyright book that I wanted to turn into an ebook. Conventional OCR engines like Tesseract weren't able to accurately recognise the page structure, which led to many transcription errors. If I could tell Tesseract to ignore certain regions (like images or repeated headers), then I could greatly reduce the number of errors in the resulting ebook. Thus: for my assignment, I wrote a program that takes an image and uses computer vision magick to determine the page's structure. So far, my program can detect and locate:

lines of text,
paragraphs,
section titles,
images and their associated captions,
boilerplate like page numbers, and
chapter titles.

Ain't it grand?

Dependencies

The project is written in Python 2.7.3 and uses the cv2 library for interacting with openCV. It also uses numpy for some of the mathematical operations. On windows, the best way to get these dependencies is to install the Python(x,y) suite (https://code.google.com/p/pythonxy/), which combines python with a customisable set of scientific computing libraries.

Program Structure

The program's root is main.py, but this simply iterates through images in a folder and constructs a Page instance from each image. Thus, the real work happens in page.py.

page.py contains a few utility methods and the Page class. The constructor calls the appropriate methods in order to determine the logical structure of the page. This structure is stored in three objects: self.margin, self.content, and self.boilerplate (which contains such non-content text objects as the page number and header).

The getBuildingBlocks method is responsible for finding words, grouping words into textual lines, discarding marginal noise, and fitting a Margin instance around the remaining lines. Most of these tasks are preformed by calling other functions.

The self.content object is found by passing the set of lines to the Content() constructor. This uses a state machine to group lines into figures, paragraphs, section titles, etc. The Content class, along with a class for each content type, is found in content.py.

The other files can generally be ignored when trying to understand the program; they are largely just convenience classes which represent page elements (such as points, geometric lines, words, text lines, and boxes), as well as supporting tools such as the Stopwatch.

How to Run the Code

Run main.py using the python interpreter. This will process each page in ./images, and for each page a series of 'snapshot' images will be displayed in order to illustrate the algorithm. To show only the final result for each image, set showSteps in main.py to False.

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 3, 2023

Comments

The getBuildingBlocks

Hello, Recently, I have some task about the document layout analysis. The description in "README.md" is very consistent with my mission. But when I try to run the code as README.md: How to Run the Code, there just some red line in each dobule word and have no resault of the detect and locate of "line of text", "paragraphs", "section titles" , etc. So I want to know what has happend to the code. Very thankful

opened by lvbohui 3

Releases(v1.0)

v1.0(Nov 7, 2013)

This is the version that I used to write the first draft of my conference paper.
Source code(tar.gz)
Source code(zip)

~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Related tags

Overview

cosc428-structor

Dependencies

Program Structure

How to Run the Code

You might also like...

Basic functions manipulating images using the OpenCV library

Some bits of javascript to transcribe scanned pages using PageXML

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Text page dewarping using a "cubic sheet" model

Deep learning based page layout analysis

ocroseg - This is a deep learning model for page layout analysis / segmentation.

a deep learning model for page layout analysis / segmentation.

OCR-D-compliant page segmentation

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Comments

The getBuildingBlocks

Releases(v1.0)

v1.0(Nov 7, 2013)

Owner

Chad Oliver

A buffered and threaded wrapper for the OpenCV VideoCapture object. Can speed up video decoding significantly. Supports

Use Youdao OCR API to covert your clipboard image to text.

Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.

Python-based tools for document analysis and OCR

One Metrics Library to Rule Them All!

Document Image Dewarping

The papers published in top-tier AI conferences in recent years.

Scene text detection and recognition based on Extremal Region(ER)

Hand Detection and Finger Detection on Live Feed

Generates a message from the infamous Jerma Impostor image

Balabobapy - Using artificial intelligence algorithms to continue the text

When Age-Invariant Face Recognition Meets Face Age Synthesis: A Multi-Task Learning Framework (CVPR 2021 oral)

Creating a virtual tv using opencv in python3.

An interactive interface for using OpenCV's GrabCut algorithm for image segmentation.

Camelot: PDF Table Extraction for Humans

A tool to make dumpy among us GIFS

Train custom VR face tracking parameters

Sort By Face

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

This is a project to detect gestures to zoom in or out, using the real-time distance between the index finger and the thumb. It's based on OpenCV and Mediapipe.