Deep learning based page layout analysis

Overview

Deep Learning Based Page Layout Analyze

This is a Python implementaion of page layout analyze tool. The goal of page layout analyze is to segment page images into different regions and recoginize them into different classes. A sementic segmentation model is trained to predict a pixel-wise probability map and a simple post-processing procedure is utilized to generate final detection bounding boxs and their corresponding labels and confidence scores. Here is what this code can do:

visual result

Requirements

This repository is mostly written in Python, so Python is essential to run the code. For now, only Python2 (Python2.7 is tested) is supported and it needs some minor modifications if you want to run this code on Python3.

The core of this repository is DeepLab_v2, which is a image semantic segmentation model. We use a TensorFlow implementation of DeepLab_v2 DeepLab-ResNet-TensorFlow written by DrSleep, so TensorFlow needs to be installed before running this code. We use TensorFlow v1.4 and it may need some slightly change if you are using other version of TensorFlow. Also, we need all the requirements from DeepLab-ResNet-TensorFlow.

  • Cython>=0.19.2
  • numpy>=1.7.1
  • matplotlib>=1.3.1
  • Pillow>=2.3.0
  • six>=1.1.0

Also Scikit-image is required for image processing.

  • scikit-image>=0.13.1

To install all the required python packages, you can just run

pip install -r requirements.txt

or for a local installation, run

pip install -U -r requirements.txt

Usage

The code is packaged into a Python function and a Python module with a main function, which will produce exactly the same final detection results. For simplifying the usage, all the parameters are fixed except for the number of classes and visual result flag. One can easily extend the function to accept the parameters that need to be altered.

First, you need to save all the images in a folder and all the images should be in 'jpg' format. Then, a output directory need to be specified to save all the output predictions that include the down-sampled images, probability maps, visualization results and a JSON file with final detection results. The output directory does not have to exist before running the code (if there isn't one, we will create one for you). Finally you can run this code by calling the function in a bash terminal or import the module in Python and either way will do.

Function

PageAnalyze.py is the good point to start with. Just call this function and magic will happen.

python PageAnalyze.py --img_dir=./test/test_images \
                      --out_dir=./test/test_outputs \
                      --num_class=2 \
                      --save=True

Module

example.py gives the easiest way to import the module and call the main function.

import sys
# Add the path of the great module and python sripts.
sys.path.append('utils/')

# Import the great module.
import page_analyze

# The image directory containing all the images need to be processed.
# All the images should be the '.jpg' format.
IMAGE_DIR = 'test/test_images/'

# The output directory saving the output masks and results.
OUTPUT_DIR = 'test/test_outputs/'

# Classes number: 2 or 4.
# 2 --- text / non-text: trained on CNKI data.
# 4 --- figure / table / equation / text: trained on POD data (beta).
CLASS_NUM = 2

# Calling the great function in the great module.
page_analyze.great_function(image_dir=IMAGE_DIR, \
	                        output_dir=OUTPUT_DIR, \
	                        class_num=CLASS_NUM, \
	                        save=True)

# The final detection results will be saved in a single JSON file.
import json
RESULTS = json.load(open(OUTPUT_DIR + 'results.json', 'r'))

Output

If the visual result flag is set to True, then the visualization results will be saved at output_dir/predictions/. To save the running time, the default value of this flag is False.

Final results will be coded into a single JSON file at output_dir/results.json. Here is the example of the JSON file.

{
	"3005": 
	{
		"confs": [0.5356797385620915, 0.913904087544255, 0.7526443014705883, 0.9095219564478454, 0.8951748322262306, 0.6817004971002486, 0.9001002744772497, 0.9337936032277651, 0.8377339456847807, 0.7026428593008852, 0.8779250071028856, 0.8281628004780167, 0.8653182372135079, 0.7315979265269327, 0.5775715633847122, 0.6177185356901381], 
		"labels": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2], 
		"bboxs": [[130, 219, 158, 477], [395, 347, 484, 1990], [543, 725, 714, 1605], [800, 257, 1068, 2082], [1137, 1209, 2007, 2168], [1175, 185, 1230, 429], [1268, 171, 1910, 1123], [1897, 164, 2316, 1123], [2055, 1209, 2986, 2165], [2364, 175, 2567, 1120], [2691, 171, 2942, 1123], [3038, 1213, 3272, 2165], [3052, 168, 3261, 1123], [2594, 563, 2694, 749], [2608, 1055, 2684, 1113], [2979, 1464, 3048, 2161]]
	}, 
	
	"3004": 
	{
		"confs": [0.630786120259585, 0.7335585214477949, 0.8283411346491016, 0.7394772139625081, 0.6203790052606408, 0.7930634196637656, 0.9062854529210855, 0.8209901351845086, 0.9105478240273018, 0.6283956118923438, 0.9496875863021265, 0.8075845380525092, 0.9290070507407969, 0.899121940386255, 0.9245526964953498], 
		"labels": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2], 
		"bboxs": [[996, 1297, 1054, 1581], [1037, 212, 1201, 1026], [1102, 1208, 1259, 2115], [1807, 143, 2163, 1102], [1988, 1293, 2043, 1574], [2094, 1317, 2245, 2016], [2272, 1191, 2385, 2142], [2437, 1191, 2491, 1742], [2529, 1180, 3265, 2149], [2728, 164, 2820, 1067], [2875, 140, 3258, 1112], [219, 164, 1026, 2135], [1211, 191, 1776, 1037], [1287, 1235, 1981, 2105], [2197, 277, 2683, 985]]
	},
	
} 

Descriptions

The pipeline of our model contains two parts: the semantic segmentation stage and the post-processing stage. We adopt DeepLab_v2 to preform a semantic segmentation procedure, and use the TensorFlow implementation DeepLab-ResNet-TensorFlow to train our semantic segmentation model. For post-processing part, we get the bounding box locations along with their confidence scores and labels by analysing the connected component on the probability map generated from our segmentation model.

Training

In order to train the segmentation model, we prepare the pixel-wise mask labels on CNKI dataset and POD dataset.

  • CNKI dataset contains total 14503 images with text / non-text annotations, and most of them are in Chinese language.

  • POD dataset contains 11333 images all in English with a four-class labeling which is figure / table / equation / text.

Noted that CNKI dataset is noisy because it is annotated by a software, but POD dataset has a lot less noise. Also, text are annotated as regions in CNKI dataset while each text line is labeled in POD dataset. Here are the examples of CNKI dataset and POD dataset.

data examples

For CNKI dataset and POD dataset, we train two DeepLab models separately. We initialize the network by the model pre-trained on MSCOCO. And each model was trained by 200k steps with batch size 5 and random scale 321*321 inputs. It took roughly 2 days to converge the model on a single GTX1080 GPU. During the training, all the other hyper paremeters we use are the default values in DeepLab-ResNet-TensorFlow. The mAPs on the training sets of two datasets are 0.909 and 0.837.

This repository does not contain the code for training in the reason that we want this repository keep an acceptable size (or else we need to pack the data as well). But we do put the trained models mdoel_cnki and mdoel_pod in the models folder for inference.

Testing

At inference time, there are four steps and each one is written in a Python module in the utils folder. Either the function or the module calls the page_analyze module which calls those four modules in turn.

  • Configuration module generates the image list file and configuration dictionary.

  • Pre-processing module re-scales (down-samples) large images and dump the scale JSON file. Due to the limit of GPU memory, we down-sample the image with height larger than 1000 pixels to 1000 and keep the aspect ratio.

  • Segmentation module is the core module of this code. It sets up a FIFO queue for all the input images and feeds them into the deep neural network. The probability maps generated by the DeepLab model will be saved in the output directory.

  • Post-Processing module reads the original images and generated probability masks ro get the bounding box locations and their labels and confidence scores. If the save flag is set to Ture, the detection results will be drew on the original images and saved in the output directory. The final results will be written in a single JSON file we have mentioned before.

Here are a simple demo of the detection pipline.

pipline demo

Running Time

We conduct some simple running time analysis on our server (14 cores E5-2680-v4*2 and 8GB GTX1080*2). Configuration module takes nearly no time and the rest of the modules running time is linear to the image number. So we run the code on 51 test images and compute the average time per image for each module. For GPU we only use one GTX1080 GPU and perform multi-processing on CPU.

pre_process inference post_process
CPU time 0.53s / 0.05s 3.98s 0.07s / 0.10s
GPU time - 0.27s -
  • For pre_process, a size 3000*2000 image is quite large for deep learning so we have to down-sample the input image (due to the limit of GPU memory) and this is the most time-consuming part. We took the down-sampled images as inputs and run the code again, and it only took 0.05s per image for the reason we don't need to re-sacle the input images.

  • For inference, it is a feed-forward precedure through the deep neural network, so the time gap between CPU and GPU is enormous. Noted that we only use one GTX1080, so it should be at least twice faster when running on a decent GPU like Titan X.

  • For post_process, connected component analysis is time-consuming, but suprisingly fast on this case. Also, there is a slightly difference between 2 classification and 4 classification which is 0.07s versus 0.10s

In general, now it takes us about a second to process one image. But if the sizes of the input images are smaller, it is likely to achieve 4 to 5 FPS, that is 4 or 5 images per second, with the help of a nice GPU of course.

Problem

We analyze the weakness of this algorithm by 51 test images and the main problem is from the post-processing procedure. Since the DeepLab model achieves 0.909 mAP, there is not a lot space to improve on the deep learning model. We categorize the problem into three types.

problem example

  • Fragmentary text regions (especially in captions). This is because the CNKI data annotated all the captions as text regions, and these extremely small text regions are very close to the non-text region (like a figure or a table) which can harm the training of deep neural network. So at the inference time, the network may predict some separable text regions on the captions causing the bad results.

  • Inseparable non-text regions (causing overlaping between regions). There is only two classes (text / non-text) in CNKI dataset, so the network can not tell the difference between figures and tables. Sometimes, some different but close table and figure regions may be predicted as one non-text region together which may cause overlaping with other regions (which is very bad for the recognition procedure afterwards).

  • Poor results on 4 classification (different data distribution). Since the 4 classes model is trained on POD dataset which has different distribution compared with CNKI dataset (language, layout and different text regions). So there is inevitable some bad results on CNKI test sets when we try to use the model trained on POD dataset. (We have already beaten the second place on POD competition by training the figure / table / equation model and using basically the same post processing procedure.)

Todo

  • Improve the post processing procedure to get a better result.

  • Modify the code to run on Python3.

Statements

  • Sorry we can not make the source code public yet.

  • For more details, please refer to our paper:

@inproceedings{li2018deeplayout,

title={DeepLayout: A Semantic Segmentation Approach to Page Layout Analysis},

author={Li, Yixin and Zou, Yajun and Ma, Jinwen},

booktitle={International Conference on Intelligent Computing},

pages={266--277},

year={2018},

organization={Springer}

}

Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Deskew by Marek Mauder https://galfar.vevb.net/deskew https://github.com/galfar/deskew v1.30 2019-06-07 Overview Deskew is a command line tool for des

Marek Mauder 127 Dec 03, 2022
Um simples projeto para fazer o reconhecimento do captcha usado pelo jogo bombcrypto

CaptchaSolver - LEIA ISSO 😓 Para iniciar o codigo: pip install -r requirements.txt python captcha_solver.py Se você deseja pegar ver o resultado das

Kawanderson 50 Mar 21, 2022
A buffered and threaded wrapper for the OpenCV VideoCapture object. Can speed up video decoding significantly. Supports

A buffered and threaded wrapper for the OpenCV VideoCapture object. Can speed up video decoding significantly. Supports "with"-syntax.

Patrice Matz 0 Oct 30, 2021
a deep learning model for page layout analysis / segmentation.

OCR Segmentation a deep learning model for page layout analysis / segmentation. dependencies tensorflow1.8 python3 dataset: uw3-framed-lines-degraded-

99 Dec 12, 2022
【Auto】原神⭐钓鱼辅助工具 | 自动收竿、校准游标 | ✨您只需要抛出鱼竿,我们会帮你完成一切✨

原神钓鱼辅助工具 ✨ 作者正在努力重构代码中……会尽快带给大家一个更完美的脚本 ✨ 「您只需抛出鱼竿,然后我们会帮您搞定一切」 如果你觉得这个脚本好用,请点一个 Star ⭐ ,你的 Star 就是作者更新最大的动力 点击这里 查看演示视频 ✨ 欢迎大家在 Issues 中分享自己的配置文件 ✨ ✨

261 Jan 02, 2023
DouZero is a reinforcement learning framework for DouDizhu - 斗地主AI

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

Kwai 3.1k Jan 05, 2023
2 telegram-bots: for image recognition and for text generation

💻 📱 Telegram_Bots 🔎 & 📖 2 telegram-bots: for image recognition and for text generation. About Image recognition bot: User sends a photo and bot de

Marina Polukoshko 1 Jan 27, 2022
This is a repository to learn and get more computer vision skills, make robotics projects integrating the computer vision as a perception tool and create a lot of awesome advanced controllers for the robots of the future.

This is a repository to learn and get more computer vision skills, make robotics projects integrating the computer vision as a perception tool and create a lot of awesome advanced controllers for the

Elkin Javier Guerra Galeano 17 Nov 03, 2022
Code for the "Sensing leg movement enhances wearable monitoring of energy expenditure" paper.

EnergyExpenditure Code for the "Sensing leg movement enhances wearable monitoring of energy expenditure" paper. Additional data for replicating this s

Patrick S 42 Oct 26, 2022
A tool to enhance your old/damaged pictures built using python & opencv.

Breathe Life into your Old Pictures Table of Contents About The Project Getting Started Prerequisites Usage Contact Acknowledgments About The Project

Shah Anwaar Khalid 5 Dec 16, 2021
Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

Daniel Soares Saldanha 2 Oct 11, 2021
caffe re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection

R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection Abstract This is a caffe re-implementation of R2CNN: Rotational Region CNN fo

candler 80 Dec 28, 2021
PianoVisuals - Create background videos synced with piano music using opencv

Steps Record piano video Use Neural Network to do body segmentation (video matti

Solbiati Alessandro 4 Jan 24, 2022
Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.

opencv_yuz_bulma Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz. Bilgisarın kendi kamerasını kullanmak için;

Ahmet Haydar Ornek 6 Apr 16, 2022
Image augmentation for machine learning experiments.

imgaug This python library helps you with augmenting images for your machine learning projects. It converts a set of input images into a new, much lar

Alexander Jung 13.2k Jan 02, 2023
governance proposal to make fei redeemable for eth

Feil Proposal 🌲 Abstract Migrate all ETH from Fei protocol-controlled value into Yearn ETH Vault. Allow redemptions of outstanding FEI for yvETH. At

13 Mar 31, 2022
A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database.

A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database. The structure, shape and proportions of the faces are comp

Pavankumar Khot 4 Mar 19, 2022
APS 6º Semestre - UNIP (2021)

UNIP - Universidade Paulista Ciência da Computação (CC) DESENVOLVIMENTO DE UM SISTEMA COMPUTACIONAL PARA ANÁLISE E CLASSIFICAÇÃO DE FORMAS Link do git

Eduardo Talarico 5 Mar 09, 2022
Developed an AI-based system to control the mouse cursor using Python and OpenCV with the real-time camera.

Developed an AI-based system to control the mouse cursor using Python and OpenCV with the real-time camera. Fingertip location is mapped to RGB images to control the mouse cursor.

Ravi Sharma 71 Dec 20, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 06, 2021