Convert tables stored as images to an usable .csv file

Overview

Convert an image of numbers to a .csv file

This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for Python to process the given image and Tesseract for number recognition.

Output Example

The repository includes:

  • the source code of image2csv.py,
  • the tools.py file where useful functions are implemented,
  • the grid_detector.py file to perform automatic grid detection,
  • a folder with some files used for test.

The code is not well documented nor fully efficient as I'm a beginner in programming, and this project is a way for me to improve my skills, in particular in Python programming.

How to use the program

First of all, the user must install the needed packages:

$ pip install -r requirements.txt   

as well as Tesseract.

Then, in a python terminal, use the command line:

$ python image2csv.py --image path/to/image

There are a few optionnal arguments:

  • --path path/to/output/csv/file
  • --grid [False]/True
  • --visualization [y]/n
  • --method [fast]/denoize

and one can find their usage using the command line:

$ python image2csv.py --help

By default, the program will try to detect a grid automatically. This detection uses OpenCV's Hough transformation and Canny detection, so the user can tweak a few parameters for better processing in the grid_detector.py file.

When then program is running with manual grid detection, the user has to interact with it via its mouse and the terminal :

  1. the image is opened in a window for the user to draw a rectangle around the first (top left) number. As this rectangle is used as a base to create a grid afterward, keep in mind that all the numbers should fit into the box.
  2. A new window is opened showing the image with the drawn rectangle. Press any key to close and continue.
  3. Based on the drawn rectangle, a grid is created to extract each number one by one. This grid is controlled by the user via two "offset" values. The user has to enter those values in the terminal, then the image is opened in a window with the created grid. Press any key to close and continue. If the numbers does not fit into the grid, the user can change the offset values and repeat this step. When the grid matches the user's expectations, he can set both of the offset values to 0 to continue.
  4. The numbers are extracted from the image and the results are shown in the terminal. (be carefoul though, the indicated number of errors represents the number of errors encountered by Tesseract, but Tesseract can identify a wrong number which will not be counted as an error !)
  5. The .csv file is created with the numbers identified by Tesseract. If Tesseract finds an error, it will show up on the .csv file as an infinite value.

Hypothesis and limits

For the program to run correctly, the input image must verify some hypothesis (just a few simple ones):

  • for manual selection, the line and row width must be constants, as the build grid is just a repetition of the initial rectangle with offsets;
  • to use automatic grid detection, a full and clear grid, with external borders, must be visible;
  • it is recommended to have a good input image resolution, to control the offsets more easily.

At last, this program is not perfect (I know you thought so, with its smooth workflow and simple hypothesis, sorry to disappoint...) and does not work with decimal numbers... But does a great job on negatives ! Also the user must be careful with the slashed zero which seems to be identified by Tesseract as a six.

Credits

For image pre-processing in the tool.py file I used a useful function implemented by @Nitish9711 for his Automatic-Number-plate-detection (https://github.com/Nitish9711/Automatic-Number-plate-detection.git).

Owner
Beginning in the programming world with the help of @29jm, holy builder of the very special SnowflakeOS. Student at the École Centrale de Lille (FR).
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

2 Jul 22, 2022
Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

Samuel 1 Jan 04, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 04, 2023
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 01, 2022
Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Insurance-Fraud-Claims Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance com

1 Jan 27, 2022
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022
Employee Turnover Analysis

Employee Turnover Analysis Submission to the DataCamp competition "Can you help reduce employee turnover?"

Jannik Wiedenhaupt 1 Feb 13, 2022
Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

10 Oct 27, 2021
Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Damien Farrell 81 Dec 26, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

Alexander Goldian 2 Dec 02, 2021
Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

Nicolai Haaber Junge 1 Oct 20, 2021
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022