Camelot: PDF Table Extraction for Humans
Camelot is a Python library that can help you extract tables from PDFs!
Note: You can also check out Excalibur, the web interface to Camelot!
Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.
>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
>>> tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle Name
KI (1/km)
Distance (mi)
Percent Fuel Savings
Improved Speed
Decreased Accel
Eliminate Stops
Decreased Idle
2012_2
3.30
1.3
5.9%
9.5%
29.2%
17.4%
2145_1
0.68
11.2
2.4%
0.1%
9.5%
2.7%
4234_1
0.59
58.7
8.5%
1.3%
8.5%
3.3%
2032_2
0.17
57.8
21.7%
0.3%
2.7%
1.2%
4171_1
0.07
173.9
58.1%
1.6%
2.1%
0.5%
Camelot also comes packaged with a command-line interface!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
You can check out some frequently asked questions here.
Why Camelot?
- Configurability: Camelot gives you control over the table extraction process with tweakable settings.
- Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
- Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.
See comparison with similar libraries and tools.
Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.
Installation
Using conda
The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.
$ conda install -c conda-forge camelot-py
Using pip
After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:
$ pip install "camelot-py[base]"
From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/camelot-dev/camelot
and install Camelot using pip:
$ cd camelot
$ pip install ".[base]"
Documentation
The documentation is available at http://camelot-py.readthedocs.io/.
Wrappers
- camelot-php provides a PHP wrapper on Camelot.
Contributing
The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.
Versioning
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.
License
This project is licensed under the MIT License, see the LICENSE file for details.
Owner
Camelot and Excalibur: PDF Table Extraction for Humans
A python library for extracting text from PDFs without losing the formatting of the PDF content.
Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins
49 Nov 07, 2022
pikepdf is a Python library for reading and writing PDF files.
A Python library for reading and writing PDF, powered by qpdf
1.6k Jan 03, 2023
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.
Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.
76 Dec 12, 2022
Performing the following operations using python on PDF.
Python PDF Handling Tutorial Python is a highly versatile language with a huge set of libraries. It is a high level language with simple syntax. Pytho
131 Dec 16, 2022
Auto Convert PDFs to png files in python
This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files
4 Dec 05, 2021
x-ray is a Python library for finding bad redactions in PDF documents.
A tool to detect whether a PDF has a bad redaction
73 Dec 19, 2022
this is simple program, that converts pdf file to png
author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window
1 Nov 01, 2021
PyMuPDF is a Python binding with support for MuPDF
PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, I
1.9k Jan 03, 2023
A bot for PDF for doing Many Things....
Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif
60 Dec 27, 2022
A Python tool to generate a static HTML file that represents the internal structure of a PDF file
PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve
394 Dec 30, 2022
Convert Lecture Videos to PDF
Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide
20 Nov 25, 2022
Produce pdf in python backend from simple bootstrap vue frontend and download to browser
vollmacht produce pdf in python backend from simple bootstrap vue frontend and download to browser Frontend in one file with bootstrap-vue (allthough
1 Nov 08, 2020
Simple pdf editor while preserving structure and format.
SIMPdf Simple pdf editor while preserving structure and format.
242 Jan 04, 2023
Convert PDF to AudioBook and Audio Speech to PDF
In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul
1 Feb 13, 2022
Compare-pdf - A Flask driven restful API for comparing two PDF files
COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description
3 Mar 13, 2022
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.
Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u
1 Nov 09, 2021
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.
mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf
49 Dec 27, 2022
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input
pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark
387 Dec 10, 2022
pdf_sprinkles: sprinkles text in your PDFs
pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc
2 Dec 17, 2021
Python script that split PDF files.
Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros
sudo apt-get
5 Apr 02, 2022

