Camelot: PDF Table Extraction for Humans
Camelot is a Python library that can help you extract tables from PDFs!
Note: You can also check out Excalibur, the web interface to Camelot!
Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings Improved Speed Decreased Accel Eliminate Stops Decreased Idle 2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4% 2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7% 4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3% 2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2% 4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5% Camelot also comes packaged with a command-line interface!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
You can check out some frequently asked questions here.
Why Camelot?
- Configurability: Camelot gives you control over the table extraction process with tweakable settings.
- Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
- Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.
See comparison with similar libraries and tools.
Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.
Installation
Using conda
The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.
$ conda install -c conda-forge camelot-pyUsing pip
After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:
$ pip install "camelot-py[base]"From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/camelot-dev/camelotand install Camelot using pip:
$ cd camelot $ pip install ".[base]"Documentation
The documentation is available at http://camelot-py.readthedocs.io/.
Wrappers
- camelot-php provides a PHP wrapper on Camelot.
Contributing
The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.
Versioning
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.
License
This project is licensed under the MIT License, see the LICENSE file for details.
Owner
Camelot and Excalibur: PDF Table Extraction for Humansrst2pdf: Use a text editor. Make a PDF.
rst2pdf: Use a text editor. Make a PDF.
487 Jan 06, 2023Simple pdf editor while preserving structure and format.
SIMPdf Simple pdf editor while preserving structure and format.
242 Jan 04, 2023Python lib for Simple PDF text extraction
Python lib for Simple PDF text extraction
651 Jan 01, 2023Table automatically extraction from PDF Document
PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve
1 Jan 10, 2022Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator
Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene
1.9k Jan 01, 2023Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.
tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF
1 Nov 30, 2021A simple pdf size compressing telegram robot witten in python.
Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t
22 Oct 28, 2022A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.
mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf
49 Dec 27, 2022pdf_sprinkles: sprinkles text in your PDFs
pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc
2 Dec 17, 2021Convert Lecture Videos to PDF
Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide
20 Nov 25, 2022Python script that split PDF files.
Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get
5 Apr 02, 2022borb is a library for reading, creating and manipulating PDF files in python.
borb is a library for reading, creating and manipulating PDF files in python.
2.9k Jan 01, 2023PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files
PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files
9 Jan 30, 2022Python PDF Parser (Not actively maintained). Check out pdfminer.six.
PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi
4.9k Jan 04, 2023Excalibur: A web interface to extract tabular data from PDFs
Excalibur: A web interface to extract tabular data from PDFs Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It i
1.2k Jan 04, 2023An application which enables the users to perform simple yet intriguing PDF operations
AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M
5 Jan 22, 2022Svg2pdfgen - Svg To PDF gen with python
Svg2pdfgen - Svg To PDF gen with python
3 May 30, 2022A bulk pdf generator. This application can generate PDFs in bulk by using just one click.
A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f
3 Apr 23, 2022A Python tool to generate a static HTML file that represents the internal structure of a PDF file
PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve
394 Dec 30, 2022Extract the table in the PDF,outputs the data similar to the json format
extract the table in the PDF,outputs the data similar to the json format
3 Nov 25, 2021