A python library for extracting text from PDFs without losing the formatting of the PDF content.

Overview

Open In Colab Multilingual PDF to Text

Install Package from Pypi

  1. Install it using pip.
pip install multilingual-pdf2text

The library uses Tesseract which can be installed by following instructions:

Tesseract Installation

Example Usage

  1. Use it in your code
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()

Tesseract supports the following languages:
Code Language

  • afr Afrikaans
  • amh Amharic
  • ara Arabic
  • asm Assamese
  • aze Azerbaijani
  • aze_cyrl Azerbaijani - Cyrillic aze_
  • bel Belarusian
  • ben Bengali
  • bod Tibetan
  • bos Bosnian
  • bul Bulgarian
  • cat Catalan; Valencian
  • ceb Cebuano
  • ces Czech
  • chi_sim Chinese - Simplified chi_
  • chi_tra Chinese - Traditional chi_
  • chr Cherokee
  • cym Welsh
  • dan Danish
  • deu German
  • dzo Dzongkha
  • ell Greek, Modern (1453-)
  • eng English
  • enm English, Middle (1100-1500)
  • epo Esperanto
  • est Estonian
  • eus Basque
  • fas Persian
  • fin Finnish
  • fra French
  • frk German Fraktur
  • frm French, Middle (ca. 1400-1600)
  • gle Irish
  • glg Galician
  • grc Greek, Ancient (-1453)
  • guj Gujarati
  • hat Haitian; Haitian Creole
  • heb Hebrew
  • hin Hindi
  • hrv Croatian
  • hun Hungarian
  • iku Inuktitut
  • ind Indonesian
  • isl Icelandic
  • ita Italian
  • ita_old Italian - Old ita_
  • jav Javanese
  • jpn Japanese
  • kan Kannada
  • kat Georgian
  • kat_old Georgian - Old kat_
  • kaz Kazakh
  • khm Central Khmer
  • kir Kirghiz; Kyrgyz
  • kor Korean
  • kur Kurdish
  • lao Lao
  • lat Latin
  • lav Latvian
  • lit Lithuanian
  • mal Malayalam
  • mar Marathi
  • mkd Macedonian
  • mlt Maltese
  • msa Malay
  • mya Burmese
  • nep Nepali
  • nld Dutch; Flemish
  • nor Norwegian
  • ori Oriya
  • pan Panjabi; Punjabi
  • pol Polish
  • por Portuguese
  • pus Pushto; Pashto
  • ron Romanian; Moldavian; Moldovan
  • rus Russian
  • san Sanskrit
  • sin Sinhala; Sinhalese
  • slk Slovak
  • slv Slovenian
  • spa Spanish; Castilian
  • spa_old Spanish; Castilian - Old spa_
  • sqi Albanian
  • srp Serbian
  • srp_latn Serbian - Latin srp_
  • swa Swahili
  • swe Swedish
  • syr Syriac
  • tam Tamil
  • tel Telugu
  • tgk Tajik
  • tgl Tagalog
  • tha Thai
  • tir Tigrinya
  • tur Turkish
  • uig Uighur; Uyghur
  • ukr Ukrainian
  • urd Urdu
  • uzb Uzbek
  • uzb_cyrl Uzbek - Cyrillic uzb_
  • vie Vietnamese
  • yid Yiddish
Owner
Shahrukh Khan
CS Grad Student @ Saarland University
Shahrukh Khan
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
Produce pdf in python backend from simple bootstrap vue frontend and download to browser

vollmacht produce pdf in python backend from simple bootstrap vue frontend and download to browser Frontend in one file with bootstrap-vue (allthough

Otto 1 Nov 08, 2020
Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

Duo Apps 6 Oct 03, 2022
Generate a preview image for a PDF.

PDF ➡️ Preview A simple tool to save me time on Illustrator. Generates a preview image for a PDF file. Useful for sneak peeks to academic publications

David Chuan-En Lin 51 Sep 22, 2022
Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

Will Fantom 1 Feb 09, 2022
A bot for PDF for doing Many Things....

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

Mr. Developer 60 Dec 27, 2022
A tool for certificate PDF generation.

certificate-pdf-generator 获奖证书PDF批量生成工具 | a Tool for certificate PDF generation. ⚠️ 下载前请注意 本项目使用了LFS来存储PDF等大文件。在克隆或下载本仓库前,请先使用apt等包管理器安装git-lfs包。如果已经克

Wanghao Xu 4 Nov 28, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
PyMuPDF is a Python binding with support for MuPDF

PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, I

PyMuPDF 1.9k Jan 03, 2023
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022
CLI tool to generate pdf invoices written in python

invoicepy CLI invoice tool, store and print invoices as pdf. save companies and customers for later use. installation pip install invoicepy config co

Adam Wojtczak 9 Aug 01, 2022
minipdf is a package for creating simple, single-page PDF documents.

minipdf minipdf is a package for creating simple, single-page PDF documents. Installation You can install the development version from GitHub with: #

mikefc 41 Dec 19, 2022
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 02, 2022
DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

Frédéric BISSON 6 Jul 27, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

RISHABH MISHRA 1 Feb 13, 2022
Busca no nome e conteúdo de arquivos PDF no diretório e subdiretórios.

PDF Finder Este script auxilia na pesquisa em pastas com inúmeros arquivos PDF. A pesquisa é feita em todos os arquivos do doretório e subdiretórios.

William Pilger 1 Nov 27, 2021
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
rst2pdf: Use a text editor. Make a PDF.

rst2pdf: Use a text editor. Make a PDF.

rst2pdf 487 Jan 06, 2023
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Jan 06, 2023
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021