A python library for extracting text from PDFs without losing the formatting of the PDF content.

Last update: Nov 07, 2022

Overview

Multilingual PDF to Text

Install Package from Pypi

Install it using pip.

pip install multilingual-pdf2text

The library uses Tesseract which can be installed by following instructions:

Tesseract Installation

Example Usage

Use it in your code

from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()

Tesseract supports the following languages:
Code Language

afr Afrikaans
amh Amharic
ara Arabic
asm Assamese
aze Azerbaijani
aze_cyrl Azerbaijani - Cyrillic aze_
bel Belarusian
ben Bengali
bod Tibetan
bos Bosnian
bul Bulgarian
cat Catalan; Valencian
ceb Cebuano
ces Czech
chi_sim Chinese - Simplified chi_
chi_tra Chinese - Traditional chi_
chr Cherokee
cym Welsh
dan Danish
deu German
dzo Dzongkha
ell Greek, Modern (1453-)
eng English
enm English, Middle (1100-1500)
epo Esperanto
est Estonian
eus Basque
fas Persian
fin Finnish
fra French
frk German Fraktur
frm French, Middle (ca. 1400-1600)
gle Irish
glg Galician
grc Greek, Ancient (-1453)
guj Gujarati
hat Haitian; Haitian Creole
heb Hebrew
hin Hindi
hrv Croatian
hun Hungarian
iku Inuktitut
ind Indonesian
isl Icelandic
ita Italian
ita_old Italian - Old ita_
jav Javanese
jpn Japanese
kan Kannada
kat Georgian
kat_old Georgian - Old kat_
kaz Kazakh
khm Central Khmer
kir Kirghiz; Kyrgyz
kor Korean
kur Kurdish
lao Lao
lat Latin
lav Latvian
lit Lithuanian
mal Malayalam
mar Marathi
mkd Macedonian
mlt Maltese
msa Malay
mya Burmese
nep Nepali
nld Dutch; Flemish
nor Norwegian
ori Oriya
pan Panjabi; Punjabi
pol Polish
por Portuguese
pus Pushto; Pashto
ron Romanian; Moldavian; Moldovan
rus Russian
san Sanskrit
sin Sinhala; Sinhalese
slk Slovak
slv Slovenian
spa Spanish; Castilian
spa_old Spanish; Castilian - Old spa_
sqi Albanian
srp Serbian
srp_latn Serbian - Latin srp_
swa Swahili
swe Swedish
syr Syriac
tam Tamil
tel Telugu
tgk Tajik
tgl Tagalog
tha Thai
tir Tigrinya
tur Turkish
uig Uighur; Uyghur
ukr Ukrainian
urd Urdu
uzb Uzbek
uzb_cyrl Uzbek - Cyrillic uzb_
vie Vietnamese
yid Yiddish

A python library for extracting text from PDFs without losing the formatting of the PDF content.

Related tags

Overview

Multilingual PDF to Text

Install Package from Pypi

Example Usage

Owner

Shahrukh Khan

Python lib for Simple PDF text extraction

Camelot is a Python library that can help you extract tables from PDFs!

pikepdf is a Python library for reading and writing PDF files.

Extract the table in the PDF，outputs the data similar to the json format

A python library for extracting text from PDFs without losing the formatting of the PDF content.

Compare-pdf - A Flask driven restful API for comparing two PDF files

Excalibur: A web interface to extract tabular data from PDFs

A simple Python script to convert multiple images (well technically also a single image) into a pdf.

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

CLI tool to generate pdf invoices written in python

A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

WeasyPrint is a smart solution helping web developers to create PDF documents.

Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python.

Program that locks/unlocks pdf files🐍

Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

minipdf is a package for creating simple, single-page PDF documents.