A python library for extracting text from PDFs without losing the formatting of the PDF content.

Overview

Open In Colab Multilingual PDF to Text

Install Package from Pypi

  1. Install it using pip.
pip install multilingual-pdf2text

The library uses Tesseract which can be installed by following instructions:

Tesseract Installation

Example Usage

  1. Use it in your code
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()

Tesseract supports the following languages:
Code Language

  • afr Afrikaans
  • amh Amharic
  • ara Arabic
  • asm Assamese
  • aze Azerbaijani
  • aze_cyrl Azerbaijani - Cyrillic aze_
  • bel Belarusian
  • ben Bengali
  • bod Tibetan
  • bos Bosnian
  • bul Bulgarian
  • cat Catalan; Valencian
  • ceb Cebuano
  • ces Czech
  • chi_sim Chinese - Simplified chi_
  • chi_tra Chinese - Traditional chi_
  • chr Cherokee
  • cym Welsh
  • dan Danish
  • deu German
  • dzo Dzongkha
  • ell Greek, Modern (1453-)
  • eng English
  • enm English, Middle (1100-1500)
  • epo Esperanto
  • est Estonian
  • eus Basque
  • fas Persian
  • fin Finnish
  • fra French
  • frk German Fraktur
  • frm French, Middle (ca. 1400-1600)
  • gle Irish
  • glg Galician
  • grc Greek, Ancient (-1453)
  • guj Gujarati
  • hat Haitian; Haitian Creole
  • heb Hebrew
  • hin Hindi
  • hrv Croatian
  • hun Hungarian
  • iku Inuktitut
  • ind Indonesian
  • isl Icelandic
  • ita Italian
  • ita_old Italian - Old ita_
  • jav Javanese
  • jpn Japanese
  • kan Kannada
  • kat Georgian
  • kat_old Georgian - Old kat_
  • kaz Kazakh
  • khm Central Khmer
  • kir Kirghiz; Kyrgyz
  • kor Korean
  • kur Kurdish
  • lao Lao
  • lat Latin
  • lav Latvian
  • lit Lithuanian
  • mal Malayalam
  • mar Marathi
  • mkd Macedonian
  • mlt Maltese
  • msa Malay
  • mya Burmese
  • nep Nepali
  • nld Dutch; Flemish
  • nor Norwegian
  • ori Oriya
  • pan Panjabi; Punjabi
  • pol Polish
  • por Portuguese
  • pus Pushto; Pashto
  • ron Romanian; Moldavian; Moldovan
  • rus Russian
  • san Sanskrit
  • sin Sinhala; Sinhalese
  • slk Slovak
  • slv Slovenian
  • spa Spanish; Castilian
  • spa_old Spanish; Castilian - Old spa_
  • sqi Albanian
  • srp Serbian
  • srp_latn Serbian - Latin srp_
  • swa Swahili
  • swe Swedish
  • syr Syriac
  • tam Tamil
  • tel Telugu
  • tgk Tajik
  • tgl Tagalog
  • tha Thai
  • tir Tigrinya
  • tur Turkish
  • uig Uighur; Uyghur
  • ukr Ukrainian
  • urd Urdu
  • uzb Uzbek
  • uzb_cyrl Uzbek - Cyrillic uzb_
  • vie Vietnamese
  • yid Yiddish
Owner
Shahrukh Khan
CS Grad Student @ Saarland University
Shahrukh Khan
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 03, 2023
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021
Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022
Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python.

About Zen-Knit: Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python. Inspired fro

Zen Reportz 27 Jul 13, 2022
A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
CLI tool to generate pdf invoices written in python

invoicepy CLI invoice tool, store and print invoices as pdf. save companies and customers for later use. installation pip install invoicepy config co

Adam Wojtczak 9 Aug 01, 2022
A tool for certificate PDF generation.

certificate-pdf-generator 获奖证书PDF批量生成工具 | a Tool for certificate PDF generation. ⚠️ 下载前请注意 本项目使用了LFS来存储PDF等大文件。在克隆或下载本仓库前,请先使用apt等包管理器安装git-lfs包。如果已经克

Wanghao Xu 4 Nov 28, 2022
WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

Kozea 5.4k Jan 08, 2023
A simple pdf size compressing telegram robot witten in python.

Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t

Renjith Mangal 22 Oct 28, 2022
PyMuPDF is a Python binding with support for MuPDF

PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, I

PyMuPDF 1.9k Jan 03, 2023
Program that locks/unlocks pdf files🐍

🐍 📄 PDFtools 📄 🐍 Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela 🚨 Aviso 🚨 Altere os caminhos referente

João Victor Vilela dos Santos 1 Nov 04, 2021
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 05, 2021
Simple pdf editor while preserving structure and format.

SIMPdf Simple pdf editor while preserving structure and format.

Shashwat Singh 242 Jan 04, 2023
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
Camelot is a Python library that can help you extract tables from PDFs!

A Python library to extract tabular data from PDFs

1.8k Jan 03, 2023
rst2pdf: Use a text editor. Make a PDF.

rst2pdf: Use a text editor. Make a PDF.

rst2pdf 487 Jan 06, 2023
pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

Will Angley 2 Dec 17, 2021
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 01, 2023
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 01, 2023