A modern pure-Python library for reading PDF files

Last update: Apr 06, 2022

Related tags

Overview

pdf

A modern pure-Python library for reading PDF files.

The goal is to have a modern interface to handle PDF files which is consistent with itself and typical Python syntax.

The library should be Python-only (hence no C-extensions), but allow to change the backend. Similar in concept to matplotlib backends and Keras backends.

The default backend could be PyPDF2.

Possible other backends could be PyMuPDF (using MuPDF) and PikePDF (using QPDF).

WARNING: This library is UNSTABLE at the moment! Expect many changes!

Installation

pip install pdffile

Usage

Retrieve Metadata

>>> import pdf

>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1

>>> doc.metadata
Metadata(
    title=None,
    producer='pdfTeX-1.40.23',
    creator='TeX',
    creation_date=datetime.datetime(2022, 4, 3, 18, 5, 42),
    modification_date=datetime.datetime(2022, 4, 3, 18, 5, 42)
    other={
         '/CreationDate': "D:20220403180542+02'00'",
         '/ModDate': "D:20220403180542+02'00'",
         '/Trapped': '/False',
         '/PTEX.Fullbanner': 'This is pdfTeX, V...'})

Encrypted PDFs

If you have an encrypted PDF, just provide the key:

doc = pdf.PdfFile(pdf_path, password=password)

All following operations work just as described.

Get Outline

>>> import pdf
>>> doc = pdf.PdfFile(pdf_path, password=password)
>>> doc.outline
[
    Links(page=5, text='1 Header'),
    Links(page=5, text='1.1 A section'),
    Links(page=9, text='2 Foobar'),
    Links(page=108, text='References')
]

Extract Text

>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
<pdf.PdfPage object at 0x7f72d2b04100>
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'

Alternatively, you can use doc.text to get the text of all pages.

A modern pure-Python library for reading PDF files

Related tags

Overview

pdf

Installation

Usage

Retrieve Metadata

Encrypted PDFs

Get Outline

Extract Text

Owner

Unofficial PyTorch Implementation of "DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features"

High frequency AI based algorithmic trading module.

Finetuner allows one to tune the weights of any deep neural network for better embeddings on search tasks

Management Dashboard for Torchserve

Simulation code and tutorial for BBHnet training data

Code repository for our paper "Learning to Generate Scene Graph from Natural Language Supervision" in ICCV 2021

Repository for the paper : Meta-FDMixup: Cross-Domain Few-Shot Learning Guided byLabeled Target Data

Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

EmoTag helps you train emotion detection model for Chinese audios

Digital Twin Mobility Profiling: A Spatio-Temporal Graph Learning Approach

A object detecting neural network powered by the yolo architecture and leveraging the PyTorch framework and associated libraries.

Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

Learning from History: Modeling Temporal Knowledge Graphs with Sequential Copy-Generation Networks

This is the repository for the NeurIPS-21 paper [Contrastive Graph Poisson Networks: Semi-Supervised Learning with Extremely Limited Labels].

High-resolution networks and Segmentation Transformer for Semantic Segmentation

Official Implementation for Fast Training of Neural Lumigraph Representations using Meta Learning.

MicroNet: Improving Image Recognition with Extremely Low FLOPs (ICCV 2021)

Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes, ICCV 2017

ML models and internal tensors 3D visualizer

neural image generation