PyMuPDF is a Python binding with support for MuPDF

Overview

PyMuPDF 1.18.14

logo

Release date: June 1, 2021

Travis-CI: Build Status

On PyPI since August 2016: Downloads

Authors

Introduction

PyMuPDF (current version 1.18.14) is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.

MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.

With PyMuPDF you can access files with extensions like ".pdf", ".xps", ".oxps", ".cbz", ".fb2" or ".epub". In addition, about 10 popular image formats can also be handled like documents: ".png", ".jpg", ".bmp", ".tiff", etc..

In partnership with Artifex, PyMuPDF is now also available for commercial licensing. This agreement has no impact on use cases, that are compliant with the open-source license AGPL. Please see the "License and Copyright" section below for additional information.

Usage and Documentation

For all supported document types (i.e. including images) you can

  • decrypt the document
  • access meta information, links and bookmarks
  • render pages in raster formats (PNG and some others), or the vector format SVG
  • search for text
  • extract text and images
  • convert to other formats: PDF, (X)HTML, XML, JSON, text

To some degree, PyMuPDF can therefore be used as an image converter: it can read a range of input formats and can produce Portable Network Graphics (PNG), Portable Anymaps (PNM, etc.), Portable Arbitrary Maps (PAM), Adobe Postscript and Adobe Photoshop documents, making the use of other graphics packages obselete in these cases. But interfacing with e.g. PIL/Pillow for image input and output is easy as well.

For PDF documents, there exists a plethorea of additional features: they can be created, joined or split up. Pages can be inserted, deleted, re-arranged or modified in many ways (including annotations and form fields).

  • Images and fonts can be extracted or inserted.

    You may want to have a look at this cool GUI example script, which lets you insert, delete, replace or re-position images under your visual control.

    Since v1.18.8 there is a new experimental Document method subset_fonts(), which automatically builds subsets based on the usage of all eligible fonts in the document. Especially for new documents, this can lead to significant file size reductions. The method was developed in cooperation with our user @cuteufo - again thanks a lot for the contribution.

  • Embedded files are fully supported.

  • PDFs can be reformatted to support double-sided printing, posterizing, applying logos or watermarks

  • Password protection is fully supported: decryption, encryption, encryption method selection, permmission level and user / owner password setting.

  • Support of the PDF Optional Content concept for images, text and drawings.

  • Low-level PDF structures can be accessed and modified.

  • PyMuPDF can also be used as a module in the command line using "python -m fitz ...". This is a versatile utility, which we will further develop going forward. It currently supports PDF document

    • encryption / decryption / optimization
    • creating sub-documents
    • document joining
    • image / font extraction
    • full support of embedded files.

Have a look at the basic demos, the examples (which contain complete, working programs), and the recipes section of our Wiki sidebar, which contains more than a dozen of guides in How-To-style.

Our documentation, written using Sphinx, is available in various formats from the following sources. It currently is a combination of a reference guide and a user manual. For a quick start look at the tutorial and the recipes chapters.

  • You can view it online at Read the Docs. This site also provides download options for PDF.
  • The search function on Read the Docs does not work for me currently. If you want a working searchable local version, please download a zipped HTML for here.
  • Find a Windows help file here.

Installation

For Windows, Linux and Mac OSX platforms, there are wheels in the download section of PyPI. This includes Python 64bit versions 3.6 through 3.9. For Windows only, 32bit versions are available too. Since version 1.18.14 there also exist wheels for the Linux ARM architecture - look for platform tag manylinux2014_aarch64.

If your platform is not supported with one of our wheels, you need to generate PyMuPDF yourself as follows. This requires the development version of Python.

Before you can do that, you must first build MuPDF. For most platforms, the MuPDF sources contain prepared procedures for achieving this. Please observe the following general steps:

  • Be sure to download the official MuPDF source release from here. Do not use MuPDF's GitHub repo. It contains their development source for future versions.

  • This repo's fitz folder contains one or more files whose names start with a single underscore "_". These files contain configuration data and potentially other fixes. Copy-rename each of them to their correct target location within the downloaded MuPDF source. Currently, these files are:

    • Optional: fitz configuration file _config.h copy-replace to: mupdf/include/mupdf/fitz/config.h. It contains configuration data like e.g. which fonts to support. If omitting this change, the binary extension module will be over 30 MB (compared to around 11 MB). Does not impact functionality.

    • Now MuPDF can be generated.

  • Please note that you will need the interface generator SWIG when building PyMuPDF from the sources of this repository (please refer to issue #312 for some background on this).

    • PyMuPDF wheels are being generated using SWIG v4.0.2.
  • If you do not use SWIG, please download the sources from PyPI - they contain sources pre-processed by SWIG, so installation should work like any other Python extension generation on your system.

Once this is done, adjust directories in setup.py and run python setup.py install.

The following sections contain further comments for some platforms.

Ubuntu

Our users (thanks to @gileadslostson and @jbarlow83!) have documented their MuPDF installation experiences from sources in this Wiki page.

OSX

First, install the MuPDF headers and libraries, which are provided by mupdf-tools: brew install mupdf-tools.

Then you might need to export ARCHFLAGS='-arch x86_64', since libmupdf.a is for x86_64 only.

Finally, please double check setup.py before building. Update include_dirs and library_dirs if necessary.

MS Windows

If you are looking to make your own binary, consult this Wiki page. It explains how to use Visual Studio for generating MuPDF in quite some detail.

Earlier Versions

Earlier versions are available in the releases directory.

License and Copyright

In order to comply with MuPDF’s dual licensing model, PyMuPDF has entered into an agreement with Artifex who has the right to sublicense PyMuPDF to third parties.

PyMuPDF and MuPDF are now available under both, open-source AGPL and commercial license agreements.

Please read the full text of the AGPL license agreement (which is also included here in file COPYING) to ensure that your use case complies with the guidelines of this license. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

Artifex is the exclusive commercial licensing agent for MuPDF.

Artifex, the Artifex logo, MuPDF, and the MuPDF logo are registered trademarks of Artifex Software Inc. © 2021 Artifex Software, Inc. All rights reserved.

Contact

Please use the Discussions menu for questions, comments, or asking others for help, and submit issues here. If you wish, you can also contact me directly via [email protected].

Comments
  • Wrong Handling of Reference Count of

    Wrong Handling of Reference Count of "None" Object

    I'm iterating all xrefs found in the pdf to determine their "content":

    document = fitz.Document(fileName)
    nonImageXrefs = []
    imageXrefs = []
    
    allXrefsLength = document.xref_length()
    for xref in range(1, allXrefsLength):
        if document.xref_get_key(xref, "Subtype")[1] == "/Image":
            if document.extract_image(xref):
                imageXrefs.append(xref)
        else:
            rawData = document.xref_stream_raw(xref)
            if rawData is None or len(rawData) == 0:
                print("xref {0} is neither image nor deflatable stream".format(xref))
            else:
                nonImageXrefs.append(xref)
    

    And when there are lot's of such actions I'm getting following error:

    Fatal Python error: none_dealloc: deallocating None
    Python runtime state: initialized
    
    Current thread 0x00002b44 (most recent call first):
      File "C:\Program Files\Python\lib\pdfUtils.py", line 592 in optimizeWithPyMuPdf
      File "C:\Users\Alex\PycharmProjects\pdfOptimizer\pdf_opt.py", line 8 in <module>
    
    Extension modules: fitz._fitz, zopfli.zopfli, PIL._imaging (total: 3)
    
    Process finished with exit code -1073740791 (0xC0000409)
    

    Line 592 is rawData = document.xref_stream_raw(xref)

    This happens in random place of xrefs list, but usual counter is between 11000-13000

    I'm using Windows 10, python 3.10 x64, pyMuPDF 1.21.1 installed by pip.

    Attached sample file, but as far as I can see it is not caused by some specific file. eos6d-mk2-im2-en1.pdf

    bug Fixed in next release 
    opened by AlexMatiash 2
  • Replace image throws an error

    Replace image throws an error

    Please provide all mandatory information!

    Describe the bug (mandatory)

    Using the replace_image method on the Page object fails with an error for a missing method on the Document object.

    To Reproduce (mandatory)

    >>> fitz_doc = fitz.open("/Users/ashah/GoogleDrive/YearbookCreatorInput/Test_School.pdf")
    >>> page6 = fitz_doc.load_page(7)
    >>> page6.get_images()
    [(112, 0, 1985, 1600, 8, 'ICCBased', '', 'Im55', 'DCTDecode'), (113, 0, 1800, 1200, 8, 'ICCBased', '', 'Im56', 'DCTDecode'), (114, 0, 2100, 1402, 8, 'ICCBased', '', 'Im57', 'DCTDecode'), (115, 0, 808, 1436, 8, 'ICCBased', '', 'Im58', 'DCTDecode'), (90, 0, 1800, 1200, 8, 'ICCBased', '', 'Im48', 'DCTDecode'), (95, 0, 1200, 1800, 8, 'ICCBased', '', 'Im53', 'DCTDecode'), (117, 121, 1767, 1144, 8, 'ICCBased', '', 'Im59', 'FlateDecode'), (92, 0, 1200, 1800, 8, 'ICCBased', '', 'Im50', 'DCTDecode'), (118, 122, 1365, 1365, 8, 'ICCBased', '', 'Im60', 'FlateDecode'), (119, 123, 924, 1159, 8, 'ICCBased', '', 'Im61', 'FlateDecode')]
    >>> page6.replace_image(95, filename='/Users/ashah/GoogleDrive/Test_School/blank.png')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.10/site-packages/fitz/utils.py", line 255, in replace_image
        if not doc.is_image(xref):
    AttributeError: 'Document' object has no attribute 'is_image'
    

    For problems when building or installing PyMuPDF, give the full output of the build/install command so that, for example, all pip/compiler/linker errors/warnings can be seen.

    Expected behavior (optional)

    Describe what you expected to happen (if not obvious).

    Screenshots (optional)

    If applicable, add screenshots to help explain your problem.

    Your configuration (mandatory)

    • Operating system, potentially version and bitness
    • Python version, bitness
    • PyMuPDF version, installation method (wheel or generated from source).

    print(sys.version, "\n", sys.platform, "\n", fitz.doc) 3.10.6 (main, Aug 11 2022, 13:49:25) [Clang 13.1.6 (clang-1316.0.21.2.5)] darwin

    PyMuPDF 1.21.1: Python bindings for the MuPDF 1.21.1 library. Version date: 2022-12-13 00:00:01. Built for Python 3.10 on darwin (64-bit).

    For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

    Additional context (optional)

    Add any other context about the problem here.

    bug Fixed in next release 
    opened by foranuj 1
  • Failed to read JPX header when trying to get blocks

    Failed to read JPX header when trying to get blocks

    Describe the bug (mandatory)

    When I'm trying to get blocks from some pdfs, the following error occurs: RuntimeError: Failed to read JPX header. The same error occurs when I'm trying to get the pixmap with the function get_pixmap.

    It works if I use page.gettext() without block or dict parameter.

    PDFs with this error have the following attributes:

    • Producer: GPL Ghostscript 9.23
    • PDF Version: 1.5

    If I edit the PDF file with any online tool, for example https://www.sejda.com/pdf-editor, the attributes change and the error disappears.

    To Reproduce (mandatory)

    PDF file - test_get_blocks.pdf

    import fitz
    
    with fitz.open("test_get_blocks.pdf") as doc:
        for page in doc:
            print(page.get_text("blocks"))
    

    Traceback

    Traceback (most recent call last):
      File "/home/johni/Projects/pdf-to-txt/main.py", line 5, in <module>
        print(page.get_text("dict"))
      File "/home/johni/.pyenv/versions/3.9.15/lib/python3.9/site-packages/fitz/utils.py", line 808, in get_text
        tp = page.get_textpage(clip=clip, flags=flags)
      File "/home/johni/.pyenv/versions/3.9.15/lib/python3.9/site-packages/fitz/fitz.py", line 5675, in get_textpage
        textpage = self._get_textpage(clip, flags=flags, matrix=matrix)
      File "/home/johni/.pyenv/versions/3.9.15/lib/python3.9/site-packages/fitz/fitz.py", line 5661, in _get_textpage
        val = _fitz.Page__get_textpage(self, clip, flags, matrix)
    RuntimeError: Failed to read JPX header
    

    Notebook to reproduce the error

    Your configuration (mandatory)

    • Operating system Ubuntu 22.04.1 LTS
    • Python version 3.9.15
    • PyMuPDF version 1.21.1
    upstream bug 
    opened by johnidm 4
  • 1.21.1: test_color_count fails

    1.21.1: test_color_count fails

    Please provide all mandatory information!

    Describe the bug (mandatory)

    test_color_count fails

    To Reproduce (mandatory)

      export PYMUPDF_SETUP_MUPDF_BUILD=""
      python -m build --wheel --no-isolation
    
      local _site_packages=$(python -c "import site; print(site.getsitepackages()[0])")
      local _test_dir="test_dir"
    
      cd $_name-$pkgver
      mkdir -vp $_test_dir
      # install to test dir for testing
      python -m installer --destdir="$_test_dir" dist/*.whl
    
      export PYTHONPATH="$_test_dir/$_site_packages:$PYTHONPATH"
      # disable broken test: https://github.com/pymupdf/PyMuPDF/issues/2040
      pytest -vv -c /dev/null tests/ -k 'not test_textbox3'
    
    =================================== FAILURES ===================================
    _______________________________ test_color_count _______________________________
    
        def test_color_count():
            pm = fitz.Pixmap(imgfile)
    >       assert pm.color_count() == 40624
    E       assert 39912 == 40624
    E        +  where 39912 = <bound method Pixmap.color_count of Pixmap(DeviceRGB, IRect(0, 0, 439, 501), 0)>()
    E        +    where <bound method Pixmap.color_count of Pixmap(DeviceRGB, IRect(0, 0, 439, 501), 0)> = Pixmap(DeviceRGB, IRect(0, 0, 439, 501), 0).color_count
    
    tests/test_pixmap.py:94: AssertionError
    =============================== warnings summary ===============================
    ../../../../usr/lib/python3.10/site-packages/_pytest/cacheprovider.py:433
      /usr/lib/python3.10/site-packages/_pytest/cacheprovider.py:433: PytestCacheWarning: could not create cache path /dev/.pytest_cache/v/cache/nodeids
        config.cache.set("cache/nodeids", sorted(self.cached_nodeids))
    
    ../../../../usr/lib/python3.10/site-packages/_pytest/cacheprovider.py:387
      /usr/lib/python3.10/site-packages/_pytest/cacheprovider.py:387: PytestCacheWarning: could not create cache path /dev/.pytest_cache/v/cache/lastfailed
        config.cache.set("cache/lastfailed", self.lastfailed)
    
    ../../../../usr/lib/python3.10/site-packages/_pytest/stepwise.py:52
      /usr/lib/python3.10/site-packages/_pytest/stepwise.py:52: PytestCacheWarning: could not create cache path /dev/.pytest_cache/v/cache/stepwise
        session.config.cache.set(STEPWISE_CACHE_DIR, [])
    
    -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
    =========================== short test summary info ============================
    FAILED ../../../../dev/test_pixmap.py::test_color_count - assert 39912 == 40624
    ====== 1 failed, 95 passed, 1 skipped, 1 deselected, 3 warnings in 1.65s =======
    

    python-pymupdf-1.21.1-1-x86_64-build.log python-pymupdf-1.21.1-1-x86_64-check.log

    Expected behavior (optional)

    All tests pass.

    Screenshots (optional)

    n/a

    Your configuration (mandatory)

    • Arch Linux
    • Python 3.10.8
    • PyMuPDF 1.21.1 from tarball

    Additional context (optional)

    n/a

    opened by dvzrv 2
  • Redaction removing more text than expected

    Redaction removing more text than expected

    Describe the bug (mandatory)

    When applying a redaction on a document, the following word is removed as well.

    To Reproduce (mandatory)

    Example PDF file: test_doc.pdf

    Run this script:

    import fitz
    doc = fitz.open("test_doc.pdf")
    page = doc[0]
    areas = page.search_for("{sig}")
    rect = areas[0]
    page.add_redact_annot(rect)
    page.apply_redactions()
    doc.saveIncr()
    doc.close()
    

    The searched word "{sig}" is removed (as expected). The word "Vertrag" on the top right is removed as well (unexpected).

    Expected behavior (optional)

    Searched string should be removed. No other change should be made.

    Screenshots (optional)

    Before script: grafik After script: grafik

    Your configuration (mandatory)

    • OS independant, happening on Windows 11 as well as Debian 11
    • Python Python 3.10.8
    • PyMuPDF 1.21.0, installed via pip

    Thank you!

    upstream bug 
    opened by seb-bau 3
  • Image in pdf changes color after applying redactions

    Image in pdf changes color after applying redactions

    Description

    Image in a PDF file changes color after applying redactions.

    To Reproduce

    Execute the following python script to reproduce the issue. The script uses this pdf file image_issue.pdf .

    import os
    import fitz
    
    script_path = os.path.abspath(__file__)
    script_folder = os.path.dirname(script_path)
    doc = fitz.open(os.path.join(script_folder, 'image_issue.pdf'))
    
    page = doc.load_page(0)
    
    rx=135.123
    ry=123.56878
    rw=69.8409
    rh=9.46397
    
    x0 = rx
    y0 = ry
    x1 = rx + rw
    y1 = ry + rh
        
    rect = fitz.Rect(x0, y0, x1, y1)
    
    font = fitz.Font("Helvetica")
    fill_color=(0,0,0)
    page.add_redact_annot(
        quad=rect,
        #text="null",
        fontname=font.name,
        fontsize=12,
        align=fitz.TEXT_ALIGN_CENTER,
        fill=fill_color,
        text_color=(1,1,1),
    )
    
    page.apply_redactions()
    
    doc.save(os.path.join(script_folder, 'image_issue_redacted.pdf'))
    

    Note that I am using the default images=2 (blank out overlapping image parts) when calling apply_redactions(). Using images= 0 (ignore) or images=1(remove complete overlapping image) are not desirable for my use case.

    Expected behavior

    The color of the image in the pdf file should not change after applying redactions.

    Screenshots

    Here's a screenshot of the problem. image

    Your configuration

    • Operating system Ubuntu 22.04.1 LTS
    • Python version 3.8.14
    • PyMuPDF version 1.20.2
    upstream bug 
    opened by ot-ksrinivasan 7
Releases(1.21.1)
Owner
PyMuPDF
This represents the central repository, PyMuPDF and related repositories
PyMuPDF
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
An application which enables the users to perform simple yet intriguing PDF operations

AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M

Raghav S 5 Jan 22, 2022
JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

joplinPdf2Images Converts a PDF to images in Joplin and adds it to the specified

Morten Haahr Kristensen 2 Apr 20, 2022
Mipdfcompressor - 💕A simple pdf size compressing telegram robot

Pdf Compressor Telegram Bot A simple pdf size compressing telegram robot. Useful for digital documentation. Mandatory Variables API_HASH - Your A

Madhavan Mi 1 Feb 14, 2022
A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
this is simple program, that converts pdf file to png

author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window

1 Nov 01, 2021
Excalibur: A web interface to extract tabular data from PDFs

Excalibur: A web interface to extract tabular data from PDFs Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It i

1.2k Jan 04, 2023
Produce pdf in python backend from simple bootstrap vue frontend and download to browser

vollmacht produce pdf in python backend from simple bootstrap vue frontend and download to browser Frontend in one file with bootstrap-vue (allthough

Otto 1 Nov 08, 2020
Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata fr

Marshal Miller 22 Nov 21, 2022
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Jan 06, 2023
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022
A simple pdf size compressing telegram robot witten in python.

Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t

Renjith Mangal 22 Oct 28, 2022
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f

Aman Nirala 3 Apr 23, 2022
PyMuPDF is a Python binding with support for MuPDF

PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, I

PyMuPDF 1.9k Jan 03, 2023
DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

Frédéric BISSON 6 Jul 27, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

RISHABH MISHRA 1 Feb 13, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021
A bot for PDF for doing Many Things....

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

Mr. Developer 60 Dec 27, 2022