PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

Overview

PDFSanitizer

Renders possibly malicious PDF files and outputs harmless PDF files

To do this, the PDF files are rendered and converted to images using PyMuPDF. The images are then saved to a new PDF file using img2pdf. This ensures no visual data is lost, but any scripts/external references/flash files are removed.

Instalation:

git clone https://github.com/lacioffi/PDFSanitizer
cd PDFSanitizer
pip install -r requirements.txt 

Usage:

pdfsanitizer.py 
    
    

   

This project uses the following libraries:

PyMuPDF - By Jorj X. McKie (@JorjMcKie)

img2pdf - By Johannes Schauer Marin Rodrigues ([email protected])

Special thanks to @CoolerVoid for helping with the sandboxing part <3

Security Considerations

This script will render possibly malicious PDFs to generate the sanitized file using MuPDF, which has had exploits in the past: https://www.cvedetails.com/vulnerability-list/vendor_id-10846/product_id-20840/Artifex-Mupdf.html Therefore, assume the machine this runs on will be pwned eventually.

To mitigate this risk, I recommend two techinques:

Running the script on a Sandbox

Firejail

This sandbox will restrict all network access, unnecessary syscalls and reading/writing from unexpected files/folders. I recommend this method.

Install it using:

sudo apt-get install firejail

The profile is already included in this repo, but it assumes you're running from ~/PDFSanitizer. Setup and use it by running the following commands:

cd ~
git clone https://github.com/lacioffi/PDFSanitizer
cd ~/PDFSanitizer/ 
firejail --profile=pdfsanitizer.profile python3 ~/PDFSanitizer/pdfsanitizer.py file.pdf ~/PDFSanitizer/out

Plase note that the input file must be located in ~/PDFSanitizer. The output folder can have any name, but it must already exist and also be inside ~/PDFSanitizer.

If you want to run PDFSanitizer from another folder, change the following line in "pdfsanitizer.profile":

whitelist ${HOME}/PDFSanitizer/

To wherever you're running the program from.

If you want the output folder or the input file to be outside of PDFSanitizer's folder, simply add a line in "pdfsanitizer.profile":

whitelist /full/path/to/output/folder
whitelist /full/path/to/file.pdf

Again, please note that the output folder must already exist.

CloudFlare Sandbox

This sandbox will restrict all network access and unnecessary syscalls, but will not restrict reading/writing to arbitrary files/folders.

Download and build it from here: https://github.com/cloudflare/sandbox Take the generated "libsandbox.so" file, put it in this PDFSanitizer's folder and run the following command:

LD_PRELOAD=./libsandbox.so SECCOMP_SYSCALL_ALLOW="read:write:lseek:close:openat:brk:stat:munmap:fstat:getdents64:ioctl:rt_sigaction:mmap:mprotect:pread64:lstat:dup:mremap:futex:getegid:getuid:getgid:geteuid:sigaltstack:rt_sigprocmask:access:uname:fcntl:getcwd:readlink:sysinfo:arch_prctl:gettid:set_tid_address:set_robust_list:prlimit64:getrandom:exit_group" python3 pdfsanitizer.py file.pdf ./output

Running the script on an isolated environment

For maximum security, run this script on an isolated, ephemeral instance or even a serverless environment. Block all network communications, maybe kill the instance after the job is done, and only allow reading from an input folder/bucket and writing to an output folder/bucket.

I didn't try this method, but I believe you can do it with some Cloud Majyks.

To-do

This method removes EVERYTHING from the PDF. It would be nice to at least keep the text copy-pasteable.

A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 02, 2022
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 01, 2023
PyMuPDF is a Python binding with support for MuPDF

PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, I

PyMuPDF 1.9k Jan 03, 2023
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022
Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 01, 2023
Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python.

About Zen-Knit: Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python. Inspired fro

Zen Reportz 27 Jul 13, 2022
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Mipdfcompressor - 💕A simple pdf size compressing telegram robot

Pdf Compressor Telegram Bot A simple pdf size compressing telegram robot. Useful for digital documentation. Mandatory Variables API_HASH - Your A

Madhavan Mi 1 Feb 14, 2022
A tool for certificate PDF generation.

certificate-pdf-generator 获奖证书PDF批量生成工具 | a Tool for certificate PDF generation. ⚠️ 下载前请注意 本项目使用了LFS来存储PDF等大文件。在克隆或下载本仓库前,请先使用apt等包管理器安装git-lfs包。如果已经克

Wanghao Xu 4 Nov 28, 2022
Simple pdf editor while preserving structure and format.

SIMPdf Simple pdf editor while preserving structure and format.

Shashwat Singh 242 Jan 04, 2023
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 03, 2023
Generate a preview image for a PDF.

PDF ➡️ Preview A simple tool to save me time on Illustrator. Generates a preview image for a PDF file. Useful for sneak peeks to academic publications

David Chuan-En Lin 51 Sep 22, 2022
Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

Will Fantom 1 Feb 09, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
Svg2pdfgen - Svg To PDF gen with python

Svg2pdfgen - Svg To PDF gen with python

Robert Urbańczyk 3 May 30, 2022