OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Last update: Jan 08, 2023

Overview

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output

See the release notes for details on the latest changes.

Main features

Generates a searchable PDF/A file from a regular PDF
Places OCR text accurately below the image to ease copy / paste
Keeps the exact resolution of the original embedded images
When possible, inserts OCR information as a "lossless" operation without disrupting any other content
Optimizes PDF images, often producing files smaller than the input file
If requested, deskews and/or cleans the image before performing OCR
Validates input and output files
Distributes work across all available CPU cores
Uses Tesseract OCR engine to recognize more than 100 languages
Scales properly to handle files with thousands of pages
Battle-tested on millions of PDFs

For details: please consult the documentation.

Motivation

I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:

Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
Or they did not handle accents and multilingual characters
Or they changed the resolution of the embedded images
Or they generated ridiculously large PDF files
Or they crashed when trying to OCR
Or they did not produce valid PDF files
On top of that none of them produced PDF/A files (format dedicated for long time storage)

...so I decided to develop my own tool.

Installation

Linux, Windows, macOS and FreeBSD are supported. Docker images are also available, for both x64 and ARM.

Operating system	Install command
Debian, Ubuntu	`apt install ocrmypdf`
Windows Subsystem for Linux	`apt install ocrmypdf`
Fedora	`dnf install ocrmypdf`
macOS	`brew install ocrmypdf`
LinuxBrew	`brew install ocrmypdf`
FreeBSD	`pkg install py37-ocrmypdf`
Conda	`conda install ocrmypdf`

For everyone else, see our documentation for installation steps.

Languages

OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users, you can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language pack

# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs

# brew macOS users
brew install tesseract-lang

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested.

OCRmyPDF supports Tesseract 4.0 and the beta versions of Tesseract 5.0. It will automatically use whichever version it finds first on the PATH environment variable. On Windows, if PATH does not provide a Tesseract binary, we use the highest version number that is installed according to the Windows Registry.

Documentation and support

Once OCRmyPDF is installed, the built-in help which explains the command syntax and options can be accessed via:

ocrmypdf --help

Our documentation is served on Read the Docs.

Please report issues on our GitHub issues page, and follow the issue template for quick response.

Requirements

In addition to the required Python version (3.7+), OCRmyPDF requires external program installations of Ghostscript and Tesseract OCR. OCRmyPDF is pure Python, and runs on pretty much everything: Linux, macOS, Windows and FreeBSD.

Press & Media

Going paperless with OCRmyPDF
Converting a scanned document into a compressed searchable PDF with redactions
c't 1-2014, page 59: Detailed presentation of OCRmyPDF v1.0 in the leading German IT magazine c't
heise Open Source, 09/2014: Texterkennung mit OCRmyPDF
heise Durchsuchbare PDF-Dokumente mit OCRmyPDF erstellen
Excellent Utilities: OCRmyPDF
LinuxUser Texterkennung mit OCRmyPDF und Scanbd automatisieren

Business enquiries

OCRmyPDF would not be the software that it is today without companies and users choosing to provide support for feature development and consulting enquiries. We are happy to discuss all enquiries, whether for extending the existing feature set, or integrating OCRmyPDF into a larger system.

License

The OCRmyPDF software is licensed under the Mozilla Public License 2.0 (MPL-2.0). This license permits integration of OCRmyPDF with other code, included commercial and closed source, but asks you to publish source-level modifications you make to OCRmyPDF.

Some components of OCRmyPDF have other licenses, as noted in those files and the debian/copyright file. Most files in misc/ use the MIT license, and the documentation and test files are generally licensed under Creative Commons ShareAlike 4.0 (CC-BY-SA 4.0).

Disclaimer

The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Comments

Improve user experience for Windows 10
Hi

Describe the issue I've managed to run OCRmyPDF.exe on Windows 10 without wsl.

To Reproduce I've made fork and added some quick fixes in this commit: https://github.com/dibu28/OCRmyPDF/commit/543088e79e8649e968d02d8fd268123255607dc1

Fixes are:

in leptonica.py librray name is liblept-5 instead of lept

in ghostscript.py 2.1) executable name is gswin64c.exe instead of gs 2.2) NamedTemporaryFile doesnt work properly and gs could not modify tmp file with access denied error. (so as a temporary workaround I'm adding "_1" to temp file name and then removing file. There could be some better way)

in _pipeline.py and helpers.py files - symlinking to temp folder on windows requires Admin privelegies. So instead of simlinking I'm just copying files.

in _sync.py file - os.path.samefile is returning error: "OSError: [WinError 1] Incorrect function: 'nul'"

So after those changes and installin dependencies it started to work from command line like this: OCRmyPDF.exe input.pdf output.pdf

Dependencies and binaries I'm using: https://www.python.org/ftp/python/3.7.5/python-3.7.5-amd64.exe https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs950/gs950w64.exe https://github.com/qpdf/qpdf/releases/download/release-qpdf-9.0.2/qpdf-9.0.2-bin-msvc64.zip

Add paths to PATH variable: set PATH=%PATH%;C:\Program Files\Tesseract-OCR; set PATH=%PATH%;C:\Program Files\gs\gs9.50\bin; set PATH=%PATH%;C:\qpdf\qpdf-9.0.2-bin-msvc64\qpdf-9.0.2\bin;

python setup.py build OCRmyPDF.exe input.pdf output.pdf

Expected behavior Can we add some workarounds using conditions based on os type?

System:

OS: Windows 10

OCRmyPDF Version: v9.0.5

Additional context
enhancement
opened by dibu28 57
$OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0x7e$

OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0x7e

As in #631 I am getting the same error. Instead of 0x7f I am getting 0x7e
I am using Python 3.9.2 64 bit, Windows 10 64 bit and OCRMYPDF = 12.5.0 I cant solve the problem as solved #631 by changing leptonica.py, that is by opening zlib.dll before liblept-5.dll.

When I run the code ocrmypdf --help or ocrmypdf --version it displays same OSerror.

Does anyone know what to do? @jbarlow83

opened by meet1919 28
Add interword space option to HOCR pdf renderer
This pull request adds a new advanced option --interword-spaces to OCRmyPDF to allow the hocr renderer to produce PDF output compatible with PDF.js and potentially other viewers that have difficulty detecting phrases, lines, and paragraphs in separately placed text layers. This new switch is a workaround for limitations of the PDF.js viewer described in https://github.com/jbarlow83/OCRmyPDF/issues/133.

Background

OCRmyPDF justifiably prioritizes the accurate placement of words on the text layer as individual glyphs. Most PDF viewers have heuristics that allow them to identify paragraphs, lines, and phrases while searching and to insert the correct inter-word spacing when copying and pasting. PDF.js has over 80 issues flagged with 4-text-selection and there have been a number of pull requests to address the issue that have apparently gotten bogged down with edge cases, performance concerns, and perhaps the inherent challenges of a pure Javascript and HTML approach to PDF rendering.

Strategy

The goal of this pull request is to add an unobtrusive option to OCRmyPDF to allow it to produce PDF.js compatible output for those that must support PDF.js as a business requirement. Specifically, this PR follows the code conventions by adding an advanced option --interword-spaces to the options parser and ensures this option is available to the hocrtransform.py renderer. When set to true, the HOCR renderer will add an additional space at the end of each text element before drawing it on the text layer. This option does not apply to other pdf renderers in OCRmyPDF, is turned off by default, and issues a warning if used without the --pdf-renderer hocr option also set.

Documentation

This PR added a new section to the advanced documentation for the new option, a note on the 'hocr' renderer description about the option in the same file, and a note that this is available in the introduction where there is a relevant discussion of PDF as a layout format dependent on the viewer to interpret the structure of the document in terms of words, sentences, and paragraphs.

Testing

We confirmed that existing tests that exercised this code continue to pass. We encountered some seemingly preexisting failures in other tests. We explored the option of adding additional tests for confirm the warning is provided and the output is as expected, and would welcome guidance as to where that test should be placed and how best to combine it with RENDERERS tests in the test_main.py or the more specific test_hocrtransform.py.

Sample PDF Output

The following file was processed with this option set to true. When loaded into the latest PDF.js viewer, multi-word search and copy and paste are improved over the standard HOCR output:

Input PDF: https://github.com/logikcull/OCRmyPDF/blob/master/tests/resources/linn.pdf

# original command ocrmypdf --output-type pdf --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.pdf

Output PDF: https://www.dropbox.com/s/2ugzxldqsvy8q6x/output.linn.hocr.pdf

Behavior when loaded into latest PDF.js viewer -- note that you have to remove spaces to find multiple words. Selecting and pasting the text also has spaces removed:

# command with new --interword-spaces option ocrmypdf --output-type pdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.interword.pdf

Output PDF: https://www.dropbox.com/s/tukp6ftpjebe1gh/output.linn.hocr.interword.pdf

Behavior when loaded into latest PDF.js viewer -- note that you can find multiple words separated by spaces. Copy and paste is also improved:

Testing in Adobe Reader and Chrome's native PDF viewer showed that files rendered with the new option continued to perform as well or better when searching and copying and pasting. Apple Preview handled neither output file particularly well so we think there has at least been no harm done.

Alternative Approaches

If this approach of adding a new option with a warning if used without hocr is too disruptive, we could also consider contributing a new pipeline task for a fifth renderer titled 'hocr-sloppy-text' or something similar that runs a nearly identical version of hocrtransform.py with the space suffix turned on by default. This approach has the serious downside of repeating complex code, but the upside of leaving the existing hocr rendered behavior 100% unchanged and opening the way in the future for other "sloppy-text" fixes required to produce PDFs for simpler viewers like PDF.js.

Related Issues

OCRMyPDF:

133: Some hints that Tesseract upgrades might provide some relief, but underlying conclusion was that PDF.js has a naive implementation of text selection and word boundaries (https://github.com/jbarlow83/OCRmyPDF/issues/133).

Tesseract:

1235 December 2017: https://github.com/tesseract-ocr/tesseract/issues/1235 includes good explanation of reason for space detection issues: "Known problem. Root cause is PDF spec which forces heuristics into text extraction, and Preview is well known to have some of the wonkiest heuristics."

699 https://github.com/tesseract-ocr/tesseract/issues/699#issuecomment-277486345

382 https://github.com/tesseract-ocr/tesseract/issues/382

337 https://github.com/tesseract-ocr/tesseract/issues/337

PDF.js:

7310: Super helpful discussion of HTML divs: https://github.com/mozilla/pdf.js/issues/7310

6657: https://github.com/mozilla/pdf.js/issues/6657

Related PR not merged: https://github.com/mozilla/pdf.js/pull/5783

Dozens of text selection issues: https://github.com/mozilla/pdf.js/issues?q=is%3Aopen+is%3Aissue+label%3A4-text-selection
opened by cforcey 28

NixOS packaging issues

Hi there

I'm currently trying to write a package file for ORCmyPDF for NixOS. I think I'm already pretty far but now I'm stuck on an error that I have no idea how to fix, as it doesn't seem to give any indication, where the problem actually occurs.

Anyway, I do get this error when it's trying to build OCRmyPDF:

building path(s) ‘/nix/store/kdpr7qaz85lrls5mwqyvgrfi5v811i5q-ORCmyPDF-5.4.3’
unpacking sources
unpacking source archive /nix/store/ajl9ibrhpbbrrccnyb7s7rl4ix8w7k48-source
source root is source
setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tests/test_userunit.py
patching sources
configuring
building
Skipping external program tests because of --force
Traceback (most recent call last):
  File "nix_run_setup.py", line 8, in <module>
    exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
  File "setup.py", line 245, in <module>
    zip_safe=False)
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/core.py", line 108, in setup
    _setup_distribution = dist = klass(attrs)
  File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 338, in __init__
    _Distribution.__init__(self, attrs)
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/dist.py", line 281, in __init__
    self.finalize_options()
  File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 471, in finalize_options
    ep.load()(self, ep.name, value)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules
    add_cffi_module(dist, cffi_module)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module
    execfile(build_file_name, mod_vars)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile
    src = f.read()
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
builder for ‘/nix/store/jsnfzz199dy49viv14l1is2i1d2r3lq9-ORCmyPDF-5.4.3.drv’ failed with exit code 1
cannot build derivation ‘/nix/store/niq3y1rw30sqx5gp5jwrd273hlv6xhb2-system-path.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built
error: build of ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’ failed

The current nix expression that I use to try to build it looks like:

{ lib, fetchFromGitHub, python3, callPackage, pytest, unpaper, ghostscript, tesseract, qpdf }:

with python3.pkgs;

let

  ruffus = callPackage ("/tankJL/opt/ruffus.nix") {};
  img2pdf = callPackage ("/tankJL/opt/img2pdf.nix") {};

in

buildPythonApplication rec {
  version = "5.4.3";
  name = "ORCmyPDF-${version}";

  src = fetchFromGitHub {
    owner = "jbarlow83";
    repo = "OCRmyPDF";
    rev = version;
    sha256 = "0vnn6g69vkqldbx76llmyz8h9ia7mkxcp290mxdsydy4bjjik6zf";
  };

  postPatch = ''
    substituteInPlace requirements.txt \
      --replace "ruffus == 2.6.3" "ruffus" \
      --replace "Pillow == 4.3.0" "Pillow" \
      --replace "reportlab == 3.4.0" "reportlab" \
      --replace "PyPDF2 == 1.26.0" "PyPDF2" \
      --replace "img2pdf == 0.2.4" "img2pdf" \
      --replace "cffi == 1.11.2" "cffi"
    substituteInPlace test_requirements.txt \
      --replace "pytest >= 3.0" "pytest"
    export SETUPTOOLS_SCM_PRETEND_VERSION="${version}"
  '';

  buildInputs = [ pytest pytest_xdist pytestcov setuptools_scm ];

  propagatedBuildInputs = [
    ruffus
    pillow
    reportlab
    pypdf2
    img2pdf
    cffi
    unpaper
    ghostscript
    tesseract
    qpdf
  ];

  meta = {
    homepage = https://github.com/jbarlow83/OCRmyPDF;
    description = "Adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.";
    license = lib.licenses.mit;
    maintainers = with lib.maintainers; [ hyper_ch ];
  };
}

I understand that there seems to be a problem with one of the files but I can't figure out where the problem actually occurs.

opened by sjau 26

ocrmypdf 11.4.4 failed to build on apple silicon

Describe the bug ocrmypdf 11.4.4 failed to build on apple silicon

build error message (run log url):

==> /opt/homebrew/Cellar/ocrmypdf/11.4.4/bin/ocrmypdf -f -q --deskew /opt/homebrew/Library/Homebrew/test/support/fixtures/test.pdf ocr.pdf
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/bin/ocrmypdf", line 5, in <module>
    from ocrmypdf.__main__ import run
  File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/libexec/lib/python3.9/site-packages/ocrmypdf/__init__.py", line 10, in <module>
    from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/libexec/lib/python3.9/site-packages/ocrmypdf/leptonica.py", line 174, in <module>
    def _stderr_handler(cstr):
MemoryError: Cannot allocate write+execute memory for ffi.callback(). You might be running on a system that prevents this. For more information, see https://cffi.readthedocs.io/en/latest/using.html#callbacks

To Reproduce pip installation and run on darwin arm64 system?

Expected behavior build successfuly

System (please complete the following information):

OS: OSX darwin arm64
Python version: python 3.9
OCRmyPDF version: Ocrmypdf 11.4.4

Additional context relates to https://github.com/Homebrew/homebrew-core/pull/68159

bug

opened by chenrui333 25

dependecy problem reportlab - allthough installed...
Issue by andreasotto Tue Nov 4 10:44:25 2014 Originally opened as https://github.com/fritz-hh/OCRmyPDF/issues/99

# ./OCRmyPDF.sh /home/ao/Leerungstermine189973.PDF /home/ao/test.pdf Please install the python library reportlab. Exiting... # apt-get install python-reportlab python-reportlab ist schon die neueste Version.

.. already installed.

Debian 6 squeeze
opened by OCRmyPDF-issuebot 25
Using Ubuntu Snap as packaging format

I took the liberty of creating a snap application recipe "snapcraft.yaml" which enables snapcraft's build plattform to build a working snap application for ocrmypdf.

Take a look here: https://github.com/alexanderlanganke/ocrmypdf-snap

While building it pulls in the application using PIP so that it always uses the most recent version. This may make it easier for users to access ocrmypdf.

So far I am getting the application to build and run but am running into a missing dependancy during runtime. I believe I need to adjust the path for one or two libraries.

I have also registered this snap (private for now) on snapcraft.

If you are interested, and I get it working, I would offer to maintain this snap for you or pass it on to you if you wish to do it yourself. Credit for the application will of course go to you! Snapcraft pulls from github so you basically need to get it working once and never touch it again. It will rebuild whenever you push to the linked repository (version bump for example).

opened by alexanderlanganke 23

[13.4.2] lossy compression of pngs into jpegs when it shouldn't

It might be just the older version, but ocrmypdf 12.7.2 seems to compress uncompressed pngs into (lossy) jpegs:

$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
$ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
$ pdfimages -list ./Example-uncompress-compress.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 4157B 4.5%

I believe it should be running the image through pngquant instead at optimize level 1.

Btw, it's probably not even worth mentioning since, looking at the changelog, I'm fairly certain you've already sorted it out in recent ocrmypdf versions, but small pdfs with small pngs grow instead of shrinking / remaining the same:

$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ img2pdf ./Example.png -o ./Example.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
$ stat -c "%n,%s" Example*.* | column -t -s,
Example-compress.pdf  7799
Example.pdf           3906
Example.png           2335

Though this might also be the pdf format changing to the archival specs...

As a side note, if compute time isn't a factor, I personally found 'optipng -o7' to produce smaller pngs than pngquant and 'jpegrescan -i -t -v' to produce the smallest jpeg, even compared to MozJPEG despite the author saying otherwise oddly enough.

p.s. forgot to mention the png-to-jpeg bug also happens with some compressed pngs but I haven't bothered trying to replicate this since I believe it should never try to convert bitmap images to jpegs to begin with.

opened by RamKromberg 21

Anaconda - Successful Install but not working

Describe the bug (*update: 2022-04-22): Reorder sentences

What's the problem? I tried installing ocrmypdf using Conda on Windows; it looks successful. I tried to run tesseract tests.jpg, and it works fine. (ocrmypdf) C:\Users\Denz\Downloads>tesseract test.jpg test

But whenever I run a test pdf, it doesn't output the OCR text. Here is the error log:

(ocrmypdf) C:\Users\Denz\Downloads>ocrmypdf --force-ocr NeedOCR2.pdf output.pdf
Scanning contents: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.11page/s]
    1 page already has text! - rasterizing text and running OCR anyway
    1 [tesseract] read_params_file: Can't open pdf
    1 [tesseract] read_params_file: Can't open txt
OCR: 100%|█████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:07<00:00,  7.41s/page]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.20s/page]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: -0.0%
Image optimization did not improve the file - optimizations will not be used
Output file is a PDF/A-2B (as expected)

my Environment Packages inside Conda

(ocrmypdf) C:\Users\Denz\Downloads>conda list
# packages in environment at C:\Users\Denz\anaconda3\envs\ocrmypdf:
#
# Name                    Version                   Build  Channel
bzip2                     1.0.8                h8ffe710_4    conda-forge
ca-certificates           2021.10.8            h5b45459_0    conda-forge
cffi                      1.15.0                   pypi_0    pypi
chardet                   4.0.0                    pypi_0    pypi
colorama                  0.4.4                    pypi_0    pypi
coloredlogs               15.0.1                   pypi_0    pypi
cryptography              36.0.2                   pypi_0    pypi
ghostscript               9.54.0               h0e60522_2    conda-forge
humanfriendly             10.0                     pypi_0    pypi
img2pdf                   0.4.3                    pypi_0    pypi
jbig                      2.1               h8d14728_2003    conda-forge
jpeg                      9e                   h8ffe710_0    conda-forge
leptonica                 1.78.0               h688788b_4    conda-forge
lerc                      3.0                  h0e60522_0    conda-forge
libarchive                3.5.2                habf0b7a_1    conda-forge
libdeflate                1.10                 h8ffe710_0    conda-forge
libffi                    3.4.2                h8ffe710_5    conda-forge
libiconv                  1.16                 he774522_0    conda-forge
libpng                    1.6.37               h1d00b33_2    conda-forge
libtiff                   4.3.0                hc4061b1_3    conda-forge
libwebp                   1.2.2                h57928b3_0    conda-forge
libwebp-base              1.2.2                h8ffe710_1    conda-forge
libxml2                   2.9.12               hf5bbc77_2    conda-forge
libzlib                   1.2.11            h8ffe710_1014    conda-forge
lxml                      4.8.0                    pypi_0    pypi
lz4-c                     1.9.3                h8ffe710_1    conda-forge
lzo                       2.10              he774522_1000    conda-forge
ocrmypdf                  13.4.1                   pypi_0    pypi
openjpeg                  2.4.0                hb211442_1    conda-forge
openssl                   3.0.2                h8ffe710_1    conda-forge
packaging                 21.3                     pypi_0    pypi
pdfminer-six              20211012                 pypi_0    pypi
pikepdf                   5.1.1                    pypi_0    pypi
pillow                    9.0.1                    pypi_0    pypi
pip                       22.0.4             pyhd8ed1ab_0    conda-forge
pluggy                    1.0.0                    pypi_0    pypi
pngquant                  1.0.7                    pypi_0    pypi
pycparser                 2.21                     pypi_0    pypi
pyparsing                 3.0.7                    pypi_0    pypi
pyreadline3               3.4.1                    pypi_0    pypi
python                    3.10.4          hcf16a7b_0_cpython    conda-forge
python_abi                3.10                    2_cp310    conda-forge
reportlab                 3.6.9                    pypi_0    pypi
setuptools                61.3.0          py310h5588dad_0    conda-forge
sqlite                    3.37.1               h8ffe710_0    conda-forge
tesseract                 5.0.1                h17c68af_0    conda-forge
tk                        8.6.12               h8ffe710_0    conda-forge
tqdm                      4.63.1                   pypi_0    pypi
tzdata                    2022a                h191b570_0    conda-forge
ucrt                      10.0.20348.0         h57928b3_0    conda-forge
vc                        14.2                 hb210afc_6    conda-forge
vs2015_runtime            14.29.30037          h902a5da_6    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h62dcd97_1    conda-forge
zlib                      1.2.11            h8ffe710_1014    conda-forge
zstd                      1.5.2                h6255e5f_0    conda-forge

System (please complete the following information):

OS: Windows 10
Python version: 3.10
OCRmyPDF version: 13.4.1

Installation Installed via Pip

Additional context Add any other context about the problem here. I believe this Issue is a similar problem. But the fix was done in Linux OS. I don't know how to fix it under conda

Here are the before & after files NeedOCR2.pdf output.pdf

https://github.com/ocrmypdf/OCRmyPDF/issues/773

third party issue

opened by denzchoe 21

Segmentation fault when using pipes

Describe the bug When running ocrmypdf through podman/docker I sometimes (#864) experience segmentation faults and the container hangs indefinitely. The output file is empty.

To Reproduce The following command is executed to reproduce the failure, due to the non-deterministic behavior of ocrmypdf, it might take a while or even multiple loops to reproduce.

for i in $(seq 0 100); do
    podman run --rm -i ocrmypdf --verbose -rcd  --jbig2-lossy -l deu - - <tmp.pdf >out.pdf; done
done

All of the options can be omitted and the issue is reproducible. The resulting log is:

ocrmypdf 12.6.0.post6+g42713b77.d20211012
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa

Running: ['unpaper', '--version']
Found unpaper 6.1
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
reading file from standard input
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/stdin, /tmp/ocrmypdf.io.yzr1_6f6/origin.pdf)
Using Tesseract OpenMP thread limit 3
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
    1 Rotating output by 0
    1 Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/ocrmypdf.io.yzr1_6f6/000001_rasterize_preview.jpg', 'stdout']
    1 page is facing ⇧, confidence 7.23 - no change
    1 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
    1 Rotating output by 0
    1 Running: ['unpaper', '-v', '--dpi', '150.0', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpmqv67lqw/input.pnm', '/tmp/tmpmqv67lqw/output.pgm']
    1 stdout/stderr = [image2 @ 0x55a80053afc0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55a80053afc0] Encoder did not produce proper pts, making some up.
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: /tmp/tmpmqv67lqw/input.pnm -> /tmp/tmpmqv67lqw/output.pgm
input-file for sheet 1: /tmp/tmpmqv67lqw/input.pnm
output-file for sheet 1: /tmp/tmpmqv67lqw/output.pgm
sheet size: 1232x1718
...
noise-filter ... deleted 47 clusters.
blur-filter... deleted 0 pixels.
writing output.

    1 resolution (150.01239999999999, 150.01239999999999)
    1 convert
    1 PIL format = PNG
    1 imgformat = PNG
    1 input dpi = 150 x 150
    1 rotation = 0°
    1 input colorspace = L
    1 width x height = 1232px x 1718px
    1 read_images() embeds a PNG
    1 convert done
    1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr.png', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr_tess', 'pdf', 'txt']
    1 Emplacement update
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/graft_layers.pdf, /tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf', '/tmp/ocrmypdf.io.yzr1_6f6/pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Treating 18 as an optimization candidate
XrefExt(xref=18, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/optimize.opt.pdf, /tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf)
/tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf -> -
Output sent to stdout

dmesg yields:

[21719.464718] conmon[91767]: segfault at 111d000 ip 00007fcf434cf980 sp 00007ffc7f66d4e8 error 4 in libc.so.6[7fcf43380000+176000]
[21719.464741] Code: d7 c1 85 c0 75 a4 48 81 ea 80 00 00 00 0f 86 07 01 00 00 48 ff c7 89 f9 48 83 cf 7f 83 e1 7f 48 01 ca 0f 1f 84 00 00 00 00 00 <c5> fd 74 4f 01 c5 fd 74 57 21 c5 fd 74 5f 41 c5 fd 74 67 61 c5 ed

(Always the same location in libc)

Exchanging >out.pdf with tee out.pdf I at some point could see strange characters being omited after %%EOF (?), however, most of the time it hangs before that.

Example file The example file is attached in encrypted form. tmp.pdf.gpg.zip

Expected behavior The output file should be correct and the tool should not hang.

System

OS: Fedora 35
OCRmyPDF Version: 12.6.0.post6+g42713b77.d20211012, but reproducible just as well with jbarlow83/ocrmypdf:v13.2.0, jbarlow83/ocrmypdf:v13.1.1 and jbarlow83/ocrmypdf:v13.1.0
How did you install ocrmypdf? podman pull jbarlow83/ocrmypdf

third party issue

opened by Fulguritus 20

White glyphs when selecting ocr-text in Evince

Problem in evince pdf reader:

It only happens when selecting. Is this a display failure? missing fonts? otherwise ocr text is correct. Similar to #178?

opened by robinrosenstock 20
Feature request: Ask user what likely-incorrect words are

OCRmyPDF is great as it is. It is an excellent tool for OCRing PDFs without any human involvement.

However, if a human is available, their involvement could be put to good use.

Problem Tesseract+OCRmyPDF doesn't OCR every word correctly when this is desired. When outputting to PDFs, correcting such PDFs is more difficult than correcting outputted text.

Proposed solution Tesseract generates a low confidence value for words it has difficulty working out the glyphs for. I understand OCRmyPDF checks all words against a dictionary for the selected language as part of it's existing process.

A word that is likely wrong could have the part of the image containing it's sentence presented to the user (with the word identified with a red box), the user asked what the word is (like a CAPTCHA), and the OCR results amended. If a word the user provides isn't in the dictionary, they should be asked if they want to add it or not.

This would happen in parallel with the main processing. Sentences containing words identified for checking would cumulatively fill the screen, waiting for human response.

The proposed functionality would obviously not be default, and should have appropriate user settings for adjustment.

Describe alternatives you've considered Get OCRmyPDF to output hOCR and PDF files simultaneously, then go through pages manually using gImageReader. It would work. But more slowly than the proposed method would be.

opened by mattention 0
Is it possible to capture Tesseract messages and suggestions either as exceptions or exit codes?

Is your feature request related to a problem? Please describe. Sometimes when running OCR jobs with redo_ocr, I can see certain suggestions like rescanning the file with force_ocr from OCRmyPDF and similar observations about the quality of text from Tesseract. Is it possible to somehow capture these messages, so that I can programmatically filter those files out and rerun OCR with the recommended parameters?

Describe the solution you'd like A status code or custom exception to catch and retry the running job.

Describe alternatives you've considered Filtering stdout and looking for said keywords.

Example file N/A

Additional context N/A

opened by sergeyyurkov1 0
[BUG] `--deskew` not compatible with blank pages or with tesseract_timeout = 0
Describe the bug The --deskew option is not behaving as expected on Ocrmypdf 13.7.0. I am experiencing two issues related to deskew.

Issue 1: Deskew not working on blank pages

I'm using the following options --output-type=pdf --tesseract-timeout=30on this blank_image.pdf. When I run the Ocrmypdf command above, I get a SubprocessOutputError. I see that issue is referenced here: https://github.com/ocrmypdf/OCRmyPDF/issues/868, but I don't think the bug fix covered all scenarios.

Issue 2: Deskew not working with tesseract_timeout=0

I want to deskew PDFs without running OCR on them, as mentioned in the docs here. However, when --tesseract-timeout=0, the document is not being deskewed because OCR is not being run. If I change --tesseract-timeout to a different integer, it successfully deskews. Here is a skewed PDF that can be used to reproduce the issue: skewed_text.pdf

To Reproduce Issue 1: Use blank_image.pdf and run ocrmypdf --deskew --output-type=pdf --tesseract-timeout=30 blank_image.pdf result.pdf . Issue2: Use skewed_text.pdf and run ocrmypdf --deskew --output-type=pdf --tesseract-timeout=0 skewed_text.pdf result_pdf.

Expected behavior I expect that blank pages do not completely block the ocrmypdf command from running. It should be able to gracefully handle the error and skip deskewing that specific page. I expect that with --tesseract_timeout=0 the page can be deskewed without having OCR applied.

Screenshots If applicable, add screenshots to help explain your problem. Deskew with 0 second timeout: Deskew with 30 second timeout:

System (please complete the following information):

OS: MacOS Ventura 13.0.1

OCRmyPDF version: 13.7.0

Installation brew install ocrmypdf
opened by deexpabada 0
Spaces in Japanese

Hi all! I wonder if it is possible to do OCR having all spaces completely ignored in the outcome? Languages like Japanese do not really use any spaces (even after commas or periods), but currently OCRmyPDF seems to find spaces between almost every character, which is very problematic when you want to search for sentences/words in the document, or google translate parts of it... Thank you in advance!

opened by KajiyaOokami 3
Ignore Digital Signed Documents

Hi,

Is there a way to ignore digital signed documents? And was there any changes recently? I would swear a year ago digital signed documents would just thrown an error.

Thanks.

opened by flaviobrunopereira 0
Draw/Blanking on wrong spot

tesadasgfdgdf.pdf

Settings: {"redo_ocr":true,"language":"deu+eng","clean":true}

The reactangle is always to low and thats why the ouput get completely wrong.

Can you please look into it. I tried everything but still same. If i change the Font on something else and switch back, everything is right then.

opened by emre1e 0

Releases(v4.0)

v4.0(Feb 17, 2016)
Automatic page rotation (-r) is now available. It uses ignores any prior rotation information on PDFs and sets rotation based on the dominant orientation of detectable text. This feature is fairly reliable but some false positives occur especially if there is not much text to work with. (#4)

Deskewing is now performed using Leptonica instead of unpaper. Leptonica is faster and more reliable at image deskewing than unpaper.

Source code(tar.gz)
Source code(zip)
v3.2(Feb 5, 2016)

See release notes
Source code(tar.gz)
Source code(zip)
v3.1.1(Jan 10, 2016)
Fix error that affected page size calculations in most documents

Source code(tar.gz)
Source code(zip)
v3.1(Dec 4, 2015)
Default output format is now PDF/A-2b instead of PDF/A-1b

Python 3.5 and OS X El Capitan are now supported platforms - no changes were needed to implement support

Improved some error messages related to missing input files

Fixed issue #20 - uppercase .PDF extension not accepted

Fixed an issue where OCRmyPDF failed to text that certain pages contained previously OCR'ed text, such as OCR text produced by Tesseract 3.04

Inserts /Creator tag into PDFs so that errors can be traced back to this project

Added new option --pdf-renderer=auto, to let OCRmyPDF pick the best PDF renderer. Currently it always chooses the 'hocrtransform' renderer but that behavior may change.

Set up Travis CI automatic integration testing

Source code(tar.gz)
Source code(zip)
v3.0(Sep 14, 2015)

See release notes for details
Source code(tar.gz)
Source code(zip)

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Related tags

Overview

Main features

Motivation

Installation

Languages

Documentation and support

Requirements

Press & Media

Business enquiries

License

Disclaimer

Comments

Background

Strategy

Documentation

Testing

Sample PDF Output

Alternative Approaches

Related Issues

Issue 1: Deskew not working on blank pages

Issue 2: Deskew not working with tesseract_timeout=0

Releases(v4.0)

v4.0(Feb 17, 2016)

v3.2(Feb 5, 2016)

v3.1.1(Jan 10, 2016)

v3.1(Dec 4, 2015)

v3.0(Sep 14, 2015)

Owner

Performing the following operations using python on PDF.

Simple pdf editor while preserving structure and format.

Camelot is a Python library that can help you extract tables from PDFs!

Auto Convert PDFs to png files in python

An application which enables the users to perform simple yet intriguing PDF operations

A Python tool to generate a static HTML file that represents the internal structure of a PDF file

Convert PDF to AudioBook and Audio Speech to PDF

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

A python library for extracting text from PDFs without losing the formatting of the PDF content.

A bot for PDF for doing Many Things....

Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

borb is a library for reading, creating and manipulating PDF files in python.

Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

minipdf is a package for creating simple, single-page PDF documents.

pdf_sprinkles: sprinkles text in your PDFs

Merge multiple PDF files into one.

Svg2pdfgen - Svg To PDF gen with python

Python script that split PDF files.