Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

Overview

hocr-tools

Build Status Codacy Badge PyPI pyversions license

About

hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.

There is a Public Specification for the hOCR Format.

About the code

Each command line program is self contained; if you have Python 2.7 with the required packages installed, it should just work. (Unfortunately, that means some code duplication; we may revisit this issue in later revisions.)

Installation

System-wide with pip

You can install hocr-tools along with its dependencies from PyPI:

sudo pip install hocr-tools

System-wide from source

On a Debian/Ubuntu system, install the dependencies from packages:

sudo apt-get install python-lxml python-reportlab python-pil \
  python-beautifulsoup python-numpy python-scipy python-matplotlib

Or, to fetch dependencies from the cheese shop:

sudo pip install -r requirements.txt  # basic

Then install the dist:

sudo python setup.py install

virtualenv

Once

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Subsequently

source venv/bin/activate
./hocr-...

Available Programs

Included command line programs:

hocr-check

hocr-check file.html

Perform consistency checks on the hOCR file.

hocr-combine

hocr-combine file1.html [file2.html ...]

Combine the OCR pages contained in each HTML file into a single document. The document metadata is taken from the first file.

hocr-cut

hocr-cut [-h] [-d] [file.html]

Cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns

hocr-eval-lines

hocr-eval-lines [-v] true-lines.txt hocr-actual.html

Evaluate hOCR output against ASCII ground truth. This evaluation method requires that the line breaks in true-lines.txt and the ocr_line elements in hocr-actual.html agree (most ASCII output from OCR systems satisfies this requirement).

hocr-eval-geom

hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual

Compare the segmentations at the level of the element name (default: ocr_line). Computes undersegmentation, oversegmentation, and missegmentation.

hocr-eval

hocr-eval hocr-true.html hocr-actual.html

Evaluate the actual OCR with respect to the ground truth. This outputs the number of OCR errors due to incorrect segmentation and the number of OCR errors due to character recognition errors.

It works by aligning segmentation components geometrically, and for each segmentation component that can be aligned, computing the string edit distance of the text the segmentation component contains.

hocr-extract-g1000

Extract lines from Google 1000 book sample

hocr-extract-images

hocr-extract-images [-b BASENAME] [-p PATTERN] [-e ELEMENT] [-P PADDING] [file]

Extract the images and texts within all the ocr_line elements within the hOCR file. The BASENAME is the image directory, the default pattern is line-%03d.png, the default element is ocr_line and there is no extra padding by default.

hocr-lines

hocr-lines [FILE]

Extract the text within all the ocr_line elements within the hOCR file given by FILE. If called without any file, hocr-lines reads hOCR data from stdin.

hocr-merge-dc

hocr-merge-dc dc.xml hocr.html > hocr-new.html

Merges the Dublin Core metadata into the hOCR file by encoding the data in its header.

hocr-pdf

hocr-pdf <imgdir> > out.pdf
hocr-pdf --savefile out.pdf <imgdir>

Create a searchable PDF from a pile of hOCR and JPEG. It is important that the corresponding JPEG and hOCR files have the same name with their respective file ending. All of these files should lie in one directory, which one has to specify as an argument when calling the command, e.g. use hocr-pdf . > out.pdf to run the command in the current directory and save the output as out.pdf alternatively hocr-pdf . --savefile out.pdf which avoids routing the output through the terminal.

hocr-split

hocr-split file.html pattern

Split a multipage hOCR file into hOCR files containing one page each. The pattern should something like "base-%03d.html"

hocr-wordfreq

hocr-wordfreq [-h] [-i] [-n MAX] [-s] [-y] [file.html]

Outputs a list of the most frequent words in an hOCR file with their number of occurrences. If called without any file, hocr-wordfreq reads hOCR data (for example from hocr-combine) from stdin.

By default, the first 10 words are shown, but any number can be requested with -n. Use -i to ignore upper and lower case, -s to split on spaces only which will then lead to words also containing punctations, and -y tries to dehyphenate the text (separation of words at line break with a hyphen) before analysis.

Unit tests

The unit tests are written using the tsht framework.

Running the full test suite:

./test/tsht

Running a single test

./test/tsht <path-to/unit-test.tsht>

e.g.

./test/tsht test/hocr-pdf/test-hocr-pdf.tsht

Writing a test

Please see the documentation in the tsht repository and take a look at the existing unit tests.

  1. Create a new directory under ./test
  2. Copy any test assets (images, hOCR files...) to this directory
  3. Create a file <name-of-your-test>.tsht starting from this template:
#!/usr/bin/env tsht

# adjust to the number of your tests
plan 1

# write your tests here
exec_ok "hocr-foo" "-x" "foo"

# remove any temporary files
# rm some-generated-file
Comments
  • Release management

    Release management

    For one, having users use specific versions makes debugging easier.

    The tools could be uploaded to PyPI, so users can install it with pip install hocr-tools, or included in distros like Debian.

    Possible course of action:

    • Release a 0.2 version soon (i.e. tag a git commit v0.2) to have a starting point
    • Consider reorganizing the module (issue #42)
    • Make the tools compatible with PyPI
    • Try to adhere to semantic versioning

    The CLI of the tools has not changed or at least not much over the last years. However, this could (and should) change in the future, possibly breaking backwards compatibility if it cannot be avoided.

    opened by kba 14
  • hocr-combine file counter

    hocr-combine file counter

    Hi, I got the following error message: hocr-combine: error: argument files: can't open 'hocr_out/9341474.html': [Errno 24] Too many open files: 'hocr_out/9341474.html'

    These are 1308 hocr files that are to be merged with combine. The file 9341474.html is the 1022nd hocr file. I solved for now that I merged the 1022 files and then added the rest.

    Since it is not very unusual that over 1000 files exist, I would suggest to enhance the counter.

    bug 
    opened by tboenig 11
  • Error while using hocr-pdf file

    Error while using hocr-pdf file

    While using the below command i m getting error related to character help out please

    hocr-pdf . > out.pdf
    Traceback (most recent call last):
      File "C:\Python36\Scripts\hocr-pdf.py", line 143, in <module>
        export_pdf(args.imgdir, 300)
      File "C:\Python36\Scripts\hocr-pdf.py", line 70, in export_pdf
        pdf.save()
      File "c:\python36\lib\site-packages\reportlab\pdfgen\canvas.py", line 1237, in save
        self._doc.SaveToFile(self._filename, self)
      File "c:\python36\lib\site-packages\reportlab\pdfbase\pdfdoc.py", line 224, in SaveToFile
        f.write(data)
      File "C:\Python36\Scripts\hocr-pdf.py", line 47, in write
        sys.stdout.write(data)
      File "c:\python36\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-14: character maps to <undefined>
    
    opened by shekarnode 11
  • hocr-check complains assert doc.xpath(

    hocr-check complains assert doc.xpath("//meta[@name='ocr-id']")!=[]

    Can be reproduced with both tesseract and gImageReader hOCR files. https://github.com/manisandro/gImageReader/issues/101

    Does the script end with this error or is it still checking the other issues?

    bug 
    opened by CharlesNepote 11
  • Convert Google Cloud Vision OCR output to hocr.

    Convert Google Cloud Vision OCR output to hocr.

    I have a question.

    I try to use Google Cloud Vision API to OCR.

    https://cloud.google.com/vision/

    The output of the OCR results including the position of the texts.

    I want to convert Google OCR output to hocr format, do you have any ideas ?

    question 
    opened by dinosauria123 9
  • Py3 prints

    Py3 prints

    The tsht-tests works for me locally with Python 2 as well as Python 3 and also Travis agreed. However, let us check this carefully and test some more examples.

    opened by zuphilip 8
  • Switch to argparse module

    Switch to argparse module

    Everything works except to pass arguments by stdin, see the second test in all environment https://travis-ci.org/UB-Mannheim/hocr-tools/builds/166182215

    I don't know how to change that. Any ideas?

    opened by zuphilip 7
  • Japanese support, again

    Japanese support, again

    Thank you for making hocr-pdf. I could convert many old Japanese scanned data to searchable pdf.

    After recent update of hocr-pdf, Japanese text in the pdf file are completely broken. It looks like €•‚ƒ „... ( †‡ˆ‰ Š‹Œ•Ž• (original Japanese text as 光学文字認識(こうがくもじにんしき) Alphabet are not broken.

    The hocr file was made by my gcv2hocr, Japanese character is correctly readable.

    If you something know this issue, please answer about it. page001.jpg page001.hocr.txt out0.pdf

    opened by dinosauria123 6
  • AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr)

    AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr)

    	Message: Error from hocr-pdf.exe: [10920] Failed to execute script hocr-pdf
    	Traceback (most recent call last):
    	  File "hocr-pdf", line 163, in 
    	  File "hocr-pdf", line 69, in export_pdf
    	  File "hocr-pdf", line 81, in add_text_layer
    	AttributeError: 'NoneType' object has no attribute 'group'
    	StackTrace: coldfusion.runtime.CustomException: Error from hocr-pdf.exe: [10920] Failed to execute script hocr-pdf
    	Traceback (most recent call last):
    	  File "hocr-pdf", line 163, in 
    	  File "hocr-pdf", line 69, in export_pdf
    	  File "hocr-pdf", line 81, in add_text_layer
    

    Looks to have to do with this line not matching negative bounding boxes: https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf#L76

    I hit this issue on a 60 page PDF around page 14 or so. I don't know why I got negative bounding boxes but It's a thing and caused my code to fail because of it. Updating the above mentioned line to match negative numbers fixes this issue and allows my PDF to be created correctly.

    opened by skylord123 6
  • hocr-pdf: issue with search and copy/paste in macOS Preview.app

    hocr-pdf: issue with search and copy/paste in macOS Preview.app

    The Preview.app is the default PDF reader for macOS. When using hocr-pdf to generate a PDF file, from an image + hocr file, the generated PDF works well for search, and copy/paste in Acrobat, PDF.js and others, but not Preview. You can't search in Preview, though you can select text and copy/paste to another document, but are just blank characters.

    Anyone knows of a specific reason for this to happen?

    opened by joao 6
  • Extend hocr-pdf to work also with lines

    Extend hocr-pdf to work also with lines

    If there are no ocrx_word's present in the hocr output then we switch to use the ocr_line's instead. This is especially needed for the current output of ocropy.

    See also #106.

    enhancement 
    opened by zuphilip 6
  • Font

    Font "Invisible" is of dubious copyright

    Solved by #178.

    Here's the diff of the two:

    diff --git a/tmp/invisible.ttx b/Mienai/Mienai.ttx
    index a906b36..2c7fb14 100644
    --- a/tmp/invisible.ttx
    +++ b/Mienai/Mienai.ttx
    @@ -4,20 +4,20 @@
       <GlyphOrder>
         <!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
         <GlyphID id="0" name=".notdef"/>
    -    <GlyphID id="1" name=".null"/>
    -    <GlyphID id="2" name="nonmarkingreturn"/>
    +    <GlyphID id="1" name="uni0000"/>
    +    <GlyphID id="2" name="glyph00002"/>
       </GlyphOrder>
     
       <head>
         <!-- Most of this table will be recalculated by the compiler -->
         <tableVersion value="1.0"/>
         <fontRevision value="1.0"/>
    -    <checkSumAdjustment value="0xeef53dd6"/>
    +    <checkSumAdjustment value="0xb140c804"/>
         <magicNumber value="0x5f0f3cf5"/>
         <flags value="00000000 00001011"/>
         <unitsPerEm value="2048"/>
    -    <created value="Sat Aug 18 06:01:18 2012"/>
    -    <modified value="Sat Aug 18 06:01:18 2012"/>
    +    <created value="Wed Aug 17 19:22:44 2022"/>
    +    <modified value="Wed Aug 17 19:41:54 2022"/>
         <xMin value="0"/>
         <yMin value="0"/>
         <xMax value="0"/>
    @@ -103,10 +103,10 @@
         <ulUnicodeRange2 value="01010000 00000000 01111000 11111011"/>
         <ulUnicodeRange3 value="00000000 00000000 00000000 00000000"/>
         <ulUnicodeRange4 value="00000000 00000000 00000000 00000000"/>
    -    <achVendID value="HL  "/>
    +    <achVendID value="MFEK"/>
         <fsSelection value="00000000 01000000"/>
    -    <usFirstCharIndex value="65535"/>
    -    <usLastCharIndex value="0"/>
    +    <usFirstCharIndex value="0"/>
    +    <usLastCharIndex value="65535"/>
         <sTypoAscender value="1491"/>
         <sTypoDescender value="-431"/>
         <sTypoLineGap value="307"/>
    @@ -118,23 +118,27 @@
     
       <hmtx>
         <mtx name=".notdef" width="2048" lsb="0"/>
    -    <mtx name=".null" width="2048" lsb="0"/>
    -    <mtx name="nonmarkingreturn" width="2048" lsb="0"/>
    +    <mtx name="glyph00002" width="2048" lsb="0"/>
    +    <mtx name="uni0000" width="2048" lsb="0"/>
       </hmtx>
     
       <cmap>
         <tableVersion version="0"/>
    -    <cmap_format_4 platformID="0" platEncID="3" language="0">
    -    </cmap_format_4>
    +    <cmap_format_13 platformID="0" platEncID="6" format="13" reserved="0" length="40" language="0" nGroups="2">
    +      <map code="0x0" name="uni0000"/><!-- ???? -->
    +      <map code="0x10ffffff" name="uni0000"/><!-- ???? -->
    +    </cmap_format_13>
         <cmap_format_0 platformID="1" platEncID="0" language="0">
    -      <map code="0x0" name=".null"/>
    -      <map code="0x8" name=".null"/>
    -      <map code="0x9" name="nonmarkingreturn"/>
    -      <map code="0xd" name="nonmarkingreturn"/>
    -      <map code="0x1d" name=".null"/>
    +      <map code="0x0" name="glyph00002"/>
    +      <map code="0x8" name="glyph00002"/>
    +      <map code="0x9" name="uni0000"/>
    +      <map code="0xd" name="uni0000"/>
    +      <map code="0x1d" name="glyph00002"/>
         </cmap_format_0>
    -    <cmap_format_4 platformID="3" platEncID="1" language="0">
    -    </cmap_format_4>
    +    <cmap_format_13 platformID="3" platEncID="10" format="13" reserved="0" length="40" language="0" nGroups="2">
    +      <map code="0x0" name="uni0000"/><!-- ???? -->
    +      <map code="0x10ffffff" name="uni0000"/><!-- ???? -->
    +    </cmap_format_13>
       </cmap>
     
       <loca>
    @@ -148,182 +152,58 @@
     
         <TTGlyph name=".notdef"/><!-- contains no outline data -->
     
    -    <TTGlyph name=".null"/><!-- contains no outline data -->
    +    <TTGlyph name="glyph00002"/><!-- contains no outline data -->
     
    -    <TTGlyph name="nonmarkingreturn"/><!-- contains no outline data -->
    +    <TTGlyph name="uni0000"/><!-- contains no outline data -->
     
       </glyf>
     
       <name>
    -    <namerecord nameID="0" platformID="1" platEncID="0" langID="0x0" unicode="True">
    -      Typeface © (your company). 2005. All Rights Reserved
    +    <namerecord nameID="0" platformID="0" platEncID="4" langID="0x409">
    +      見えない by Fredrick R. Brennan (20220817)
         </namerecord>
    -    <namerecord nameID="1" platformID="1" platEncID="0" langID="0x0" unicode="True">
    -      invisible
    +    <namerecord nameID="1" platformID="0" platEncID="4" langID="0x409">
    +      Mienai
         </namerecord>
    -    <namerecord nameID="2" platformID="1" platEncID="0" langID="0x0" unicode="True">
    +    <namerecord nameID="2" platformID="0" platEncID="4" langID="0x409">
           Regular
         </namerecord>
    -    <namerecord nameID="3" platformID="1" platEncID="0" langID="0x0" unicode="True">
    -      invisible:Version 1.00
    +    <namerecord nameID="3" platformID="0" platEncID="4" langID="0x409">
    +      Mienai:MFEK:20220817
         </namerecord>
    -    <namerecord nameID="4" platformID="1" platEncID="0" langID="0x0" unicode="True">
    -      invisible
    +    <namerecord nameID="4" platformID="0" platEncID="4" langID="0x409">
    +      Mienai
         </namerecord>
    -    <namerecord nameID="5" platformID="1" platEncID="0" langID="0x0" unicode="True">
    -      Version 1.00 September 13, 2005, initial release
    +    <namerecord nameID="5" platformID="0" platEncID="4" langID="0x409">
    +      Version 1.00;20220817 Fredrick R. Brennan CC0;made with MFEKmetadata and fontTools
         </namerecord>
    -    <namerecord nameID="6" platformID="1" platEncID="0" langID="0x0" unicode="True">
    -      invisible
    +    <namerecord nameID="6" platformID="0" platEncID="4" langID="0x409">
    +      Mienai
         </namerecord>
    -    <namerecord nameID="10" platformID="1" platEncID="0" langID="0x0" unicode="True">
    -      This font was created using Font Creator 5.0 from High-Logic.com
    +    <namerecord nameID="7" platformID="0" platEncID="4" langID="0x409">
    +      Not trademarked.
         </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x403">
    -      Normal
    +    <namerecord nameID="8" platformID="0" platEncID="4" langID="0x409">
    +      Modular Font Editor K Foundation, Inc.
         </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x405">
    -      obyčejné
    +    <namerecord nameID="9" platformID="0" platEncID="4" langID="0x409">
    +      Fredrick R. Brennan
         </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x406">
    -      normal
    +    <namerecord nameID="11" platformID="0" platEncID="4" langID="0x409">
    +      https://copypaste.wtf
         </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x407">
    -      Standard
    +    <namerecord nameID="13" platformID="0" platEncID="4" langID="0x409">
    +      This font is public domain, released under the terms of the Creative Commons Zero License. available at &lt;https://creativecommons.org/publicdomain/zero/1.0/legalcode&gt;.
         </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x408">
    -      Κανονικά
    -    </namerecord>
    -    <namerecord nameID="0" platformID="3" platEncID="1" langID="0x409">
    -      Typeface © (your company). 2005. All Rights Reserved
    -    </namerecord>
    -    <namerecord nameID="1" platformID="3" platEncID="1" langID="0x409">
    -      invisible
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x409">
    -      Regular
    -    </namerecord>
    -    <namerecord nameID="3" platformID="3" platEncID="1" langID="0x409">
    -      invisible:Version 1.00
    -    </namerecord>
    -    <namerecord nameID="4" platformID="3" platEncID="1" langID="0x409">
    -      invisible
    -    </namerecord>
    -    <namerecord nameID="5" platformID="3" platEncID="1" langID="0x409">
    -      Version 1.00 September 13, 2005, initial release
    -    </namerecord>
    -    <namerecord nameID="6" platformID="3" platEncID="1" langID="0x409">
    -      invisible
    -    </namerecord>
    -    <namerecord nameID="10" platformID="3" platEncID="1" langID="0x409">
    -      This font was created using Font Creator 5.0 from High-Logic.com
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x40a">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x40b">
    -      Normaali
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x40c">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x40e">
    -      Normál
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x410">
    -      Normale
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x413">
    -      Standaard
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x414">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x415">
    -      Normalny
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x416">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x419">
    -      Обычный
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x41b">
    -      Normálne
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x41d">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x41f">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x424">
    -      Navadno
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x42d">
    -      Arrunta
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x80a">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0x816">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0xc0a">
    -      Normal
    -    </namerecord>
    -    <namerecord nameID="2" platformID="3" platEncID="1" langID="0xc0c">
    -      Normal
    +    <namerecord nameID="14" platformID="0" platEncID="4" langID="0x409">
    +      https://creativecommons.org/publicdomain/zero/1.0/legalcode
         </namerecord>
       </name>
     
    -  <post>
    -    <formatType value="2.0"/>
    -    <italicAngle value="0.0"/>
    -    <underlinePosition value="-217"/>
    -    <underlineThickness value="150"/>
    -    <isFixedPitch value="0"/>
    -    <minMemType42 value="0"/>
    -    <maxMemType42 value="0"/>
    -    <minMemType1 value="0"/>
    -    <maxMemType1 value="0"/>
    -    <psNames>
    -      <!-- This file uses unique glyph names based on the information
    -           found in the 'post' table. Since these names might not be unique,
    -           we have to invent artificial names in case of clashes. In order to
    -           be able to retain the original information, we need a name to
    -           ps name mapping for those cases where they differ. That's what
    -           you see below.
    -            -->
    -    </psNames>
    -    <extraNames>
    -      <!-- following are the name that are not taken from the standard Mac glyph order -->
    -    </extraNames>
    -  </post>
    -
    -  <gasp>
    -    <gaspRange rangeMaxPPEM="65535" rangeGaspBehavior="2"/>
    -  </gasp>
    -
    -  <FFTM>
    -    <!-- FontForge's timestamp, font source creation and modification dates -->
    -    <version value="1"/>
    -    <FFTimeStamp value="Mon Sep 14 17:32:09 2009"/>
    -    <sourceCreated value="Tue Sep 13 14:30:08 2005"/>
    -    <sourceModified value="Sat Aug 18 06:01:05 2012"/>
    -  </FFTM>
    -
    -  <GDEF>
    -    <Version value="0x00010000"/>
    -    <GlyphClassDef>
    -      <ClassDef glyph=".null" class="1"/>
    -      <ClassDef glyph="nonmarkingreturn" class="1"/>
    -    </GlyphClassDef>
    -    <LigCaretList>
    -      <Coverage>
    -      </Coverage>
    -      <!-- LigGlyphCount=0 -->
    -    </LigCaretList>
    -  </GDEF>
    +  <vmtx>
    +    <mtx name=".notdef" height="2048" tsb="0"/>
    +    <mtx name="glyph00002" height="2048" tsb="0"/>
    +    <mtx name="uni0000" height="2048" tsb="0"/>
    +  </vmtx>
     
     </ttFont>
    
    opened by ctrlcctrlv 0
  • Replace copyright dubious

    Replace copyright dubious "Invisible" font with MFEK/Mienai.ttf

    This is an important change I think.

    The existing font is of unclear copyright status and has many unnecessary tables (FFTM, GDEF).

    I made Mienai just for this project, I think it's better to have a font users can actually examine.

    Edit: Also my font includes a vertical metrics table so will work in vertical Japanese texts.

    opened by ctrlcctrlv 0
  • Bilingual Text Encoding is not Working for Kannada-English Output Hocr File

    Bilingual Text Encoding is not Working for Kannada-English Output Hocr File

    I am facing issues with hocr pdf conversion for English Kannada encoded into the text layer of the PDF File

    I have a image below in kannada language (https://drive.google.com/file/d/11P2XMFWjmc0S6rzfOX58UtZZJkG2StNI/view?usp=sharing)

    following is the corresponding output hocr of the file https://drive.google.com/file/d/1wm-40rCN_rSE4cqT499kZAjAs5y6A3xl/view?usp=sharing

    following is output of the gcv ocr for the particular file in JSON OCR Output in JSON

    The output of hocr-pdf conversion is as follows Hocr-PDF output

    As you can see if you search for english words it will highlight ,but for kannada language its giving gibberish results in the output file generated using hocr-pdf conversion

    Any guidance in this regards is appreciated

    opened by vaibhavsanil 0
  • a proposition to help hocr-tools become ZE best

    a proposition to help hocr-tools become ZE best

    It would simplify people's life A LOT, if you could write a version of hocr-pdf that does everything on its own: create the hOCR for all of a pdf's pages, merge them, then merge the resulting file with the pdf. and VOILÀ, no loss in the conversion, no mess, no fuss... Perhaps allowing for changing the engine too.

    question 
    opened by evanescente-ondine 1
  • hocr-pdf: change encoding from latin1 to utf-8

    hocr-pdf: change encoding from latin1 to utf-8

    Also what will happen if we go ahead and change the encoding from 'latin-1' to 'utf-8' would that help if we are dealing with lets say Arabic Typescript.

    Possibly, I have never used hocr-pdf with non-latin texts - what happens when you do?

    Originally posted by @UBISOFT-1 in https://github.com/ocropus/hocr-tools/issues/170#issuecomment-992315882

    opened by kba 3
  • decodebytes() Depreciated in hocr-pdf use decodestring()

    decodebytes() Depreciated in hocr-pdf use decodestring()

    /home/muneeb/.local/bin/hocr-pdf:134: DeprecationWarning: decodestring() is a deprecated alias since Python 3.1, use decodebytes()
      uncompressed = bytearray(zlib.decompress(base64.decodestring(font)))
    

    In the file we need to go ahead and use decodestring() function instead.

    opened by UBISOFT-1 3
Releases(v1.3.0)
  • v1.3.0(Mar 2, 2019)

    • Add new script hocr-cut for cutting a page #108
    • Add --savefile argument to hocr-pdf #125 #126
    • Reformat code according to PEP8 and several other cleanup and documentation work

    See details https://github.com/tmbdev/hocr-tools/compare/v1.2.0...v1.3.0

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Mar 29, 2017)

    • hocr-wordfreq: word frequency counter #93 #96 #98 #99 #100 #104
    • Switch to argparse module #82 #97
    • Delete numpy dependency, rewrite edit dist algo #88
    • Extend hocr-pdf to work also with lines #107

    See details: https://github.com/tmbdev/hocr-tools/compare/v1.1.1...v1.2.0

    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Oct 23, 2016)

    • Fix hocr-combine: Delete the function call to importNode which does not exists in etree and seems not necessary anymore.
    • Fix hocr-eval: The function get_text of this file failed in Python 3 and we use now the same code of this function as in the other tools.
    • Fix hocr-lines: It was outputting byte strings in Python 3.
    • Add tests for hocr-combine, hocr-eval, hocr-eval-geom, hocr-lines

    See details: https://github.com/tmbdev/hocr-tools/compare/v1.1.0...v1.1.1

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Sep 27, 2016)

    The hocr-tools are now compatible with Python 2 as well as Python 3!

    • Change print statements according to Python 3 and use from __future__ import print_function
    • Fix hocr-eval-lines, add tests
    • Start code cleaning according to PEP 8 coding styles
    • Add Dockerfile for consistent local testing
    • Load from filename not stream

    See details: https://github.com/tmbdev/hocr-tools/compare/v1.0.1...v1.1.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Sep 20, 2016)

    Fixed bugs

    • hocr-split: Duplicate content in <html> https://github.com/tmbdev/hocr-tools/issues/58
    • hocr-pdf: ocr_line does not have to be a span (e.g. also a div is possible) https://github.com/tmbdev/hocr-tools/pull/57
    • hocr-check: Fix containment checks and metadata checks, add tests https://github.com/tmbdev/hocr-tools/pull/52 https://github.com/tmbdev/hocr-tools/pull/61 https://github.com/tmbdev/hocr-tools/pull/62

    Ongoing work

    • Check handling of non ASCII characters in hOCR files https://github.com/tmbdev/hocr-tools/issues/53
    • Make hocr-tools fit for Python 3 https://github.com/tmbdev/hocr-tools/issues/37

    See details: https://github.com/tmbdev/hocr-tools/compare/v1.0.0...v1.0.1

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Sep 1, 2016)

    We start now to release on GitHub and also PyPI. Today with v1.0.0 marks the beginning of this activity. However, we retrospectively also tag some older important points with version numbers starting with 0.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Sep 1, 2016)

  • v0.3.1(Sep 1, 2016)

  • v0.3.0(Sep 1, 2016)

  • v0.2.2(Sep 1, 2016)

  • v0.2.1(Sep 1, 2016)

  • v0.2.0(Sep 1, 2016)

  • v0.1.0(Sep 1, 2016)

Owner
OCRopus
The OCRopus OCR System and Related Software
OCRopus
An Agnostic Computer Vision Framework - Pluggable to any Training Library: Fastai, Pytorch-Lightning with more to come

An Agnostic Object Detection Framework IceVision is the first agnostic computer vision framework to offer a curated collection with hundreds of high-q

airctic 790 Jan 05, 2023
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Jacobo José Guijarro Villalba 75 Oct 21, 2022
The official code for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

SpeechDrivesTemplates The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates". [arxiv

Qian Shenhan 53 Dec 23, 2022
Zoom , GoogleMeets에서 Vtuber 데뷔하기

EasyVtuber Facial landmark와 GAN을 이용한 Character Face Generation Google Meets, Zoom 등에서 자신만의 웹툰, 만화 캐릭터로 대화해보세요! 악세사리는 어느정도 추가해도 잘 작동해요! 안타깝게도 RTX 2070

Gunwoo Han 140 Dec 23, 2022
A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.

Attention-based OCR Visual attention-based OCR model for image recognition with additional tools for creating TFRecords datasets and exporting the tra

Ed Medvedev 933 Dec 29, 2022
Generate a list of papers with publicly available source code in the daily arxiv

2021-06-08 paper code optimal network slicing for service-oriented networks with flexible routing and guaranteed e2e latency networkslicing multi-moda

79 Jan 03, 2023
Create single line SVG illustrations from your pictures

Create single line SVG illustrations from your pictures

Javier Bórquez 686 Dec 26, 2022
Smart computer vision application

Smart-computer-vision-application Backend : opencv and python Library required:

2 Jan 31, 2022
3点クリックで円を指定し、極座標変換を行うサンプルプログラム

click-warpPolar 3点クリックで円を指定し、極座標変換を行うサンプルプログラムです。 Requirements OpenCV 3.4.2 or Later Usage 実行方法は以下です。 起動後、マウスで3点をクリックし円を指定してください。 python click-warpPol

KazuhitoTakahashi 17 Dec 30, 2022
text detection mainly based on ctpn model in tensorflow, id card detect, connectionist text proposal network

text-detection-ctpn Scene text detection based on ctpn (connectionist text proposal network). It is implemented in tensorflow. The origin paper can be

Shaohui Ruan 3.3k Dec 30, 2022
Text recognition (optical character recognition) with deep learning methods.

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis | paper | training and evaluation data | failure cases and cle

Clova AI Research 3.2k Jan 04, 2023
ERQA - Edge Restoration Quality Assessment

ERQA - a full-reference quality metric designed to analyze how good image and video restoration methods (SR, deblurring, denoising, etc) are restoring real details.

MSU Video Group 27 Dec 17, 2022
With the virtual keyboard, you can write on the real time images by combining the thumb and index fingers on the letter you want.

Virtual Keyboard With the virtual keyboard, you can write on the real time images by combining the thumb and index fingers on the letter you want. At

Güldeniz Bektaş 5 Jan 23, 2022
Slice a single image into multiple pieces and create a dataset from them

OpenCV Image to Dataset Converter Slice a single image of Persian digits into mu

Meysam Parvizi 14 Dec 29, 2022
Text-to-Image generation

Generate vivid Images for Any (Chinese) text CogView is a pretrained (4B-param) transformer for text-to-image generation in general domain. Read our p

THUDM 1.3k Jan 05, 2023
Um simples projeto para fazer o reconhecimento do captcha usado pelo jogo bombcrypto

CaptchaSolver - LEIA ISSO 😓 Para iniciar o codigo: pip install -r requirements.txt python captcha_solver.py Se você deseja pegar ver o resultado das

Kawanderson 50 Mar 21, 2022
Scene text recognition

AttentionOCR for Arbitrary-Shaped Scene Text Recognition Introduction This is the ranked No.1 tensorflow based scene text spotting algorithm on ICDAR2

777 Jan 09, 2023
An Implementation of the FOTS: Fast Oriented Text Spotting with a Unified Network

FOTS: Fast Oriented Text Spotting with a Unified Network Introduction This is a pytorch re-implementation of FOTS: Fast Oriented Text Spotting with a

GeorgeJoe 171 Aug 04, 2022
An organized collection of tutorials and projects created for aspriring computer vision students.

A repository created with the purpose of teaching students in BME lab 308A- Hanoi University of Science and Technology

Givralnguyen 5 Nov 24, 2021
Code release for Hu et al., Learning to Segment Every Thing. in CVPR, 2018.

Learning to Segment Every Thing This repository contains the code for the following paper: R. Hu, P. Dollár, K. He, T. Darrell, R. Girshick, Learning

Ronghang Hu 417 Oct 03, 2022