Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Last update: Oct 28, 2021

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

Indonesian ID Card OCR using tesseract OCR

第一届西安交通大学人工智能实践大赛（2018AI实践大赛--图片文字识别）第一名；仅采用densenet识别图中文字

轻量级公式 OCR 小工具：一键识别各类公式图片，并转换为 LaTeX 格式

Official code for :rocket: Unsupervised Change Detection of Extreme Events Using ML On-Board :rocket:

Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

Generate text images for training deep learning ocr model

TensorFlow Implementation of FOTS, Fast Oriented Text Spotting with a Unified Network.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Single Shot Text Detector with Regional Attention

Captcha Recognition

Slice a single image into multiple pieces and create a dataset from them

Open Source Differentiable Computer Vision Library for PyTorch

AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST, and the significant improvement was also made, which make long text predictions more accurate.https://github.com/huoyijie/raspberrypi-car

Recognizing cropped text in natural images.

OCR, Object Detection, Number Plate, Real Time

Vietnamese Language Detection and Recognition

OCR system for Arabic language that converts images of typed text to machine-encoded text.

Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Augmenting Anchors by the Detector Itself