Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Last update: Jan 06, 2023

Related tags

Deep Learning docformer

Overview

DocFormer - PyTorch

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄 📄 📄 .

DocFormer is a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

The official implementation was not released by the authors.

Install

There might be some issues with the import of pytessaract, so in order to debug that, we need to write

pip install pytesseract
sudo apt install tesseract-ocr

And then,

pip install git+https://github.com/shabie/docformer

Usage

from docformer import modeling, dataset
from transformers import BertTokenizerFast


config = {
  "coordinate_size": 96,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "image_feature_pool_shape": [7, 7, 256],
  "intermediate_ff_size_factor": 4,
  "max_2d_position_embeddings": 1000,
  "max_position_embeddings": 512,
  "max_relative_positions": 8,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "shape_size": 96,
  "vocab_size": 30522,
  "layer_norm_eps": 1e-12,
}

fp = "filepath/to/the/image.tif"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encoding = dataset.create_features(fp, tokenizer)

feature_extractor = modeling.ExtractFeatures(config)
docformer = modeling.DocFormerEncoder(config)
v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)  # shape (1, 512, 768)

License

MIT

Maintainers

Contribute

Citations

@InProceedings{Appalaraju_2021_ICCV,
    author    = {Appalaraju, Srikar and Jasani, Bhavan and Kota, Bhargava Urala and Xie, Yusheng and Manmatha, R.},
    title     = {DocFormer: End-to-End Transformer for Document Understanding},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {993-1003}
}

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Related tags

Overview

DocFormer - PyTorch

Install

Usage

License

Maintainers

Contribute

Citations

Owner

Benchmarks for Model-Based Optimization

Official code for our CVPR '22 paper "Dataset Distillation by Matching Training Trajectories"

(ICCV 2021) PyTorch implementation of Paper "Progressive Correspondence Pruning by Consensus Learning"

A graphical Semi-automatic annotation tool based on labelImg and Yolov5

MoveNet Single Pose on OpenVINO

Space-invaders - Simple Game created using Python & PyGame, as my Beginner Python Project

Human annotated noisy labels for CIFAR-10 and CIFAR-100.

Official PyTorch implementation of "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble" (NeurIPS'21)

This is a repository for a Semantic Segmentation inference API using the Gluoncv CV toolkit

ADSPM: Attribute-Driven Spontaneous Motion in Unpaired Image Translation

Official PyTorch Implementation of SSMix (Findings of ACL 2021)

Airbus Ship Detection Challenge

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, CVPR2022

Definition of a business problem according to Wilson Lower Bound Score and Time Based Average Rating

Implementations of paper Controlling Directions Orthogonal to a Classifier

Galaxy images labelled by morphology (shape). Aimed at ML development and teaching

AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation

Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021).

Deep learning image registration library for PyTorch

TensorFlow CNN for fast style transfer