Simple and understandable swin-transformer OCR project

Last update: Dec 31, 2022

Overview

swin-transformer-ocr

Overview

Simple and understandable swin-transformer OCR project. The model in this repository heavily relied on high-level open-source projects like timm and x_transformers. And also you can find that the procedure of training is intuitive thanks to the legibility of pytorch-lightning.

The model in this repository encodes input image to context vector with 'shifted-window` which is a swin-transformer encoding mechanism. And it decodes the vector with a normal auto-regressive transformer.

If you are not familiar with transformer OCR structure, transformer-ocr would be easier to understand because it uses a traditional convolution network (ResNet-v2) for the encoder.

Performance

With private korean handwritten text dataset, the accuracy(exact match) is 97.6%.

Data

./dataset/
├─ preprocessed_image/
│  ├─ cropped_image_0.jpg
│  ├─ cropped_image_1.jpg
│  ├─ ...
├─ train.txt
└─ val.txt

# in train.txt
cropped_image_0.jpg\tHello World.
cropped_image_1.jpg\tvision-transformer-ocr
...

You should preprocess the data first. Crop the image by word or sentence level area. Put all image data in a specific directory. Ground truth information should be provided with a txt file. In the txt file, write the image file name and label with \t separator in the same line.

Configuration

In settings/ directory, you can find default.yaml. You can set almost every hyper-parameter in that file. Copy one and edit it as your experiment version. I recommend you to run with the default setting first, before you change it.

Train

python run.py --version 0 --setting settings/default.yaml --num_workers 16 --batch_size 128

You can check your training log with tensorboard.

tensorboard --log_dir tb_logs --bind_all

Predict

When your model finishes training, you can use your model for prediction.

python predict.py --setting <your_setting.yaml> --target <image_or_directory> --tokenizer <your_tokenizer_pkl> --checkpoint <saved_checkpoint>

Exporting to ONNX

You can export your model to ONNX format. It's very easy thanks to pytorch-lightning. See the related pytorch-lightning document.

Citations

@misc{liu-2021,
    title   = {Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
	author  = {Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng Zhang and Stephen Lin and Baining Guo},
	year    = {2021},
    eprint  = {2103.14030},
	archivePrefix = {arXiv}
}

Simple and understandable swin-transformer OCR project

Related tags

Overview

swin-transformer-ocr

Overview

Performance

Data

Configuration

Train

Predict

Exporting to ONNX

Citations

Owner

Ha YongWook

Universal Probability Distributions with Optimal Transport and Convex Optimization

Clean Machine Learning, a Coding Kata

This repository contains the DendroMap implementation for scalable and interactive exploration of image datasets in machine learning.

An example of time series augmentation methods with Keras

PyTorch implementation of hand mesh reconstruction described in CMR and MobRecon.

This is the official implementation of Elaborative Rehearsal for Zero-shot Action Recognition (ICCV2021)

NumQMBasic - A mini-course offered to Undergrad physics students

This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transformers.

A PyTorch library and evaluation platform for end-to-end compression research

Changing the Mind of Transformers for Topically-Controllable Language Generation

Plug-n-Play Reinforcement Learning in Python with OpenAI Gym and JAX

g9.py - Torch interactive graphics

A transformer which can randomly augment VOC format dataset (both image and bbox) online.

Using Language Model to Bootstrap Human Activity Recognition Ambient Sensors Based in Smart Homes

Out-of-distribution detection using the pNML regret. NeurIPS2021

Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

Project for tracking occupancy in Tel-Aviv parking lots.

Llvlir - Low Level Variable Length Intermediate Representation

A large-scale database for graph representation learning

A method to perform unsupervised cross-region adaptation of crop classifiers trained with satellite image time series.