A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

Overview

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

The official code of VisionLAN (ICCV2021). VisionLAN successfully achieves the transformation from two-step to one-step recognition (from Two to One), which adaptively considers both visual and linguistic information in a unified structure without the need of extra language model.

ToDo List

  • Release code
  • Document for Installation
  • Trained models
  • Document for testing and training
  • Evaluation
  • re-organize and clean the parameters

Updates

2021/10/9 We upload the code, datasets, and trained models.
2021/10/9 Fix a bug in cfs_LF_1.py.

Requirements

Python2.7
Colour
LMDB
Pillow
opencv-python
torch==1.3.0
torchvision==0.4.1
editdistance
matplotlib==2.2.5

Step-by-step install

pip install -r requirements.txt

Data preparing

Training sets

SynthText We use the tool to crop images from original SynthText dataset, and convert images into LMDB dataset.

MJSynth We use tool to convert images into LMDB dataset. (We only use training set in this implementation)

We have upload these LMDB datasets in RuiKe (password:x6si).

Testing sets

Evaluation datasets, LMDB datasets can be downloaded from BaiduYun (password:fjyy) or RuiKe

IIIT5K Words (IIIT5K)
ICDAR 2013 (IC13)
Street View Text (SVT)
ICDAR 2015 (IC15)
Street View Text-Perspective (SVTP)
CUTE80 (CUTE)

The structure of data directory is

datasets
├── evaluation
│   ├── Sumof6benchmarks
│   ├── CUTE
│   ├── IC13
│   ├── IC15
│   ├── IIIT5K
│   ├── SVT
│   └── SVTP
└── train
    ├── MJSynth
    └── SynthText

Evaluation

Results on 6 benchmarks

Methods IIIT5K IC13 SVT IC15 SVTP CUTE
Paper 95.8 95.7 91.7 83.7 86.0 88.5
This implementation 95.9 96.3 90.7 84.1 85.3 88.9

Download our trained model in BaiduYun (password: e3kj) or RuiKe (password: cxqi), and put it in output/LA/final.pth.

CUDA_VISIBLE_DEVICES=0 python eval.py

Visualize character-wise mask map

Examples of the visualization of mask_c: image

   CUDA_VISIBLE_DEVICES=0 python visualize.py

You can modify the 'mask_id' in cfgs/cfgs_visualize to change the mask position for visualization.

Results on OST datasets

Occlusion Scene Text (OST) dataset is proposed to reflect the ability for recognizing cases with missing visual cues. This dataset is collected from 6 benchmarks (IC13, IC15, IIIT5K, SVT, SVTP and CT) containing 4832 images. Images in this dataset are manually occluded in weak or heavy degree. Weak and heavy degrees mean that we occlude the character using one or two lines. For each image, we randomly choose one degree to only cover one character.

Examples of images in OST dataset: image image

Methods Average Weak Heavy
Paper 60.3 70.3 50.3
This implementation 60.3 70.8 49.8

The LMDB dataset is available in BaiduYun (password:yrrj) or RuiKe (password: vmzr)

Training

4 2080Ti GPUs are used in this implementation.

Language-free (LF) process

Step 1: We first train the vision model without MLM. (Our trained LF_1 model(BaiduYun) (password:avs5) or RuiKe (password:qwzn))

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LF_1.py

Step 2: We finetune the MLM with vision model (Our trained LF_2 model(BaiduYun) (password:04jg) or RuiKe (password:v67q))

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LF_2.py

Language-aware (LA) process

Use the mask map to guide the linguistic learning in the vision model.

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LA.py

Tip: In LA process, model with loss (Loss VisionLAN) higher than 0.3 and the training accuracy (Accuracy) lower than 91.0 after the first 200 training iters obains better performance.

Improvement

  1. Mask id randomly generated according to the max length can not well adapt to the occlusion of long text. Thus, evenly sampled mask id can further improve the performance of MLM.
  2. Heavier vision model is able to capture more robust linguistic information in our later experiments.

Citation

If you find our method useful for your reserach, please cite

 @article{wang2021two,
  title={From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network},
  author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
  journal={ICCV},
  year={2021}
}

Feedback

Suggestions and discussions are greatly welcome. Please contact the authors by sending email to [email protected]

SegTransVAE: Hybrid CNN - Transformer with Regularization for medical image segmentation

SegTransVAE: Hybrid CNN - Transformer with Regularization for medical image segmentation This repo is the official implementation for SegTransVAE. Seg

Nguyen Truong Hai 4 Aug 04, 2022
EXplainable Artificial Intelligence (XAI)

EXplainable Artificial Intelligence (XAI) This repository includes the codes for different projects on eXplainable Artificial Intelligence (XAI) by th

4 Nov 28, 2022
End-to-end Temporal Action Detection with Transformer. [Under review]

TadTR: End-to-end Temporal Action Detection with Transformer By Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Song Bai, Xiang Bai. This repo holds the c

Xiaolong Liu 105 Dec 25, 2022
LinkNet - This repository contains our Torch7 implementation of the network developed by us at e-Lab.

LinkNet This repository contains our Torch7 implementation of the network developed by us at e-Lab. You can go to our blogpost or read the article Lin

e-Lab 158 Nov 11, 2022
Repository for the COLING 2020 paper "Explainable Automated Fact-Checking: A Survey."

Explainable Fact Checking: A Survey This repository and the accompanying webpage contain resources for the paper "Explainable Fact Checking: A Survey"

Neema Kotonya 42 Nov 17, 2022
Offcial implementation of "A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Prediction, ICCV-2021".

HF2-VAD Offcial implementation of "A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Predictio

76 Dec 21, 2022
A simple interface for editing natural photos with generative neural networks.

Neural Photo Editor A simple interface for editing natural photos with generative neural networks. This repository contains code for the paper "Neural

Andy Brock 2.1k Dec 29, 2022
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
[ICCV 2021] Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain

Amplitude-Phase Recombination (ICCV'21) Official PyTorch implementation of "Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neur

Guangyao Chen 53 Oct 05, 2022
COVID-Net Open Source Initiative

The COVID-Net models provided here are intended to be used as reference models that can be built upon and enhanced as new data becomes available

Linda Wang 1.1k Dec 26, 2022
Fast methods to work with hydro- and topography data in pure Python.

PyFlwDir Intro PyFlwDir contains a series of methods to work with gridded DEM and flow direction datasets, which are key to many workflows in many ear

Deltares 27 Dec 07, 2022
Weakly Supervised Segmentation by Tensorflow.

Weakly Supervised Segmentation by Tensorflow. Implements semantic segmentation in Simple Does It: Weakly Supervised Instance and Semantic Segmentation, by Khoreva et al. (CVPR 2017).

CHENG-YOU LU 52 Dec 27, 2022
Code for the paper Task Agnostic Morphology Evolution.

Task-Agnostic Morphology Optimization This repository contains code for the paper Task-Agnostic Morphology Evolution by Donald (Joey) Hejna, Pieter Ab

Joey Hejna 18 Aug 04, 2022
AI grand challenge 2020 Repo (Speech Recognition Track)

KorBERT를 활용한 한국어 텍스트 기반 위협 상황인지(2020 인공지능 그랜드 챌린지) 본 프로젝트는 ETRI에서 제공된 한국어 korBERT 모델을 활용하여 폭력 기반 한국어 텍스트를 분류하는 다양한 분류 모델들을 제공합니다. 본 개발자들이 참여한 2020 인공지

Young-Seok Choi 23 Jan 25, 2022
The code for our paper Semi-Supervised Learning with Multi-Head Co-Training

Semi-Supervised Learning with Multi-Head Co-Training (PyTorch) Abstract Co-training, extended from self-training, is one of the frameworks for semi-su

cmc 6 Dec 04, 2022
Keras Image Embeddings using Contrastive Loss

Image to Embedding projection in vector space. Implementation in keras and tensorflow of batch all triplet loss for one-shot/few-shot learning.

Shravan Anand K 5 Mar 21, 2022
Distributing reference energies for SMIRNOFF implementations

Warning: This code is currently experimental and under active development. Is it not yet suitable for distribution or use as reference implementation.

Open Force Field Initiative 1 Dec 07, 2021
Accommodating supervised learning algorithms for the historical prices of the world's favorite cryptocurrency and boosting it through LightGBM.

Accommodating supervised learning algorithms for the historical prices of the world's favorite cryptocurrency and boosting it through LightGBM.

1 Nov 27, 2021
Hyperopt for solving CIFAR-100 with a convolutional neural network (CNN) built with Keras and TensorFlow, GPU backend

Hyperopt for solving CIFAR-100 with a convolutional neural network (CNN) built with Keras and TensorFlow, GPU backend This project acts as both a tuto

Guillaume Chevalier 103 Jul 22, 2022
Kaggle Feedback Prize - Evaluating Student Writing 15th solution

Kaggle Feedback Prize - Evaluating Student Writing 15th solution First of all, I would like to thank the excellent notebooks and discussions from http

Lingyuan Zhang 6 Mar 24, 2022