Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
The official code of ABINet (CVPR 2021, Oral).
ABINet uses a vision model and an explicit language model to recognize text in the wild, which are trained in end-to-end way. The language model (BCN) achieves bidirectional language representation in simulating cloze test, additionally utilizing iterative correction strategy.
Runtime Environment
-
We provide a pre-built docker image using the Dockerfile from
docker/Dockerfile -
Running in Docker
$ [email protected]:FangShancheng/ABINet.git $ docker run --gpus all --rm -ti --ipc=host -v $(pwd)/ABINet:/app fangshancheng/fastai:torch1.1 /bin/bash -
(Untested) Or using the dependencies
pip install -r requirements.txt
Datasets
-
Training datasets
- MJSynth (MJ):
- Use
tools/create_lmdb_dataset.pyto convert images into LMDB dataset - LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- SynthText (ST):
- Use
tools/crop_by_word_bb.pyto crop images from original SynthText dataset, and convert images into LMDB dataset bytools/create_lmdb_dataset.py - LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- WikiText103, which is only used for pre-trainig language models:
- Use
notebooks/prepare_wikitext103.ipynbto convert text into CSV format. - CSV dataset BaiduNetdisk(passwd:dk01)
- Use
- MJSynth (MJ):
-
Evaluation datasets, LMDB datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive.
- ICDAR 2013 (IC13)
- ICDAR 2015 (IC15)
- IIIT5K Words (IIIT)
- Street View Text (SVT)
- Street View Text-Perspective (SVTP)
- CUTE80 (CUTE)
-
The structure of
datadirectory isdata ├── charset_36.txt ├── evaluation │ ├── CUTE80 │ ├── IC13_857 │ ├── IC15_1811 │ ├── IIIT5k_3000 │ ├── SVT │ └── SVTP ├── training │ ├── MJ │ │ ├── MJ_test │ │ ├── MJ_train │ │ └── MJ_valid │ └── ST ├── WikiText-103.csv └── WikiText-103_eval_d1.csv
Pretrained Models
Get the pretrained models from BaiduNetdisk(passwd:kwck), GoogleDrive. Performances of the pretrained models are summaried as follows:
| Model | IC13 | SVT | IIIT | IC15 | SVTP | CUTE | AVG |
|---|---|---|---|---|---|---|---|
| ABINet-SV | 97.1 | 92.7 | 95.2 | 84.0 | 86.7 | 88.5 | 91.4 |
| ABINet-LV | 97.0 | 93.4 | 96.4 | 85.9 | 89.5 | 89.2 | 92.7 |
Training
- Pre-train vision model
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_vision_model.yaml - Pre-train language model
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml - Train ABINet
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/train_abinet.yaml
Note:
- You can set the
checkpointpath for vision and language models separately for specific pretrained model, or set toNoneto train from scratch
Evaluation
CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_abinet.yaml --phase test --image_only
Additional flags:
--checkpoint /path/to/checkpointset the path of evaluation model--test_root /path/to/datasetset the path of evaluation dataset--model_eval [alignment|vision]which sub-model to evaluate--image_onlydisable dumping visualization of attention masks
Visualization
Successful and failure cases on low-quality images:
Citation
If you find our method useful for your reserach, please cite
@article{fang2021read,
title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}
License
This project is only free for academic research purposes, licensed under the 2-clause BSD License - see the LICENSE file for details.
Feel free to contact [email protected] if you have any questions.

