Scene-Text-Detection-and-Recognition (Pytorch)

Overview

Scene-Text-Detection-and-Recognition (Pytorch)

1. Proposed Method

The models

Our model comprises two parts: scene text detection and scene text recognition. the descriptions of these two models are as follow:

  • Scene Text Detection
    We employ YoloV5 [1] to detect the ROI (Region Of Interest) from an image and Resnet50 [2] to implement the ROI transformation algorithm. This algorithm transforms the coordinates detected by YoloV5 to the proper location, which fits the text well. YoloV5 can detect all ROIs that might be strings while ROI transformation can make the bbox more fit the region of the string. The visualization result is illustrated below, where the bbox of the dark green is ROI detected by YoloV5 and the bbox of the red is ROI after ROI transformation.

  • Scene Text Recognition
    We employ ViT [3] to recognize the string of bbox detected by YoloV5 since our task is not a single text recognition. The transformer-based model achieves the state-of-the-art performance in Natural Language Processing (NLP). The attention mechanism can make the model pay attention to the words that need to be output at the moment. The model architecture is demonstrated below.

The whole training process is shown in the figure below.

Data augmentation

  • Random Scale Resize
    We found that the sizes of the images in the public dataset are different. Therefore, if we resize the small image to the large, most of the image features will be lost. To solve this problem, we apply the random scale resize algorithm to obtain the low-resolution image from the high-resolution image in the training phase. The visualization results are demonstrated as follows.
Original image 72x72 --> 224x224 96x96 --> 224x224 121x121 --> 224x224 146x146 --> 224x224 196x196 --> 224x224
  • ColorJitter
    In the training phase, the model's input is RGB channel. To enhance the reliability of the model, we appply the collorjitter algorithm to make the model see the images with different contrast, brightness, saturation and hue value. And this kind of method is also widely used in image classification. The visualization results are demonstrated as follows.
Input image brightness=0.5 contrast=0.5 saturation=0.5 hue=0.5 brightness=0.5 contrast=0.5 saturation=0.5 hue=0.5
  • Random Rotaion
    After we observe the training data, we found that most of the images in training data are square-shaped (original image), while some of the testing data is a little skewed. Therefore, we apply the random rotation algorithm to make the model more generalization. The visualization results are demonstrated as follows.
Original image Random Rotation Random Horizontal Flip Both

2. Demo

  • Predicted results
    Before we recognize the string bbox detected by YoloV5, we filter out the bbox with a size less than 45*45. Because the image resolution of a bbox with a size less than 45*45 is too low to recognize the correct string.
Input image Scene Text detection Scene Text recognition
驗車
委託汽車代檢
元力汽車公司
新竹區監理所
3c配件
玻璃貼
專業包膜
台灣大哥大
myfone
新店中正
加盟門市
西門町

排骨酥麵
非常感謝
tvbs食尚玩家
蘋果日報
壹週刊
財訊
錢櫃雜誌
聯合報
飛碟電台
等報導
排骨酥專賣店
西門町

排骨酥麵
排骨酥麵
嘉義店
永晟
電動工具行
492913338
  • Attention maps in ViT
    We also visualize the attention maps in ViT, to check whether the model focus on the correct location of the image. The visualization results are demonstrated as follows.
Original image Attention map

3. Competition Results

  • Public Scores
    We conducted extensive experiments, and The results are demonstrated below. From the results, we can see the improvement of the results by adding each module at each stage. At first, we only employed YoloV5 to detect all the ROI in the images, and the result of detection is not good enough. We also compare the result of ViT with data augmentation or not, the results show that our data augmentation is effective to solve this task (compare the last row and the sixth row). In addition, we filter out the bbox with a size less than 45*45 since the resolution of bbox is too low to recognize the correct strings.
Models(Detection/Recognition) Final score Precision Recall
YoloV5(L) / ViT(aug) 0.60926 0.7794 0.9084
YoloV5(L) +
ROI_transformation(Resnet50) / ViT(aug)
0.73148 0.9261 0.9017
YoloV5(L) +
ROI_transformation(Resnet50) +
reduce overlap bbox / ViT(aug)
0.78254 0.9324 0.9072
YoloV5(L) +
ROI_transformation(SEResnet50) +
reduce overlap bbox / ViT(aug)
0.78527 0.9324 0.9072
YoloV5(L) +
ROI_transformation(SEResnet50) +
reduce overlap bbox / ViT(aug) + filter bbox(40 * 40)
0.79373 0.9333 0.9029
YoloV5(L) +
ROI_transformation(SEResnet50) +
reduce overlap bbox / ViT(aug) + filter bbox(45 * 45)
0.79466 0.9335 0.9011
YoloV5(L) +
ROI_transformation(SEResnet50) +
reduce overlap bbox / ViT(aug) + filter bbox(50 * 50)
0.79431 0.9338 0.8991
YoloV5(L) +
ROI_transformation(SEResnet50) +
reduce overlap bbox / ViT(no aug) + filter bbox(45 * 45)
0.73802 0.9335 0.9011
  • Private Scores
Models(Detection/Recognition) Final score Precision Recall
YoloV5(L) +
ROI_transformation(SEResnet50) +
reduce overlap bbox / ViT(aug) + filter bbox(40 * 40)
0.7828 0.9328 0.8919
YoloV5(L) +
ROI_transformation(SEResnet50) +
reduce overlap bbox / ViT(aug) + filter bbox(45 * 45)
0.7833 0.9323 0.8968
YoloV5(L) +
ROI_transformation(SEResnet50) +
reduce overlap bbox / ViT(aug) + filter bbox(50 * 50)
0.7830 0.9325 0.8944

4. Computer Equipment

  • System: Windows10、Ubuntu20.04

  • Pytorch version: Pytorch 1.7 or higher

  • Python version: Python 3.6

  • Testing:
    CPU: AMR Ryzen 7 4800H with Radeon Graphics RAM: 32GB
    GPU: NVIDIA GeForce RTX 1660Ti 6GB

  • Training:
    CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
    RAM: 256GB
    GPU: NVIDIA GeForce RTX 3090 24GB * 2

5. Getting Started

  • Clone this repo to your local
git clone https://github.com/come880412/Scene-Text-Detection-and-Recognition.git
cd Scene-Text-Detection-and-Recognition

Download pretrained models

  • Scene Text Detection
    Please download pretrained models from Scene_Text_Detection. There are three folders, "ROI_transformation", "yolo_models" and "yolo_weight". First, please put the weights in "ROI_transformation" to the path ./Scene_Text_Detection/Tranform_card/models/. Second, please put all the models in "yolo_models" to the ./Scene_Text_Detection/yolov5-master/. Finally, please put the weight in "yolo_weight" to the path ./Scene_Text_Detection/yolov5-master/runs/train/expl/weights/.

  • Scene Text Recogniton
    Please download pretrained models from Scene_Text_Recognition. There are two files in this foler, "best_accuracy.pth" and "character.txt". Please put the files to the path ./Scene_Text_Recogtion/saved_models/.

Inference

  • You should first download the pretrained models and change your path to ./Scene_Text_Detection/yolov5-master/
$ python Text_detection.py
  • The result will be saved in the path '../output/'. Where the folder "example" is the images detected by YoloV5 and after ROI transformation, the file "example.csv" records the coordinates of the bbox, starting from the upper left corner of the coordinates clockwise, respectively (x1, y1), (x2, y2), (x3, y3), and (x4, y4), and the file "exmaple_45.csv" is the predicted result.
  • If you would like to visualize the bbox detected by yoloV5, you can use the function public_crop() in the script ../../data_process.py to extract the bbox from images.

Training

  • You should first download the dataset provided by official, then put the data in the path '../dataset/'. After that, you could use the following script to transform the original data to the training format.
$ python data_process.py
  • Scene_Text_Detection
    There are two models for the scene text detection task: ROI transformation and YoloV5. You could use the follow script to train these two models.
$ cd ./Scene_Text_Detection/yolov5-master # YoloV5
$ python train.py

$ cd ../Tranform_card/ # ROI Transformation
$ python Trainer.py
  • Scene_Text_Recognition
$ cd ./Scene_Text_Recogtion # ViT for text recognition
$ python train.py

References

[1] https://github.com/ultralytics/yolov5
[2] https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
[3] https://github.com/roatienza/deep-text-recognition-benchmark
[4] https://www.pyimagesearch.com/2014/08/25/4-point-opencv-getperspective-transform-example/
[5] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).

Owner
Gi-Luen Huang
Gi-Luen Huang
Back to Basics: Efficient Network Compression via IMP

Back to Basics: Efficient Network Compression via IMP Authors: Max Zimmer, Christoph Spiegel, Sebastian Pokutta This repository contains the code to r

IOL Lab @ ZIB 1 Nov 19, 2021
Repository of our paper 'Refer-it-in-RGBD' in CVPR 2021

Refer-it-in-RGBD This is the repository of our paper 'Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images' in CVPR 2021 Pape

Haolin Liu 34 Nov 07, 2022
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 1.9.0 ubuntu20/python3.9/pip ubuntu20/python3.8/p

ESPnet 5.9k Jan 04, 2023
TransGAN: Two Transformers Can Make One Strong GAN

[Preprint] "TransGAN: Two Transformers Can Make One Strong GAN", Yifan Jiang, Shiyu Chang, Zhangyang Wang

VITA 1.5k Jan 07, 2023
git《Self-Attention Attribution: Interpreting Information Interactions Inside Transformer》(AAAI 2021) GitHub:

Self-Attention Attribution This repository contains the implementation for AAAI-2021 paper Self-Attention Attribution: Interpreting Information Intera

60 Dec 29, 2022
This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

Trajectory Prediction using Equivariant Continuous Convolution (ECCO) This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivar

Spatiotemporal Machine Learning 45 Jul 22, 2022
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators [Project Website] [Replicate.ai Project] StyleGAN-NADA: CLIP-Guided Domain Adaptation

992 Dec 30, 2022
A new video text spotting framework with Transformer

TransVTSpotter: End-to-end Video Text Spotter with Transformer Introduction A Multilingual, Open World Video Text Dataset and End-to-end Video Text Sp

weijiawu 67 Jan 03, 2023
✅ How Robust are Fact Checking Systems on Colloquial Claims?. In NAACL-HLT, 2021.

How Robust are Fact Checking Systems on Colloquial Claims? Official PyTorch implementation of our NAACL paper: Byeongchang Kim*, Hyunwoo Kim*, Seokhee

Byeongchang Kim 19 Mar 15, 2022
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

International Business Machines 43 Dec 26, 2022
PyTorch implementation of SQN based on CloserLook3D's encoder

SQN_pytorch This repo is an implementation of Semantic Query Network (SQN) using CloserLook3D's encoder in Pytorch. For TensorFlow implementation, che

PointCloudYC 1 Oct 21, 2021
OpenFed: A Comprehensive and Versatile Open-Source Federated Learning Framework

OpenFed: A Comprehensive and Versatile Open-Source Federated Learning Framework Introduction OpenFed is a foundational library for federated learning

25 Dec 12, 2022
This is an example of object detection on Micro bacterium tuberculosis using Mask-RCNN

Mask-RCNN on Mycobacterium tuberculosis This is an example of object detection on Mycobacterium Tuberculosis using Mask RCNN. Implement of Mask R-CNN

Jun-En Ding 1 Sep 16, 2021
This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization

Spherical Gaussian Optimization This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization. This code has b

41 Dec 14, 2022
Recurrent Scale Approximation (RSA) for Object Detection

Recurrent Scale Approximation (RSA) for Object Detection Codebase for Recurrent Scale Approximation for Object Detection in CNN published at ICCV 2017

Yu Liu (Louis) 239 Dec 28, 2022
(AAAI 2021) Progressive One-shot Human Parsing

End-to-end One-shot Human Parsing This is the official repository for our two papers: Progressive One-shot Human Parsing (AAAI 2021) End-to-end One-sh

54 Dec 30, 2022
Official repo for SemanticGAN https://nv-tlabs.github.io/semanticGAN/

SemanticGAN This is the official code for: Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalizat

151 Dec 28, 2022
Automated detection of anomalous exoplanet transits in light curve data.

Automatically detecting anomalous exoplanet transits This repository contains the source code for the paper "Automatically detecting anomalous exoplan

1 Feb 01, 2022
Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

Numenta 6.3k Dec 30, 2022
Code and results accompanying our paper titled Mixture Proportion Estimation and PU Learning: A Modern Approach at Neurips 2021 (Spotlight)

Mixture Proportion Estimation and PU Learning: A Modern Approach This repository is the official implementation of Mixture Proportion Estimation and P

Approximately Correct Machine Intelligence (ACMI) Lab 23 Dec 28, 2022