The tutorial is a collection of many other resources and my own notes

Overview
# TOC

Before reading
the tutorial is a collection of many other resources and my own notes. Note that the ref if any in the tutorial means the whole passage. And part to be referred if any means the part has been summarized or detailed by me. Feel free to click the [the part to be referred] to read the original.

CTC_pytorch

1. Why we need CTC? ---> looking back on history

Feel free to skip it if you already know the purpose of CTC coming into being.

1.1. About CRNN

We need to learn CRNN because in the context we need an output to be a sequence.

ref: the overview from CRNN to CTC !! highly recommended !!

part to be referred

multi-digit sequence recognition

  • Characted-based
  • word-based
  • sequence-to-sequence
  • CRNN = CNN + RNN
    • CNN --> relationship between pixel
    • (the small fonts) Specifially, each feature vec of a feature seq is generated from left to right on the feature maps. That means the i-th feature vec is the concatenation of the columns of all the maps. So the shape of the tensor can be reshaped as e.g. (batch_size, 32, 256)

image1



1.2. from Cross Entropy Loss to CTC Loss

Usually, CE is applied to compute loss as the following way. And gt(also target) can be encoded as a stable matrix or vector.

image2

However, in OCR or audio recognition, each target input/gt has various forms. e.g. "I like to play piano" can be unpredictable in handwriting.

image3

Some stroke is longer than expected. Others are short.
Assume that the above example is encoded as number sequence [5, 3, 8, 3, 0].

image4

  • Tips: blank(the blue box symbol here) is introduced because we allow the model to predict a blank label due to unsureness or the end comes, which is similar with human when we are not pretty sure to make a good prediction. ref:lihongyi lecture starting from 3:45

Therefore, we see that this is an one-to-many question where e.g. "I like to play piano" has many target forms. But we not just have one sequence. We might also have other sequence e.g. "I love you", "Not only you but also I like apple" etc, none of which have a same sentence length. And this is what cross entropy cannot achieve in one batch. But now we can encode all sequences/sentences into a new sequence with a max length of all sequences.

e.g.
"I love you" --> len = 10
"How are you" --> len = 11
"what's your name" --> len = 16

In this context the input_length should be >= 16.

For dealing with the expanded targets, CTC is introduced by using the ideas of (1) HMM forward algorithm and (2) dynamic programing.

2. Details about CTC

2.1. intuition: forward algorithm

image5

image6

Tips: the reason we have - inserted between each two token is because, for each moment/horizontal(Note) position we allow the model to predict a blank representing unsureness.

Note that moment is for audio recognition analogue. horizontal position is for OCR analogue.



2.2. implementation: forward algorithm with dynamic programming

the complete code is CTC.py

given 3 samples, they are
"orange" :[15, 18, 1, 14, 7, 5]    len = 6
"apple" :[1, 16, 16, 12, 5]    len = 5
"watermelon" :[[23, 1, 20, 5, 18, 13, 5, 12, 15, 14]  len = 10

{0:blank, 1:A, 2:B, ... 26:Z}

2.2.1. dummy input ---> what the input looks like

# ------------ a dummy input ----------------
log_probs = torch.randn(15, 3, 27).log_softmax(2).detach().requires_grad_()# 15:input_length  3:batchsize  27:num of token(class)
# targets = torch.randint(0, 27, (3, 10), dtype=torch.long)
targets = torch.tensor([[15, 18, 1,  14, 7, 5,  0, 0,  0,  0],
                        [1,  16, 16, 12, 5, 0,  0, 0,  0,  0],
                        [23, 1,  20, 5, 18, 13, 5, 12, 15, 14]]
                        )

# assume that the prediction vary within 15 input_length.But the target length is still the true length.
""" 
e.g. [a,0,0,0,p,0,p,p,p, ...l,e] is one of the prediction
 """
input_lengths = torch.full((3,), 15, dtype=torch.long)
target_lengths = torch.tensor([6,5,10], dtype = torch.long)



2.2.2. expand the target ---> what the target matrix look like

Recall that one target can be encoded in many different forms. So we introduce a targets mat to represent it as follows.

"-d-o-g-" ">
target_prime = targets.new_full((2 * target_length + 1,), blank) # create a targets_prime full of zero

target_prime[1::2] = targets[i, :target_length] # equivalent to insert blanks in targets. e.g. targets = "dog" --> "-d-o-g-"

Now we got target_prime(also expanded target) for e.g. "apple"
target_prime is
tensor([ 0, 1, 0, 16, 0, 16, 0, 12, 0, 5, 0]) which is visualized as the red part(also t1)

image7

Note that the t8 is only for illustration. In the example, the width of target matrix should be 15(input_length).

probs = log_probs[:input_length, i].exp()

Then we convert original inputs from log-space like this, referring to "In practice, the above recursion ..." in original paper https://www.cs.toronto.edu/~graves/icml_2006.pdf

2.3. Alpha Matrix

image8

# alpha matrix init at t1 indicated by purple boxes.
alpha_col = log_probs.new_zeros((target_length * 2 + 1,))
alpha_col[0] = probs[0, blank] # refers to green box
alpha_col[1] = probs[0, target_prime[1]]
  • blank is the index of blank(here it's 0)
  • target_prime[1] refers to the 1-st index of the token. e.g. "apple": "a", "orange": "o"

2.4. Dynamic programming based on 3 conditions

refer to the details in CTC.py

reference:

Owner
手写AI
手写AI
Source Code for 'Practical Python Projects' (video) by Sunil Gupta

Apress Source Code This repository accompanies %Practical Python Projects by Sunil Gupta (Apress, 2021). Download the files as a zip using the green b

Apress 2 Jun 01, 2022
Build AGNOS, the operating system for your comma three

agnos-builder This is the tool to build AGNOS, our Ubuntu based OS. AGNOS runs on the comma three devkit. NOTE: the edk2_tici and agnos-firmare submod

comma.ai 21 Dec 24, 2022
API spec validator and OpenAPI document generator for Python web frameworks.

API spec validator and OpenAPI document generator for Python web frameworks.

1001001 249 Dec 22, 2022
A next-generation curated knowledge sharing platform for data scientists and other technical professions.

Knowledge Repo The Knowledge Repo project is focused on facilitating the sharing of knowledge between data scientists and other technical roles using

Airbnb 5.2k Dec 27, 2022
ACPOA plugin creation helper

ACPOA Plugin What is ACPOA ACPOA is the acronym for "Application Core for Plugin Oriented Applications". It's a tool to create flexible and extendable

Leikt Sol'Reihin 1 Oct 20, 2021
Fast syllable estimation library based on pattern matching.

Syllables: A fast syllable estimator for Python Syllables is a fast, simple syllable estimator for Python. It's intended for use in places where speed

ProseGrinder 26 Dec 14, 2022
Searches a document for hash tags. Support multiple natural languages. Works in various contexts.

ht-getter Searches a document for hash tags. Supports multiple natural languages. Works in various contexts. This package uses a non-regex approach an

Rairye 1 Mar 01, 2022
A plugin to introduce a generic API for Decompiler support in GEF

decomp2gef A plugin to introduce a generic API for Decompiler support in GEF. Like GEF, the plugin is battery-included and requires no external depend

Zion 379 Jan 08, 2023
Minimal reproducible example for `mkdocstrings` Python handler issue

Minimal reproducible example for `mkdocstrings` Python handler issue

Hayden Richards 0 Feb 17, 2022
Highlight Translator can help you translate the words quickly and accurately.

Highlight Translator can help you translate the words quickly and accurately. By only highlighting, copying, or screenshoting the content you want to translate anywhere on your computer (ex. PDF, PPT

Coolshan 48 Dec 21, 2022
Use Brainf*ck with python!

Brainfudge Run Brainf*ck code with python! Classes Interpreter(array_len): encapsulate all functions into class __init__(self, array_len: int=30000) -

1 Dec 14, 2021
SamrSearch - SamrSearch can get user info and group info with MS-SAMR

SamrSearch SamrSearch can get user info and group info with MS-SAMR.like net use

knight 10 Oct 06, 2022
Explorative Data Analysis Guidelines

Explorative Data Analysis Get data into a usable format! Find out if the following predictive modeling phase will be successful! Combine everything in

Florian Rohrer 18 Dec 26, 2022
A markdown wiki and dashboarding system for Datasette

datasette-notebook A markdown wiki and dashboarding system for Datasette This is an experimental alpha and everything about it is likely to change. In

Simon Willison 19 Apr 20, 2022
100 Days of Code Learning program to keep a habit of coding daily and learn things at your own pace with help from our remote community.

100 Days of Code Learning program to keep a habit of coding daily and learn things at your own pace with help from our remote community.

Git Commit Show by Invide 41 Dec 30, 2022
30 days of Python programming challenge is a step-by-step guide to learn the Python programming language in 30 days

30 days of Python programming challenge is a step-by-step guide to learn the Python programming language in 30 days. This challenge may take more than100 days, follow your own pace.

Asabeneh 17.7k Jan 07, 2023
the project for the most brutal and effective language learning technique

- "The project for the most brutal and effective language learning technique" (c) Alex Kay The langflow project was created especially for language le

Alexander Kaigorodov 7 Dec 26, 2021
Zero configuration Airflow plugin that let you manage your DAG files.

simple-dag-editor SimpleDagEditor is a zero configuration plugin for Apache Airflow. It provides a file managing interface that points to your dag_fol

30 Jul 20, 2022
🧙 A simple, typed and monad-based Result type for Python.

meiga 🧙 A simple, typed and monad-based Result type for Python. Table of Contents Installation 💻 Getting Started 📈 Example Features Result Function

Alice Biometrics 31 Jan 08, 2023