当前位置：网站首页>7 kinds of visual MLP finishing (Part 1)

7 kinds of visual MLP finishing (Part 1)

2022-07-19 05:47:00 【byzy】

If vision Transformer Removing the MSA part , Whether the performance can reach the same level ？ Or just use MLP Whether it is feasible to realize the visual task ？ Thus taking into account the visual MLP.

One 、EANet（External Attention）

Link to the original text ：https://arxiv.org/pdf/2105.02358.pdf

$\begin{aligned} A&=\textup{Norm}(FM_k^T)\\ F_{out}&=AM_v \end{aligned}$

among M_k and M_v Is a learnable parameter , Independent of input .Norm by double normalization（ Line and column respectively ）：

$(\tilde{\alpha})_{i,j}=FM^T_k,\hat{\alpha}_{i,j}=\frac{\exp(\tilde{\alpha}_{i,j})}{\sum_k \exp(\tilde{\alpha}_{k,j})},\alpha_{i,j}=\frac{\hat{\alpha}_{i,j}}{\sum_k \hat{\alpha}_{i,k}}$

Two 、MLP-Mixer

Link to the original text ：https://arxiv.org/pdf/2105.01601.pdf

Mixer Layer

among MLP Double layer , There are GELU Activation function .

Network structure

Divide the image into non overlapping patch, Then project the dimension as , obtain Input to Mixer in .Mixer contain 2 individual MLP, The first one acts on the column （ All columns share parameters ）, The second one works on rows （ All rows share parameters ）.

Mixer The formula （ by patch Number ）

$\begin{aligned} U_{\ast ,i}&=X_{\ast,i}+W_2\sigma(W_1\textup{LayerNorm(X)}_{\ast,i}),i=1,\cdots,C\\ Y_{j,\ast}&=U_{j,\ast}+W_4\sigma(W_3\textup{LayerNorm(U)}_{j,\ast}),j=1,\cdots,S \end{aligned}$

Mixer not used position embedding, because token-mixing MLP For input token Order sensitive , It is possible to learn location information .

3、 ... and 、CycleMLP

Link to the original text ：https://arxiv.org/pdf/2107.10224.pdf

Actually, it is the above MLP-Mixer Improvement .

Tradition MLP The main problem ：（1） Space dimension MLP Unable to adapt to different input sizes ;（2）channel Dimensional MLP Cannot capture spatial interaction .

Model structure

Patch Embedding

Use a size of 7 The window of （ step 4） Divide the pictures into overlapping patch. And then patch Get high-dimensional features through linear layers .

Different stage There are transition part , Reduce token Number , increase channel dimension .

CycleMLP block

Channel MLP by 2 A linear layer （channel FC）+GELU.Channel FC It has nothing to do with the size of the input image , But only 1 Pixel .

And traditional MLP comparison ,Cycle MLP Used Cycle FC layer , send MLP The class model can handle input images of different sizes .Cycle FC Used 3 A parallel Cycle FC operator.

Cycle FC Output （ S_P To feel the size of the field ）：

$Y_{i,j}=\sum_{c=0}^{C_i-1}F^T_{j,c}\cdot X_{i+c\%S_p,c}$

Pseudonucleus

The area obtained by projecting the sampling points onto the spatial plane .

Four 、gMLP

Link to the original text ：https://arxiv.org/pdf/2105.08050.pdf

gMLP（g Express gating） contain The same block , Each block is as follows ：

$\begin{aligned} Z&=\sigma(XU)\\ \tilde{Z}&=s(Z)\\ Y&=\tilde{Z}V \end{aligned}$

among $\sigma$ Is the activation function , $s(\cdot )$ Capture spatial interactions （ When s(Z)=Z It is an ordinary double-layer MLP）, $\odot$ To multiply by elements . Models don't need position embedding, Because it can be capture .

The simplest choice to capture spatial interactions is the linear layer ：

$f_{W,b}(Z)=WZ+b,s(Z)=Z\odot f_{W,b}(Z)$

here $s(\cdot )$ go by the name of SGU（spatial gating unit）. It's kind of like SE（ see 5 Kind of 2D Attention Arrangement The third one in ）, Just turn the pool into a linear layer .

The equally effective method is , take Along the channel In two parts Z_1 and Z_2 , then