当前位置：网站首页>Seq2seq and attention model learning notes

Seq2seq and attention model learning notes

2022-07-26 08:19:00 【I am I】

Sequence to Sequence (seq2seq) and Attention Link to the original text

seq2seq It is the application of sequence to sequence , In order to solve the problem of unequal length of output and output , Now it has been widely used in, for example Machine translation , Man machine Q & A , Picture text description Wait for content generation . The sequence length of input and output is variable ！！

encoder-decoder frame ：

condetional language models( Conditional language model ）

In the original language model, input and output is a type of data , And in the CLM Can be other source information , For example, picture information , Language , Voice information, etc .

The specific work flow is as follows ：

Input the source word and the previously generated target word into the network ;
Get the vector representation of the context from the network decoder （ Source and previous target ）;
According to this vector , Forecast next token Probability distribution of

decoder The resulting vector passes through a linear layer after softmax Function to the next token Probability .

Simple model ： With two RNNs Composed of Encoder and Decoder

The simplest encoder - The decoder model consists of two RNN (LSTM) form ： One for encoder , The other is for decoder . Encoder RNN Read source sentence , The final state is as decoder RNN Initial state of . Hope it is the final encoder state “ code ” All the information about the source , The decoder can generate the target sentence based on this vector .

This model can be modified differently ： for example , Encoder and decoder can have several layers . for example , stay Sequence to Sequence Learning with Neural Networks This multi-layer model is used in this paper - This is one of the first attempts to solve sequence to sequence tasks using neural networks .

In the same paper , The author looked at the last encoder status and visualized several examples - As shown below . Interestingly , The expressions of sentences with similar meanings but different structures are very similar ！

cross-entropy loss( Cross entropy loss function ）

The standard loss function is the cross entropy loss function , The target distribution is p*, The predicted distribution is p

because pi* It's not equal to 0, So we get ：

At every step , We maximize the probability that the model assigns to the correct tag . See the illustration of a single time step .

For the whole example, the loss will be −∑t=1->n log⁡(p(yt|y<t,x)). Look at the illustration of the training process (the illustration is for the RNN model, but the model can be different).

Attention attention

The Problem of Fixed Encoder Representation

In the model we see so far , The encoder compresses the entire source statement into a vector . It's very hard. —— The number of possible meanings of the source is infinite . When the encoder is forced to put all the information into a single vector （512） In the middle of the day , It is likely to forget something .

Not only is it difficult for the encoder to put all the information into one vector —— This is also difficult for the decoder . The decoder can only see a representation of the source . however , In each generation step , Different parts of the source may be more useful than others . But in the current situation , The decoder must extract relevant information from the same fixed representation —— It's not an easy thing .

Attention mechanism is part of neural network . In each decoder step , It determines which source parts are more important . In this setting , The encoder does not have to compress the entire source into a single vector - It provides representation for all source tags （ for example , all RNN Status, not the last ）.

The calculation scheme is as follows ：

How to calculate the attention score ： There are several ways

The most popular way to calculate attention scores is ：

Dot product - The easiest way ;
Bilinear functions （ also called “Luong attention”）—— For papers Effective Approaches to Attention-based Neural Machine Translation;
Multilayer perceptron （ also called “Bahdanau attention”）—— The method proposed in the original paper .

原网站

版权声明
本文为[I am I]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/201/202207181754263791.html