当前位置:网站首页>ViLT Vision-and-Language Transformer Without Convolution or Region Supervision
ViLT Vision-and-Language Transformer Without Convolution or Region Supervision
2022-07-19 01:50:00 【Chubby Zhu】
2021 Published in ICML, It is called the simplest multimodal model .
Thesis title :ViLT: Visual language converter without convolution and region suggestions
Research questions :
Research motivation : The existing VLP Methods rely heavily on images feature extraction The process , Most include regional supervision ( Such as target detection ) And convolution structure ( Such as ResNet). On the one hand, more calculations are needed , It affects the speed and efficiency ; On the other hand, poor expression ability , It is the upper limit of the expression ability of the visual embedder and its predefined visual vocabulary .
Main contributions :ViLT It is the multimode with the smallest parameter quantity at present Transformer Method .ViLT Using pre-trained ViT To initialize the interaction transformer, In this way, we can directly use the interaction layer to process visual features , There is no need to add an additional visual encoder( Such as Faster-RCNN).
Research ideas : Proposed a vision and language Transformer(ViL), Deal with the two modes in a unified way . With the past VLP The difference of the model lies in its shallow layer 、 Convolution free embedded pixel level input , Delete depth inserters for visual input only , It can significantly reduce the size and running time of the model .
Research methods : It will be VE Designed as TE The same lightweight method , The main calculation of this method focuses on modal interaction , Pictured 2(d). This paper proposes a classification method based on the following two points of visual and language models :1) The two modes are in the special parameters and / Or whether there is a uniform expression level in calculation ;2) Whether the two modes interact in a deep network . The combination of these points will result in a graph 2 Four prototypes in .

Research process :
This pre training model has three pre training tasks .
Image Text Matching
Randomly 0.5 The probability will text - The picture in the picture pair replaces other pictures , Then a linear output is used for the corresponding output of the text flag bit ITM head Will output feature Map to a binary logits, Used to judge whether the image and text match .
meanwhile ViLT reference UNITER, Designed a word patch alignment (WPA) To calculate textual subset and visual subset Alignment score of .( This It needs to be studied UNITER The paper of )
The idea is calculated word embedding And image block vision embedding The correlation between ( There is no announcement here , It is estimated that the real paper has not been put on ).
Mask Languge modeling
Random mask fall 15% The word , Using vision - Text joint representation to predict .
Whole Word Masking
This is a kind of word tokens Conduct mask The technique of , Its purpose is to make the model make full use of the information of other modes .
The training method of this pre training task is :
such as “giraffe” The word" ,
- The word separator will become 3 Parts of ["gi", "##raf", "##fe"].
- If not all marks are blocked , for example ["gi", "[MASK]", "##fe"], Then the model may only rely on two nearby language tags ["gi", "##fe"] To predict the shielded “##raf” Instead of using the context information in the image .
experimental result :
Data sets :Microsoft COCO, Visual Genome, SBU Captions, Google Conceptual Captions ( Four data sets are pre trained )

Ablation Experiment , It mainly verifies the effect of the two pre training tasks taken in the model pre training and the technology used in one fine-tuning .
- Whole word masking
- Masked patch predition
- Image enhancement technology :RandAugment

The experiment reflects Whole word masking The importance of pre training and image enhancement in fine-tuning
summary : The method proposed in this paper greatly improves the efficiency and shows similar performance , Compared with region feature Our method is faster 60 times , Compared with grid feature The method is fast 4 times , And downstream tasks show similar or even better performance .
ViLT Capable of those equipped with convolutional visual embedded networks ( Such as Faster R-CNN、ResNets) competitors . About the future VLP My job is to pay more attention to transformer Modal interaction within the module .
Even though ViLT-B/32 It's very eye-catching , But it proves more efficient without convolution and regional supervision VLP The model is still competent .
scalability For example, about large-scale transformer The paper shows , Given the appropriate amount of data , Pre trained transformer It has good performance . This observation is for better implementation ViLT A variation of the ( Such as ViLT-L and ViLT-H) Paved the way . We will train larger models in the future , Because aligned visual and linguistic datasets are still scarce .
Mask modeling of visual input We think , Alternative clustering or simultaneous clustering methods can be applied in visual unsupervised research . Encourage future work to design a more complex visual form mask target without using area supervision .
Enhancement strategy Previous work on contrastive visual representation learning shows that , Compared with simpler enhancement strategies , Don't use RandAugment The Gaussian ambiguity ratio brings significant benefits to the downstream performance . Exploring appropriate enhancement strategies for text and visual input will be a valuable supplement .
边栏推荐
- 5G专网在智慧医疗中的应用
- [ahu2021 school competition] EZ injection
- windwos 下载安装OpenSSH
- 基于CSI的通信感知一体化设计:问题、挑战和展望
- ACE下载地址
- Common asynchronous sending code writing
- How to use express and how to match and use routes
- 组合键截图分析
- Today's code farmer girl learned about nodejs and repl interactive interpreter
- Byte two side: what is pseudo sharing? How to avoid it?
猜你喜欢
随机推荐
通感一体去蜂窝超大规模MIMO与高频段无线接入技术
[ahu2021 school competition] EZ injection
字节二面:什么是伪共享?如何避免?
雷达通信一体化波形设计综述
Today's code farmer girl learned about nodejs and repl interactive interpreter
iPhone 各大机型设备号
如何建设实时开发平台,深入释放企业实时数据价值?
NFT IP授权热度渐起,NFT2.0时代即将到来?
Byte two side: what is pseudo sharing? How to avoid it?
VSCode中安装Go:tools failed to install.
监听浏览器返回操作-禁止返回上一页
Introduction to software vulnerability analysis (II)
Redis 突然变慢了?
基于深度学习的加密流量识别研究综述及展望
【文献阅读】isl: An Integer Set Library for the Polyhedral Model
解决scala无法对Native进行类的初始化
基于机器学习技术的无线小区负载均衡自优化
开源项目丨 Taier 1.1 版本正式发布,新增功能一览为快
TCP and UDP, TCP server and client, UDP server and client
js数组处理【splice 实现数组的删除、插入、替换】








