Vision transformers (ViTs) have found only limited practical use in processing images

Last update: Sep 10, 2022

Related tags

Overview

CXV

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. CXV outperforms other architectures, token mixers (eg ConvMixer, FNet and MLP Mixer), transformer models (eg ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources.

Models:

CNV - Convolutional Nyströmformer for Vision
CPV - Convolutional Performer for Vision
CLTV - Convolutional Linear Transformer for Vision

Vision transformers (ViTs) have found only limited practical use in processing images

Related tags

Overview

CXV

Convolutional Xformers for Vision

Owner

Cloudwalker

👐OpenHands : Making Sign Language Recognition Accessible (WiP 🚧👷‍♂️🏗)

Code for `BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery`, Neurips 2021

Learning from Synthetic Data with Fine-grained Attributes for Person Re-Identification

Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

Measures input lag without dedicated hardware, performing motion detection on recorded or live video

A simple pygame dino game which can also be trained and played by a NEAT KI

PyTorch CZSL framework containing GQA, the open-world setting, and the CGE and CompCos methods.

An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.

Self-Supervised CNN-GCN Autoencoder

arxiv-sanity, but very lite, simply providing the core value proposition of the ability to tag arxiv papers of interest and have the program recommend similar papers.

Fuse radar and camera for detection

Visual dialog agents with pre-trained vision-and-language encoders.

An implementation of Equivariant e2 convolutional kernals into a convolutional self attention network, applied to radio astronomy data.

A Fast Monotone Rotating Shallow Water model

Framework to build and train RL algorithms

Code repository for the paper: Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild (ICCV 2021)

PED: DETR for Crowd Pedestrian Detection

List of awesome things around semantic segmentation 🎉

FaceAPI: AI-powered Face Detection & Rotation Tracking, Face Description & Recognition, Age & Gender & Emotion Prediction for Browser and NodeJS using TensorFlow/JS

:boar: :bear: Deep Learning based Python Library for Stock Market Prediction and Modelling