Vision Transformer Segmentation Network

This implementation of ViT in pytorch uses a super simple and straight-forward way of generating an output of the same size as the input by applying the inverse rearrange operation on all the predicted outputs. This enables convolution-free multi-class segmentation.

Most of the code is taken from https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py

Default Architecture Parameters:

model = ViTSeg( image_size=112, 
                channels=1,
                patch_size=7, 
                num_classes=1, 
                dim=768, 
                depth=6, 
                heads=12, 
                mlp_dim=2048, 
                learned_pos=False, 
                use_token=False)

image_size: An integer or a tuple defining the size of the input image (some code rewrite would enable any image size to be passed)
channels: An integer defining the umber of channels in the input image
patch_size: An integer or a tuple defining the size of the patches
num_classes: An integer representing the nuber of channels in the ouput
dim: An integer defining the size of the embedding dimension
depth: An integer defining the number of transformer layers
heads: An integer defining the number of heads in the transformer layers
mlp_dim: An integer defining the size of the MLP in the transformer layers
learned_pos: A boolean which, if true, switches from fixed positional encoding to learned positional encodings
use_token: A boolean which, if true, add a CLS token in the input and output

Citation

If you find this repository useful, please consider citing it:

@article{reynaud2021vitseg,
  title={ViTSeg-https://github.com/HReynaud/ViTSeg}, 
  url={https://github.com/HReynaud/ViTSeg},  
  Author={Reynaud, Hadrien}, 
  Year={2021}
}

A simple approach to emable dense segmentation with ViT.

Related tags

Overview

Vision Transformer Segmentation Network

Default Architecture Parameters:

Citation

Owner

HReynaud

FocusFace: Multi-task Contrastive Learning for Masked Face Recognition

tmm_fast is a lightweight package to speed up optical planar multilayer thin-film device computation.

A library for building and serving multi-node distributed faiss indices.

VLG-Net: Video-Language Graph Matching Networks for Video Grounding

NeuroGen: activation optimized image synthesis for discovery neuroscience

Have you ever wondered how cool it would be to have your own A.I

A real-time motion capture system that estimates poses and global translations using only 6 inertial measurement units

Open source code for Paper "A Co-Interactive Transformer for Joint Slot Filling and Intent Detection"

AISTATS 2019: Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

LaneAF: Robust Multi-Lane Detection with Affinity Fields

A curated list of neural network pruning resources.

Model serving at scale

Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models.

LEDNet: A Lightweight Encoder-Decoder Network for Real-time Semantic Segmentation

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Instantaneous Motion Generation for Robots and Machines.

MT3: Multi-Task Multitrack Music Transcription

The implementation for the SportsCap (IJCV 2021)

My implementation of Fully Convolutional Neural Networks in Keras

EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos.