Reproduction of Vision Transformer in Tensorflow2. Train from scratch and Finetune.

Last update: Dec 27, 2022

Related tags

Overview

Vision Transformer(ViT) in Tensorflow2

Tensorflow2 implementation of the Vision Transformer(ViT).

This repository is for An image is worth 16x16 words: Transformers for image recognition at scale and How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers.

Limitations.

Due to memory limitations, only the ti/16, s/16, and b/16 models were tested.
Due to memory limitations, batch_size 2048 in s16 and 1024 in b/16 (in paper, 4096).
Due to computational resource limitations, only reproduce using imagenet1k.

All experimental results and graphs are opend in Wandb.

https://docs.google.com/spreadsheets/d/1j0lFlaMuqccFiHj3eQVpZYIbSoXY6Pz6oEW76x7g25M/edit?usp=sharing
upstream: https://wandb.ai/justhungryman/vit
downstream: https://wandb.ai/justhungryman/vit-downstream/
In case of an experiment in which the tpu is stopped, it is resumed (duplicated experiment name but different start epoch).

Model weights

Since this is personal project, it is hard to train with large datasets like imagenet21k. For a pretrain model with good performance, see the official repo. But if you really need it, contact me.

Install dependencies

pip install -r requirements

All experiments were done on tpu_v3-8 with the support of TRC. But you can experiment on GPU. Check conf/config.yaml and conf/downstream.yaml

  # TPU options
  env:
    mode: tpu
    gcp_project: {your_project}
    tpu_name: node-1
    tpu_zone: europe-west4-a
    mixed_precision: True
  # GPU options
  # env:
  #   mode: gpu
  #   mixed_precision: True

Train from scratch

python run.py experiment=vit-s16-aug_light1-bs_2048-wd_0.1-do_0.1-dp_0.1-lr_1e-3 base.project_name=vit-s16-aug_light1-bs_2048-wd_0.1-do_0.1-dp_0.1-lr_1e-3 base.save_dir={your_save_dir} base.env.gcp_project={your_gcp_project} base.env.tpu_name={your_tpu_name} base.debug=False

Downstream

python run.py --config-name=downstream experiment=downstream-imagenet-ti16_384 base.pretrained={your_checkpoint} base.project_name={your_project_name} base.save_dir={your_save_dir} base.env.gcp_project={your_gcp_project} base.env.tpu_name={your_tpu_name} base.debug=False

Board

To track metics, you can use wandb or tensorboard (default: wandb). You can change in conf/callbacks/{filename.yaml}.

modules:
  - type: MonitorCallback
  - type: TerminateOnNaN
  - type: ProgbarLogger
    params:
      count_mode: steps
  - type: ModelCheckpoint
    params:
      filepath: ???
      save_weights_only: True
  - type: Wandb
    project: vit
    nested_dict: False
    hide_config: True
    params: 
      monitor: val_loss
      save_model: False
  # - type: TensorBoard
  #   params:
  #     log_dir: ???
  #     histogram_freq: 1

TFC

This open source was assisted by TPU Research Cloud (TRC) program

Thank you for providing the TPU.

Citations

@article{dosovitskiy2020image,
  title={An image is worth 16x16 words: Transformers for image recognition at scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others},
  journal={arXiv preprint arXiv:2010.11929},
  year={2020}
}

@article{steiner2021train,
  title={How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers},
  author={Steiner, Andreas and Kolesnikov, Alexander and Zhai, Xiaohua and Wightman, Ross and Uszkoreit, Jakob and Beyer, Lucas},
  journal={arXiv preprint arXiv:2106.10270},
  year={2021}
}

Reproduction of Vision Transformer in Tensorflow2. Train from scratch and Finetune.

Related tags

Overview

Vision Transformer(ViT) in Tensorflow2

Limitations.

Model weights

Install dependencies

Train from scratch

Downstream

Board

TFC

Citations

Owner

sungjun lee

PyTorch implementation of hand mesh reconstruction described in CMR and MobRecon.

Convolutional Neural Network for 3D meshes in PyTorch

Baselines for TrajNet++

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence

Using pytorch to implement unet network for liver image segmentation.

Low-code/No-code approach for deep learning inference on devices

This is a classifier which basically predicts whether there is a gun law in a state or not, depending on various things like murder rates etc.

Array Camera Ptychography

Generating Videos with Scene Dynamics

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

Python scripts performing class agnostic object localization using the Object Localization Network model in ONNX.

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers

PyTorch Implementation of DSB for Score Based Generative Modeling. Experiments managed using Hydra.

A Pose Estimator for Dense Reconstruction with the Structured Light Illumination Sensor

Fast and customizable reconnaissance workflow tool based on simple YAML based DSL.

A pytorch &keras implementation and demo of Fastformer.

PEPit is a package enabling computer-assisted worst-case analyses of first-order optimization methods.

3D Pose Estimation for Vehicles

An implementation of the methods presented in Causal-BALD: Deep Bayesian Active Learning of Outcomes to Infer Treatment-Effects from Observational Data.