Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Last update: Dec 23, 2022

Overview

Transformer in Transformer

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch.

Install

$ pip install transformer-in-transformer

Usage

import torch
from transformer_in_transformer import TNT

tnt = TNT(
    image_size = 256,       # size of image
    patch_dim = 512,        # dimension of patch token
    pixel_dim = 24,         # dimension of pixel token
    patch_size = 16,        # patch size
    pixel_size = 4,         # pixel size
    depth = 6,              # depth
    num_classes = 1000,     # output number of classes
    attn_dropout = 0.1,     # attention dropout
    ff_dropout = 0.1        # feedforward dropout
)

img = torch.randn(2, 3, 256, 256)
logits = tnt(img) # (2, 1000)

Citations

@misc{han2021transformer,
    title   = {Transformer in Transformer}, 
    author  = {Kai Han and An Xiao and Enhua Wu and Jianyuan Guo and Chunjing Xu and Yunhe Wang},
    year    = {2021},
    eprint  = {2103.00112},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

Only works if pixel_size**2 == patch_size?

Hi, is this only supposed to work if

pixel_size**2 == patch_size

?. When setting the patch_size to any number that doesn't fulfill the equation this error occurs:

--> 146         pixels += rearrange(self.pixel_pos_emb, 'n d -> () n d')
    147 
    148         for pixel_attn, pixel_ff, pixel_to_patch_residual, patch_attn, patch_ff in self.layers:

RuntimeError: The size of tensor a (4) must match the size of tensor b (64) at non-singleton dimension 1

The error came when running

tnt = TNT(
    image_size = 128,       # size of image
    patch_dim = 256,        # dimension of patch token
    pixel_dim = 24,         # dimension of pixel token
    patch_size = 16,        # patch size
    pixel_size = 2,         # pixel size
    depth = 6,              # depth
    heads = 1,
    num_classes = 2,     # output number of classes
    attn_dropout = 0.1,     # attention dropout
    ff_dropout = 0.1        # feedforward dropout,
)
img = torch.randn(2, 3, 128, 128)
logits = tnt(img)

Since I am completely new to einops its quite hard for me to debug :D Thanks

opened by PhilippMarquardt 1

Not sure what is wrong!

RuntimeError Traceback (most recent call last) in 14 15 img = torch.randn(1, 3, 256, 256) ---> 16 logits = tnt(img) # (2, 1000)

~/opt/anaconda3/envs/ml/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1109 or _global_forward_hooks or _global_forward_pre_hooks): -> 1110 return forward_call(*input, **kwargs) 1111 # Do not call functions when jit is used 1112 full_backward_hooks, non_full_backward_hooks = [], []

~/opt/anaconda3/envs/ml/lib/python3.8/site-packages/transformer_in_transformer/tnt.py in forward(self, x) 159 patches = repeat(self.patch_tokens[:(n + 1)], 'n d -> b n d', b = b) 160 --> 161 patches += rearrange(self.patch_pos_emb[:(n + 1)], 'n d -> () n d') 162 pixels += rearrange(self.pixel_pos_emb, 'n d -> () n d') 163

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

opened by RisabBiswas 0

patch_tokens vs patch_pos_emb

Hi!

I'm trying to understand your TNT implementation and one thing that got me a bit confused is why there are 2 parameters patch_tokens and patch_pos_emb which seems to have the same purpose - to encode patch position. Isn't one of them redundant?

self.patch_tokens = nn.Parameter(torch.randn(num_patch_tokens + 1, patch_dim))
self.patch_pos_emb = nn.Parameter(torch.randn(num_patch_tokens + 1, patch_dim))
...
patches = repeat(self.patch_tokens[:(n + 1)], 'n d -> b n d', b = b)
patches += rearrange(self.patch_pos_emb[:(n + 1)], 'n d -> () n d')

opened by stas-sl 0

Inconsistent model params with MindSpore src code
There's no function or readme description of TNT-S/TNT-B model in this codebase. Something like :

def tnt_b(num_class): return TNT(img_size=384, patch_size=16, num_channels=3, embedding_dim=640, num_heads=10, num_layers=12, hidden_dim=640*4, stride=4, num_class=num_class)

And heads number of inner block should be 4.... https://github.com/lucidrains/transformer-in-transformer/blob/main/transformer_in_transformer/tnt.py#L135

Wondering if anyone reproduce the paper reported results with this codebase??
opened by WongChen 0
Why the loss become NaN?

It is a great project. I am very interested in Transformer in Transformer model. I had use your model to train on Vehicle-1M dataset. Vehicle-1M is a fine graied visual classification dataset. When I use this model the loss become NaN after some batch iteration. I had decrease the learning rate of AdamOptimizer and clipping the graident torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2) . But the loss still will become NaN sometimes. It seems that gradients are not big but they are in the same direction for many iterations. How to solve it?

opened by yt7589 3

Releases(0.1.2)

0.1.2(Dec 27, 2021)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 23, 2021)

Source code(tar.gz)
Source code(zip)
0.1.0(Mar 21, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 10, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 9, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Mar 3, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Mar 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 2, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub Repository

Evaluating AlexNet features at various depths

Linear Separability Evaluation This repo provides the scripts to test a learned AlexNet's feature representation performance at the five different con

32 Dec 30, 2022

Nest - A flexible tool for building and sharing deep learning modules

Nest - A flexible tool for building and sharing deep learning modules Nest is a flexible deep learning module manager, which aims at encouraging code

41 Oct 10, 2022

A method to perform unsupervised cross-region adaptation of crop classifiers trained with satellite image time series.

TimeMatch Official source code of TimeMatch: Unsupervised Cross-region Adaptation by Temporal Shift Estimation by Joachim Nyborg, Charlotte Pelletier,

17 Nov 01, 2022

[ICML 2021] “ Self-Damaging Contrastive Learning”, Ziyu Jiang, Tianlong Chen, Bobak Mortazavi, Zhangyang Wang

Self-Damaging Contrastive Learning Introduction The recent breakthrough achieved by contrastive learning accelerates the pace for deploying unsupervis

51 Dec 29, 2022

The materials used in the SaxonJS tutorial presented at Declarative Amsterdam, 2021

SaxonJS-Tutorial-2021, version 1.0.4 Last updated on 4 November, 2021. Table of contents Background Prerequisites Starting a web server Running a Java

11 Oct 23, 2022

code release for USENIX'22 paper `On the Security Risks of AutoML`

This project is a minimized runnable project cut from trojanzoo, which contains more datasets, models, attacks and defenses. This repo will not be mai

5 Apr 19, 2022

NasirKhusraw - The TSP solved using genetic algorithm and show TSP path overlaid on a map of the Iran provinces & their capitals.

Nasir Khusraw : Travelling Salesman Problem The TSP solved using genetic algorithm. This project show TSP path overlaid on a map of the Iran provinces

2 Sep 01, 2022

Locally Differentially Private Distributed Deep Learning via Knowledge Distillation (LDP-DL)

Locally Differentially Private Distributed Deep Learning via Knowledge Distillation (LDP-DL) A preprint version of our paper: Link here This is a samp

3 Jan 08, 2023

PyTorch IPFS Dataset

PyTorch IPFS Dataset IPFSDataset(Dataset) See the jupyter notepad to see how it works and how it interacts with a standard pytorch DataLoader You need

2 Apr 13, 2022

Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators

Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators This is our Pytorch implementation for t

12 Jul 22, 2022

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

3 Jan 03, 2023

The Power of Scale for Parameter-Efficient Prompt Tuning

The Power of Scale for Parameter-Efficient Prompt Tuning Implementation of soft embeddings from https://arxiv.org/abs/2104.08691v1 using Pytorch and H

208 Dec 30, 2022

🛰️ List of earth observation companies and job sites

Earth Observation Companies & Jobs source Portals & Jobs Geospatial Geospatial jobs newsletter: ~biweekly newsletter with geospatial jobs by Ali Ahmad

64 Dec 27, 2022

This repository accompanies the ACM TOIS paper "What can I cook with these ingredients?" - Understanding cooking-related information needs in conversational search

In this repository you find data that has been gathered when conducting in-situ experiments in a conversational cooking setting. These data include tr

6 Sep 22, 2022

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR)

3k Dec 31, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Related tags

Overview

Transformer in Transformer

Install

Usage

Citations

Comments

Only works if pixel_size**2 == patch_size?

Not sure what is wrong!

patch_tokens vs patch_pos_emb

Inconsistent model params with MindSpore src code

Why the loss become NaN?

Releases(0.1.2)

0.1.2(Dec 27, 2021)

0.1.1(Mar 23, 2021)

0.1.0(Mar 21, 2021)

0.0.9(Mar 18, 2021)

0.0.8(Mar 10, 2021)

0.0.7(Mar 9, 2021)

0.0.6(Mar 4, 2021)

0.0.5(Mar 4, 2021)

0.0.4(Mar 4, 2021)

0.0.3(Mar 3, 2021)

0.0.2(Mar 2, 2021)

0.0.1(Mar 2, 2021)