2D Human Pose estimation using transformers. Implementation in Pytorch

Overview

PE-former: Pose Estimation Transformer

Vision transformer architectures perform very well for image classification tasks. Efforts to solve more challenging vision tasks with transformers rely on convolutional backbones for feature extraction.

POTR is a pure transformer architecture (no CNN backbone) for 2D body pose estimation. It uses an encoder-decoder architecture with a vision transformer as an encoder and a transformer decoder (derived from DETR).

You can use the code in this repository to train and evaluate different POTR configurations on the COCO dataset.

Model

POTR is based on building blocks derived from recent SOTA models. As shown in the figure there are two major components: A Visual Transformer encoder, and a Transformer decoder.

model

The input image is initially converted into tokens following the ViT paradigm. A position embedding is used to help retain the patch-location information. The tokens and the position embedding are used as input to transformer encoder. The transformed tokens are used as the memory input of the transformer decoder. The inputs of the decoder are M learned queries. For each query the network will produce a joint prediction. The output tokens from the transformer decoder are passed through two heads (FFNs).

  • The first is a classification head used to predict the joint type (i.e class) of each query.
  • The second is a regression head that predicts the normalized coordinates (in the range [0,1]) of the joint in the input image.

Predictions that do not correspond to joints are mapped to a "no object" class.

Acknowledgements

The code in this repository is based on the following:

Thank you!

Preparing

Create a python venv and install all the dependencies:

python -m venv pyenv
source pyenv/bin/activate
pip install -r requirements.txt

Training

Here are some CLI examples using the lit_main.py script.

Training POTR with a deit_small encoder, patch size of 16x16 pixels and input resolution 192x256:

python lit_main.py --vit_arch deit_deit_small --patch_size 16 --batch_size 42 --input_size 192 256 --hidden_dim 384 --vit_dim 384 --gpus 1 --num_workers 24

POTR with Xcit_small_p16 encoder:

 python lit_main.py --vit_arch xcit_small_12_p16 --batch_size 42 --input_size 288 384 --hidden_dim 384 --vit_dim 384 --gpus 1 --num_workers 24   --vit_weights https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_384_dist.pth

POTR with the ViT as Backbone (VAB) configuration:

 python lit_main.py --vit_as_backbone --vit_arch resnet50 --batch_size 42 --input_size 192 256 --hidden_dim 384 --vit_dim 384 --gpus 1 --position_embedding learned_nocls --num_workers 16 --num_queries 100 --dim_feedforward 1536 --accumulate_grad_batches 1

Baseline that uses a resnet50 (pretrained with dino) as an encoder:

 python lit_main.py --vit_arch resnet50 --patch_size 16 --batch_size 42 --input_size 192 256 --hidden_dim 384 --vit_dim 384 --gpus 1 --num_workers 24 --vit_weights https://dl.fbaipublicfiles.com/dino/dino_resnet50_pretrain/dino_resnet50_pretrain.pth --position_embedding learned_nocls

Check the lit_main.py cli arguments for a complete list.

python lit_main.py --help

Evaluation

Evaluate a trained model using the evaluate.py script.

For example to evaluate POTR with an xcit_small_12_p8 encoder:

python evaluate.py --vit_arch xcit_small_12_p8 --patch_size 8 --batch_size 42 --input_size 192 256 --hidden_dim 384 --vit_dim 384  --position_embedding enc_xcit --num_workers 16 --num_queries 100 --dim_feedforward 1536 --init_weights paper_experiments/xcit_small12_p8_dino_192_256_paper/checkpoints/checkpoint-epoch\=065-AP\=0.736.ckpt --use_det_bbox

Evaluate POTR with a deit_small encoder:

 python evaluate.py --vit_arch deit_deit_small --patch_size 16 --batch_size 42 --input_size 192 256 --hidden_dim 384 --vit_dim 384 --num_workers 24 --init_weights lightning_logs/version_0/checkpoints/checkpoint-epoch\=074-AP\=0.622.ckpt  --use_det_bbox

Set the argument of --init_weights to your model's checkpoint.

Model Zoo

name input params AP AR url
POTR-Deit-dino-p8 192x256 36.4M 70.6 78.1 model
POTR-Xcit-p16 288x384 40.6M 70.2 77.4 model
POTR-Xcit-dino-p16 288x384 40.6M 70.7 77.9 model
POTR-Xcit-dino-p8 192x256 40.5M 71.6 78.7 model
POTR-Xcit-dino-p8 288x384 40.5M 72.6 79.4 model

Check the experiments folder for configuration files and evaluation results.

All trained models and tensorboard training logs can be downloaded from this drive folder.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Owner
Panteleris Paschalis
Panteleris Paschalis
Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models.

Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models.

2.7k Jan 05, 2023
“袋鼯麻麻——智能购物平台”能够精准地定位识别每一个商品

“袋鼯麻麻——智能购物平台”能够精准地定位识别每一个商品,并且能够返回完整地购物清单及顾客应付的实际商品总价格,极大地降低零售行业实际运营过程中巨大的人力成本,提升零售行业无人化、自动化、智能化水平。

thomas-yanxin 192 Jan 05, 2023
Hyperbolic Hierarchical Clustering.

Hyperbolic Hierarchical Clustering (HypHC) This code is the official PyTorch implementation of the NeurIPS 2020 paper: From Trees to Continuous Embedd

HazyResearch 154 Dec 15, 2022
The official repository for BaMBNet

BaMBNet-Pytorch Paper

Junjun Jiang 18 Dec 04, 2022
Title: Heart-Failure-Classification

This Notebook is based off an open source dataset available on where I have created models to classify patients who can potentially witness heart failure on the basis of various parameters. The best

Akarsh Singh 2 Sep 13, 2022
MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks.

MVGCN MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks. Developer: Fu Hait

13 Dec 01, 2022
Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT This repository is the official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. ArXiv If

International Business Machines 168 Dec 29, 2022
This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

JigsawClustering Jigsaw Clustering for Unsupervised Visual Representation Learning Pengguang Chen, Shu Liu, Jiaya Jia Introduction This project provid

DV Lab 73 Sep 18, 2022
OrienMask: Real-time Instance Segmentation with Discriminative Orientation Maps

OrienMask This repository implements the framework OrienMask for real-time instance segmentation. It achieves 34.8 mask AP on COCO test-dev at the spe

45 Dec 13, 2022
An implementation of the AdaOPS (Adaptive Online Packing-based Search), which is an online POMDP Solver used to solve problems defined with the POMDPs.jl generative interface.

AdaOPS An implementation of the AdaOPS (Adaptive Online Packing-guided Search), which is an online POMDP Solver used to solve problems defined with th

9 Oct 05, 2022
Using Tensorflow Object Detection API to detect Waymo open dataset

Waymo-2D-Object-Detection Using Tensorflow Object Detection API to detect Waymo open dataset Result CenterNet Training Loss SSD ResNet Training Loss C

76 Dec 12, 2022
A library for building and serving multi-node distributed faiss indices.

About Distributed faiss index service. A lightweight library that lets you work with FAISS indexes which don't fit into a single server memory. It fol

Meta Research 170 Dec 30, 2022
This is the official repository of Music Playlist Title Generation: A Machine-Translation Approach.

PlyTitle_Generation This is the official repository of Music Playlist Title Generation: A Machine-Translation Approach. The paper has been accepted by

SeungHeonDoh 6 Jan 03, 2022
[TIP2020] Adaptive Graph Representation Learning for Video Person Re-identification

Introduction This is the PyTorch implementation for Adaptive Graph Representation Learning for Video Person Re-identification. Get started git clone h

WuYiming 41 Dec 12, 2022
Best Practices on Recommendation Systems

Recommenders What's New (February 4, 2021) We have a new relase Recommenders 2021.2! It comes with lots of bug fixes, optimizations and 3 new algorith

Microsoft 14.8k Jan 03, 2023
Investigating automatic navigation towards standard US views integrating MARL with the virtual US environment developed in CT2US simulation

AutomaticUSnavigation Investigating automatic navigation towards standard US views integrating MARL with the virtual US environment developed in CT2US

Cesare Magnetti 6 Dec 05, 2022
Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CAC) Xin Lai*, Zhuotao Tian*, Li Jiang, Shu Liu, Hengshuang Zhao, Li

Jia Research Lab 137 Dec 14, 2022
Spatial Single-Cell Analysis Toolkit

Single-Cell Image Analysis Package Scimap is a scalable toolkit for analyzing spatial molecular data. The underlying framework is generalizable to spa

Laboratory of Systems Pharmacology @ Harvard 30 Nov 08, 2022
Implementation of ToeplitzLDA for spatiotemporal stationary time series data.

Code for the ToeplitzLDA classifier proposed in here. The classifier conforms sklearn and can be used as a drop-in replacement for other LDA classifiers. For in-depth usage refer to the learning from

Jan Sosulski 5 Nov 07, 2022
Object Detection Projekt in GKI WS2021/22

tfObjectDetection Object Detection Projekt with tensorflow in GKI WS2021/22 Docker Container: docker run -it --name --gpus all -v path/to/project:p

Tim Eggers 1 Jul 18, 2022