Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

Overview

Taming Visually Guided Sound Generation

• [Project Page] • [ArXiv] • [Poster] • Open In Colab

Generated Samples Using our Model

Listen for the samples on our project page.

Overview

We propose to tame the visually guided sound generation by shrinking a training dataset to a set of representative vectors aka. a codebook. These codebook vectors can, then, be controllably sampled to form a novel sound given a set of visual cues as a prime.

The codebook is trained on spectrograms similarly to VQGAN (an upgraded VQVAE). We refer to it as Spectrogram VQGAN

Spectrogram VQGAN

Once the spectrogram codebook is trained, we can train a transformer (a variant of GPT-2) to autoregressively sample the codebook entries as tokens conditioned on a set of visual features

Vision-based Conditional Cross-modal Autoregressive Sampler

This approach allows training a spectrogram generation model which produces long, relevant, and high-fidelity sounds while supporting tens of data classes.

Environment Preparation

During experimentation, we used Linux machines with conda virtual environments, PyTorch 1.8 and CUDA 11.

Start by cloning this repo

git clone https://github.com/v-iashin/SpecVQGAN.git

Next, install the environment. For your convenience, we provide both conda and docker environments.

Conda

conda env create -f conda_env.yml

Test your environment

conda activate specvqgan
python -c "import torch; print(torch.cuda.is_available())"
# True

Docker

Download the image from Docker Hub and test if CUDA is available:

docker run \
    --mount type=bind,source=/absolute/path/to/SpecVQGAN/,destination=/home/ubuntu/SpecVQGAN/ \
    --mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SpecVQGAN/logs/ \
    --mount type=bind,source=/absolute/path/to/vggsound/features/,destination=/home/ubuntu/SpecVQGAN/data/vggsound/ \
    --shm-size 8G \
    -it --gpus '"device=0"' \
    iashin/specvqgan:latest \
    python
>>> import torch; print(torch.cuda.is_available())
# True

or build it yourself

docker build - < Dockerfile --tag specvqgan

Data

In this project, we used VAS and VGGSound datasets. VAS can be downloaded directly using the link provided in the RegNet repository. For VGGSound, however, one might need to retrieve videos directly from YouTube.

Download

The scripts will download features, check the md5 sum, unpack, and do a clean-up for each part of the dataset:

cd ./data
# 24GB
bash ./download_vas_features.sh
# 420GB (+ 420GB if you also need ResNet50 Features)
bash ./download_vggsound_features.sh

The unpacked features are going to be saved in ./data/downloaded_features/*. Move them to ./data/vas and ./data/vggsound such that the folder structure would match the structure of the demo files. By default, it will download BN Inception features, to download ResNet50 features uncomment the lines in scripts ./download_*_features.sh

If you wish to download the parts manually, use the following URL templates:

  • https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vas/*.tar
  • https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vggsound/*.tar

Also, make sure to check the md5 sums provided in ./data/md5sum_vas.md5 and ./data/md5sum_vggsound.md5 along with file names.

Note, we distribute features for the VGGSound dataset in 64 parts. Each part holds ~3k clips and can be used independently as a subset of the whole dataset (the parts are not class-stratified though).

Extract Features Manually

For BN Inception features, we employ the same procedure as RegNet.

For ResNet50 features, we rely on video_features repository and used these commands:

# VAS (few hours on three 2080Ti)
strings=("dog" "fireworks" "drum" "baby" "gun" "sneeze" "cough" "hammer")
for class in "${strings[@]}"; do
    python main.py \
        --feature_type resnet50 \
        --device_ids 0 1 2 \
        --batch_size 86 \
        --extraction_fps 21.5 \
        --file_with_video_paths ./paths_to_mp4_${class}.txt \
        --output_path ./data/vas/features/${class}/feature_resnet50_dim2048_21.5fps \
        --on_extraction save_pickle
done

# VGGSound (6 days on three 2080Ti)
python main.py \
    --feature_type resnet50 \
    --device_ids 0 1 2 \
    --batch_size 86 \
    --extraction_fps 21.5 \
    --file_with_video_paths ./paths_to_mp4s.txt \
    --output_path ./data/vggsound/feature_resnet50_dim2048_21.5fps \
    --on_extraction save_pickle

Similar to BN Inception, we need to "tile" (cycle) a video if it is shorter than 10s. For ResNet50 we achieve this by tiling the resulting frame-level features up to 215 on temporal dimension, e.g. as follows:

feats = pickle.load(open(path, 'rb')).astype(np.float32)
reps = 1 + (215 // feats.shape[0])
feats = np.tile(feats, (reps, 1))[:215, :]
with open(new_path, 'wb') as file:
    pickle.dump(feats, file)

Pretrained Models

Unpack the pre-trained models to ./logs/ directory.

Codebooks

Trained on Evaluated on FID ↓ Avg. MKL ↓ Link / MD5SUM
VGGSound VGGSound 1.0 0.8 7ea229427297b5d220fb1c80db32dbc5
VAS VAS 6.0 1.0 0024ad3705c5e58a11779d3d9e97cc8a

Run Sampling Tool to see the reconstruction results for available data.

Transformers

The setting (a): the transformer is trained on VGGSound to sample from the VGGSound codebook:

Condition Features FID ↓ Avg. MKL ↓ Sample Time️ ↓ Link / MD5SUM
No Feats 13.5 9.7 7.7 b1f9bb63d831611479249031a1203371
1 Feat BN Inception 8.6 7.7 7.7 f2fe41dab17e232bd94c6d119a807fee
1 Feat ResNet50 11.5* 7.3* 7.7 27a61d4b74a72578d13579333ed056f6
5 Feats BN Inception 9.4 7.0 7.9 b082d894b741f0d7a1af9c2732bad70f
5 Feats ResNet50 11.3* 7.0* 7.9 f4d7105811589d441b69f00d7d0b8dc8
212 Feats BN Inception 9.6 6.8 11.8 79895ac08303b1536809cad1ec9a7502
212 Feats ResNet50 10.5* 6.9* 11.8 b222cc0e7aeb419f533d5806a08669fe

* – calculated on 1 sampler per video the test set instead of 10 samples per video as the rest. Evaluating a model on a larger number of samples per video is an expensive procedure. When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

The setting (b): the transformer is trained on VAS to sample from the VGGSound codebook

Condition Features FID ↓ Avg. MKL ↓ Sample Time️ ↓ Link / MD5SUM
No Feats 33.7 9.6 7.7 e6b0b5be1f8ac551700f49d29cda50d7
1 Feat BN Inception 38.6 7.3 7.7 a98a124d6b3613923f28adfacba3890c
1 Feat ResNet50 26.5* 6.7* 7.7 37cd48f06d74176fa8d0f27303841d94
5 Feats BN Inception 29.1 6.9 7.9 38da002f900fb81275b73e158e919e16
5 Feats ResNet50 22.3* 6.5* 7.9 7b6951a33771ef527f1c1b1f99b7595e
212 Feats BN Inception 20.5 6.0 11.8 1c4e56077d737677eac524383e6d98d3
212 Feats ResNet50 20.8* 6.2* 11.8 6e553ea44c8bc7a3310961f74e7974ea

* – calculated on 10 sampler per video the validation set instead of 100 samples per video as the rest. Evaluating a model on a larger number of samples per video is an expensive procedure. When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

The setting (c): the transformer is trained on VAS to sample from the VAS codebook

Condition Features FID ↓ Avg. MKL ↓ Sample Time ↓ Link / MD5SUM
No Feats 28.7 9.2 7.6 ea4945802094f826061483e7b9892839
1 Feat BN Inception 25.1 6.6 7.6 8a3adf60baa049a79ae62e2e95014ff7
1 Feat ResNet50 25.1* 6.3* 7.6 a7a1342030653945e97f68a8112ed54a
5 Feats BN Inception 24.8 6.2 7.8 4e1b24207780eff26a387dd9317d054d
5 Feats ResNet50 20.9* 6.1* 7.8 78b8d42be19dd1b0a346b1f512967302
212 Feats BN Inception 25.4 5.9 11.6 4542632b3c5bfbf827ea7868cedd4634
212 Feats ResNet50 22.6* 5.8* 11.6 dc2b5cbd28ad98d2f9ca4329e8aa0f64

* – calculated on 10 sampler per video the validation set instead of 100 samples per video as the rest. Evaluating a model on a larger number of samples per video is an expensive procedure. When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

A transformer can also be trained to generate a spectrogram given a specific class. We also provide pre-trained models for all three settings: The setting (c): the transformer is trained on VAS to sample from the VAS codebook

Setting Codebook Sampling for FID ↓ Avg. MKL ↓ Sample Time ↓ Link / MD5SUM
(a) VGGSound VGGSound 7.8 5.0 7.7 98a3788ab973f1c3cc02e2e41ad253bc
(b) VGGSound VAS 39.6 6.7 7.7 16a816a270f09a76bfd97fe0006c704b
(c) VAS VAS 23.9 5.5 7.6 412b01be179c2b8b02dfa0c0b49b9a0f

VGGish-ish, Melception, and MelGAN

These will be downloaded automatically during the first run. However, if you need them separately, here are the checkpoints

  • VGGish-ish (1.54GB, 197040c524a07ccacf7715d7080a80bd) + Normalization Parameters (in /specvqgan/modules/losses/vggishish/data/)
  • Melception (0.27GB, a71a41041e945b457c7d3d814bbcf72d) + Normalization Parameters (in /specvqgan/modules/losses/vggishish/data/)
  • MelGAN

The reference performance of VGGish-ish and Melception:

Model Top-1 Acc Top-5 Acc mAP mAUC
VGGish-ish 34.70 63.71 36.63 95.70
Melception 44.49 73.79 47.58 96.66

Run Sampling Tool to see Melception and MelGAN in action.

Training

The training is done in two stages. First, a spectrogram codebook should be trained. Second, a transformer is trained to sample from the codebook The first and the second stages can be trained on the same or separate datasets as long as the process of spectrogram extraction is the same.

Training a Spectrogram Codebook

To train a spectrogram codebook, we tried two datasets: VAS and VGGSound. We run our experiments on a relatively expensive hardware setup with four 40GB NVidia A100 but the models can also be trained on one 12GB NVidia 2080Ti with smaller batch size. When training on four 40GB NVidia A100, change arguments to --gpus 0,1,2,3 and data.params.batch_size=8 for the codebook and =16 for the transformer. The training will hang a bit at 0, 2, 4, 8, ... steps because of the logging. If folders with features and spectrograms are located elsewhere, the paths can be specified in data.params.spec_dir_path, data.params.rgb_feats_dir_path, and data.params.flow_feats_dir_path arguments but use the same format as in the config file e.g. notice the * in the path which globs class folders.

# VAS Codebook
# mind the comma after `0,`
python train.py --base configs/vas_codebook.yaml -t True --gpus 0,
# or
# VGGSound codebook
python train.py --base configs/vggsound_codebook.yaml -t True --gpus 0,

Training a Transformer

A transformer (GPT-2) is trained to sample from the spectrogram codebook given a set of frame-level visual features.

VAS Transformer

# with the VAS codebook
python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt
# or with the VGGSound codebook which has 1024 codes
python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \
    model.params.transformer_config.params.GPT_config.vocab_size=1024 \
    model.params.first_stage_config.params.n_embed=1024 \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-05-19T22-16-54_vggsound_codebook/checkpoints/epoch_39.ckpt

VGGSound Transformer

python train.py --base configs/vggsound_transformer.yaml -t True --gpus 0, \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-05-19T22-16-54_vggsound_codebook/checkpoints/epoch_39.ckpt

Controlling the Condition Size

The size of the visual condition is controlled by two arguments in the config file. The feat_sample_size is the size of the visual features resampled equidistantly from all available features (212) and block_size is the attention span. Make sure to use block_size = 53 * 5 + feat_sample_size. For instance, for feat_sample_size=212 the block_size=477. However, the longer the condition, the more memory and more timely the sampling. By default, the configs are using feat_sample_size=212 for VAS and 5 for VGGSound. Feel free to tweak it to your liking/application for example:

python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \
    model.params.transformer_config.params.GPT_config.block_size=318 \
    data.params.feat_sampler_cfg.params.feat_sample_size=53 \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt

The No Feats settings (without visual condition) are trained similarly to the settings with visual conditioning where the condition is replaced with random vectors. The optimal approach here is to use replace_feats_with_random=true along with feat_sample_size=1 for example (VAS):

python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \
    data.params.replace_feats_with_random=true \
    model.params.transformer_config.params.GPT_config.block_size=266 \
    data.params.feat_sampler_cfg.params.feat_sample_size=1 \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt

Training VGGish-ish and Melception

We include all necessary files for training both vggishish and melception in ./specvqgan/modules/losses/vggishish. Run it on a 12GB GPU as

cd ./specvqgan/modules/losses/vggishish
# vggish-ish
python train_vggishish.py config=./configs/vggish.yaml device='cuda:0'
# melception
python train_melception.py config=./configs/melception.yaml device='cuda:1'

Training MelGAN

To train the vocoder, use this command:

cd ./vocoder
python scripts/train.py \
    --save_path ./logs/`date +"%Y-%m-%dT%H-%M-%S"` \
    --data_path /path/to/melspec_10s_22050hz \
    --batch_size 64

Evaluation

The evaluation is done in two steps. First, the samples are generated for each video. Second, evaluation script is run. The sampling procedure supports multi-gpu multi-node parallization. We provide a multi-gpu command which can easily be applied on a multi-node setup by replacing --master_addr to your main machine and --node_rank for every worker's id (also see an sbatch script in ./evaluation/sbatch_sample.sh if you have a SLURM cluster at your disposal):

# Sample
python -m torch.distributed.launch \
    --nproc_per_node=3 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=62374 \
    --use_env \
        evaluation/generate_samples.py \
        sampler.config_sampler=evaluation/configs/sampler.yaml \
        sampler.model_logdir=$EXPERIMENT_PATH \
        sampler.splits=$SPLITS \
        sampler.samples_per_video=$SAMPLES_PER_VIDEO \
        sampler.batch_size=$SAMPLER_BATCHSIZE \
        sampler.top_k=$TOP_K \
        data.params.spec_dir_path=$SPEC_DIR_PATH \
        data.params.rgb_feats_dir_path=$RGB_FEATS_DIR_PATH \
        data.params.flow_feats_dir_path=$FLOW_FEATS_DIR_PATH \
        sampler.now=$NOW
# Evaluate
python -m torch.distributed.launch \
    --nproc_per_node=3 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=62374 \
    --use_env \
    evaluate.py \
        config=./evaluation/configs/eval_melception_${DATASET,,}.yaml \
        input2.path_to_exp=$EXPERIMENT_PATH \
        patch.specs_dir=$SPEC_DIR_PATH \
        patch.spec_dir_path=$SPEC_DIR_PATH \
        patch.rgb_feats_dir_path=$RGB_FEATS_DIR_PATH \
        patch.flow_feats_dir_path=$FLOW_FEATS_DIR_PATH \
        input1.params.root=$EXPERIMENT_PATH/samples_$NOW/$SAMPLES_FOLDER

The variables for the VAS dataset:

EXPERIMENT_PATH="./logs/<folder-name-of-vas-transformer-or-codebook>"
SPEC_DIR_PATH="./data/vas/features/*/melspec_10s_22050hz/"
RGB_FEATS_DIR_PATH="./data/vas/features/*/feature_rgb_bninception_dim1024_21.5fps/"
FLOW_FEATS_DIR_PATH="./data/vas/features/*/feature_flow_bninception_dim1024_21.5fps/"
SAMPLES_FOLDER="VAS_validation"
SPLITS="\"[validation, ]\""
SAMPLER_BATCHSIZE=4
SAMPLES_PER_VIDEO=10
TOP_K=64 # use TOP_K=512 when evaluating a VAS transformer trained with a VGGSound codebook
NOW=`date +"%Y-%m-%dT%H-%M-%S"`

The variables for the VGGSound dataset:

EXPERIMENT_PATH="./logs/<folder-name-of-vggsound-transformer-or-codebook>"
SPEC_DIR_PATH="./data/vggsound/melspec_10s_22050hz/"
RGB_FEATS_DIR_PATH="./data/vggsound/feature_rgb_bninception_dim1024_21.5fps/"
FLOW_FEATS_DIR_PATH="./data/vggsound/feature_flow_bninception_dim1024_21.5fps/"
SAMPLES_FOLDER="VGGSound_test"
SPLITS="\"[test, ]\""
SAMPLER_BATCHSIZE=32
SAMPLES_PER_VIDEO=1
TOP_K=512
NOW=`date +"%Y-%m-%dT%H-%M-%S" the`

Sampling Tool

For interactive sampling, we rely on the Streamlit library. To start the streamlit server locally, run

# mind the trailing `--`
streamlit run --server.port 5555 ./sample_visualization.py --
# go to `localhost:5555` in your browser

or Open In Colab.

We also alternatively provide a similar notebook in ./generation_demo.ipynb to play with the demo on a local machine.

The Neural Audio Codec Demo

While the Spectrogram VQGAN was never designed to be a neural audio codec but it happened to be highly effective for this task. We can employ our Spectrogram VQGAN pre-trained on an open-domain dataset as a neural audio codec without a change

If you wish to apply the SpecVQGAN for audio compression for arbitrary audio, please see our Google Colab demo: Open In Colab.

Integrated to Huggingface Spaces with Gradio. See demo: Hugging Face Spaces

We also alternatively provide a similar notebook in ./neural_audio_codec_demo.ipynb to play with the demo on a local machine.

Citation

Our paper was accepted as an oral presentation for the BMVC 2021. Please, use this bibtex if you would like to cite our work

@InProceedings{SpecVQGAN_Iashin_2021,
  title={Taming Visually Guided Sound Generation},
  author={Iashin, Vladimir and Rahtu, Esa},
  booktitle={British Machine Vision Conference (BMVC)},
  year={2021}
}

Acknowledgments

Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

We also acknowledge the following codebases:

Owner
Vladimir Iashin
Deep learning researcher, tech enthusiast, Ph.D. student
Vladimir Iashin
GyroSPD: Vector-valued Distance and Gyrocalculus on the Space of Symmetric Positive Definite Matrices

GyroSPD Code for the paper "Vector-valued Distance and Gyrocalculus on the Space of Symmetric Positive Definite Matrices" accepted at NeurIPS 2021. Re

Federico Lopez 12 Dec 12, 2022
Official Pytorch Implementation of 3DV2021 paper: SAFA: Structure Aware Face Animation.

SAFA: Structure Aware Face Animation (3DV2021) Official Pytorch Implementation of 3DV2021 paper: SAFA: Structure Aware Face Animation. Getting Started

QiulinW 122 Dec 23, 2022
The code for paper "Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation" which is accepted by AAAI 2022

Contrastive Spatio Temporal Pretext Learning for Self-supervised Video Representation (AAAI 2022) The code for paper "Contrastive Spatio-Temporal Pret

8 Jun 30, 2022
PyTorch implementation of "LayoutTransformer: Layout Generation and Completion with Self-attention"

PyTorch implementation of "LayoutTransformer: Layout Generation and Completion with Self-attention" to appear in ICCV 2021

Kamal Gupta 75 Dec 23, 2022
Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21

MonoFlex Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21. Work in progress. Installation This repo is tested w

Yunpeng 169 Dec 06, 2022
Stacked Recurrent Hourglass Network for Stereo Matching

SRH-Net: Stacked Recurrent Hourglass Introduction This repository is supplementary material of our RA-L submission, which helps reviewers to understan

28 Jan 03, 2023
Transformer model implemented with Pytorch

transformer-pytorch Transformer model implemented with Pytorch Attention is all you need-[Paper] Architecture Self-Attention self_attention.py class

Mingu Kang 12 Sep 03, 2022
Official implementation of Pixel-Level Bijective Matching for Video Object Segmentation

BMVOS This is the official implementation of Pixel-Level Bijective Matching for Video Object Segmentation, to appear in WACV 2022. @article{cho2021pix

Suhwan Cho 13 Dec 14, 2022
Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

gHHC Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, D

Nicholas Monath 35 Nov 16, 2022
TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [NeurIPS 2021] Abstract Multiple instance learn

132 Dec 30, 2022
The source code for 'Noisy-Labeled NER with Confidence Estimation' accepted by NAACL 2021

Kun Liu*, Yao Fu*, Chuanqi Tan, Mosha Chen, Ningyu Zhang, Songfang Huang, Sheng Gao. Noisy-Labeled NER with Confidence Estimation. NAACL 2021. [arxiv]

30 Nov 12, 2022
Sentiment analysis translations of the Bhagavad Gita

Sentiment and Semantic Analysis of Bhagavad Gita Translations It is well known that translations of songs and poems not only breaks rhythm and rhyming

Machine learning and Bayesian inference @ UNSW Sydney 3 Aug 01, 2022
Repository for "Toward Practical Monocular Indoor Depth Estimation" (CVPR 2022)

Toward Practical Monocular Indoor Depth Estimation Cho-Ying Wu, Jialiang Wang, Michael Hall, Ulrich Neumann, Shuochen Su [arXiv] [project site] DistDe

Meta Research 122 Dec 13, 2022
MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset This repository contains links to data and code to fetch and reproduce

Daniel Varab 19 Dec 16, 2022
Deep Learning for humans

Keras: Deep Learning for Python Under Construction In the near future, this repository will be used once again for developing the Keras codebase. For

Keras 57k Jan 09, 2023
Code for TIP 2017 paper --- Illumination Decomposition for Photograph with Multiple Light Sources.

Illumination_Decomposition Code for TIP 2017 paper --- Illumination Decomposition for Photograph with Multiple Light Sources. This code implements the

QAY 7 Nov 15, 2020
[ACM MM 2021] Yes, "Attention is All You Need", for Exemplar based Colorization

Transformer for Image Colorization This is an implemention for Yes, "Attention Is All You Need", for Exemplar based Colorization, and the current soft

Wang Yin 30 Dec 07, 2022
Code release for Hu et al. Segmentation from Natural Language Expressions. in ECCV, 2016

Segmentation from Natural Language Expressions This repository contains the code for the following paper: R. Hu, M. Rohrbach, T. Darrell, Segmentation

Ronghang Hu 88 May 24, 2022
SMPL-X: A new joint 3D model of the human body, face and hands together

SMPL-X: A new joint 3D model of the human body, face and hands together [Paper Page] [Paper] [Supp. Mat.] Table of Contents License Description News I

Vassilis Choutas 1k Jan 09, 2023
This repository is for Competition for ML_data class

This repository is for Competition for ML_data class. Based on mmsegmentatoin,mainly using swin transformer to completed the competition.

jianlong 2 Oct 23, 2022