Vector Quantized Diffusion Model for Text-to-Image Synthesis

Overview

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Due to company policy, I have to set microsoft/VQ-Diffusion to private for now, so I provide the same code here.

Overview

This is the official repo for the paper: Vector Quantized Diffusion Model for Text-to-Image Synthesis.

VQ-Diffusion is based on a VQ-VAE whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). It produces significantly better text-to-image generation results when compared with Autoregressive models with similar numbers of parameters. Compared with previous GAN-based methods, VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin.

Framework

Requirements

We suggest to use the docker. Also, you may run:

bash install_req.sh

Data Preparing

Microsoft COCO

│MSCOCO_Caption/
├──annotations/
│  ├── captions_train2014.json
│  ├── captions_val2014.json
├──train2014/
│  ├── train2014/
│  │   ├── COCO_train2014_000000000009.jpg
│  │   ├── ......
├──val2014/
│  ├── val2014/
│  │   ├── COCO_val2014_000000000042.jpg
│  │   ├── ......

CUB-200

│CUB-200/
├──images/
│  ├── 001.Black_footed_Albatross/
│  ├── 002.Laysan_Albatross
│  ├── ......
├──text/
│  ├── text/
│  │   ├── 001.Black_footed_Albatross/
│  │   ├── 002.Laysan_Albatross
│  │   ├── ......
├──train/
│  ├── filenames.pickle
├──test/
│  ├── filenames.pickle

ImageNet

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Pretrained Model

We release four text-to-image pretrained model, trained on Conceptual Caption, MSCOCO, CUB200, and LAION-human datasets. Also, we release the ImageNet pretrained model, and provide the CLIP pretrained model for convenient. These should be put under OUTPUT/pretrained_model/ . These pretrained model file may be large because they are training checkpoints, which contains gradient information, optimizer information, ema model and others.

Besides, we provide the VQVAE models on FFHQ, OpenImages, and imagenet datasets, these model are from Taming Transformer, we provide them here for convenient. Please put them under OUTPUT/pretrained_model/taming_dvae/ .

Inference

To generate image from given text:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_text.yaml', path='OUTPUT/pretrained_model/human_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_condition("a beautiful smiling woman",truncation_rate=0.85, save_root="RESULT",batch_size=4)
VQ_Diffusion_model.inference_generate_sample_with_condition("a woman in yellow dress",truncation_rate=0.85, save_root="RESULT",batch_size=4,fast=2) # for fast inference

You may change human_pretrained.pth to other pretrained model to test different text.

To generate image from given ImageNet class label:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_class(407,truncation_rate=0.86, save_root="RESULT",batch_size=4)

Training

First, change the data_root to correct path in configs/coco.yaml or other configs.

Train Text2Image generation on MSCOCO dataset:

python running_command/run_train_coco.py

Train Text2Image generation on CUB200 dataset:

python running_command/run_train_cub.py

Train conditional generation on ImageNet dataset:

python running_command/run_train_imagenet.py

Train unconditional generation on FFHQ dataset:

python running_command/run_train_ffhq.py

Cite VQ-Diffusion

if you find our code helpful for your research, please consider citing:

@article{gu2021vector,
  title={Vector Quantized Diffusion Model for Text-to-Image Synthesis},
  author={Gu, Shuyang and Chen, Dong and Bao, Jianmin and Wen, Fang and Zhang, Bo and Chen, Dongdong and Yuan, Lu and Guo, Baining},
  journal={arXiv preprint arXiv:2111.14822},
  year={2021}
}

Acknowledgement

Thanks to everyone who makes their code and models available. In particular,

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using VQ-Diffusion, please submit a GitHub issue. For other communications related to VQ-Diffusion, please contact Shuyang Gu ([email protected]) or Dong Chen ([email protected]).

Owner
Shuyang Gu
Shuyang Gu
Official PyTorch Implementation of GAN-Supervised Dense Visual Alignment

GAN-Supervised Dense Visual Alignment — Official PyTorch Implementation Paper | Project Page | Video This repo contains training, evaluation and visua

944 Jan 07, 2023
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022
A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

A 2D Visual Localization Framework based on Essential Matrices This repository provides implementation of our paper accepted at ICRA: To Learn or Not

Qunjie Zhou 27 Nov 07, 2022
This is project is the implementation of the DeepShift: Towards Multiplication-Less Neural Networks paper

DeepShift This is project is the implementation of the DeepShift: Towards Multiplication-Less Neural Networks paper, that aims to replace multiplicati

Mostafa Elhoushi 88 Dec 23, 2022
Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU

Cross-modal Retrieval using Transformer Encoder Reasoning Networks This project reimplements the idea from "Transformer Reasoning Network for Image-Te

Minh-Khoi Pham 5 Nov 05, 2022
[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

Shengyu Zhao 373 Jan 02, 2023
Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

merlot_reserve Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound" MERLOT Reserve (in submission) is a mo

Rowan Zellers 92 Dec 11, 2022
Repo for the Tutorials of Day1-Day3 of the Nordic Probabilistic AI School 2021 (https://probabilistic.ai/)

ProbAI 2021 - Probabilistic Programming and Variational Inference Tutorial with Pryo Day 1 (June 14) Slides Notebook: students_PPLs_Intro Notebook: so

PGM-Lab 46 Nov 01, 2022
Codes and pretrained weights for winning submission of 2021 Brain Tumor Segmentation (BraTS) Challenge

Winning submission to the 2021 Brain Tumor Segmentation Challenge This repo contains the codes and pretrained weights for the winning submission to th

94 Dec 28, 2022
Convnext-tf - Unofficial tensorflow keras implementation of ConvNeXt

ConvNeXt Tensorflow This is unofficial tensorflow keras implementation of ConvNe

29 Oct 06, 2022
Parameter Efficient Deep Probabilistic Forecasting

PEDPF Parameter Efficient Deep Probabilistic Forecasting (PEDPF) is a repository containing code to run experiments for several deep learning based pr

Olivier Sprangers 10 Jun 13, 2022
Software & Hardware to do multi color printing with Sharpies

3D Print Colorizer is a combination of 3D printed parts and a Cura plugin which allows anyone with an Ender 3 like 3D printer to produce multi colored

343 Jan 06, 2023
CNN designed for pansharpening

PROGRESSIVE BAND-SEPARATED CONVOLUTIONAL NEURAL NETWORK FOR MULTISPECTRAL PANSHARPENING This repository contains main code for the paper PROGRESSIVE B

SerendipitysX 3 Dec 29, 2021
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

250 Jan 08, 2023
Code for Mesh Convolution Using a Learned Kernel Basis

Mesh Convolution This repository contains the implementation (in PyTorch) of the paper FULLY CONVOLUTIONAL MESH AUTOENCODER USING EFFICIENT SPATIALLY

Yi_Zhou 35 Jan 03, 2023
This is an implementation of PIFuhd based on Pytorch

Open-PIFuhd This is a unofficial implementation of PIFuhd PIFuHD: Multi-Level Pixel-Aligned Implicit Function forHigh-Resolution 3D Human Digitization

Lingteng Qiu 235 Dec 19, 2022
🌳 A Python-inspired implementation of the Optimum-Path Forest classifier.

OPFython: A Python-Inspired Optimum-Path Forest Classifier Welcome to OPFython. Note that this implementation relies purely on the standard LibOPF. Th

Gustavo Rosa 30 Jan 04, 2023
[CVPR 2021] Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach

Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach This is the repo to host the dataset TextSeg and code for TexRNe

SHI Lab 174 Dec 19, 2022
Small-bets - Ergodic Experiment With Python

Ergodic Experiment Based on this video. Run this experiment with this command: p

Michael Brant 3 Jan 11, 2022
Stable Neural ODE with Lyapunov-Stable Equilibrium Points for Defending Against Adversarial Attacks

Stable Neural ODE with Lyapunov-Stable Equilibrium Points for Defending Against Adversarial Attacks Stable Neural ODE with Lyapunov-Stable Equilibrium

Kang Qiyu 8 Dec 12, 2022