RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

Related tags

Deep Learningru-dolph
Overview

[Paper] [Хабр] [Model Card] [Colab] [Kaggle]

RuDOLPH 🦌 🎄 ☃️

One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP


Russian Diffusion On Language Picture Hyper-modality (RuDOLPH) is a fast and light text-image-text transformer (350M GPT-3) designed for a quick and easy fine-tuning setup for the solution of various tasks: from generating images by text description and image classification to visual question answering and more. This model demonstrates the power of Hyper-modality Transformers.

(!!!) Hyper-modality means generalized multi-modal, e.g., model that consists of two multi-modal parts: text-2-image and image-2-text becomes text and image hyper-modality model

Sparse Attention Mask

row - col - row - [last] conv

Models

Installing

pip install rudolph==0.0.1rc8

Usage

Fine-Tuning example by @Alex Wortega Open In Colab

Init models

from rudalle import get_tokenizer, get_vae
from rudalle.utils import seed_everything
from rudalle.image_prompts import ImagePrompts

from rudolph.model import get_rudolph_model
from rudolph.pipelines import zs_clf, generate_codebooks, self_reranking_by_image, self_reranking_by_text, show, generate_captions, generate_texts
from rudolph import utils

device = 'cuda'
model = get_rudolph_model('350M', fp16=True, device=device)
model.to(device);
tokenizer = get_tokenizer()
vae = get_vae(dwt=False).to(device)

Setup for Fast Image Generation

text = 'старинный будильник многоугольной формы'
bs, images_num = 48, 48
top_k, top_p = 512, 0.9
with torch.no_grad():
    codebooks = generate_codebooks(text, tokenizer, model, top_k=top_k, images_num=images_num, top_p=top_p, bs=bs)
    ppl_text, ppl_image = self_reranking_by_text(text, codebooks, tokenizer, model, bs=bs)
    images = vae.decode(codebooks[ppl_text.argsort()[:9]])
images = torchvision.utils.make_grid(images, nrow=3)
img = torchvision.transforms.functional.to_pil_image(images)
img

Text Generation

generate_texts(
    tokenizer,
    model,
    template='красивый пейзаж ',
    top_k=32, top_p=0.8, texts_num=32, bs=32, seed=42
)[:8]

[{'text': 'красивый пейзаж и деревья в горах с синим небом и облаками в солнечный день. карпаты украина', 'ppl': 155.72},
 {'text': 'красивый пейзаж с горным озером и красивым пейзажем на восходе солнца', 'ppl': 195.81},
 {'text': 'красивый пейзаж с горными вершинами и чистым небом', 'ppl': 219.57},
 {'text': 'красивый пейзаж с горами в тумане, покрывающими горы', 'ppl': 221.36},
 {'text': 'красивый пейзаж и водопад в национальном парке пхутта в таиланде', 'ppl': 248.82},
 {'text': 'красивый пейзаж с голубым небом и белым облаком', 'ppl': 260.76},
 {'text': 'красивый пейзаж с рекой, горы и голубое небо', 'ppl': 273.1},
 {'text': 'красивый пейзаж с зелеными деревьями и голубым небом', 'ppl': 286.22}]

Image Generation + Self Reranking

text = 'красивый пейзаж с озером и лесом на заднем плане'
images_num, bs = 256, 32
seed_everything(42)
codebooks = []
for top_k, top_p, images_num in [
    (2048, 0.975, images_num),
    (1536, 0.975, images_num),
    (1024, 0.975, images_num),
]:
    codebooks.append(generate_codebooks(text, tokenizer, model, top_k=top_k, images_num=images_num, top_p=top_p, bs=bs))

codebooks = torch.cat(codebooks)

ppl_text, ppl_image = self_reranking_by_text(text, codebooks, tokenizer, model, bs=bs)
with torch.no_grad():
    images = vae.decode(codebooks[ppl_text.argsort()[:16]])

pil_images = utils.torch_tensors_to_pil_list(images)
show(pil_images, 8)

text = 'зимнее время года'

ppl_text, ppl_image = self_reranking_by_text(text, codebooks, tokenizer, model, bs=32)
with torch.no_grad():
    images = vae.decode(codebooks[ppl_text.argsort()[:16]])

pil_images = utils.torch_tensors_to_pil_list(images)
show(pil_images, 8)

text = 'ночное время суток'

ppl_text, ppl_image = self_reranking_by_text(text, codebooks, tokenizer, model, bs=32)
with torch.no_grad():
    images = vae.decode(codebooks[ppl_text.argsort()[:16]])

pil_images = utils.torch_tensors_to_pil_list(images)
show(pil_images, 8)

Image Prompt (like Inpainting)

text = 'лодка с алыми парусами'

images_num = 1024
bs = 32

borders = {'up': 6, 'left': 4, 'right': 6, 'down': 2}
image_prompts = ImagePrompts(pil_img, borders, vae, device, crop_first=True)

seed_everything(42)
codebooks = []
for top_k, top_p, images_num in [
    (1024, 0.99, images_num),
]:
    codebooks.append(
        generate_codebooks(text, tokenizer, model, top_k=top_k, images_num=images_num, top_p=top_p, bs=bs, image_prompts=image_prompts)
    )

codebooks = torch.cat(codebooks)

ppl_text, ppl_image = self_reranking_by_text(
    text,
    codebooks,
    tokenizer,
    model,
    bs=bs,
)
with torch.no_grad():
    images = vae.decode(codebooks[ppl_text.argsort()[:16]])

pil_images = utils.torch_tensors_to_pil_list(images)
show(pil_images, 8)

Diffusion (TODO, see Colab)

Image Captioning + Self Reranking

texts = generate_captions(pil_img, tokenizer, model, vae, template='на картинке ', top_k=16, captions_num=128, bs=32, top_p=0.6, temperature=0.8, seed=43, limit_eos=False)
ppl_text, ppl_image = self_reranking_by_image(texts, pil_img, tokenizer, model, vae, bs=32, seed=42)
for idx in ppl_image.argsort()[:8]:
    print(f'-{texts[idx]}')

-на картинке изображено - каяк с плавающей на нем женщиной
-на картинке - лодка с призраками
-на картинке корабль « », вид с воздуха
-на картинке лодка с парусом и 3d эффектом, вид с воздуха
-на картинке лодка с привидениями, вид сверху
-на картинке подводная лодка «акула», вид с воздуха
-на картинке изображено - надувная лодка с жестким дном
-на картинке с сайта esquire, изображен маленький красный корабль

-на картинке собака с длинными ушами, вид спереди
-на картинке собака с большими ушами и с длинными лапами, вид спереди
-на картинке собака с большими ушами и мордой собаки, вид спереди
-на картинке собака с белой гривой, вид спереди собака с коричневым цветом
-на картинке собака с большими ушами и собака с большими ушами, вид спереди
-на картинке собака с большими ушами и коричневым мехом, вид спереди
-на картинке собака с белой гривой, вид спереди собака с белой гривой
-на картинке собака с большими ушами и длинными ушами, вид спереди

-на картинке изображен жилой комплекс «арбат»
-на картинке видно здание с окнами в центре города
-на картинке изображен жилой дом с видом на улицу
-на картинке виднеется здание в центре города
-на картинке изображен вид на жилой комплекс, вид с улицы
-на картинке видна башня банка сбербанка
-на картинке изображен фасад здания с окнами в центре города
-на картинке виднеется здание с балконом

-на картинке мотоцикл иж юпитер вариант с мотором от иж юпитер, вид сзади
-на картинке мотоцикл с мотором и мотором с мотором от мотоцикла, вид сбоку
-на картинке изображен мотоцикл с кузовом из фильма «бэтмен против супермена», вид спереди
-на картинке велосипед с велосипедом в гараже, вид спереди
-на картинке мотоцикл с мотоциклом «мотоцикл» вид сзади, вид спереди
-на картинке велосипед с корзиной для покупок, вид сзади
-на картинке велосипед с мотором от мотоцикла иж юпитер вариант 2 варианта, вид сбоку
-на картинке мотоцикл с мотоциклом « », вид спереди

Zero-Shot Image Classification using PPL

import base64
import requests
from PIL import Image
from io import BytesIO

bs4_urls = requests.get('https://raw.githubusercontent.com/sberbank-ai/ru-dolph/master/pics/pipelines/cats_vs_dogs_bs4.json').json()

f, ax = plt.subplots(2,4, figsize=(12,6))

for i, bs4_url in enumerate(bs4_urls):
    pil_img = Image.open(BytesIO(base64.b64decode(bs4_url)))
    
    classes = ['кошка', 'собака']
    preds = zs_clf(
        pil_img, 
        classes,
        model, 
        tokenizer,
        vae,
        template = '{}', 
    )
    ax[i//4, i%4].imshow(pil_img)
    ax[i//4, i%4].set_title(preds['class'])

Linear Probe (TODO, see Colab)

Authors:

Drawing Drawing

Citation

@article{shonenkov2022ruDolph,
  title         = {RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP},
  author        = {Alex Shonenkov and Michael Konstantinov},
  year          = {2022},
  eprint        = {...},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}
@misc{github2022ruDolph,
  title         = {RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP},
  author        = {Alex Shonenkov and Michael Konstantinov},
  year          = {2022},
  howpublished  = {\url{https://github.com/sberbank-ai/ru-dolph}},
}

Supported by

Owner
AI Forever
Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.
AI Forever
FSL-Mate: A collection of resources for few-shot learning (FSL).

FSL-Mate is a collection of resources for few-shot learning (FSL). In particular, FSL-Mate currently contains FewShotPapers: a paper list which tracks

Yaqing Wang 1.5k Jan 08, 2023
Effective Use of Transformer Networks for Entity Tracking

Effective Use of Transformer Networks for Entity Tracking (EMNLP19) This is a PyTorch implementation of our EMNLP paper on the effectiveness of pre-tr

5 Nov 06, 2021
Relative Positional Encoding for Transformers with Linear Complexity

Stochastic Positional Encoding (SPE) This is the source code repository for the ICML 2021 paper Relative Positional Encoding for Transformers with Lin

Antoine Liutkus 48 Nov 16, 2022
A package related to building quasi-fibration symmetries

qf A package related to building quasi-fibration symmetries. If you'd like to learn more about how it works, see the brief explanation and References

Paolo Boldi 1 Dec 01, 2021
Code for the Higgs Boson Machine Learning Challenge organised by CERN & EPFL

A method to solve the Higgs boson challenge using Least Squares - Novae This project is the Project 1 of EPFL CS-433 Machine Learning. The project is

Giacomo Orsi 1 Nov 09, 2021
PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

Rowan Zellers 51 Oct 08, 2022
Demo project for real time anomaly detection using kafka and python

kafkaml-anomaly-detection Project for real time anomaly detection using kafka and python It's assumed that zookeeper and kafka are running in the loca

Rodrigo Arenas 36 Dec 12, 2022
Multimodal Descriptions of Social Concepts: Automatic Modeling and Detection of (Highly Abstract) Social Concepts evoked by Art Images

MUSCO - Multimodal Descriptions of Social Concepts Automatic Modeling of (Highly Abstract) Social Concepts evoked by Art Images This project aims to i

0 Aug 22, 2021
DA2Lite is an automated model compression toolkit for PyTorch.

DA2Lite (Deep Architecture to Lite) is a toolkit to compress and accelerate deep network models. ⭐ Star us on GitHub — it helps!! Frameworks & Librari

Sinhan Kang 7 Mar 22, 2022
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

Institute of Computational Perception 45 Dec 29, 2022
Playable Video Generation

Playable Video Generation Playable Video Generation Willi Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, Elisa Ricci Paper: ArX

Willi Menapace 136 Dec 31, 2022
A Library for Modelling Probabilistic Hierarchical Graphical Models in PyTorch

A Library for Modelling Probabilistic Hierarchical Graphical Models in PyTorch

Korbinian Pöppel 47 Nov 28, 2022
Code release for Local Light Field Fusion at SIGGRAPH 2019

Local Light Field Fusion Project | Video | Paper Tensorflow implementation for novel view synthesis from sparse input images. Local Light Field Fusion

1.1k Dec 27, 2022
Bayesian Neural Networks in PyTorch

We present the new scheme to compute Monte Carlo estimator in Bayesian VI settings with almost no memory cost in GPU, regardles of the number of sampl

Jurijs Nazarovs 7 May 03, 2022
Official repository for the CVPR 2021 paper "Learning Feature Aggregation for Deep 3D Morphable Models"

Deep3DMM Official repository for the CVPR 2021 paper Learning Feature Aggregation for Deep 3D Morphable Models. Requirements This code is tested on Py

38 Dec 27, 2022
Attentional Focus Modulates Automatic Finger‑tapping Movements

"Attentional Focus Modulates Automatic Finger‑tapping Movements", in Scientific Reports

Xingxun Jiang 1 Dec 02, 2021
prior-based-losses-for-medical-image-segmentation

Repository for papers: Benchmark: Effect of Prior-based Losses on Segmentation Performance: A Benchmark Midl: A Surprisingly Effective Perimeter-based

Rosana EL JURDI 9 Sep 07, 2022
A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥 Click on Use this template to initialize new re

Hyunsoo Cho 1 Dec 20, 2021
PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision.

PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{CV2018, author = {Donny You ( Donny You 40 Sep 14, 2022

[AAAI 2022] Separate Contrastive Learning for Organs-at-Risk and Gross-Tumor-Volume Segmentation with Limited Annotation

A paper Introduction This is an official release of the paper Separate Contrastive Learning for Organs-at-Risk and Gross-Tumor-Volume Segmentation wit

Jiacheng Wang 14 Dec 08, 2022