Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Overview

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Download links and PyTorch implementation of "Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision", ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, Noah Snavely ICCV 2021

Project Page | Paper

drawing

The WikiScenes Dataset

  1. Image and Textual Descriptions: WikiScenes contains 63K images with captions of 99 cathedrals. We provide two versions for download:

    • Low-res version used in our experiments (maximum width set to 200[px], aspect ratio fixed): (1.9GB .zip file)
    • Higher-res version (maximum longer dimension set to 1200[px], aspect ratio fixed): (19.4GB .zip file)

    Licenses for the images are provided here: (LicenseInfo.json file)

    Data Structure

    WikiScenes is organized recursively, following the tree structure in Wikimedia. Each semantic category (e.g. cathedral) contains the following recursive structure:

    ----0 (e.g., "milano cathedral duomo milan milano italy italia")
    --------0 (e.g., "Exterior of the Duomo (Milan)")
    ----------------0 (e.g., "Duomo (Milan) in art - exterior")
    ----------------1
    ----------------...
    ----------------K0-0
    ----------------category.json
    ----------------pictures (contains all pictures in current hierarchy level)
    --------1
    --------...
    --------K0
    --------category.json
    --------pictures (contains all pictures in current hierarchy level)
    ----1
    ----2
    ----...
    ----N
    ----category.json
    

    category.json is a dictionary of the following format:

    {
        "max_index": SUB-DIR-NUMBER
        "pairs" :    {
                        CATEGORY-NAME: SUB-DIR-NAME
                    }
        "pictures" : {
                        PICTURE-NAME: {
                                            "caption": CAPTION-DATA,
                                            "url": URL-DATA,
                                            "properties": PROPERTIES
                                    }
                    }
    }
    

    where:

    1. SUB-DIR-NUMBER is the total number of subcategories
    2. CATEGORY-NAME is the name of the category (e.g., "milano cathedral duomo milan milano italy italia")
    3. SUB-DIR-NAME is the name of the sub-folder (e.g., "0")
    4. PICTURE-NAME is the name of the jpg file located within the pictures folder
    5. CAPTION-DATA contains the caption and URL contains the url from which the image was scraped.
    6. PROPERTIES is a list of properties pre-computed for the image-caption pair (e.g. estimated language of caption).
  2. Keypoint correspondences: We also provide keypoint correspondences between pixels of images from the same landmark: (982MB .zip file)

    Data Structure

     {
         "image_id" : {
                         "kp_id": (x, y),
                     }
     }
    

    where:

    1. image_id is the id of each image.
    2. kp_id is the id of keypoints, which is unique across the whole dataset.
    3. (x, y) the location of the keypoint in this image.
  3. COLMAP reconstructions: We provide the full 3D models used for computing keypoint correspondences: (1GB .zip file)

    To view these models, download and install COLMAP. The reconstructions are organized by landmarks. Each landmark folder contains all the reconstructions associated with that landmark. Each reconstruction contains 3 files:

    1. points3d.txt that contains one line of data for each 3D point associated with the reconstruction. The format for each point is: POINT3D_ID, X, Y, Z, R, G, B, ERROR, TRACK[] as (IMAGE_ID, POINT2D_IDX).
    2. images.txt that contains two lines of data for each image associated with the reconstruction. The format of the first line is: IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME. The format of the second line is: POINTS2D[] as (X, Y, POINT3D_ID)
    3. cameras.txt that contains one line of data for each camera associated with the reconstruction according to the following format: CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]

    Please refer to COLMAP's tutorial for further instructions on how to view these reconstructions.

  4. Companion datasets for additional landmark categories: We provide download links for additional category types:

    Synagogues

    Images and captions (PENDING .zip file), correspondences (PENDING .zip file), reconstructions (PENDING .zip file)

    Mosques

    Images and captions (PENDING .zip file), correspondences (PENDING .zip file), reconstructions (PENDING .zip file)

Reproducing Results

  1. Minimum requirements. This project was originally developed with Python 3.6, PyTorch 1.0 and CUDA 9.0. The training requires at least one Titan X GPU (12Gb memory) .

  2. Setup your Python environment. Clone the repository and install the dependencies:

    conda create -n <environment_name> --file requirements.txt -c conda-forge/label/cf202003
    conda activate <environment_name>
    conda install scikit-learn=0.21
    pip install opencv-python
    
  3. Download the dataset. Download the data as detailed above, unzip and place as follows: Image and textual descriptions in <project>/data/ and the correspondence file in <project>.

  4. Download pre-trained models. Download the initial weights (pre-trained on ImageNet) for the backbone model and place in <project>/models/weights/.

    Backbone Initial Weights Comments
    ResNet50 resnet50-19c8e357.pth PyTorch official model
  5. Train on the WikiScenes dataset. See instructions below. Note that the first run always takes longer for pre-processing. Some computations are cached afterwards.

Training, Inference and Evaluation

The directory launch contains template bash scripts for training, inference and evaluation.

Training. For each run, you need to specify the names of two variables, bash EXP and bash RUN_ID. Running bash EXP=wiki RUN_ID=v01 ./launch/run_wikiscenes_resnet50.sh will create a directory ./logs/wikiscenes_corr/wiki/ with tensorboard events and saved snapshots in ./snapshots/wikiscenes_corr/wiki/v01.

Inference.

If you want to do inference with our pre-trained model, please make a directory and put the model there.

    mkdir -p ./snapshots/wikiscenes_corr/final/ours

Download our validation set, and unzip it.

    unzip val_seg.zip

run sh ./launch/infer_val_wikiscenes.sh to predict masks. You can find the predicted masks in ./logs/masks.

If you want to evaluate you own models, you will also need to specify:

  • EXP and RUN_ID you used for training;
  • OUTPUT_DIR the path where to save the masks;
  • SNAPSHOT specifies the model suffix in the format e000Xs0.000;

Evaluation. To compute IoU of the masks, run sh ./launch/eval_seg.sh.

Pre-trained model

For testing, we provide our pre-trained ResNet50 model:

Backbone Link
ResNet50 model_enc_e024Xs-0.800.pth (157M)

Datasheet

We provide a datasheet for our dataset here.

License

The images in our dataset are provided by Wikimedia Commons under various free licenses. These licenses permit the use, study, derivation, and redistribution of these images—sometimes with restrictions, e.g. requiring attribution and with copyleft. We provide full license text and attribution for all images, make no modifications to any, and release these images under their original licenses. The associated captions are provided as a part of unstructured text in Wikimedia Commons, with rights to the original writers under the CC BY-SA 3.0 license. We modify these (as specified in our paper) and release such derivatives under the same license. We provide the rest of our dataset under a CC BY-NC-SA 4.0 license.

Citation

@inproceedings{Wu2021Towers,
 title={Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision},
 author={Wu, Xiaoshi and Averbuch-Elor, Hadar and Sun, Jin and Snavely, Noah},
 booktitle={ICCV},
 year={2021}
}

Acknowledgement

Our code is based on the implementation of Single-Stage Semantic Segmentation from Image Labels

Owner
Blakey Wu
Blakey Wu
Code Repository for The Kaggle Book, Published by Packt Publishing

The Kaggle Book Data analysis and machine learning for competitive data science Code Repository for The Kaggle Book, Published by Packt Publishing "Lu

Packt 1.6k Jan 07, 2023
Automates Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning :rocket:

MLJAR Automated Machine Learning Documentation: https://supervised.mljar.com/ Source Code: https://github.com/mljar/mljar-supervised Table of Contents

MLJAR 2.4k Dec 31, 2022
This repository contains all data used for writing a research paper Multiple Object Trackers in OpenCV: A Benchmark, presented in ISIE 2021 conference in Kyoto, Japan.

OpenCV-Multiple-Object-Tracking Python is version 3.6.7 to install opencv: pip uninstall opecv-python pip uninstall opencv-contrib-python pip install

6 Dec 19, 2021
Deep Learning Package based on TensorFlow

White-Box-Layer is a Python module for deep learning built on top of TensorFlow and is distributed under the MIT license. The project was started in M

YeongHyeon Park 7 Dec 27, 2021
Bravia core script for python

Bravia-Core-Script You need to have a mandatory account If this L3 does not work, try another L3. enjoy

5 Dec 26, 2021
RGB-D Local Implicit Function for Depth Completion of Transparent Objects

RGB-D Local Implicit Function for Depth Completion of Transparent Objects [Project Page] [Paper] Overview This repository maintains the official imple

NVIDIA Research Projects 43 Dec 12, 2022
Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks

Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks Abstract Facial expression recognition in video

Bogireddy Sai Prasanna Teja Reddy 103 Dec 29, 2022
PyTorch-centric library for evaluating and enhancing the robustness of AI technologies

Responsible AI Toolbox A library that provides high-quality, PyTorch-centric tools for evaluating and enhancing both the robustness and the explainabi

24 Dec 22, 2022
PenguinSpeciesPredictionML - Basic model to predict Penguin species based on beak size and sex.

Penguin Species Prediction (ML) 🐧 👨🏽‍💻 What? 💻 This project is a basic model using sklearn methods to predict Penguin species based on beak size

Tucker Paron 0 Jan 08, 2022
🔮 Execution time predictions for deep neural network training iterations across different GPUs.

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training Habitat is a tool that predicts a deep neural network's

Geoffrey Yu 44 Dec 27, 2022
Generate fine-tuning samples & Fine-tuning the model & Generate samples by transferring Note On

UPMT Generate fine-tuning samples & Fine-tuning the model & Generate samples by transferring Note On See main.py as an example: from model import PopM

7 Sep 01, 2022
Alex Pashevich 62 Dec 24, 2022
Replication of Pix2Seq with Pretrained Model

Pretrained-Pix2Seq We provide the pre-trained model of Pix2Seq. This version contains new data augmentation. The model is trained for 300 epochs and c

peng gao 51 Nov 22, 2022
Lipstick ain't enough: Beyond Color-Matching for In-the-Wild Makeup Transfer (CVPR 2021)

Table of Content Introduction Datasets Getting Started Requirements Usage Example Training & Evaluation CPM: Color-Pattern Makeup Transfer CPM is a ho

VinAI Research 248 Dec 13, 2022
Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder

ASEGAN: Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder 中文版简介 Readme with English Version 介绍 基于SEGAN模型的改进版本,使用自主设计的非

Nitin 53 Nov 17, 2022
TensorFlow 2 implementation of the Yahoo Open-NSFW model

TensorFlow 2 implementation of the Yahoo Open-NSFW model

Bosco Yung 101 Jan 01, 2023
Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)

Change is Everywhere Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery by Zhuo Zheng, Ailong Ma, Liangpei Zhang and Yanfei

Zhuo Zheng 125 Dec 13, 2022
HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.

HTSeq DEVS: https://github.com/htseq/htseq DOCS: https://htseq.readthedocs.io A Python library to facilitate programmatic analysis of data from high-t

HTSeq 57 Dec 20, 2022
A PyTorch implementation of "Signed Graph Convolutional Network" (ICDM 2018).

SGCN ⠀ A PyTorch implementation of Signed Graph Convolutional Network (ICDM 2018). Abstract Due to the fact much of today's data can be represented as

Benedek Rozemberczki 251 Nov 30, 2022
Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps[AAAI2021]

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps Here is the code for ssbassline model. We also provide OCR results/features/mode

ZephyrZhuQi 51 Nov 18, 2022