X-VLM: Multi-Grained Vision Language Pre-Training

Last update: Dec 23, 2022

Overview

X-VLM: learning multi-grained vision language alignments

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.

Jan 2022: release official PyTorch implementation and X-VLM-base checkpoints
Dec 2021: X-VLM-base (4M) achieves new SoTA
Nov 2021: release preprint in arXiv

Hiring

We are looking for interns at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to [email protected].

Features

Support several backbones
- vision encoder: deit / clip-vit / swin-transformer
- text encoder: bert / roberta
Support apex O1 / O2 for pre-training
Read from and write to HDFS
Distributed training across nodes for both pre-training and fine-tuning

Please read the code for more details.

Requirements

Install python3 environment

pip3 install -r requirements.txt

Download raw images from corresponding websites
Download the json files we provided, which contains image read paths and captions and/or bbox annotations
If running pre-training scripts:
- install Apex
- download pre-trained models for parameter initialization
  - image encoder: swin-transformer-base
  - text encoder: bert-base
Organize these files like this (% is for pre-training only):

X-VLM/
    data/
        finetune/
            refcoco+/*.json
            *.json
        
        %pretrain_4m/*.json
        %swin_base_patch4_window7_224_22k.pth
        %bert-base-uncased/
            config.json
            pytorch_model.bin
            tokenizer_config.json
            tokenizer.json
            vocab.txt

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png
        
        %sbu/*.jpg
        %cc-3m/*.jpg

Pretrain

python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"

For distributed training across nodes, see run.py for more details.

Data

We are organizing the data and the scripts. All these will be released in Vision-Language-Data in March. Please feel free to prepare your own datasets by referring the code in dataset/pretrain_dataset.py.

Checkpoints

X-VLM-base (4M)
X-VLM-base 14M, WIP
X-VLM-large 14M, WIP

Finetune

2 nodes for fine-tuning, specify --output_hdfs to save some tmp results. # evaluate python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th" ">

# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"  # if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results.

# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th"

See run.py for fine-tuning on other tasks (Retrieval, NLVR2, RefCOCO). We set some python assertions to help you run the code correctly. The fine-tuning scripts are based on ALBEF. We thank the author for opening source their code.

Data

download json files

Checkpoints and Logs

retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-bbox
Note that fine-tuning configs are given in "X-VLM/configs/*.yaml"

Citation

If you use this code, please considering citing:

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues or help using this code, please submit a GitHub issue.

X-VLM: Multi-Grained Vision Language Pre-Training

Related tags

Overview

X-VLM: learning multi-grained vision language alignments

Hiring

Features

Requirements

Pretrain

Data

Checkpoints

Finetune

Data

Checkpoints and Logs

Citation

Contact

Owner

Yan Zeng

M3DSSD: Monocular 3D Single Stage Object Detector

Proof-Of-Concept Piano-Drums Music AI Model/Implementation

(AAAI2020)Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing

This is the repository for the NeurIPS-21 paper [Contrastive Graph Poisson Networks: Semi-Supervised Learning with Extremely Limited Labels].

Official Pytorch implementation of "Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021)

implementation of the paper "MarginGAN: Adversarial Training in Semi-Supervised Learning"

CAPRI: Context-Aware Interpretable Point-of-Interest Recommendation Framework

An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

Implementation of Axial attention - attending to multi-dimensional data efficiently

MIMIC Code Repository: Code shared by the research community for the MIMIC-III database

Latent Execution for Neural Program Synthesis

Learning hierarchical attention for weakly-supervised chest X-ray abnormality localization and diagnosis

SAS: Self-Augmentation Strategy for Language Model Pre-training

CLIPImageClassifier wraps clip image model from transformers

2021搜狐校园文本匹配算法大赛分比我们低的都是帅哥队

[CVPR2021] DoDNet: Learning to segment multi-organ and tumors from multiple partially labeled datasets

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer"

The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 2021)

OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion.

X-VLM: Multi-Grained Vision Language Pre-Training

Related tags

Overview

X-VLM: learning multi-grained vision language alignments

Hiring

Features

Requirements

Pretrain

Data

Checkpoints

Finetune

Data

Checkpoints and Logs

Citation

Contact

Owner

Yan Zeng

M3DSSD: Monocular 3D Single Stage Object Detector

Proof-Of-Concept Piano-Drums Music AI Model/Implementation

(AAAI2020)Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing

This is the repository for the NeurIPS-21 paper [Contrastive Graph Poisson Networks: Semi-Supervised Learning with Extremely Limited Labels].

Official Pytorch implementation of "Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021)

implementation of the paper "MarginGAN: Adversarial Training in Semi-Supervised Learning"

CAPRI: Context-Aware Interpretable Point-of-Interest Recommendation Framework

An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

Implementation of Axial attention - attending to multi-dimensional data efficiently

MIMIC Code Repository: Code shared by the research community for the MIMIC-III database

Latent Execution for Neural Program Synthesis

Learning hierarchical attention for weakly-supervised chest X-ray abnormality localization and diagnosis

SAS: Self-Augmentation Strategy for Language Model Pre-training

CLIPImageClassifier wraps clip image model from transformers

2021搜狐校园文本匹配算法大赛 分比我们低的都是帅哥队

[CVPR2021] DoDNet: Learning to segment multi-organ and tumors from multiple partially labeled datasets

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer"

The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 2021)

OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion.

2021搜狐校园文本匹配算法大赛分比我们低的都是帅哥队