CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Last update: Mar 10, 2022

Related tags

Overview

CLIP-Indonesian

CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder jointly to project the representation of images and the corresponding text into the same embedding space. The expected outcome is the text embeddings and image embeddings are located near each other.

This repository hosts the code for CLIP-Indonesian, which is a CLIP multimodal model trained on Indonesian data.

For the image encoder, we use VIT, more specifically openai/clip-vit-base-patch32. Meanwhile, for the text encoder, we experimented with two models: IndoBERT Large (indobenchmark/indobert-base-p2) and Indonesian RoBERTa Base (flax-community/indonesian-roberta-base).

Most of the CLIP script is based on HybridCLIP and clip-italian.

Still a work in progress so may not give the best result (yet) :)

clip-indonesian was presented in PyCon ID 2021. You can view the slide deck here.

Dataset

More details about the dataset used can be found here.

Results

The results of the training can be accessed here.

Demo

References

Bianchi, F., Attanasio, G., Pisoni, R., Terragni, S., Sarti, G., Lakshmi, S. (2021). Contrastive Language-Image Pre-training for the Italian Language arXiv preprint arXiv:2108.08688.

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.

Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., ... & Purwarianti, A. (2020). IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. arXiv preprint arXiv:2009.05387.

Hybrid CLIP by the HuggingFace team

Indonesian Roberta Base by Wilson Wongso, Steven Limcorn, Samsul Rahmadani, and Chew Kok Wah

Indonesian Translated Datasets by Samsul Rahmadani

Acknowledgment

All training was done on a TPUv3-8 VM sponsored by TPU Research Cloud.

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Related tags

Overview

CLIP-Indonesian

Dataset

Results

Demo

Links

References

Acknowledgment

Owner

Galuh

Approaches to modeling terrain and maps in python

A Pytorch loader for MVTecAD dataset.

[NeurIPS 2021] Low-Rank Subspaces in GANs

Pytorch-Swin-Unet-V2 - a modified version of Swin Unet based on Swin Transfomer V2

Source code for Zalo AI 2021 submission

A customisable game where you have to quickly click on black tiles in order of appearance while avoiding clicking on white squares.

商品推荐系统

Image Matching Evaluation

Person Re-identification

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Housing Price Prediction

TVNet: Temporal Voting Network for Action Localization

💊 A 3D Generative Model for Structure-Based Drug Design (NeurIPS 2021)

CSE-519---Project - Job Title Analysis (Project for CSE 519 - Data Science Fundamentals)

Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains

This Jupyter notebook shows one way to implement a simple first-order low-pass filter on sampled data in discrete time.

Neural Scene Flow Prior (NeurIPS 2021 spotlight)

Final project code: Implementing MAE with downscaled encoders and datasets, for ESE546 FA21 at University of Pennsylvania