Skip to content

SuperSupermoon/mmodal_mixer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

M4 (MLP-Mixer based Multi-modal image-text retrieval)

image

Image:

Original image is cropped with 16 x 16 patch size without overlap. Then, it is reshaped to (batch, (hxw), (patch x patch x channel)).

Text:

Also, original text is tokenized and embedded with BERT-based approach (BERT-base-uncased).

Data processing:

When we train our model, we randomly samples(50 %) reports to make the matched- and un-matched image-text set. Basically, matched and un-matched set is classified with label information using chexpert labeler, we consider unmatched set when randomly sampled report is not exactly same with original one.

Mixer based approach is trained efficiently with xxxx throuput with xxx accuracy.

Exp settings.

batch: 256 batch epoch: 50 epoch

Chest X-ray Image-reports retrieval

Model spec: patch size:16, embedding dim: 768

Input spec: img size: 224x224x3 -> pathch size: (224/16) x (224/16) text max len: 128 legth

input embedding: cls, txt, sep, img output: matched or unmatched

Results

image

About

Final project in KAIST AI class.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages