M4 (MLP-Mixer based Multi-modal image-text retrieval)

Image:

Original image is cropped with 16 x 16 patch size without overlap. Then, it is reshaped to (batch, (hxw), (patch x patch x channel)).

Text:

Also, original text is tokenized and embedded with BERT-based approach (BERT-base-uncased).

Data processing:

When we train our model, we randomly samples(50 %) reports to make the matched- and un-matched image-text set. Basically, matched and un-matched set is classified with label information using chexpert labeler, we consider unmatched set when randomly sampled report is not exactly same with original one.

Mixer based approach is trained efficiently with xxxx throuput with xxx accuracy.

Exp settings.

batch: 256 batch epoch: 50 epoch

Chest X-ray Image-reports retrieval

Model spec: patch size:16, embedding dim: 768

Input spec: img size: 224x224x3 -> pathch size: (224/16) x (224/16) text max len: 128 legth

input embedding: cls, txt, sep, img output: matched or unmatched

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
mlp_mixer_pytorch		mlp_mixer_pytorch
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mlp_mixer_pytorch

mlp_mixer_pytorch

.gitignore

.gitignore

README.md

README.md

Repository files navigation

M4 (MLP-Mixer based Multi-modal image-text retrieval)

Image:

Text:

Data processing:

Exp settings.

Chest X-ray Image-reports retrieval

Results

About

Releases

Packages

Languages

SuperSupermoon/mmodal_mixer

Folders and files

Latest commit

History

Repository files navigation

M4 (MLP-Mixer based Multi-modal image-text retrieval)

Image:

Text:

Data processing:

Exp settings.

Chest X-ray Image-reports retrieval

Results

About

Resources

Stars

Watchers

Forks

Languages