Repository for Multimodal AutoML Benchmark

Last update: Nov 24, 2022

Overview

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Repository for the NeurIPS 2021 Dataset Track Submission "Benchmarking Multimodal AutoML for Tabular Data with Text Fields" (Link, Full Paper with Appendix). An earlier version of the paper, called "Multimodal AutoML on Structured Tables with Text Fields" (Link) has been accepted by ICML 2021 AutoML workshop as Oral. As we have since updated the benchmark with more datasets, the version used in the AutoML workshop paper has been archived at the icml_workshop branch.

This benchmark contains a diverse collection of tabular datasets. Each dataset contains numeric/categorical as well as text columns. The goal is to evaluate the performance of (automated) ML systems for supervised learning (classification and regression) with such multimodal data. The folder multimodal_text_benchmark/scripts/benchmark/ provides Python scripts to run different variants of the AutoGluon and H2O AutoML tools on the benchmark.

Datasets used in the Benchmark

Here's a brief summary of the datasets in our benchmark. Each dataset is described in greater detail in the multimodal_text_benchmark/ folder.

ID	key	#Train	#Test	Task	Metric	Prediction Target
prod	product_sentiment_machine_hack	5,091	1,273	multiclass	accuracy	sentiment related to product
salary	data_scientist_salary	15,84	3961	multiclass	accuracy	salary range in data scientist job listings
airbnb	melbourne_airbnb	18,316	4,579	multiclass	accuracy	price of Airbnb listing
channel	news_channel	20,284	5,071	multiclass	accuracy	category of news article
wine	wine_reviews	84,123	21,031	multiclass	accuracy	variety of wine
imdb	imdb_genre_prediction	800	200	binary	roc_auc	whether film is a drama
fake	fake_job_postings2	12,725	3,182	binary	roc_auc	whether job postings are fake
kick	kick_starter_funding	86,052	21,626	binary	roc_auc	will Kickstarter get funding
jigsaw	jigsaw_unintended_bias100K	100,000	25,000	binary	roc_auc	whether comments are toxic
qaa	google_qa_answer_type_reason_explanation	4,863	1,216	regression	r2	type of answer
qaq	google_qa_question_type_reason_explanation	4,863	1,216	regression	r2	type of question
book	bookprice_prediction	4,989	1,248	regression	r2	price of books
jc	jc_penney_products	10,860	2,715	regression	r2	price of JC Penney products
cloth	women_clothing_review	18,788	4,698	regression	r2	review score
ae	ae_price_prediction	22,662	5,666	regression	r2	American-Eagle item prices
pop	news_popularity2	24,007	6,002	regression	r2	news article popularity online
house	california_house_price	24,007	6,002	regression	r2	sale price of houses in California
mercari	mercari_price_suggestion100K	100,000	25,000	regression	r2	price of Mercari products

License

The versions of datasets in this benchmark are released under the CC BY-NC-SA license. Note that the datasets in this benchmark are modified versions of previously publicly-available original copies and we do not own any of the datasets in the benchmark. Any data from this benchmark which has previously been published elsewhere falls under the original license from which the data originated. Please refer to the licenses of each original source linked in the multimodal_text_benchmark/README.md.

Install the Benchmark Suite

cd multimodal_text_benchmark
# Install the benchmarking suite
python3 -m pip install -U -e .

You can do a quick test of the installation by going to the test folder

cd multimodal_text_benchmark/tests
python3 -m pytest test_datasets.py

To work with one of the datasets, use the following code:

from auto_mm_bench.datasets import dataset_registry

print(dataset_registry.list_keys())  # list of all dataset names
dataset_name = 'product_sentiment_machine_hack'

train_dataset = dataset_registry.create(dataset_name, 'train')
test_dataset = dataset_registry.create(dataset_name, 'test')
print(train_dataset.data)
print(test_dataset.data)

To access all datasets that comprise the benchmark:

from auto_mm_bench.datasets import create_dataset, TEXT_BENCHMARK_ALIAS_MAPPING

for dataset_name in list(TEXT_BENCHMARK_ALIAS_MAPPING.values()):
    print(dataset_name)
    dataset = create_dataset(dataset_name)

Run Experiments

Go to multimodal_text_benchmark/scripts/benchmark to see how to run some baseline ML methods over the benchmark.

References

BibTeX entry of the ICML Workshop Version:

@article{agmultimodaltext,
  title={Multimodal AutoML on Structured Tables with Text Fields},
  author={Shi, Xingjian and Mueller, Jonas and Erickson, Nick and Li, Mu and Smola, Alexander},
  journal={8th ICML Workshop on Automated Machine Learning (AutoML)},
  year={2021}
}

Repository for Multimodal AutoML Benchmark

Related tags

Overview

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Datasets used in the Benchmark

License

Install the Benchmark Suite

Run Experiments

References

Owner

Xingjian Shi

A code generator from ONNX to PyTorch code

CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Grad2Task: Improved Few-shot Text Classification Using Gradients for Task Representation

Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

CTF Challenge for CSAW Finals 2021

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

Codebase to experiment with a hybrid Transformer that combines conditional sequence generation with regression

Code for AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network (ICCV 2021).

CrossMLP - The repository offers the official implementation of our BMVC 2021 paper (oral) in PyTorch.

A library for graph deep learning research

Automatically erase objects in the video, such as logo, text, etc.

Tensorflow 2 implementations of the C-SimCLR and C-BYOL self-supervised visual representation methods from "Compressive Visual Representations" (NeurIPS 2021)

Official code for On Path Integration of Grid Cells: Group Representation and Isotropic Scaling (NeurIPS 2021)

Image reconstruction done with untrained neural networks.

A `Neural = Symbolic` framework for sound and complete weighted real-value logic

CBKH: The Cornell Biomedical Knowledge Hub

A simple code to convert image format and channel as well as resizing and renaming multiple images.