Research on Tabular Deep Learning (Python package & papers)

Last update: Dec 30, 2022

Overview

Research on Tabular Deep Learning

For paper implementations, see the section "Papers and projects".

rtdl is a PyTorch-based package providing a user-friendly API for the main models and concepts from our papers. See the documentation.

Press "Watch" to stay up to date with new papers and releases!

Feel free to report issues and post questions/feedback/ideas.

Papers and projects

Name	Location	Comment
On Embeddings for Numerical Features in Tabular Deep Learning	link	arXiv 2022
Revisiting Deep Learning Models for Tabular Data	link	NeurIPS 2021
`rtdl`	link	Python package

Comments

Fix MLP.make_baseline() return type

Return object of type cls, not MLP, in MLP.make_baseline(). Otherwise, child classes inheriting from MLP constructed using the .make_baseline() method always have type MLP (instead of the type of the child class).

opened by jpgard 6
Is it possible to provide a scikit-learn interface?

This project is interesting and I want to use it as the baseline algorithm for my paper. However, it seems that I need to take several steps in order to make a prediction. Consequently, is it possible to provide a scikit-learn interface for making a convenient comparison between different algorithms?

opened by hengzhe-zhang 5
Cannot link in the document of zero

Hi! I am trying to understand the usage of python package zero, which is used in the example of rtdl. But I found that the linkage in the comment line of the code is not available anymore.

Here is the invalid link: https://yura52.github.io/zero/0.0.4/reference/api/zero.improve_reproducibility.html

I am wondering is there any other document? Thank you!

Regards.

opened by WuZheng326 4
embedding of categorical variables

Hi Yury,

Thank you for your excellent work. I get a problem when handling categorical features. Do I need to pre-train the embedding layer when applying it to the data processing or just to attach the embedding layer to the model and train it with the model.

opened by lhq12 3
Add ⭐️Weights & Biases⭐️ Logging

This PR aims to add basic Weights and Biases Metric Logging by appending to the existing codebase with minimal changes while supporting Checkpoint uploads as Weights and Biases Artifacts.

Wherever needed, I have used the existing Weights and Biases integrations viz. LightGBM and XGBoost.

I have validated the performance of all the proposed runs by running 150+ runs, which can be viewed on this project page and in detail in an accompanying blog post.

opened by SauravMaheshkar 3
Bugs in piecewise-linear encoding
Here, indices = as_tensor(values) must be changed to this:

indices = as_tensor(indices)

Here, np.array(d_encoding) must be changed to this:

torch.tensor(d_encoding).to(indices)

Here, the argument dtype=X.dtype is missing for np.array

Here, .to(X) is missing

Here, it must be:

is_last_bin = bin_indices + 1 == as_tensor(list(map(len, bin_edges)))
opened by Yura52 2

LGBMRegressor on California Housing dataset is 0.68 >> 0.46

I use the sample code to prepare the dataset:

device = 'cpu'
dataset = sklearn.datasets.fetch_california_housing()
task_type = 'regression'

X_all = dataset['data'].astype('float32')
y_all = dataset['target'].astype('float32')
n_classes = None

X = {}
y = {}
X['train'], X['test'], y['train'], y['test'] = sklearn.model_selection.train_test_split(
    X_all, y_all, train_size=0.8
)
X['train'], X['val'], y['train'], y['val'] = sklearn.model_selection.train_test_split(
    X['train'], y['train'], train_size=0.8
)

# not the best way to preprocess features, but enough for the demonstration
preprocess = sklearn.preprocessing.StandardScaler().fit(X['train'])
X = {
    k: torch.tensor(preprocess.fit_transform(v), device=device)
    for k, v in X.items()
}
y = {k: torch.tensor(v, device=device) for k, v in y.items()}

# !!! CRUCIAL for neural networks when solving regression problems !!!
y_mean = y['train'].mean().item()
y_std = y['train'].std().item()
y = {k: (v - y_mean) / y_std for k, v in y.items()}

y = {k: v.float() for k, v in y.items()}

And I train a LGBMRegressor with the default hyper parameters:

model = lgb.LGBMRegressor()
model.fit(X['train'], y['train'])

But when I evaluate on the test fold, I found the performance is 0.68:

>>> test_pred = model.predict(X['test'])
>>> test_pred = torch.from_numpy(test_pred)
>>> rmse = torch.nn.functional.mse_loss(
>>>     test_pred.view(-1), y['test'].view(-1)) ** 0.5 * y_std
>>> print(f'Test RMSE: {rmse:.2f}.')
Test RMSE: 0.68.

Even using the model from rtdl gives me 0.56 RMSE:

(epoch) 57 (batch) 0 (loss) 0.1885
(epoch) 57 (batch) 10 (loss) 0.1315
(epoch) 57 (batch) 20 (loss) 0.1735
(epoch) 57 (batch) 30 (loss) 0.1197
(epoch) 57 (batch) 40 (loss) 0.1952
(epoch) 57 (batch) 50 (loss) 0.1167
Epoch 057 | Validation score: 0.7334 | Test score: 0.5612 <<< BEST VALIDATION EPOCH

Is there anything I miss? How can I reproduce the performance in your paper? Thanks!

opened by fingertap 2

Regression results about the RTDL models.

Hi, you did a great implementation of the tab-transformer. However, when I use your example notebook to do the simple regression for the Sin(x), neither the baseline model or the FTTransformer give the good results. I have no idea about this and want to know why.

Here is the link

opened by linkedlist771 1
typos in CatEmbeddings
link. The variable cardinalities_and_dimensions does not exist

link. The condition looks broken. Solution: simplify it and remove the word "spec" from the error message.
opened by Yura52 0
Running error, prenormalization is not a class variable

The code crushes at this line, because prenormalization is not in self

https://github.com/Yura52/rtdl/blob/b130dd2e596c17109bef825bc9c8608e1ae617cc/rtdl/nn/_backbones.py#L627

opened by zahar-chikishev 0
Typos?

Hello,

I am trying to use PiecewiseLinearEncoder(). I think I found a few typos. Please check my work.

I first ran into an issue in piecewise_linear_encoding where I got the error in line 618 saying "RuntimeError: The size of tensor a (3688) must match the size of tensor b (32) at non-singleton dimension 1"

I dug into the code and found that when PiecewiseLinearEncoder was calling piecewise_linear_encoding the positional arguments of indices and ratios were switched in the former from what was expected in the latter.

Additionally, when inspecting piecewise_linear_encoding it looks like bin_edges = as_tensor(bin_ratios) not "as_tensor(bin_edges)" which would make more sense.

Can you please check this out? Much appreciated.

opened by jdefriel 1

How to resume training?

I ran your model in colab for a few hours before google terminated it. I used pickle.dump/load to store the trained model. It works to make predictions but it doesn't seem to be able to resume training.

      if progress.success:
          print(' <<< BEST VALIDATION EPOCH', end='')
          with open(mydrive+jobname, 'wb') as filehandler:
            dump((model, y_std, y_mean),filehandler)
            #we could see result was improving

        with open(mydrive+jobname, 'rb') as filehandler:
          model, y_std, y_mean = load(filehandler)
        pred=model(batch,None) #this seems to work
        for epoch in range(1, n_epochs + 1):
            for iteration, batch_idx in enumerate(train_loader):
                model.train()
                optimizer.zero_grad()
                x_batch = X['train'][batch_idx]
                y_batch = y['train'][batch_idx]
                loss = loss_fn(apply_model(x_batch).squeeze(1), y_batch)
                loss.backward()
                optimizer.step()
                if iteration % report_frequency == 0:
                    print(f'(epoch) {epoch} (batch) {iteration} (loss) {loss.item():.4f}')
                #no improvement any more. even the model was dumped immediately after created.

what is the right way to store the model so that I can resume the training?

opened by jerronl 0

A scikit-learn interface for RTDL package.

Hello! I have written a scikit-learn interface for the RTDL package (https://github.com/hengzhe-zhang/scikit-rtdl). I rely on the skorch to avoid coding errors, and set the default parameters based on the parameters presented in your paper. Hoping you will like it!

opened by hengzhe-zhang 1

Releases(v0.0.13)

v0.0.13(Mar 16, 2022)
This is a technical release, no changes in API. Also, check out our new paper "On Embeddings for Numerical Features in Tabular Deep Learning".

Changes

minor documentation fix

moved paper implementations to separate repositories

Source code(tar.gz)
Source code(zip)
v0.0.12(Mar 10, 2022)

This is a technical update which was required for the release of our new paper "On Embeddings for Numerical Features in Tabular Deep Learning". This version does not bring any functional changes to the rtdl library compared to the previous version (v0.0.10).
Source code(tar.gz)
Source code(zip)
v0.0.10(Feb 28, 2022)
New features

rtdl.data.get_category_sizes

Project

some changes in the documentation reflecting the new structure of the repository

Source code(tar.gz)
Source code(zip)
v0.0.9(Nov 7, 2021)
This is a hot-fix release after the big 0.0.8 release (see the release notes for 0.0.8):

revert the breaking change in NumericalFeatureTokenizer accidentally introduced in 0.0.8

minor documentation refinements

Source code(tar.gz)
Source code(zip)
v0.0.8(Nov 6, 2021)
This release focuses on improving the documentation.

Documentation

The following models and classes are now documented:

MLP

ResNet

FTTransformer

MultiheadAttention

NumericalFeatureTokenizer

CategoricalFeatureTokenizer

FeatureTokenizer

CLSToken

Usability have been greatly improved:

signatures are now highlighted

added the "copy" button to code blocks

permalink buttons (signature anchors) are now visible

Bug fixes

MultiheadAttention: fix the crash when bias=False

Dependencies

numpy >= 1.18

torch >= 1.7

Project

added spell checking for documentation

sphinx was updated to 4.2.0

flit was updated to 3.4.0

Source code(tar.gz)
Source code(zip)
v0.0.7(Oct 10, 2021)
API changes

remove FlatEmbedding

Project

remove bin, lib and output from the PyPI package

Source code(tar.gz)
Source code(zip)
v0.0.6(Aug 26, 2021)
v0.0.6

New features

CLSToken (old name: "AppendCLSToken"): add expand method for easy construction of batches of [CLS]-tokens

Bug fixes

FTTransformer: the make_baseline method now properly constructs an instance

API changes

FTTransformer: the ffn_d_intermidiate argument was renamed to a more conventional ffn_d_hidden

FTTransformer: the normalization argument was split into three arguments: attention_normalization, ffn_normalization, head_normalization

ResNet: the d_intermidiate argument was renamed to a more conventional d_hidden

AppendCLSToken: renamed to CLSToken

Documentation improvements

CLSToken

MLP.make_baseline

Project

add tests with CUDA

remove the .vscode directory from the repository

Source code(tar.gz)
Source code(zip)
v0.0.5(Jul 20, 2021)
API Changes:

MLP.make_baseline is now more user-friendly and accepts a single d_layers argument instead of four (d_first, d_intermidiate, d_last, n_blocks)

Source code(tar.gz)
Source code(zip)
v0.0.4(Jul 11, 2021)
Fixes

make CategoricalFeatureTokenizer compatible with .to(device)

Source code(tar.gz)
Source code(zip)
v0.0.3(Jul 2, 2021)
API Changes

ResNet & ResNet.Block: the d parameter was renamed to d_main

Fixes

minor fix in the comments in examples/rtdl.ipynb

Project

add tests that validate that the models in rtdl are literally the same as in the implementation of the paper

Source code(tar.gz)
Source code(zip)

Owner

Yura Gorishniy

GitHub Repository https://Yura52.github.io/rtdl

Learning to Adapt Structured Output Space for Semantic Segmentation, CVPR 2018 (spotlight)

Learning to Adapt Structured Output Space for Semantic Segmentation Pytorch implementation of our method for adapting semantic segmentation from the s

782 Dec 30, 2022

discovering subdomains, hidden paths, extracting unique links

python-website-crawler discovering subdomains, hidden paths, extracting unique links pip install -r requirements.txt discover subdomain: You can give

4 Sep 05, 2022

Face Depixelizer based on "PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models" repository.

NOTE We have noticed a lot of concern that PULSE will be used to identify individuals whose faces have been blurred out. We want to emphasize that thi

2k Dec 29, 2022

MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

4.7k Dec 29, 2022

An Exact Solver for Semi-supervised Minimum Sum-of-Squares Clustering

PC-SOS-SDP: an Exact Solver for Semi-supervised Minimum Sum-of-Squares Clustering PC-SOS-SDP is an exact algorithm based on the branch-and-bound techn

1 Nov 13, 2022

ScaleNet: A Shallow Architecture for Scale Estimation

ScaleNet: A Shallow Architecture for Scale Estimation Repository for the code of ScaleNet paper: "ScaleNet: A Shallow Architecture for Scale Estimatio

34 Nov 09, 2022

The mini-MusicNet dataset

mini-MusicNet A music-domain dataset for multi-label classification Music transcription is sequence-to-sequence prediction problem: given an audio per

4 Nov 09, 2022

Fake News Detection Using Machine Learning Methods

Fake-News-Detection-Using-Machine-Learning-Methods Fake news is always a real and dangerous issue. However, with the presence and abundance of various

1 Jan 11, 2022

Official PyTorch Implementation of Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition, ICCV 2021

26 Dec 07, 2022

Pca-on-genotypes - Mini bioinformatics project - PCA on genotypes

Mini bioinformatics project: PCA on genotypes This repo contains the code from t

8 Dec 04, 2022

Multi-Task Deep Neural Networks for Natural Language Understanding

New Release We released Adversarial training for both LM pre-training/finetuning and f-divergence. Large-scale Adversarial training for LMs: ALUM code

2.1k Dec 30, 2022

Ascend your Jupyter Notebook usage

Jupyter Ascending Sync Jupyter Notebooks from any editor About Jupyter Ascending lets you edit Jupyter notebooks from your favorite editor, then insta

254 Jan 08, 2023

Code for Mesh Convolution Using a Learned Kernel Basis

Mesh Convolution This repository contains the implementation (in PyTorch) of the paper FULLY CONVOLUTIONAL MESH AUTOENCODER USING EFFICIENT SPATIALLY

35 Jan 03, 2023

[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Chasing Sparsity in Vision Transformers: An End-to-End Exploration Codes for [Preprint] Chasing Sparsity in Vision Transformers: An End-to-End Explora

64 Dec 08, 2022

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

EntityQuestions This repository contains the EntityQuestions dataset as well as code to evaluate retrieval results from the the paper Simple Entity-ce

119 Sep 28, 2022

Exploring Visual Engagement Signals for Representation Learning

Exploring Visual Engagement Signals for Representation Learning Menglin Jia, Zuxuan Wu, Austin Reiter, Claire Cardie, Serge Belongie and Ser-Nam Lim C

9 Jul 23, 2022

CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection

CLOCs is a novel Camera-LiDAR Object Candidates fusion network. It provides a low-complexity multi-modal fusion framework that improves the performance of single-modality detectors. CLOCs operates on

254 Dec 16, 2022

This is the workbook I created while I was studying for the Qiskit Associate Developer exam. I hope this becomes useful to others as it was for me :)

A Workbook for the Qiskit Developer Certification Exam Hello everyone! This is Bartu, a fellow Qiskitter. I have recently taken the Certification exam

66 Dec 10, 2022

Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

Tensor Component Analysis for Interpreting the Latent Space of GANs [ paper | project page ] Code to reproduce the results in the paper "Tensor Compon

4 Jun 17, 2022

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

Research on Tabular Deep Learning (Python package & papers)

Related tags

Overview

Research on Tabular Deep Learning

Papers and projects

Comments

Releases(v0.0.13)

v0.0.13(Mar 16, 2022)

Changes

v0.0.12(Mar 10, 2022)

v0.0.10(Feb 28, 2022)

New features

Project

v0.0.9(Nov 7, 2021)

v0.0.8(Nov 6, 2021)

Documentation

Bug fixes

Dependencies

Project

v0.0.7(Oct 10, 2021)

API changes

Project

v0.0.6(Aug 26, 2021)

v0.0.6

New features

Bug fixes

API changes

Documentation improvements

Project

v0.0.5(Jul 20, 2021)

v0.0.4(Jul 11, 2021)

Fixes

v0.0.3(Jul 2, 2021)

API Changes

Fixes

Project

Owner

Yura Gorishniy

Learning to Adapt Structured Output Space for Semantic Segmentation, CVPR 2018 (spotlight)

discovering subdomains, hidden paths, extracting unique links

Face Depixelizer based on "PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models" repository.

MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

An Exact Solver for Semi-supervised Minimum Sum-of-Squares Clustering

ScaleNet: A Shallow Architecture for Scale Estimation

The mini-MusicNet dataset

Fake News Detection Using Machine Learning Methods

Official PyTorch Implementation of Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition, ICCV 2021

Pca-on-genotypes - Mini bioinformatics project - PCA on genotypes

Multi-Task Deep Neural Networks for Natural Language Understanding

Ascend your Jupyter Notebook usage

Code for Mesh Convolution Using a Learned Kernel Basis

[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

Exploring Visual Engagement Signals for Representation Learning

CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection

This is the workbook I created while I was studying for the Qiskit Associate Developer exam. I hope this becomes useful to others as it was for me :)

Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].