Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Last update: Nov 21, 2022

Overview

Learned NDV estimator

Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically. See our VLDB 2022 paper Learning to be a Statistician: Learned Estimator for Number of Distinct Values for more details.

How to use

Install the package

pip install estndv
Import and create an instance

   from estndv import ndvEstimator
   estimator = ndvEstimator()

Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:

ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)
If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:

ndv = estimator.profile_predict(f=[2,1,1], N=100000)
If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method estimator.sample_predict_batch() or estimator.profile_predict_batch().

How to train the ndv estimator

You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:

Go to the model_training folder cd model_training
Install requirements

pip install requirements.txt
Generate training data. (This uses a lot of memory.)

python training_data_generation.py
Train model

python model_training.py
Save trained pytorch model parameters to numpy, this generates a file model_paras.npy

python torch2npy.py
Test with your model parameters by specifying a path to your model_paras.npy

estimator = ndvEstimator(para_path=your path to model_paras.npy)

Citation

If you use our work or found it useful, please cite our paper:

@article{wu2022learning,
   author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
   title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
   year = {2021},
   issue_date = {October 2021},
   publisher = {VLDB Endowment},
   volume = {15},
   number = {2},
   issn = {2150-8097},
   url = {https://doi.org/10.14778/3489496.3489508},
   doi = {10.14778/3489496.3489508},
   journal = {Proc. VLDB Endow.},
   month = {oct},
   pages = {272–284},
   numpages = {13}
}

Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Related tags

Overview

Learned NDV estimator

How to use

How to train the ndv estimator

Citation

Owner

Official repo for the work titled "SharinGAN: Combining Synthetic and Real Data for Unsupervised GeometryEstimation"

Bringing Characters to Life with Computer Brains in Unity

We propose a new method for effective shadow removal by regarding it as an exposure fusion problem.

Self-supervised Augmentation Consistency for Adapting Semantic Segmentation (CVPR 2021)

Bayesian Inference Tools in Python

Implementation of our NeurIPS 2021 paper "A Bi-Level Framework for Learning to Solve Combinatorial Optimization on Graphs".

Contains a bunch of different python programm tasks

This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer"

Attempt at implementation of a simple GAN using Keras

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

This repo. is an implementation of ACFFNet, which is accepted for in Image and Vision Computing.

Code for the paper Open Sesame: Getting Inside BERT's Linguistic Knowledge.

PyTorch implementation of DeepUME: Learning the Universal Manifold Embedding for Robust Point Cloud Registration (BMVC 2021)

An implementation on "Curved-Voxel Clustering for Accurate Segmentation of 3D LiDAR Point Clouds with Real-Time Performance"

ACV is a python library that provides explanations for any machine learning model or data.

A PyTorch implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019).

HackBMU-5.0-Team-Ctrl-Alt-Elite - HackBMU 5.0 Team Ctrl Alt Elite

Multi-atlas segmentation (MAS) is a promising framework for medical image segmentation

A pyparsing-based library for parsing SOQL statements