Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

Overview

FFT-accelerated Interpolation-based t-SNE (FIt-SNE)

Introduction

t-Stochastic Neighborhood Embedding (t-SNE) is a highly successful method for dimensionality reduction and visualization of high dimensional datasets. A popular implementation of t-SNE uses the Barnes-Hut algorithm to approximate the gradient at each iteration of gradient descent. We accelerated this implementation as follows:

  • Computation of the N-body Simulation: Instead of approximating the N-body simulation using Barnes-Hut, we interpolate onto an equispaced grid and use FFT to perform the convolution, dramatically reducing the time to compute the gradient at each iteration of gradient descent. See the this post for some intuition on how it works.
  • Computation of Input Similarities: Instead of computing nearest neighbors using vantage-point trees, we approximate nearest neighbors using the Annoy library. The neighbor lookups are multithreaded to take advantage of machines with multiple cores. Using "near" neighbors as opposed to strictly "nearest" neighbors is faster, but also has a smoothing effect, which can be useful for embedding some datasets (see Linderman et al. (2017)). If subtle detail is required (e.g. in identifying small clusters), then use vantage-point trees (which is also multithreaded in this implementation).

Check out our paper or preprint for more details and some benchmarks.

Features

Additionally, this implementation includes the following features:

  • Early exaggeration: In Linderman and Steinerberger (2018), we showed that appropriately choosing the early exaggeration coefficient can lead to improved embedding of swissrolls and other synthetic datasets. Early exaggeration is built into all t-SNE implementations; here we highlight its importance as a parameter.
  • Late exaggeration: Increasing the exaggeration coefficient late in the optimization process can improve separation of the clusters. Kobak and Berens (2019) suggest starting late exaggeration immediately following early exaggeration.
  • Initialization: Custom initialization can be provided from Python/Matlab/R. As suggested by Kobak and Berens (2019), initializing t-SNE with the first two principal components (scaled to have standard deviation 0.0001) results in an embedding which often preserves the global structure more effectively than the default random normalization. See there for other initialisation tricks.
  • Variable degrees of freedom: Kobak et al. (2019) show that decreasing the degree of freedom (df) of the t-distribution (resulting in heavier tails) reveals fine structure that is not visible in standard t-SNE.
  • Perplexity combination: The perplexity parameter determines the width of the Gaussian kernel, such that small perplexity values uncover local structure while larger values reveal global structure. Kobak and Berens (2019) show that using combination of several perplexity values, resulting in a multi-scale embedding, can be useful.
  • All optimisation parameters can be controlled from Python/Matlab/R. For example, Belkina et al. (2019) highlight the importance of increasing the learning rate when embedding large data sets.

Installation

R, Matlab, and Python wrappers are fast_tsne.R, fast_tsne.m, and fast_tsne.py respectively. Each of these wrappers can be used after installing FFTW and compiling the C++ code, as below. Gioele La Manno implemented a Python (Cython) wrapper, which is available on PyPI here.

Note: If you update to a new version of FIt-SNE using git pull, be sure to recompile.

OSX and Linux

The only prerequisite is FFTW, which can be downloaded and installed from the website. Then, from the root directory compile the code as:

g++ -std=c++11 -O3  src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp  -o bin/fast_tsne -pthread -lfftw3 -lm -Wno-address-of-packed-member

See here for instructions in case one does not have sudo rights (one can install FFTW in the home directory and provide its path to g++).

Check out examples/ for usage. The Python demo notebook walks one through the most of the available options using the MNIST data set.

Windows

A Windows binary is available here. Please extract to the bin/ folder, and you should be all set.

If you would like to compile it yourself see below. The code has been currently tested with MS Visual Studio 2015 (i.e., MS Visual Studio Version 14).

  1. First open the provided FItSNE solution (FItSNE.sln) using MS Visual Studio and build the Release configuration.
  2. Copy the binary file ( e.g. x64/Release/FItSNE.exe) generated by the build process to the bin/ folder
  3. For Windows, we have added all dependencies, including the FFTW library, which is distributed under the GNU General Public License. For the binary to find the FFTW DLLs, you need to either add src/winlibs/fftw/ to your PATH, or to copy the DLLs into the bin/ directory.

As of this commit, only the R wrapper properly calls the Windows executable. The Python and Matlab wrappers can be trivially changed to call it (just changing bin/fast_tsne to bin/FItSNE.exe in the code), and will be changed in future commits.

Many thanks to Josef Spidlen for this Windows implementation!

Acknowledgements and References

We are grateful for members of the community who have contributed to improving FIt-SNE, especially Dmitry Kobak, Pavlin Poličar, and Josef Spidlen.

If you use FIt-SNE, please cite:

George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger. (2019). Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods. (link)

Our implementation is derived from the Barnes-Hut implementation:

Laurens van der Maaten (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15(1):3221–3245. (link)

Comments
  • Changing the defaults in the Python wrapper

    Changing the defaults in the Python wrapper

    1. Learning rate is set to max(200, N/early_exag_coeff) by default.
    2. Iteration number is set to 750 by default (250+500).
    3. Initialization is set to PCA (via ARPACK) by default.
    4. N-body algorithm is set to FFT for N>=8000 and to BH for N<8000 by default. UPDATE: I TOOK THIS OUT!
    5. Late exaggeration start is set to the early exagg end by default (if late exagg coeff is provided).

    I updated the example notebook too.

    This fixes issues #88 #89 #90.

    UPDATE: also implements multithreaded Barnes-Hut!

    opened by dkobak 53
  • Default learning rate

    Default learning rate

    Embedding a large dataset (let's say n=1mln) with FIt-SNE using default parameters will yield a horrible result. Now that we understand it pretty well and now that there are papers published suggesting a very easy fix, why don't we change the default learning rate? What's the benefit of keeping it set to 200 as it was in the original t-SNE implementation?

    My suggestion: if n>=10000 and if the learning rate is not explicitly set, then the wrapper sets it to n/12. The cutoff can be smaller than 10000 but in my experience smaller data sets work fine with learning rate 200, and 10000 is a nice round number.

    @pavlin-policar I suggest to adopt the same convention in openTSNE. What do you guys think?

    opened by dkobak 38
  • Runtime of the cost computation

    Runtime of the cost computation

    While running the code on the n=1.3mln dataset, I noticed that the CPU usage looks like that:

    screenshot from 2018-09-20 15-09-04

    There are exactly 50 short valleys (each valley takes ~1s) with short periods of 100% CPU for all cores (also ~1s) -- this I assume corresponds to the attractive force computation (parallel on all cores) and the repulsive force computation (currently not parallel). Together they take ~2s per iteration, i.e. ~100s per 50 iterations. But after 50 iterations there is a long ~30s period when something happens on one core. WHAT IS IT?!

    The only thing that is done every 50 iterations, is the cost computation. Does the cost computation really take so long? Can it be accelerated? If not, and if it really takes 25% of the computation time, then I think it should be optional and switched off by default (at least for large datasets).

    opened by dkobak 30
  • Barnes-Hut by default for small N

    Barnes-Hut by default for small N

    For small-ish sample sizes, Barnes-Hut is noticeably faster than FI. Given that FIt-SNE implements both, why not default to BH whenever N is small enough? This could be implemented in the wrappers, e.g. in Python by nbody_algo='auto'by default.

    @pavlin-policar What do you think about it? openTSNE also implements multicore BH (right?), so it could similarly use negative_gradient_method='auto' by default.

    Not sure what a good cutoff would be, but maybe George knows.

    opened by dkobak 22
  • Refactor interpolation code

    Refactor interpolation code

    For the past month or two, I've been playing around with this method, and in order to better understand how it works, I've rewritten most of the interpolation code, as it was quite confusing.

    Description of changes

    My primary goal here was to make the code understandable to anybody following your paper.

    The first thing I did was run formatting on the project, as it really was quite inconsistent and that really bugs me :) Unfortunately, this makes the entire PR quite large, having touched every file in the project.

    The files I did change and that should be reviewed are nbodyfft (I rewrote most of this) and tsne (I didn't really do much here, changed a few things in the parts that relate to the FFT bit.

    I noticed that there was a lot of seemingly needless memory rearranging via iarr, ysort and the likes. After a bit of head scratching, I found that the code basically ensures that points belonging to the same box (or interval) are rearranged to be close to each other in memory. This really isn't necessary, because we can compute the correct interval index on the fly, saving on a lot of memory allocation and rearraging. I removed all of that code and comptued the indices on the fly. IMO this makes the function a whole lot clearer and understandable.

    The indices are a bit complicated to compute e.g.

    int idx = (box_i * n_interpolation_points + interp_i) * (n_boxes * n_interpolation_points) +
              (box_j * n_interpolation_points) + interp_j;
    

    but if you think about it a bit, it is easier to get your head around than the previous rearranging. I just took a piece of paper and drew a grid to figure these out and it's really quite easy. This also has the added benefit of being a bit faster than the previous version, since we're not wasting time copying memory around.

    Other than that I made the code more C++ish (under the guidance of intelliJ Clang-tidy warnings. I am by no means good at C++ so change things at will) and added comments that I thought would make the code clearer to somebody following your paper. I also removed a lot of commented out code that was presumably there for debugging, as it was taking up a lot of space and making everything very unreadable.

    I also added a CMakeLists.txt as per the recommendation of intelliJ. I think that's really it, but as I have been working on this for a while, I may have forgotten to mention some small thing.

    End note

    I was fairly careful that the results I got were always still correct, but one can never be too careful. If you find any bugs or issues, I'll be sure to correct them.

    Also, I'd ask you make sure that my comments are in fact correct, as understanding the code was my primary goal here and making it understandable for anybody going through it in the future. A misleading comment is worse than no comment at all. Same goes for variable names.

    As requested, I rebased my branch on top of the latest master so you guys wouldn't have to deal with the many merge conflicts. I haven't had the chance to test the functionality you guys added in the past two months. It should work, but it should be tested.

    opened by pavlin-policar 18
  • Precomputed distance matrix (take 2)

    Precomputed distance matrix (take 2)

    Hi, I had previously asked about using FIt-SNE with a precomputed distance matrix and I was pointed to an earlier version which did support it. Thanks for that, but I now want something else.

    I see that there is now an "degrees of freedom" parameter that makes the heavy tails heavier, and I'm interested in using that if possible, and I wonder if I can do that with a precomputed distance matrix.

    Thanks for any help!

    opened by sg-s 16
  • unable to reproduce example

    unable to reproduce example

    While trying to reproduce your example code in atom, the result of the first figure looks nothing like yours. I don't know where I might be changing things. I'm doing this in Windows. Here is how my code looks like:

    import numpy as np import sys import os import pylab as plt sys.path.append('path to /FIt-SNE') from fast_tsne import fast_tsne

    from keras.datasets import mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train = x_train.reshape(60000, 784).astype('float64') / 255 x_test = x_test.reshape(10000, 784).astype('float64') / 255 X = np.concatenate((x_train, x_test)) y = np.concatenate((y_train, y_test)) print(X.shape) X = X - X.mean(axis=0) U, s, V = np.linalg.svd(X, full_matrices=False) X50 = np.dot(U, np.diag(s))[:, :50]

    print(X50.shape)

    PCAinit = X50[:, :2] / np.std(X50[:, 0]) * 0.0001 col = np.array(['#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a'])

    Z = fast_tsne(X50, perplexity=50, seed=42)

    print(Z.shape)

    plt.figure(figsize=(5, 5)) plt.scatter(Z[:, 0], Z[:, 1], c=col[y], s=1) plt.tight_layout() plt.show()

    figure_1

    opened by adpala 15
  • How to make clusters more compact?

    How to make clusters more compact?

    Dear all,

    first of all, thanks for the great tool! I am currently developing upon it and it has been really nice to use so far.

    When comparing it to "classical" t-SNE visualizations, I found the classic visualization (i.e., the 2D embedded coordinates) to be somewhat more compact. Put differently, the clusters were more strongly separated. This, in turn, is relevant as I would like to cluster the 2D embedded data.

    Could you please let me know if and, if so, how this could be approached using FIt-SNE? I checked the documentation and it seems that the late exageration parameter might be useful here. However, I have no intuition what would be a good cooefficient there to start with and at which iteration to start.

    Your thoughts and input are greatly appreciated!

    Best,

    Cedric

    P.S. I am currently using v.1.1.0 and the python wrapper.

    opened by claczny 14
  • Conda package / new release?

    Conda package / new release?

    Hi, I'm wondering if you'd be interested in vendoring this package through Conda? I can help with that if needed.

    Also, it appears that there have been many commits post last release. It'd be nice to create a new release to reflect newest changes. Thanks!

    opened by raivivek 13
  • Modifications to allow for compilation under Windows using MS Visual Studio

    Modifications to allow for compilation under Windows using MS Visual Studio

    This now compiles on OSX using g++ -std=c++11 -O3 src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp -o bin/fast_tsne -pthread -lfftw3 -lm as well as under Windows by opening and rebuilding the provided solution with MS Visual Studio (tested MS Visual Studio 2015). I have not tested compilation under Linux, but I assume it should work just fine.

    opened by jspidlen 13
  • Variable degree of freedom is not supported for 1D

    Variable degree of freedom is not supported for 1D

    Just realized that our variable degree of freedom is not supported for 1D visualizations. Let me tell you why it could be cool to implement when we talk later today :-)

    By the way, here is the 1D MNIST tSNE that I am getting with my default settings fast_tsne(X50, learning_rate=X50.shape[0]/12, map_dims=1, initialization=PCAinit[:,0]) where X50 is 70000x50 matrix after PCA:

    mnist-1d

    It's beautiful, but 4d and 9s (salmon/violet) get tangled up. I played a bit with the parameters but couldn't make them separate fully. It seems 1D optimisation is harder than 2D (which is no surprise).

    opened by dkobak 11
  • compile fitsne error

    compile fitsne error

    I use this command to compile from the main page:

    g++ -std=c++11 -O3  src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp  -o bin/fast_tsne -pthread -lfftw3 -lm
    

    but got a warning, and the program stop:

    In file included from src/tsne.cpp:42:0:
    src/annoylib.h:84:17: note: #pragma message: Using no AVX instructions
     #pragma message "Using no AVX instructions"
                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    src/tsne.cpp: In member function 'void TSNE::computeGradient(double*, unsigned int*, unsigned int*, double*, double*, int, int, double*, double, unsigned int)':
    src/tsne.cpp:633:89: warning: format '%d' expects argument of type 'int', but argument 3 has type 'long unsigned int' [-Wformat=]
                             pos_f[i * 2 + 1], neg_f[i * 2] / sum_Q, neg_f[i * 2 + 1] / sum_Q);
    

    what's wrong with the program, thanks

    opened by zyh3826 0
  • FFT supports only 2 components

    FFT supports only 2 components

    Hi!

    Are you planning to improve the algorithm to include more than 2 dimensions? Would that be possible? Any idea how 3 tsne components/ 3 dimensions would be possible with fitsne?

    opened by marvelocity 0
  • Integer Overflow? Large datasets

    Integer Overflow? Large datasets

    Hi all,

    Thanks very much for being patient with me - it's very much appreciated. I've attempted to make some of the relevant changes, however sadly these haven't fixed the issue. The help is very much appreciated!

    opened by TonyX26 0
  • Memory Allocation Failed - Large Datasets

    Memory Allocation Failed - Large Datasets

    Hi all,

    I've been trying to run FIt-SNE on a FCS file 20 million events large. Unfortunately, despite allocating 1.5TB of memory, an error still arises (below). This does not occur when running the same file downsampled to 2 or 5 million cells. I have just been trying to run a small 20 iterations, just to identify the problem, however it never manages to get there...

    Has anyone encountered this error before? I've attached the error file, the output, and my script.

    Thanks!

    =============== t-SNE v1.2.1 ===============
    fast_tsne data_path: <path> 
    fast_tsne result_path: <path>
    fast_tsne nthreads: 96
    Read the following parameters:
    	 n 19113296 by d 17 dataset, theta 0.500000,
    	 perplexity 50.000000, no_dims 2, max_iter 20,
    	 stop_lying_iter 250, mom_switch_iter 250,
    	 momentum 0.500000, final_momentum 0.800000,
    	 learning_rate 1592774.666667, max_step_norm 5.000000,
    	 K -1, sigma -30.000000, nbody_algo 2,
    	 knn_algo 1, early_exag_coeff 12.000000,
    	 no_momentum_during_exag 0, n_trees 50, search_k 7500,
    	 start_late_exag_iter -1, late_exag_coeff 1.000000
    	 nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 1.000000
    Read the 19113296 x 17 data matrix successfully. X[0,0] = 71838.656250
    Read the initialization successfully.
    Will use momentum during exaggeration phase
    Computing input similarities...
    Using perplexity, so normalizing input data (to prevent numerical problems)
    Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
    Using ANNOY for knn search, with parameters: n_trees 50 and search_k 7500
    Going to allocate memory. N: 19113296, K: 150, N*K = -1427972896
    Memory allocation failed!
    
    Resource Usage on 2021-08-05 16:59:31:
    Job Id:             job_ID
    Project:            ##
    Exit Status:        1
    Service Units:      6.20
    NCPUs Requested:    48                     NCPUs Used: 48              
                                               CPU Time Used: 00:02:26                                   
       Memory Requested:   1.46TB                Memory Used: 37.03GB         
       Walltime requested: 20:00:00            Walltime Used: 00:02:35        
       JobFS requested:    30.0GB                 JobFS used: 15.18KB         
    

    Error file:

    FIt-SNE R wrapper loading.
    FIt-SNE root directory was set to <directory>
    Using rsvd() to compute the top PCs for initialization.
    Error in fftRtsne(dsobject_s[, -c(1, 19:24)], perplexity = 50, max_iter = 20) : 
      tsne call failed
    Execution halted
    

    Script:

    library(flowCore)
    
    ## Sourcing FITSNE 
    fast_tsne_path  <- "<path>/fast_tsne" 
    source(paste0(fast_tsne_path,".R"))
    
    ## Loading in File
    object <- exprs(read.FCS("<file>.fcs"))
    
    ## Running FIt-SNE 
    tsne_object <-fftRtsne(object[,-c(1, 19:24)],perplexity = 50, max_iter = 20)
    export_obj <- cbind(object, tSNEX = tsne_object[,1], tSNEY = tsne_object[,2],fast_tsne_path=fast_tsne_path)
    
    ## Saving Object
    saveRDS(export_obj, "fitSNE_alltube_simple20.rds")
    
    
    opened by TonyX26 7
  • Could not run in Google Colab

    Could not run in Google Colab

    Hello,

    I am trying to run fast_tsne.py in Google Colab due to inadequity of my system. When I tried to run the code I am getting the error below:

    FileNotFoundError: [Errno 2] No such file or directory: '/content/bin/fast_tsne': '/content/bin/fast_tsne'

    I tried to change the directory and fix the code but I could not fixed it. I would be very glad if you could help me about the issue.

    Kind regards.

    opened by sabotaha 1
Releases(v1.2.1)
  • v1.2.1(Apr 19, 2020)

  • v1.2.0(Mar 30, 2020)

    Several changes to default FIt-SNE settings to make it more suitable for embedding large datasets. See this recent paper by Dmitry Kobak and Philipp Berens for more details.

    Major changes to default values: -Learning rate increased from the fixed value of 200 to max(200, N/early_exag_coeff). -Iteration number decreased from 1000 to 750. -Initialization is set to PCA (computed via fast SVD implementations like ARPACK).

    Minor changes: -Late exaggeration start is set to the end of early exaggeration (if late exaggeration coefficient is provided). -Limiting max step size to 5 (solves problem encountered when learning rate set too high and attractive forces cause a small number of points to overshoot)

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Feb 8, 2019)

    • Decreasing the degree of freedom (df) of the t-distribution reveals fine structure that is not visible in standard t-SNE. This PR adds a df parameter for that purpose. Preprint will be forthcoming.
    • Added documentation to Python and R wrappers
    • Added License
    • Binary checks if the wrapper version matches the binary version
    Source code(tar.gz)
    Source code(zip)
    FItSNE-Windows-1.1.0.zip(1.00 MB)
  • v1.0.0(Oct 28, 2018)

Owner
Kluger Lab
Kluger Lab
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
Datetimes for Humans™

Maya: Datetimes for Humans™ Datetimes are very frustrating to work with in Python, especially when dealing with different locales on different systems

Timo Furrer 3.4k Dec 28, 2022
ZenML 🙏: MLOps framework to create reproducible ML pipelines for production machine learning.

ZenML is an extensible, open-source MLOps framework to create production-ready machine learning pipelines. It has a simple, flexible syntax, is cloud and tool agnostic, and has interfaces/abstraction

ZenML 2.6k Jan 08, 2023
A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

VijayAadhithya2019rit 1 Feb 02, 2022
This is an auto-ML tool specialized in detecting of outliers

Auto-ML tool specialized in detecting of outliers Description This tool will allows you, with a Dash visualization, to compare 10 models of machine le

1 Nov 03, 2021
Neural Machine Translation (NMT) tutorial with OpenNMT-py

Neural Machine Translation (NMT) tutorial with OpenNMT-py. Data preprocessing, model training, evaluation, and deployment.

Yasmin Moslem 29 Jan 09, 2023
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

6 Jun 30, 2022
Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Sean Zahller 1 Feb 04, 2022
AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

English | 简体中文 AutoX是什么? AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。 简单易用: AutoX的接口和sklearn类似,方便上手使用。

4Paradigm 431 Dec 28, 2022
To-Be is a machine learning challenge on CodaLab Platform about Mortality Prediction

To-Be is a machine learning challenge on CodaLab Platform about Mortality Prediction. The challenge aims to adress the problems of medical imbalanced data classification.

Marwan Mashra 1 Jan 31, 2022
This machine learning model was developed for House Prices

This machine learning model was developed for House Prices - Advanced Regression Techniques competition in Kaggle by using several machine learning models such as Random Forest, XGBoost and LightGBM.

serhat_derya 1 Mar 02, 2022
A Collection of Conference & School Notes in Machine Learning 🦄📝🎉

Machine Learning Conference & Summer School Notes. 🦄📝🎉

558 Dec 28, 2022
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 363 Dec 14, 2022
This is a curated list of medical data for machine learning

Medical Data for Machine Learning This is a curated list of medical data for machine learning. This list is provided for informational purposes only,

Andrew L. Beam 5.4k Dec 26, 2022
Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Oracle 95 Dec 28, 2022
A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

Rafael de la Fuente 393 Dec 27, 2022
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021
Estudos e projetos feitos com PySpark.

PySpark (Spark com Python) PySpark é uma biblioteca Spark escrita em Python, e seu objetivo é permitir a análise interativa dos dados em um ambiente d

Karinne Cristina 54 Nov 06, 2022