MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data

Related tags

Machine LearningMCML
Overview

MCML

MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use for single-cell datasets though the method can use any matrix as input.

MCML modules include the MCML and bMCML algorithms for dimensionality reduction, and MCML tools include functions for quantitative analysis of inter- and intra- distances between labeled groups and nearest neighbor metrics in the latent or ambient space. The modules are autoencoder-based neural networks with label-aware cost functions for weight optimization.

Briefly, MCML adapts the Neighborhood Component Analysis algorithm to utilize mutliple classes of labels for each observation (cell) to embed observations of the same labels close to each other. This essentially optimizes the latent space for k-Nearest Neighbors (KNN) classification.

bMCML demonstrates targeted reconstruction error, which optimizes for recapitulation of intra-label distances (the pairwise distances between cells within the same label).

tools include functions for inter- and intra-label distance calculations as well as metrics on the labels of n the k nearest neighbors of each observation. These can be performed on any latent or ambient space (matrix) input.

Requirements

You need Python 3.6 or later to run MCML. You can have multiple Python versions (2.x and 3.x) installed on the same system without problems.

In Ubuntu, Mint and Debian you can install Python 3 like this:

$ sudo apt-get install python3 python3-pip

For other Linux distributions, macOS and Windows, packages are available at

https://www.python.org/getit/

Quick start

MCML can be installed using pip:

$ python3 -m pip install -U MCML

If you want to run the latest version of the code, you can install from git:

$ python3 -m pip install -U git+git://github.com/pachterlab/MCML.git

Examples

Example data download:

$ wget --quiet https://caltech.box.com/shared/static/i66kelel9ouep3yw8bn2duudkqey190j
$ mv i66kelel9ouep3yw8bn2duudkqey190j mat.mtx
$ wget --quiet https://caltech.box.com/shared/static/dcmr36vmsxgcwneh0attqt0z6qm6vpg6
$ mv dcmr36vmsxgcwneh0attqt0z6qm6vpg6 metadata.csv

Extract matrix (obs x features) and labels for each obs:

>>> import pandas as pd
>>> import scipy.io as sio
>>> import numpy as np

>>> mat = sio.mmread('mat.mtx') #Is a centered and scaled matrix (scaling input is optional)
>>> mat.shape
(3850, 1999)

>>> meta = pd.read_csv('metadata.csv')
>>> meta.head()
 Unnamed: 0          sample_name  smartseq_cluster_id  smartseq_cluster  ... n_genes percent_mito pass_count_filter  pass_mito_filter
0  SM-GE4R2_S062_E1-50  SM-GE4R2_S062_E1-50                   46   Nr5a1_9|11 Rorb  ...    9772          0.0              True              True
1  SM-GE4SI_S356_E1-50  SM-GE4SI_S356_E1-50                   46   Nr5a1_9|11 Rorb  ...    8253          0.0              True              True
2  SM-GE4SI_S172_E1-50  SM-GE4SI_S172_E1-50                   46   Nr5a1_9|11 Rorb  ...    9394          0.0              True              True
3   LS-15034_S07_E1-50   LS-15034_S07_E1-50                   42  Nr5a1_4|7 Glipr1  ...   10643          0.0              True              True
4   LS-15034_S28_E1-50   LS-15034_S28_E1-50                   42  Nr5a1_4|7 Glipr1  ...   10550          0.0              True              True

>>> cellTypes = list(meta.smartseq_cluster)
>>> sexLabels = list(meta.sex_label)
>>> len(sexLabels)
3850



To run the MCML algorithm for dimensionality reduction (Python 3):

>>> from MCML.modules import MCML, bMCML

>>> mcml = MCML(n_latent = 50, epochs = 100) #Initialize MCML class

>>> latentMCML = mcml.fit(mat, np.array([cellTypes,sexLabels]) , fracNCA = 0.8 , silent = True) #Run MCML
>>> latentMCML.shape
(3850, 50)

This incorporates both the cell type and sex labels into the latent space construction. Use plotLosses() to view the loss function components over the training epochs.

>>> mcml.plotLosses(figsize=(10,3),axisFontSize=10,tickFontSize=8) #Plot loss over epochs



To run the bMCML algorithm for dimensionality reduction (Python 3):

>>> bmcml = bMCML(n_latent = 50, epochs = 100) #Initialize bMCML class


>>> latentbMCML = bmcml.fit(mat, np.array(cellTypes), np.array(sexLabels), silent=True) #Run bMCML
>>> latentbMCML.shape
(3850, 50)

>>> bmcml.plotLosses(figsize=(10,3),axisFontSize=10,tickFontSize=8) #Plot loss over epochs

bMCML is optimizing for the intra-distances of the sex labels i.e. the pairwise distances of cells in each sex for each cell type.

For both bMCML and MCML objects, fit() can be replaced with trainTest() to train the algorithms on a subset of the full data and apply the learned weights to the remaining test data. This offers a method assessing overfitting.



To use the metrics available in tools:

>>> from MCML import tools as tl

#Pairwise distances between centroids of cells in each label
>>> cDists = tl.getCentroidDists(mat, np.array(cellTypes)) 
>>> len(cDists)
784

#Avg pairwise distances between cells of *both* sexes, for each cell type
>>> interDists = tl.getInterVar(mat, np.array(cellTypes), np.array(sexLabels))  
>>> len(interDists)
27

#Avg pairwise distances between cells of the *same* sex, for each cell type
>>> intraDists = tl.getIntraVar(mat, np.array(cellTypes), np.array(sexLabels)) 
>>> len(intraDists)
53

#Fraction of neighbors for each cell with same label as cell itself (also returns which labels neighbors have)
>>> neighbor_fracs, which_labels = tl.frac_unique_neighbors(mat, np.array(cellTypes), metric = 1,neighbors = 30)

#Get nearest neighbors for any embedding
>>> orig_neigh = tl.getNeighbors(mat, n_neigh = 15, p=1)
>>> latent_neigh = tl.getNeighbors(latentMCML, n_neigh = 15, p=1)

#Get Jaccard distance between latent and ambient nearest neighbors
>>> jac_dists = tl.getJaccard(orig_neigh, latent_neigh)
>>>len(jac_dists)
3850



To see further details of all inputs and outputs for all functions use:

>>> help(MCML)
>>> help(bMCML)
>>> help(tl)

License

MCML is licensed under the terms of the BSD License (see the file LICENSE).

Owner
Pachter Lab
Pachter Lab
The Emergence of Individuality

The Emergence of Individuality

16 Jul 20, 2022
#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit 🎈 This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

Streamlit 53 Jan 02, 2023
Toolss - Automatic installer of hacking tools (ONLY FOR TERMUKS!)

Tools Автоматический установщик хакерских утилит (ТОЛЬКО ДЛЯ ТЕРМУКС!) Оригиналь

14 Jan 05, 2023
Optimal Randomized Canonical Correlation Analysis

ORCCA Optimal Randomized Canonical Correlation Analysis This project is for the python version of ORCCA algorithm. It depends on Numpy for matrix calc

Yinsong Wang 1 Nov 21, 2021
OptaPy is an AI constraint solver for Python to optimize planning and scheduling problems.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

OptaPy 208 Dec 27, 2022
Adaptive: parallel active learning of mathematical functions

adaptive Adaptive: parallel active learning of mathematical functions. adaptive is an open-source Python library designed to make adaptive parallel fu

741 Dec 27, 2022
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Spark Python Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, fro

Jose A Dianes 1.5k Jan 02, 2023
learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your

BDFD 6 Nov 05, 2022
Dragonfly is an open source python library for scalable Bayesian optimisation.

Dragonfly is an open source python library for scalable Bayesian optimisation. Bayesian optimisation is used for optimising black-box functions whose

744 Jan 02, 2023
A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

Chris Santiago 0 Mar 30, 2022
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Target 696 Dec 26, 2022
This handbook accompanies the course: Machine Learning with Hung-Yi Lee

This handbook accompanies the course: Machine Learning with Hung-Yi Lee

RenChu Wang 472 Dec 31, 2022
Winning solution for the Galaxy Challenge on Kaggle

Winning solution for the Galaxy Challenge on Kaggle

Sander Dieleman 483 Jan 02, 2023
database for artificial intelligence/machine learning data

AIDB v0.0.1 database for artificial intelligence/machine learning data Overview aidb is a database designed for large dataset for machine learning pro

Aarush Gupta 1 Oct 24, 2021
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 07, 2023
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU 10 Dec 19, 2022
使用数学和计算机知识投机倒把

偷鸡不成项目集锦 坦率地讲,涉及金融市场的好策略如果公开,必然导致使用的人多,最后策略变差。所以这个仓库只收集我目前失败了的案例。 加密货币组合套利 中国体育彩票预测 我赚不上钱的项目,也许可以帮助更有能力的人去赚钱。

Roy 28 Dec 29, 2022
Educational python for Neural Networks, written in pure Python/NumPy.

Educational python for Neural Networks, written in pure Python/NumPy.

127 Oct 27, 2022
Machine Learning Course with Python:

A Machine Learning Course with Python Table of Contents Download Free Deep Learning Resource Guide Slack Group Introduction Motivation Machine Learnin

Instill AI 6.9k Jan 03, 2023
icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

icepickle It's a cooler way to store simple linear models. The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-lea

vincent d warmerdam 24 Dec 09, 2022