MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data

Related tags

Machine LearningMCML
Overview

MCML

MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use for single-cell datasets though the method can use any matrix as input.

MCML modules include the MCML and bMCML algorithms for dimensionality reduction, and MCML tools include functions for quantitative analysis of inter- and intra- distances between labeled groups and nearest neighbor metrics in the latent or ambient space. The modules are autoencoder-based neural networks with label-aware cost functions for weight optimization.

Briefly, MCML adapts the Neighborhood Component Analysis algorithm to utilize mutliple classes of labels for each observation (cell) to embed observations of the same labels close to each other. This essentially optimizes the latent space for k-Nearest Neighbors (KNN) classification.

bMCML demonstrates targeted reconstruction error, which optimizes for recapitulation of intra-label distances (the pairwise distances between cells within the same label).

tools include functions for inter- and intra-label distance calculations as well as metrics on the labels of n the k nearest neighbors of each observation. These can be performed on any latent or ambient space (matrix) input.

Requirements

You need Python 3.6 or later to run MCML. You can have multiple Python versions (2.x and 3.x) installed on the same system without problems.

In Ubuntu, Mint and Debian you can install Python 3 like this:

$ sudo apt-get install python3 python3-pip

For other Linux distributions, macOS and Windows, packages are available at

https://www.python.org/getit/

Quick start

MCML can be installed using pip:

$ python3 -m pip install -U MCML

If you want to run the latest version of the code, you can install from git:

$ python3 -m pip install -U git+git://github.com/pachterlab/MCML.git

Examples

Example data download:

$ wget --quiet https://caltech.box.com/shared/static/i66kelel9ouep3yw8bn2duudkqey190j
$ mv i66kelel9ouep3yw8bn2duudkqey190j mat.mtx
$ wget --quiet https://caltech.box.com/shared/static/dcmr36vmsxgcwneh0attqt0z6qm6vpg6
$ mv dcmr36vmsxgcwneh0attqt0z6qm6vpg6 metadata.csv

Extract matrix (obs x features) and labels for each obs:

>>> import pandas as pd
>>> import scipy.io as sio
>>> import numpy as np

>>> mat = sio.mmread('mat.mtx') #Is a centered and scaled matrix (scaling input is optional)
>>> mat.shape
(3850, 1999)

>>> meta = pd.read_csv('metadata.csv')
>>> meta.head()
 Unnamed: 0          sample_name  smartseq_cluster_id  smartseq_cluster  ... n_genes percent_mito pass_count_filter  pass_mito_filter
0  SM-GE4R2_S062_E1-50  SM-GE4R2_S062_E1-50                   46   Nr5a1_9|11 Rorb  ...    9772          0.0              True              True
1  SM-GE4SI_S356_E1-50  SM-GE4SI_S356_E1-50                   46   Nr5a1_9|11 Rorb  ...    8253          0.0              True              True
2  SM-GE4SI_S172_E1-50  SM-GE4SI_S172_E1-50                   46   Nr5a1_9|11 Rorb  ...    9394          0.0              True              True
3   LS-15034_S07_E1-50   LS-15034_S07_E1-50                   42  Nr5a1_4|7 Glipr1  ...   10643          0.0              True              True
4   LS-15034_S28_E1-50   LS-15034_S28_E1-50                   42  Nr5a1_4|7 Glipr1  ...   10550          0.0              True              True

>>> cellTypes = list(meta.smartseq_cluster)
>>> sexLabels = list(meta.sex_label)
>>> len(sexLabels)
3850



To run the MCML algorithm for dimensionality reduction (Python 3):

>>> from MCML.modules import MCML, bMCML

>>> mcml = MCML(n_latent = 50, epochs = 100) #Initialize MCML class

>>> latentMCML = mcml.fit(mat, np.array([cellTypes,sexLabels]) , fracNCA = 0.8 , silent = True) #Run MCML
>>> latentMCML.shape
(3850, 50)

This incorporates both the cell type and sex labels into the latent space construction. Use plotLosses() to view the loss function components over the training epochs.

>>> mcml.plotLosses(figsize=(10,3),axisFontSize=10,tickFontSize=8) #Plot loss over epochs



To run the bMCML algorithm for dimensionality reduction (Python 3):

>>> bmcml = bMCML(n_latent = 50, epochs = 100) #Initialize bMCML class


>>> latentbMCML = bmcml.fit(mat, np.array(cellTypes), np.array(sexLabels), silent=True) #Run bMCML
>>> latentbMCML.shape
(3850, 50)

>>> bmcml.plotLosses(figsize=(10,3),axisFontSize=10,tickFontSize=8) #Plot loss over epochs

bMCML is optimizing for the intra-distances of the sex labels i.e. the pairwise distances of cells in each sex for each cell type.

For both bMCML and MCML objects, fit() can be replaced with trainTest() to train the algorithms on a subset of the full data and apply the learned weights to the remaining test data. This offers a method assessing overfitting.



To use the metrics available in tools:

>>> from MCML import tools as tl

#Pairwise distances between centroids of cells in each label
>>> cDists = tl.getCentroidDists(mat, np.array(cellTypes)) 
>>> len(cDists)
784

#Avg pairwise distances between cells of *both* sexes, for each cell type
>>> interDists = tl.getInterVar(mat, np.array(cellTypes), np.array(sexLabels))  
>>> len(interDists)
27

#Avg pairwise distances between cells of the *same* sex, for each cell type
>>> intraDists = tl.getIntraVar(mat, np.array(cellTypes), np.array(sexLabels)) 
>>> len(intraDists)
53

#Fraction of neighbors for each cell with same label as cell itself (also returns which labels neighbors have)
>>> neighbor_fracs, which_labels = tl.frac_unique_neighbors(mat, np.array(cellTypes), metric = 1,neighbors = 30)

#Get nearest neighbors for any embedding
>>> orig_neigh = tl.getNeighbors(mat, n_neigh = 15, p=1)
>>> latent_neigh = tl.getNeighbors(latentMCML, n_neigh = 15, p=1)

#Get Jaccard distance between latent and ambient nearest neighbors
>>> jac_dists = tl.getJaccard(orig_neigh, latent_neigh)
>>>len(jac_dists)
3850



To see further details of all inputs and outputs for all functions use:

>>> help(MCML)
>>> help(bMCML)
>>> help(tl)

License

MCML is licensed under the terms of the BSD License (see the file LICENSE).

Owner
Pachter Lab
Pachter Lab
A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

VijayAadhithya2019rit 1 Feb 02, 2022
This is a curated list of medical data for machine learning

Medical Data for Machine Learning This is a curated list of medical data for machine learning. This list is provided for informational purposes only,

Andrew L. Beam 5.4k Dec 26, 2022
healthy and lesion models for learning based on the joint estimation of stochasticity and volatility

health-lesion-stovol healthy and lesion models for learning based on the joint estimation of stochasticity and volatility Reference please cite this p

5 Nov 01, 2022
Price forecasting of SGB and IRFC Bonds and comparing there returns

Project_Bonds Project Title : Price forecasting of SGB and IRFC Bonds and comparing there returns. Introduction of the Project The 2008-09 global fina

Tishya S 1 Oct 28, 2021
Crunchdao - Python API for the Crunchdao machine learning tournament

Python API for the Crunchdao machine learning tournament Interact with the Crunc

3 Jan 19, 2022
Spark development environment for k8s

Local Spark Dev Env with Docker Development environment for k8s. Using the spark-operator image to ensure it will be the same environment. Start conta

Otacilio Filho 18 Jan 04, 2022
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is imp

432 Jan 05, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022
Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map,

Mohamadreza Rezaei 1 Jan 19, 2022
Climin is a Python package for optimization, heavily biased to machine learning scenarios

climin climin is a Python package for optimization, heavily biased to machine learning scenarios distributed under the BSD 3-clause license. It works

Biomimetic Robotics and Machine Learning at Technische Universität München 177 Sep 02, 2022
Python factor analysis library (PCA, CA, MCA, MFA, FAMD)

Prince is a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence anal

Max Halford 915 Dec 31, 2022
Python module for data science and machine learning users.

dsnk-distributions package dsnk distribution is a Python module for data science and machine learning that was created with the goal of reducing calcu

Emmanuel ASIFIWE 1 Nov 23, 2021
Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

PyTASER PyTASER is a Python (3.9+) library and set of command-line tools for classifying spectral features in bulk materials, post-DFT. The goal of th

Materials Design Group 4 Dec 27, 2022
Machine Learning from Scratch

Machine Learning from Scratch Author: Shengxuan Wang From: Oregon State University Content: Building Machine Learning model from Scratch, without usin

ShawnWang 0 Jul 05, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 02, 2023
fMRIprep Pipeline To Machine Learning

fMRIprep Pipeline To Machine Learning(Demo) 所有配置均在config.py文件下定义 前置环境(lilab) 各个节点均安装docker,并有fmripre的镜像 可以使用conda中的base环境(相应的第三份包之后更新) 1. fmriprep scr

Alien 3 Mar 08, 2022
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

2.5k Dec 28, 2022
Cool Python features for machine learning that I used to be too afraid to use. Will be updated as I have more time / learn more.

python-is-cool A gentle guide to the Python features that I didn't know existed or was too afraid to use. This will be updated as I learn more and bec

Chip Huyen 3.3k Jan 05, 2023
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 04, 2023
Course files for "Ocean/Atmosphere Time Series Analysis"

time-series This package contains all necessary files for the course Ocean/Atmosphere Time Series Analysis, an introduction to data and time series an

Jonathan Lilly 107 Nov 29, 2022