To be a next-generation DL-based phenotype prediction from genome mutations.

Overview
Sequence -----------+--> 3D_structure --> 3D_module --+                                      +--> ?
|                   |                                 |                                      +--> ?
|                   |                                 +--> Joint_module --> Hierarchical_CLF +--> ?
|                   |                                 |                                      +--> ?
+-> NLP_embeddings -+-------> Embedding_module -------+                                      +--> ?

ClynMut: Predicting the Clynical Relevance of Genome Mutations (wip)

To be a next-generation DL-based phenotype prediction from genome mutations. Will use sota NLP and structural techniques.

Planned modules will likely be:

  • 3D learning module
  • NLP embeddings
  • Joint module + Hierarchical classification

The main idea is for the model to learn the prediction in an end-to-end fashion.

Install

$ pip install clynmut

Example Usage:

import torch
from clynmut import *

hier_graph = {"class": "all", 
              "children": [
                {"class": "effect_1", "children": [
                  {"class": "effect_12", "children": []},
                  {"class": "effect_13", "children": []}
                ]},
                {"class": "effect_2", "children": []},
                {"class": "effect_3", "children": []},
              ]}

model = MutPredict(
    seq_embedd_dim = 512,
    struct_embedd_dim = 256, 
    seq_reason_dim = 512, 
    struct_reason_dim = 256,
    hier_graph = hier_graph,
    dropout = 0.0,
    use_msa = False,
    device = None)

seqs = ["AFTQRWHDLKEIMNIDALTWER",
        "GHITSMNWILWVYGFLE"]

pred_dicts = model(seqs, pred_format="dict")

Important topics:

3D structure learning

There are a couple architectures that can be used here. I've been working on 2 of them, which are likely to be used here:

Hierarchical classification

  • A simple custom helper class has been developed for it.

Testing

$ python setup.py test

Datasets:

This package will use the awesome work by Jonathan King at this repository.

To install

$ pip install git+https://github.com/jonathanking/sidechainnet.git

Or

$ git clone https://github.com/jonathanking/sidechainnet.git
$ cd sidechainnet && pip install -e .

Citations:

@article{pejaver_urresti_lugo-martinez_pagel_lin_nam_mort_cooper_sebat_iakoucheva et al._2020,
    title={Inferring the molecular and phenotypic impact of amino acid variants with MutPred2},
    volume={11},
    DOI={10.1038/s41467-020-19669-x},
    number={1},
    journal={Nature Communications},
    author={Pejaver, Vikas and Urresti, Jorge and Lugo-Martinez, Jose and Pagel, Kymberleigh A. and Lin, Guan Ning and Nam, Hyun-Jun and Mort, Matthew and Cooper, David N. and Sebat, Jonathan and Iakoucheva, Lilia M. et al.},
    year={2020}
@article{rehmat_farooq_kumar_ul hussain_naveed_2020, 
    title={Predicting the pathogenicity of protein coding mutations using Natural Language Processing},
    DOI={10.1109/embc44109.2020.9175781},
    journal={2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)},
    author={Rehmat, Naeem and Farooq, Hammad and Kumar, Sanjay and ul Hussain, Sibt and Naveed, Hammad},
    year={2020}
@article{pagel_antaki_lian_mort_cooper_sebat_iakoucheva_mooney_radivojac_2019,
    title={Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome},
    volume={15},
    DOI={10.1371/journal.pcbi.1007112},
    number={6},
    journal={PLOS Computational Biology},
    author={Pagel, Kymberleigh A. and Antaki, Danny and Lian, AoJie and Mort, Matthew and Cooper, David N. and Sebat, Jonathan and Iakoucheva, Lilia M. and Mooney, Sean D. and Radivojac, Predrag},
    year={2019},
    pages={e1007112}
You might also like...
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.  This is part of the CASL project: http://casl-project.ai/
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Python generation script for BitBirds

BitBirds generation script Intro This is published under MIT license, which means you can do whatever you want with it - entirely at your own risk. Pl

TTS is a library for advanced Text-to-Speech generation.
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

Comments
  • TO DO LIST

    TO DO LIST

    • [x] Add embeddings functionality
    • [ ] Add 3d structure module (likely-to-be GVP/... based)
    • [x] Add classifier
    • [x] Hierarchical classification helper based on differentiability
    • [x] End-to-end code
    • [ ] data collection
    • [ ] data formatting
    • [ ] Run featurization for all data points (esm1b + af2 structs)
    • [ ] Perform a sample training
    • [ ] Perform sample evaluation
    • [ ] Iterate - improve
    • [ ] ...
    • [ ] idk, will see as we go
    opened by hypnopump 0
Releases(0.0.2)
Owner
Eric Alcaide
For he today that sheds his blood with me; Shall be my brother
Eric Alcaide
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

fluz 11 Nov 16, 2022
TLA - Twitter Linguistic Analysis

TLA - Twitter Linguistic Analysis Tool for linguistic analysis of communities TLA is built using PyTorch, Transformers and several other State-of-the-

Tushar Sarkar 47 Aug 14, 2022
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

Marcel Bollmann 11 Sep 25, 2022
DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

Maker's Asylum 62 Dec 22, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 03, 2023
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

RapidOCR Team 767 Jan 09, 2023
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 06, 2021
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022
NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

2 Jan 17, 2022
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates Vibhor Agarwal, Sagar Joglekar, Anthony P. Young an

Vibhor Agarwal 2 Jun 30, 2022