Codebase of deep learning models for inferring stability of mRNA molecules

Last update: Dec 29, 2022

Related tags

Overview

Kaggle OpenVaccine Models

Codebase of deep learning models for inferring stability of mRNA molecules, corresponding to the Kaggle Open Vaccine Challenge and accompanying manuscript "Predictive models of RNA degradation through dual crowdsourcing", Wayment-Steele et al (2021) (full citation when available).

Models contained here are:

"Nullrecurrent": A reconstruction of winning solution by Jiayang Gao. Link to original notebooks provided below.

"DegScore-XGBoost": A model based the original DegScore model and XGBoost.

NB on other historic names for models

The Nullrecurrent model was called "OV" model in some instances and the .h5 model files for the Nullrecurrent model are labeled "ov".
The DegScore-XGBoost model was called the "BT" model in Eterna analysis.

Organization

scripts: Python scripts to perform inference.

notebooks: Python notebooks to perform inference.

model_files: Store .h5 model files used at inference time.

data: Data corresponding to Kaggle challenge and to subsequent tests on mRNAs.

data/Kaggle_RYOS_data

This directory contains training set and test sets in .csv and in .json form.

Kaggle_RYOS_trainset_prediction_output_Sep2021.txt contains predictions from the Nullrecurrent code in this repository.

Model MCRMSEs were evaluated by uploading submissions to the Kaggle competition website at https://www.kaggle.com/c/stanford-covid-vaccine.

data/mRNA_233x_data

This directory contains original data and scripts to reproduce model analysis from manuscript.

Because all the original formats are slightly different, the reformat_*.py scripts read in the original formats and reformats them in two forms for each prediction: "FULL" and "PCR" in the directory formatted_predictions.

"FULL" is per-nucleotide predictions for all the nucleotides. "PCR" has had the regions outside the RT-PCR sequencing set to NaN.

python collate_predictions.py reads in all the data and outputs all_predictions_233x.csv

RegenerateFigure5.ipynb reproduces the final scatterplot comparisons.

posthoc_code_predictions contains predictions from the Nullrecurrent code model contained in this repository. To generate these predictions use the sequence file in the mRNA_233x_data folder and run the following command(s):

python scripts/nullrecurrent_inference.py -d deg_Mg_pH10 -i 233_sequences.txt -o 233x_nullrecurrent_output_Oct2021_deg_Mg_50C.txt,

etc.

Dependencies

Install via pip install requirements.txt or conda install --file requirements.txt.

Not pip-installable: EternaFold, Vienna, and Arnie, see below.

Setup

Install git-lfs (best to do before git-cloning this KaggleOpenVaccine repo).
Install EternaFold (the nullrecurrent model uses this), available for free noncommercial use here.
Install ViennaRNA (the DegScore-XGBoost model uses this), available here.
Git clone Arnie, which wraps EternaFold in python and allows RNA thermodynamic calculations across many packages. Follow instructions here to link EternaFold to it.
Add path to this repository as KOV_PATH (so that script can find path to stored model files):

export KOV_PATH='/path/to/KaggleOpenVaccine'

Usage

To run the nullrecurrent winning solution on one construct, given in example.txt:

CGC

Run

python scripts/nullrecurrent_inference.py [-d deg] -i example.txt -o predict.txt

where the deg is one of the following options

deg_Mg_pH10
deg_pH10
deg_Mg_50C
deg_50C

Similarly, for the DegScore-XGBoost model :

python scripts/degscore-xgboost_inference.py -i example.txt -o predict.txt

This write a text file of output predictions to predict.txt:

(Nullrecurrent output)

2.1289976365, 2.650808962, 2.1869660805000004

(DegScore-XGBoost output)

0.2697107, 0.37091506, 0.48528114

A note on energy model versions

The predictions in the Kaggle competition and for the manuscript were performed with EternaFold parameters and CONTRAfold-SE code. The currently available EternaFold code will result in slightly different values. For more on the difference, see the EternaFold README.

Individual Kaggle Solutions

This code is based on the winning solution for the Open Vaccine Kaggle Competition Challenge. The competition can be found here:

https://www.kaggle.com/c/stanford-covid-vaccine/overview

This code is also the supplementary material for the Kaggle Competition Solution Paper. The individual Kaggle writeups for the top solutions that have been featured in that paper can be found in the following table:

Team Name	Team Members	Rank	Link to the solution
Nullrecurrent	Jiayang Gao	1	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189620

Kazuki ** 2	Kazuki Onodera, Kazuki Fujikawa	2	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189709

Striderl	Hanfei Mao	3	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189574

FromTheWheel & Dyed & StoneShop	Gilles Vandewiele, Michele Tinti, Bram Steenwinckel	4	https://www.kaggle.com/group16/covid-19-mrna-4th-place-solution

tito	Takuya Ito	5	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189691

nyanp	Taiga Noumi	6	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189241

One architecture	Shujun He	7	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189564

ishikei	Keiichiro Ishi	8	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/190314

Keep going to be GM	Youhan Lee	9	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189845

Social Distancing Please	Fatih Öztürk,Anthony Chiu,Emin Ozturk	11	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189571

The Machine	Karim Amer,Mohamed Fares	13	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189585

You might also like...

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

PySlowFast PySlowFast is an open source video understanding codebase from FAIR that provides state-of-the-art video classification models with efficie

5.3k Jan 3, 2023

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

2.9k Jan 4, 2023

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Decision Transformer Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas†, and Igor M

1.4k Jan 7, 2023

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

70 Dec 7, 2022

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

48 Dec 20, 2022

Codebase for "ProtoAttend: Attention-Based Prototypical Learning."

Codebase for "ProtoAttend: Attention-Based Prototypical Learning." Authors: Sercan O. Arik and Tomas Pfister Paper: Sercan O. Arik and Tomas Pfister,

2 May 17, 2022

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Stock Price Prediction Using Deep Learning Univariate Time Series Predicting stock price using historical data of a company using Neural networks for

7 Nov 27, 2022

Spearmint Bayesian optimization codebase

Spearmint Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code n

Formerly: Harvard Intelligent Probabilistic Systems Group -- Now at Princeton

1.5k Dec 29, 2022

A general 3D Object Detection codebase in PyTorch.

Det3D is the first 3D Object Detection toolbox which provides off the box implementations of many 3D object detection algorithms such as PointPillars, SECOND, PIXOR, etc, as well as state-of-the-art methods on major benchmarks like KITTI(ViP) and nuScenes(CBGS).

1.4k Jan 5, 2023

Comments

HW edits

Changes:

Remove hardcoded paths in scripts

Remove tmp csv output files for nullrecurrent

Rename to reflect model naming in paper "nullrecurrent"

Reorganize example inputs and outputs

Update README

Add requirements file

opened by HWaymentSteele 0

Releases(v1.0)

v1.0(Sep 30, 2022)

Release to accompany Wayment-Steele et al. (2022) "Deep learning models for predicting RNA degradation via dual crowdsourcing".
Source code(tar.gz)
Source code(zip)

Codebase of deep learning models for inferring stability of mRNA molecules

Related tags

Overview

Kaggle OpenVaccine Models

Organization

data/Kaggle_RYOS_data

data/mRNA_233x_data

Dependencies

Setup

Usage

A note on energy model versions

Individual Kaggle Solutions

You might also like...

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

Codebase for "ProtoAttend: Attention-Based Prototypical Learning."

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Spearmint Bayesian optimization codebase

A general 3D Object Detection codebase in PyTorch.

Comments

HW edits

Releases(v1.0)

v1.0(Sep 30, 2022)

Owner

Eternagame

code and models for "Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation"

A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Optimizers-visualized - Visualization of different optimizers on local minimas and saddle points.

The code for Expectation-Maximization Attention Networks for Semantic Segmentation (ICCV'2019 Oral)

The first public PyTorch implementation of Attentive Recurrent Comparators

The Official PyTorch Implementation of DiscoBox.

TAug :: Time Series Data Augmentation using Deep Generative Models

Code for `BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery`, Neurips 2021

All of the figures and notebooks for my deep learning book, for free!

A robotic arm that mimics hand movement through MediaPipe tracking.

CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer

RAMA: Rapid algorithm for multicut problem

Yggdrasil - A simplistic bot designed to streamline your server experience

A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

Minecraft Hack Detection With Python

Project to create an open-source 6 DoF input device

Memory-efficient optimum einsum using opt_einsum planning and PyTorch kernels.

A lightweight python AUTOmatic-arRAY library.

Training and Evaluation Code for Neural Volumes

Official code for article "Expression is enough: Improving traﬀic signal control with advanced traﬀic state representation"