Tandem Mass Spectrum Prediction with Graph Transformers

Overview

MassFormer

This is the original implementation of MassFormer, a graph transformer for small molecule MS/MS prediction. Check out the preprint on arxiv.

Setting Up Environment

We recommend using conda. Three conda yml files are provided in the env/ directory (cpu.yml, cu101.yml, cu102.yml), providing different pytorch installation options (CPU-only, CUDA 10.1, CUDA 10.2). They can be trivially modified to support other versions of CUDA.

To set up an environment, run the command conda env create -f ${CONDA_YAML}, where ${CONDA_YAML} is the path to the desired yaml file.

Downloading NIST Data

Note: this step requires a Windows System or Virtual Machine

The NIST 2020 LC-MS/MS dataset can be purchased from an authorized distributor. The spectra and associated compounds can be exported to MSP/MOL format using the included lib2nist software. There is a single MSP file which contains all of the mass spectra, and multiple MOL files which include the molecular structure information for each spectrum (linked by ID). We've included a screenshot describing the lib2nist export settings.

Alt text

There is a minor bug in the export software that sometimes results in errors when parsing the MOL files. To fix this bug, run the script python mol_fix.py ${MOL_DIR}, where ${MOL_DIR} is a path to the NIST export directory with MOL files.

Downloading Massbank Data

The MassBank of North America (MB-NA) data is in MSP format, with the chemical information provided in the form of a SMILES string (as opposed to a MOL file). It can be downloaded from the MassBank website, under the tab "LS-MS/MS Spectra".

Exporting and Preparing Data

We recommend creating a directory called data/ and placing the downloaded and uncompressed data into a folder data/raw/.

To parse both of the datasets, run parse_and_export.py. Then, to prepare the data for model training, run prepare_data.py. By default the processed data will end up in data/proc/.

Setting Up Weights and Biases

Our implementation uses Weights and Biases (W&B) for logging and visualization. For full functionality, you must set up a free W&B account.

Training Models

A default config file is provided in "config/template.yml". This trains a MassFormer model on the NIST HCD spectra. Our experiments used systems with 32GB RAM, 1 Nvidia RTX 2080 (11GB VRAM), and 6 CPU cores.

The config/ directory has a template config file template.yml and 8 files corresponding to the experiments from the paper. The template config can be modified to train models of your choosing.

To train a template model without W&B with only CPU, run python runner.py -w False -d -1

To train a template model with W&B on CUDA device 0, run python runner.py -w True -d 0

Reproducing Tables

To reproduce a model from one of the experiments in Table 2 or Table 3 from the paper, run python runner.py -w True -d 0 -c ${CONFIG_YAML} -n 5 -i ${RUN_ID}, where ${CONFIG_YAML} refers to a specific yaml file in the config/ directory and ${RUN_ID} refers to an arbitrary but unique integer ID.

Reproducing Visualizations

The explain.py script can be used to reproduce the visualizations in the paper, but requires a trained model saved on W&B (i.e. by running a script from the previous section).

To reproduce a visualization from Figures 2,3,4,5, run python explain.py ${WANDB_RUN_ID} --wandb_mode=online, where ${WANDB_RUN_ID} is the unique W&B run id of the desired model's completed training script. The figues will be uploaded as PNG files to W&B.

Reproducing Sweeps

The W&B sweep config files that were used to select model hyperparameters can be found in the sweeps/ directory. They can be initialized using wandb sweep ${PATH_TO_SWEEP}.

Owner
Röst Lab
Röst lab at U of T -- join us at https://gitter.im/Roestlab/Lobby
Röst Lab
Create Badges with stats of Scratch User, Project and Studio. Use those badges in Github readmes, etc.

Scratch-Stats-Badge Create customized Badges with stats of Scratch User, Studio or Project. Use those badges in Github readmes, etc. Examples Document

Siddhesh Chavan 5 Aug 28, 2022
Schema validation just got Pythonic

Schema validation just got Pythonic schema is a library for validating Python data structures, such as those obtained from config-files, forms, extern

Vladimir Keleshev 2.7k Jan 06, 2023
Functions for easily making publication-quality figures with matplotlib.

Data-viz utils 📈 Functions for data visualization in matplotlib 📚 API Can be installed using pip install dvu and then imported with import dvu. You

Chandan Singh 16 Sep 15, 2022
Statistics and Visualization of acceptance rate, main keyword of CVPR 2021 accepted papers for the main Computer Vision conference (CVPR)

Statistics and Visualization of acceptance rate, main keyword of CVPR 2021 accepted papers for the main Computer Vision conference (CVPR)

Hoseong Lee 78 Aug 23, 2022
Data parsing and validation using Python type hints

pydantic Data validation and settings management using Python type hinting. Fast and extensible, pydantic plays nicely with your linters/IDE/brain. De

Samuel Colvin 12.1k Jan 06, 2023
PanGraphViewer -- show panenome graph in an easy way

PanGraphViewer -- show panenome graph in an easy way Table of Contents Versions and dependences Desktop-based panGraphViewer Library installation for

16 Dec 17, 2022
Using SQLite within Python to create database and analyze Starcraft 2 units data (Pandas also used)

SQLite python Starcraft 2 English This project shows the usage of SQLite with python. To create, modify and communicate with the SQLite database from

1 Dec 30, 2021
Sci palettes for matplotlib/seaborn

sci palettes for matplotlib/seaborn Installation python3 -m pip install sci-palettes Usage import seaborn as sns import matplotlib.pyplot as plt impor

Qingdong Su 2 Jun 07, 2022
This GitHub Repository contains Data Analysis projects that I have completed so far! While most of th project are focused on Data Analysis, some of them are also put here to show off other skills that I have learned.

Welcome to my Data Analysis projects page! This GitHub Repository contains Data Analysis projects that I have completed so far! While most of th proje

Kyle Dini 1 Jan 31, 2022
Plotly Dash Command Line Tools - Easily create and deploy Plotly Dash projects from templates

🛠️ dash-tools - Create and Deploy Plotly Dash Apps from Command Line | | | | | Create a templated multi-page Plotly Dash app with CLI in less than 7

Andrew Hossack 50 Dec 30, 2022
HW 02 for CS40 - matplotlib practice

HW 02 for CS40 - matplotlib practice project instructions https://github.com/mikeizbicki/cmc-csci040/tree/2021fall/hw_02 Drake Lyric Analysis Bar Char

13 Oct 27, 2021
Generate the report for OCULTest.

Sample report generated in this function Usage example from utils.gen_report import generate_report if __name__ == '__main__': # def generate_rep

Philip Guo 1 Mar 10, 2022
With Holoviews, your data visualizes itself.

HoloViews Stop plotting your data - annotate your data and let it visualize itself. HoloViews is an open-source Python library designed to make data a

HoloViz 2.3k Jan 04, 2023
Data Visualizer for Super Mario Kart (SNES)

Data Visualizer for Super Mario Kart (SNES)

MrL314 21 Nov 20, 2022
clock_plot provides a simple way to visualize timeseries data, mapping 24 hours onto the 360 degrees of a polar plot

clock_plot clock_plot provides a simple way to visualize timeseries data mapping 24 hours onto the 360 degrees of a polar plot. For usage, please see

12 Aug 24, 2022
A little logger for machine learning research

Blinker Blinker provides a fast dispatching system that allows any number of interested parties to subscribe to events, or "signals". Signal receivers

Reinforcement Learning Working Group 27 Dec 03, 2022
🐞 📊 Ladybug extension to generate 2D charts

ladybug-charts Ladybug extension to generate 2D charts. Installation pip install ladybug-charts QuickStart import ladybug_charts API Documentation Loc

Ladybug Tools 3 Dec 30, 2022
GitHub English Top Charts

Help you discover excellent English projects and get rid of the interference of other spoken language.

kon9chunkit 529 Jan 02, 2023
Realtime Viewer Mandelbrot set with Python and Taichi (cpu, opengl, cuda, vulkan, metal)

Mandelbrot-set-Realtime-Viewer- Realtime Viewer Mandelbrot set with Python and Taichi (cpu, opengl, cuda, vulkan, metal) Control: "WASD" - movement, "

22 Oct 31, 2022
Create animated and pretty Pandas Dataframe or Pandas Series

Rich DataFrame Create animated and pretty Pandas Dataframe or Pandas Series, as shown below: Installation pip install rich-dataframe Usage Minimal exa

Khuyen Tran 92 Dec 26, 2022