Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

Overview

intronIC_logo

(intron Interrogator and Classifier)

intronIC is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), using a genome and annotation or the sequences themselves. Alternatively, intronIC can be used to simply extract all intron sequences without classification (using -s).

Installation

via pip

If you have (or can get) pip, running it on this repo is the easiest way to install the most recent version of intronIC (if you have multiple versions of Python installed, be sure to use the appropriate Python 3 version e.g. python3 in the following commands):

python3 -m pip install git+https://github.com/glarue/intronIC

Alternatively, you can get the last stable version published to PyPI:

python3 -m pip install intronIC

If successful, intronIC should now be callable from the command-line.

To upgrade to the latest version from a previous one, include --upgrade in either of the previous pip commands, e.g.

python3 -m pip install git+https://github.com/glarue/intronIC --upgrade

via git clone

Otherwise, you can simply clone this repository to your local machine using git:

git clone https://github.com/glarue/intronIC.git
cd intronIC/intronIC

If you clone the repo, you may also wish to add intronIC/intronIC to your system PATH (how best to do this depends on your platform).

See the wiki for more detail information about configuration/run options.

Dependencies

To install dependencies separately using pip, do

python3 -m pip install numpy scipy matplotlib 'scikit-learn>=0.22' biogl

intronIC was built and tested on Linux, but should run on Windows or Mac OSes without too much trouble (I say that now...).

Useful arguments

The required arguments for any classification run include a name (-n; see note below), along with either of the following:

  • Genome (-g) and annotation/BED (-a, -b) files

    —OR—

  • Intron sequences file (-q) (see Training-data-and-PWMs for formatting information, which matches the reference sequence format)

By default, intronIC includes non-canonical introns, and considers only the longest isoform of each gene. Helpful arguments may include:

  • -p parallel processes, which can significantly reduce runtime

  • -f cds use only CDS features to identify introns (by default, uses both CDS and exon features)

  • --no_nc exclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries

  • -i include introns from multiple isoforms of the same gene (default: longest isoform only)

Running on test data

  • If you have installed via pip, first download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice.

  • If you have cloned the repo, first change to the /intronIC/intronIC/test_data subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. Replace intronIC with ../intronIC.py in the following examples.

Classify annotated introns

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with probability scores >90%, or equivalently (depending on the output file) relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:

HomSap-gene:[email protected]:ENST00000614285-intron_1(47);[c:-1]      10.0    AT-AC   GCC|ATATCCTTTT...TTTTCCTTAATT...AATAC|TCC       CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC       50719   transcript:ENST00000614285      gene:ENSG00000141837    1       47      3.9
     2       u12     cds

To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.

0)' homo_sapiens.meta.iic">
awk '($2!="." && $2>0)' homo_sapiens.meta.iic

Extract all annotated intron sequences

If you just want to retrieve all annotated intron sequences (without classification), add the -s flag:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s

See the rest of the wiki for more details about output files, etc.

A note on the -n (name) argument

By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.

Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.

If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.

Resource usage

For genomes with a large number of annotated introns, memory usage can be on the order of gigabytes. This should rarely be a problem even for most modern personal computers, however. For reference, the Ensembl 95 release of the human genome requires ~5 GB of memory.

For many non-model genomes, intronIC should run fairly quickly (e.g. tens of minutes). For human and other very well annotated genomes, runtime may be longer (the human Ensembl 95 release takes ~20-35 minutes in testing); run time scales relatively linearly with the total number of annotated introns, and can be improved by using parallel processes via -p.

See the rest of the wiki for more detailed instructions.

Cite

If you find this tool useful, please cite:

Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett, Comprehensive database and evolutionary dynamics of U12-type introns, Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078, https://doi.org/10.1093/nar/gkaa464

About

intronIC was written to provide a customizable, open-source method for identifying minor (U12-type) spliceosomal introns from annotated intron sequences. Minor introns usually represent ~0.5% (at most) of a given genome's introns, and contain distinct splicing motifs which make them amenable to bioinformatic identification.

Earlier minor intron resources (U12DB, SpliceRack, ERISdb, etc.), while important contributions to the field, are static by design. As such, these databases fail to reflect the dramatic increase in available genome sequences and annotation quality of the last decade.

In addition, other published identification methods employ a certain amount of heuristic fuzziness in defining the classification criteria of their U12-type scoring systems (i.e how "U12-like" does an intron need to look before being called a U12-type intron). intronIC relegates this decision to the well-established support-vector machine (SVM) classification method, which produces an easy-to-interpret "probability of being U12-type" score for each intron.

Furthermore, intronIC provides researchers the opportunity to tailor the underlying training data/position-weight matrices, should they have species-specific data to take advantage of.

Finally, intronIC performs a fair amount of bookkeping during the intron collection process, resulting in (potentially) useful metadata about each intron including parent gene/transcript, ordinal index and phase, information which (as far as I'm aware) is otherwise somewhat non-trivial to acquire.

You might also like...
A library of extension and helper modules for Python's data analysis and machine learning libraries.
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

SLAM-application: installation and test (3D): LeGO-LOAM, LIO-SAM, and LVI-SAM Tested on Quadruped robot in Gazebo ● Results: video, video2 Requirement

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search
A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Comments
  • [BUG] intronIC not working for example data and own data

    [BUG] intronIC not working for example data and own data

    I am trying to run intronIC with example/own data and is not working.

    COMMAND:

    intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens (same as wiki)

    FEEDBACK:

    [#] Starting intronIC [v1.1.1] run on [homo_sapiens (HomSap)] [#] Run command: [/home/rocesv/anaconda3/envs/Seidr/bin/intronIC -g /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens] [#] Using [cds,exon] features to define introns [#] [58933] introns found in [Homo_sapiens.Chr19.Ensembl_91.gff3.gz] [#] [38681] introns with redundant coordinates excluded [#] [8178] introns omitted from scoring based on the following criteria: [#] * short (<30 nt): 66 [#] * ambiguous nucleotides in scoring regions: 0 [#] * non-canonical boundaries: 0 [#] * overlapping coordinates: 0 [#] * not in longest isoform: 8112 [#] Most common non-canonical splice sites: [#] * AT-AG (16/328, 4.88%) [#] * GT-TG (12/328, 3.66%) [#] * GG-AG (12/328, 3.66%) [#] * GA-AG (11/328, 3.35%) [#] * AG-AG (10/328, 3.05%) [#] [24] ([15] unique, [9] redundant) putatively misannotated U12-type introns corrected in [homo_sapiens.annotation.iic] [#] [12074] introns included in scoring analysis [#] Scoring introns using the following regions: [five, bp] [#] Raw scores calculated for [20690] U2 and [387] U12 reference introns [#] Raw scores calculated for [12074] experimental introns [#] Training set score vectors constructed: [20690] U2, [387] U12 [#] Training SVM using reference data Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in <module> sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'iid'

    PROBLEM TRACEBACK:

    **Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, kwargs) TypeError: init() got an unexpected keyword argument 'iid

    In both cases i have the same problem and the log is similar. Any idea? I am very interested on using this amazing tool.

    Thank you in advance :)

    PD: Running in conda env with python 3.9 (wsl 2 Ubuntu 20.04 Windows 10Pro)

    opened by RocesV 2
Releases(v1.3.7)
  • v1.3.7(Jun 10, 2022)

  • v1.3.6(Jun 10, 2022)

    • Deal with edge-case issue where a gene feature has children exon/CDS features in a direct parent-child relationship. Previously, this would bypass the recursive search for introns used by get_introns() due to an early exit, resulting in preferential inclusion of introns whose Parent attribute was the gene itself rather than a child transcript.
    • Remove old code/fix whitespace
    • Update __version__ paradigm
    • Remove Physarum-specific branch-point PWM code
    Source code(tar.gz)
    Source code(zip)
  • v1.3.2(Oct 20, 2021)

    Misc. minor changes not affecting functionality.

    Switch to limiting master (soon to be main) to point releases, with development code contained to dev.

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Jul 23, 2021)

    • Changes default scoring behavior to include all (5', BPS and 3') regions, instead of the previous default of just 5' and BPS. The 3' region typically contains less differentiation between U2- and U12-type introns, but may help reduce FP and FN classifier calls in edge cases. Of course, it's also possible that it could also introduce FPs and/or FNs, although in my experience using all three seems to be more conservative than not.
    • Misc. minor internal changes.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Feb 8, 2021)

    intronIC v1.2.0

    • Fix GridSearchCV regression with newer versions of scikit-learn (>v0.22) (see issue #1)
    • Due to scikit-learn's inversion of a default flag in GridSearchCV, intronIC must now require scikit-learn to be at least v0.22
    • This fix breaks compatibility with scikit-learn versions <v0.22
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Dec 5, 2020)

    intronIC v1.1.1

    • Replace parent-child hierarchical clustering of annotation features with simpler, directed graph-based approach
    • Fix occasional issues where parent genes of CDS/exon features weren't correctly identified
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Oct 31, 2020)

    A number of changes to the underlying data in this release - the default PWMs have been changed to a slightly less-stringent set, which should leave most results relatively unchanged and deals with some edge-cases where the original PWMs were overly penalizing for certain base positions due to being built from low-N samples. Other changes include:

    • Default 3'SS region shortened to [-6, 4]
    • By default, the human U2-type BPS PWM is used instead of the on-the-fly version. A per-run PWM can be generated using --generate_u2_bps_pwm
    • z-scores in the output have been adjusted to correspond to the entire dataset (previously, they were based on the training set only)
    • Non-canonical introns by default now use whatever PWM is closest to their terminal dinucleotides if one is obvious (e.g. for AT-TC introns, this would be the AT-AC PWM; for AT-AG introns, GT-AG and AT-AC are equally close in terms of edit distance). Otherwise, the terminal dinucleotides will be ignored and the best PWM will be selected based on the geometric mean of the component scores from each PWM. This can be reverted to the old behavior using --no_ignore_nc_dnts
    Source code(tar.gz)
    Source code(zip)
  • v1.0.14(Oct 30, 2020)

    intronIC v1.0.14

    • Uses human U2-type BPS PWM (data from Pineda 2018) by default. To restore the previous paradigm wherein U2-type BPS PWMs are generated on-the-fly using the best match to U12-type BPS motifs in likely U2-type introns, pass --generate_u2_bps_pwm.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.13(Sep 6, 2020)

    intronIC v1.0.13

    • Add best U2-type BPS to meta.iic output file. Previously, only the best U12-type BPS sequence was reported. In certain cases, it may be useful to know which U2-type sequence was used in determining the BPS log-ratio score.
    • Reduce formatting stringency for custom PWMs This should reduce headaches if folks are adding their own PWMs by ignoring case, etc.
    • Add clause to terminate multiprocessing pool processes on forced exit There were cases I'd noticed in my own usage when force-exiting (e.g. via ctrl-c) where zombie processes would persist. Wrapping the whole thing in a try/except/finally seems to eliminate the issue (limited testing).
    Source code(tar.gz)
    Source code(zip)
  • v1.0.12(Aug 21, 2020)

Owner
Graham Larue
PhD candidate in bioinformatics and molecular evolution at UC Merced
Graham Larue
Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

AriesTriputranto 1 Dec 13, 2021
MegFlow - Efficient ML solutions for long-tailed demands.

Efficient ML solutions for long-tailed demands.

旷视天元 MegEngine 371 Dec 21, 2022
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Jeong-Yoon Lee 720 Dec 25, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 03, 2022
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

154 Dec 17, 2022
A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

VijayAadhithya2019rit 1 Feb 02, 2022
Iterative stochastic gradient descent (SGD) linear regressor with regularization

SGD-Linear-Regressor Iterative stochastic gradient descent (SGD) linear regressor with regularization Dataset: Kaggle “Graduate Admission 2” https://w

Zechen Ma 1 Oct 29, 2021
moDel Agnostic Language for Exploration and eXplanation

moDel Agnostic Language for Exploration and eXplanation Overview Unverified black box model is the path to the failure. Opaqueness leads to distrust.

Model Oriented 1.2k Jan 04, 2023
Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Amplo 10 May 15, 2022
An implementation of Relaxed Linear Adversarial Concept Erasure (RLACE)

Background This repository contains an implementation of Relaxed Linear Adversarial Concept Erasure (RLACE). Given a dataset X of dense representation

Shauli Ravfogel 4 Apr 13, 2022
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

Zalando Research 120 Dec 24, 2022
Simple data balancing baselines for worst-group-accuracy benchmarks.

BalancingGroups Code to replicate the experimental results from Simple data balancing baselines achieve competitive worst-group-accuracy. Replicating

Facebook Research 29 Dec 02, 2022
Pragmatic AI Labs 421 Dec 31, 2022
An open-source library of algorithms to analyse time series in GPU and CPU.

An open-source library of algorithms to analyse time series in GPU and CPU.

Shapelets 216 Dec 30, 2022
A collection of Machine Learning Models To Web Api which are built on open source technologies/frameworks like Django, Flask.

Author Ibrahim Koné From-Machine-Learning-Models-To-WebAPI A collection of Machine Learning Models To Web Api which are built on open source technolog

Ibrahim Koné 2 May 24, 2022
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

Zachary Petroff 4 Dec 30, 2022
ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

ClearML 4k Jan 09, 2023
Dragonfly is an open source python library for scalable Bayesian optimisation.

Dragonfly is an open source python library for scalable Bayesian optimisation. Bayesian optimisation is used for optimising black-box functions whose

744 Jan 02, 2023
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021