Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

Last update: Jul 26, 2022

Overview

(intron Interrogator and Classifier)

intronIC is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), using a genome and annotation or the sequences themselves. Alternatively, intronIC can be used to simply extract all intron sequences without classification (using -s).

Installation

via `pip`

If you have (or can get) pip, running it on this repo is the easiest way to install the most recent version of intronIC (if you have multiple versions of Python installed, be sure to use the appropriate Python 3 version e.g. python3 in the following commands):

python3 -m pip install git+https://github.com/glarue/intronIC

Alternatively, you can get the last stable version published to PyPI:

python3 -m pip install intronIC

If successful, intronIC should now be callable from the command-line.

To upgrade to the latest version from a previous one, include --upgrade in either of the previous pip commands, e.g.

python3 -m pip install git+https://github.com/glarue/intronIC --upgrade

via `git clone`

Otherwise, you can simply clone this repository to your local machine using git:

git clone https://github.com/glarue/intronIC.git
cd intronIC/intronIC

If you clone the repo, you may also wish to add intronIC/intronIC to your system PATH (how best to do this depends on your platform).

See the wiki for more detail information about configuration/run options.

Dependencies

Python >=3.3
numpy & scipy
scikit-learn >=0.22
biogl
matplotlib (optional, required for plotting)

To install dependencies separately using pip, do

python3 -m pip install numpy scipy matplotlib 'scikit-learn>=0.22' biogl

intronIC was built and tested on Linux, but should run on Windows or Mac OSes without too much trouble (I say that now...).

Useful arguments

The required arguments for any classification run include a name (-n; see note below), along with either of the following:

Genome (-g) and annotation/BED (-a, -b) files

—OR—
Intron sequences file (-q) (see Training-data-and-PWMs for formatting information, which matches the reference sequence format)

By default, intronIC includes non-canonical introns, and considers only the longest isoform of each gene. Helpful arguments may include:

-p parallel processes, which can significantly reduce runtime
-f cds use only CDS features to identify introns (by default, uses both CDS and exon features)
--no_nc exclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries
-i include introns from multiple isoforms of the same gene (default: longest isoform only)

Running on test data

If you have installed via pip, first download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice.
If you have cloned the repo, first change to the /intronIC/intronIC/test_data subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. Replace intronIC with ../intronIC.py in the following examples.

Classify annotated introns

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with probability scores >90%, or equivalently (depending on the output file) relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:

HomSap-gene:[email protected]:ENST00000614285-intron_1(47);[c:-1]      10.0    AT-AC   GCC|ATATCCTTTT...TTTTCCTTAATT...AATAC|TCC       CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC       50719   transcript:ENST00000614285      gene:ENSG00000141837    1       47      3.9
     2       u12     cds

To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.

0)' homo_sapiens.meta.iic">

awk '($2!="." && $2>0)' homo_sapiens.meta.iic

Extract all annotated intron sequences

If you just want to retrieve all annotated intron sequences (without classification), add the -s flag:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s

See the rest of the wiki for more details about output files, etc.

A note on the `-n` (name) argument

By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.

Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.

If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.

Resource usage

For genomes with a large number of annotated introns, memory usage can be on the order of gigabytes. This should rarely be a problem even for most modern personal computers, however. For reference, the Ensembl 95 release of the human genome requires ~5 GB of memory.

For many non-model genomes, intronIC should run fairly quickly (e.g. tens of minutes). For human and other very well annotated genomes, runtime may be longer (the human Ensembl 95 release takes ~20-35 minutes in testing); run time scales relatively linearly with the total number of annotated introns, and can be improved by using parallel processes via -p.

See the rest of the wiki for more detailed instructions.

Cite

If you find this tool useful, please cite:

Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett, Comprehensive database and evolutionary dynamics of U12-type introns, Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078, https://doi.org/10.1093/nar/gkaa464

About

intronIC was written to provide a customizable, open-source method for identifying minor (U12-type) spliceosomal introns from annotated intron sequences. Minor introns usually represent ~0.5% (at most) of a given genome's introns, and contain distinct splicing motifs which make them amenable to bioinformatic identification.

Earlier minor intron resources (U12DB, SpliceRack, ERISdb, etc.), while important contributions to the field, are static by design. As such, these databases fail to reflect the dramatic increase in available genome sequences and annotation quality of the last decade.

In addition, other published identification methods employ a certain amount of heuristic fuzziness in defining the classification criteria of their U12-type scoring systems (i.e how "U12-like" does an intron need to look before being called a U12-type intron). intronIC relegates this decision to the well-established support-vector machine (SVM) classification method, which produces an easy-to-interpret "probability of being U12-type" score for each intron.

Furthermore, intronIC provides researchers the opportunity to tailor the underlying training data/position-weight matrices, should they have species-specific data to take advantage of.

Finally, intronIC performs a fair amount of bookkeping during the intron collection process, resulting in (potentially) useful metadata about each intron including parent gene/transcript, ordinal index and phase, information which (as far as I'm aware) is otherwise somewhat non-trivial to acquire.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm Contents Petastorm Installation Generating a dataset Plain Python API Tensorflow API Pytorch API Spark Dataset Converter API Analyzing petas

1.6k Dec 31, 2022

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

4.2k Dec 29, 2022

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 5, 2023

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

924 Jan 3, 2023

(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

SLAM-application: installation and test (3D): LeGO-LOAM, LIO-SAM, and LVI-SAM Tested on Quadruped robot in Gazebo ● Results: video, video2 Requirement

203 Dec 26, 2022

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

124 Dec 28, 2022

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

73 Oct 17, 2022

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

164 Jan 4, 2023

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

31 Nov 3, 2022

Comments

[BUG] intronIC not working for example data and own data

I am trying to run intronIC with example/own data and is not working.

COMMAND:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens (same as wiki)

FEEDBACK:

[#] Starting intronIC [v1.1.1] run on [homo_sapiens (HomSap)] [#] Run command: [/home/rocesv/anaconda3/envs/Seidr/bin/intronIC -g /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens] [#] Using [cds,exon] features to define introns [#] [58933] introns found in [Homo_sapiens.Chr19.Ensembl_91.gff3.gz] [#] [38681] introns with redundant coordinates excluded [#] [8178] introns omitted from scoring based on the following criteria: [#] * short (<30 nt): 66 [#] * ambiguous nucleotides in scoring regions: 0 [#] * non-canonical boundaries: 0 [#] * overlapping coordinates: 0 [#] * not in longest isoform: 8112 [#] Most common non-canonical splice sites: [#] * AT-AG (16/328, 4.88%) [#] * GT-TG (12/328, 3.66%) [#] * GG-AG (12/328, 3.66%) [#] * GA-AG (11/328, 3.35%) [#] * AG-AG (10/328, 3.05%) [#] [24] ([15] unique, [9] redundant) putatively misannotated U12-type introns corrected in [homo_sapiens.annotation.iic] [#] [12074] introns included in scoring analysis [#] Scoring introns using the following regions: [five, bp] [#] Raw scores calculated for [20690] U2 and [387] U12 reference introns [#] Raw scores calculated for [12074] experimental introns [#] Training set score vectors constructed: [20690] U2, [387] U12 [#] Training SVM using reference data Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in <module> sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'iid'

PROBLEM TRACEBACK:

**Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, kwargs) TypeError: init() got an unexpected keyword argument 'iid

In both cases i have the same problem and the log is similar. Any idea? I am very interested on using this amazing tool.

Thank you in advance :)

PD: Running in conda env with python 3.9 (wsl 2 Ubuntu 20.04 Windows 10Pro)

opened by RocesV 2

Releases(v1.3.7)

v1.3.7(Jun 10, 2022)

Fix annoying issues with generating a version number under different installation scenarios.
Source code(tar.gz)
Source code(zip)
v1.3.6(Jun 10, 2022)
Deal with edge-case issue where a gene feature has children exon/CDS features in a direct parent-child relationship. Previously, this would bypass the recursive search for introns used by get_introns() due to an early exit, resulting in preferential inclusion of introns whose Parent attribute was the gene itself rather than a child transcript.

Remove old code/fix whitespace

Update __version__ paradigm

Remove Physarum-specific branch-point PWM code

Source code(tar.gz)
Source code(zip)
v1.3.2(Oct 20, 2021)

Misc. minor changes not affecting functionality.

Switch to limiting master (soon to be main) to point releases, with development code contained to dev.
Source code(tar.gz)
Source code(zip)
v1.3.0(Jul 23, 2021)
Changes default scoring behavior to include all (5', BPS and 3') regions, instead of the previous default of just 5' and BPS. The 3' region typically contains less differentiation between U2- and U12-type introns, but may help reduce FP and FN classifier calls in edge cases. Of course, it's also possible that it could also introduce FPs and/or FNs, although in my experience using all three seems to be more conservative than not.

Misc. minor internal changes.

Source code(tar.gz)
Source code(zip)
v1.2.0(Feb 8, 2021)
intronIC v1.2.0

Fix GridSearchCV regression with newer versions of scikit-learn (>v0.22) (see issue #1)

Due to scikit-learn's inversion of a default flag in GridSearchCV, intronIC must now require scikit-learn to be at least v0.22

This fix breaks compatibility with scikit-learn versions <v0.22

Source code(tar.gz)
Source code(zip)
v1.1.1(Dec 5, 2020)
intronIC v1.1.1

Replace parent-child hierarchical clustering of annotation features with simpler, directed graph-based approach

Fix occasional issues where parent genes of CDS/exon features weren't correctly identified

Source code(tar.gz)
Source code(zip)
v1.1.0(Oct 31, 2020)
A number of changes to the underlying data in this release - the default PWMs have been changed to a slightly less-stringent set, which should leave most results relatively unchanged and deals with some edge-cases where the original PWMs were overly penalizing for certain base positions due to being built from low-N samples. Other changes include:

Default 3'SS region shortened to [-6, 4]

By default, the human U2-type BPS PWM is used instead of the on-the-fly version. A per-run PWM can be generated using --generate_u2_bps_pwm

z-scores in the output have been adjusted to correspond to the entire dataset (previously, they were based on the training set only)

Non-canonical introns by default now use whatever PWM is closest to their terminal dinucleotides if one is obvious (e.g. for AT-TC introns, this would be the AT-AC PWM; for AT-AG introns, GT-AG and AT-AC are equally close in terms of edit distance). Otherwise, the terminal dinucleotides will be ignored and the best PWM will be selected based on the geometric mean of the component scores from each PWM. This can be reverted to the old behavior using --no_ignore_nc_dnts

Source code(tar.gz)
Source code(zip)
v1.0.14(Oct 30, 2020)
intronIC v1.0.14

Uses human U2-type BPS PWM (data from Pineda 2018) by default. To restore the previous paradigm wherein U2-type BPS PWMs are generated on-the-fly using the best match to U12-type BPS motifs in likely U2-type introns, pass --generate_u2_bps_pwm.

Source code(tar.gz)
Source code(zip)
v1.0.13(Sep 6, 2020)
intronIC v1.0.13

Add best U2-type BPS to meta.iic output file. Previously, only the best U12-type BPS sequence was reported. In certain cases, it may be useful to know which U2-type sequence was used in determining the BPS log-ratio score.

Reduce formatting stringency for custom PWMs This should reduce headaches if folks are adding their own PWMs by ignoring case, etc.

Add clause to terminate multiprocessing pool processes on forced exit There were cases I'd noticed in my own usage when force-exiting (e.g. via ctrl-c) where zombie processes would persist. Wrapping the whole thing in a try/except/finally seems to eliminate the issue (limited testing).

Source code(tar.gz)
Source code(zip)
v1.0.12(Aug 21, 2020)

Manual release to trigger Zenodo archive. No functional difference between this and previous version (v1.0.11).
Source code(tar.gz)
Source code(zip)

Owner

Graham Larue

PhD candidate in bioinformatics and molecular evolution at UC Merced

GitHub Repository

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly. Its main purpose is the transformation of bilinear forms into sparse matrices and linear forms into vectors.

297 Dec 13, 2022

fastFM: A Library for Factorization Machines

Citing fastFM The library fastFM is an academic project. The time and resources spent developing fastFM are therefore justified by the number of citat

1k Dec 24, 2022

Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

419 Jan 01, 2023

Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list Uber Open Source 997 Dec 30, 2022

OptaPy is an AI constraint solver for Python to optimize planning and scheduling problems.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

208 Dec 27, 2022

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

sklearn-porter Transpile trained scikit-learn estimators to C, Java, JavaScript and others. It's recommended for limited embedded systems and critical

1.2k Jan 05, 2023

Reggy - Regressions with arbitrarily complex regularization terms

reggy Regressions with arbitrarily complex regularization terms. Currently suppo

1 Jan 20, 2022

scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

3 Dec 16, 2022

QML: A Python Toolkit for Quantum Machine Learning

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids.

176 Dec 09, 2022

Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022

Combines Bayesian analyses from many datasets.

PosteriorStacker Combines Bayesian analyses from many datasets. Introduction Method Tutorial Output plot and files Introduction Fitting a model to a d

19 Feb 13, 2022

ML Kaggle Titanic Problem using LogisticRegrission

-ML-Kaggle-Titanic-Problem-using-LogisticRegrission here you will find the solution for the titanic problem on kaggle with comments and step by step c

3 Oct 23, 2022

Katana project is a template for ASAP 🚀 ML application deployment

Katana project is a FastAPI template for ASAP 🚀 ML API deployment

100 Dec 26, 2022

This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing variance.

minvar_invest_portfolio This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing var

1 Jan 06, 2022

Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

Related tags

Overview

(intron Interrogator and Classifier)

Installation

via pip

via git clone

Dependencies

Useful arguments

Running on test data

Classify annotated introns

Extract all annotated intron sequences

A note on the -n (name) argument

Resource usage

Cite

About

You might also like...

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

A library of extension and helper modules for Python's data analysis and machine learning libraries.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Comments

[BUG] intronIC not working for example data and own data

Releases(v1.3.7)

v1.3.7(Jun 10, 2022)

v1.3.6(Jun 10, 2022)

v1.3.2(Oct 20, 2021)

v1.3.0(Jul 23, 2021)

v1.2.0(Feb 8, 2021)

v1.1.1(Dec 5, 2020)

v1.1.0(Oct 31, 2020)

v1.0.14(Oct 30, 2020)

v1.0.13(Sep 6, 2020)

v1.0.12(Aug 21, 2020)