AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

Overview

AptaMAT

Purpose

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the comparison of the matrices representing the two secondary structures to analyze, assimilable to dotplots. The dot-bracket notation of the structure is converted in a half binary matrix showing width equal to structure's length. Each matrix case (i,j) is filled with '1' if the nucleotide in position i is paired with the nucleotide in position j, with '0' otherwise.

The differences between matrices is calculated by applying Manhattan distance on each point in the template matrix against all the points from the compared matrix. This calculation is repeated between compared matrix and template matrix to handle all the differences. Both calculation are then sum up and divided by the sum of all the points in both matrices.

Dependencies

AptaMat have been written in Python 3.8+

Two Python modules are needed :

These can be installed by typing in the command prompt either :

./setup

or

pip install numpy
pip install scipy

Use of Anaconda is highly recommended.

Usage

AptaMat is a flexible Python script which can take several arguments:

  • structures followed by secondary structures written in dotbracket format
  • files followed by path to formatted files containing one, or several secondary structures in dotbracket format

Both structures and files are independent functions in the script and cannot be called at the same time.

usage: AptaMAT.py [-h] [-structures STRUCTURES [STRUCTURES ...]] [-files FILES [FILES ...]] 

The structures argument must be a string formatted secondary structures. The first input structure is the template structure for the comparison. The following input are the compared structures. There are no input limitations. Quotes are necessary.

usage: AptaMat.py structures [-h] "struct_1" "struct_2" ["struct_n" ...]

The files argument must be a formatted file. Multiple files can be parsed. The first structure encountered during the parsing is used as the template structure. The others are the compared structures.

usage: AptaMat.py -files [-h] struct_file_1 [struct_file_n ...]

The input must be a text file, containing at least secondary structures, and accept additional information such as Title, Sequence or Structure index. If several files are provided, the function parses the files one by one and always takes the first structure encountered as the template structure. Files must be formatted as follows:

>5HRU
TCGATTGGATTGTGCCGGAAGTGCTGGCTCGA
--Template--
((((.........(((((.....)))))))))
--Compared--
.........(((.(((((.....))))).)))

Examples

structures function

First introducing a simple example with 2 structures:

AptaMat : 0.08 ">
$ AptaMat.py -structures "(((...)))" "((.....))"
 (((...)))
 ((.....))
> AptaMat : 0.08

Then, it is possible to input several structures:

AptaMat : 0.08 (((...))) .(.....). > AptaMat : 0.2 (((...))) (.......) > AptaMat : 0.3 ">
$ AptaMat.py -structures "(((...)))" "((.....))" ".(.....)." "(.......)"
 (((...)))
 ((.....))
> AptaMat : 0.08

 (((...)))
 .(.....).
> AptaMat : 0.2

 (((...)))
 (.......)
> AptaMat : 0.3

files function

Taking the above file example:

$ AptaMat.py -files example.fa
5HRU
Template - Compared
 ((((.........(((((.....)))))))))
 .........(((.(((((.....))))).)))
> AptaMat : 0.1134453781512605

Note

Compared structures need to have the same length as the Template structure.

For the moment, no features have been included to check whether the base pair is able to exist or not, according to literature. You must be careful about the sequence input and the base pairing associate.

The script accepts the extended dotbracket notation useful to compare pseudoknots or Tetrad. However, the resulting distance might not be accurate.

You might also like...
The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment
Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment Brief explanation of PT Bukalapak.com Tbk Bukalapak was found

My first Python project is a simple Mad Libs program.
My first Python project is a simple Mad Libs program.

Python CLI Mad Libs Game My first Python project is a simple Mad Libs program. Mad Libs is a phrasal template word game created by Leonard Stern and R

simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Comments
  • Allow comparison with not folded secondary structure

    Allow comparison with not folded secondary structure

    User may want to perform quantitative analysis and attribute distance to non folded oligonucleotides against folded anyway for example in pipeline. Different solution can be considered:

    • Give a default distance value to unfolded vs folded structure (worst solution)
    • Distance must be equal to the maximum number of base pair observable : len(structrure)//2. Several issues could arise from this:
      • How to manage with enhancement #7 ? Take the largest ? Shortest ?
      • It would give abnormally high distance value and will remains constistent even though different structure folding are compared to the same unfolded structure. Considering our main advantage over others algorithm, failed to rank at this point is not good.
    • Assign Manhattan Distance for each point in matrix ( the one showing folding) the farthest theoretical + 1 in the structure. This may give a large distance between the two structures no matter the size and the + 1 prevent an equality one distance with an actually folded structure showing the same coordinate than the farthest theoretical point. Moreover, we can obtain different score when comparing different folding to the same unfolded structure.
    enhancement 
    opened by GitHuBinet 0
  • Different length support and optimal alignment

    Different length support and optimal alignment

    Allow different structure length alignment. This would surely needs an optimal structure alignment to make AptaMat distance the lowest for a shared motif. Maybe we should consider the missing bases in the score calculation.

    enhancement 
    opened by GitHuBinet 0
  • Is the algorithm time consuming ?

    Is the algorithm time consuming ?

    Considering the expected structure size (less than 100n) the calculation run quite fast. However, theoretically the calculation can takes time when the structure is larger with complexity around log(n^2). Possible improvement can be considered as this time complexity is linked with the double browsing of dotbracket input

    • [ ] Think about the possibility of improving this bracket search.
    • [ ] Study the .ct notation for ssNA secondary structure (see in ".ct notation" enhancement)
    • [x] #6
    • [ ] Test the algorithm with this new feature
    question 
    opened by GEC-git 0
  • G-quadruplex/pseudoknot comprehension

    G-quadruplex/pseudoknot comprehension

    Add features with G-quadruplex and pseudoknot comprehension. This kind of secondary structures requires extended dotbracket notation. https://www.tbi.univie.ac.at/RNA/ViennaRNA/doc/html/rna_structure_notations.html

    The '([{<' & string.ascii_uppercase is already included but some doubt remain about the comparison accuracy because no test have been done on this kind of secondary structure

    • [ ] Perform some try on Q-quadruplex & pseudoknots and conclude about comparison reliability. /!\ The complexity comes from the G-quadruplex structures. The tetrad can form base pair in many different way and some secondary structure notation can be similar. Here is an exemple of case with the same interacting Guanine GGTTGGTGTGGTTGG ([..[)...(]..]) ((..)(...)(..))
    • [x] #5
    enhancement invalid 
    opened by GEC-git 0
Releases(v0.9-pre-release)
  • v0.9-pre-release(Oct 28, 2022)

    Pre-release content

    https://github.com/GEC-git/AptaMat

    • Create LICENSE by @GEC-git in https://github.com/GEC-git/AptaMat/pull/2
    • main script AptaMat.py
    • README.MD edited and published
    • Beta AptaMat logo edited and published

    Contributors

    • @GEC-git contributed in https://github.com/GEC-git/AptaMat
    • @GitHuBinet contributed in https://github.com/GEC-git/AptaMat

    Full Changelog: https://github.com/GEC-git/AptaMat/commits/v0.9-pre-release

    Source code(tar.gz)
    Source code(zip)
Owner
GEC UTC
We are the "Genie Enzymatique et Cellulaire" CNRS UMR 7025 research unit.
GEC UTC
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Nobel Data Analysis

Nobel_Data_Analysis This project is for analyzing a set of data about people who have won the Nobel Prize in different fields and different countries

Mohammed Hassan El Sayed 1 Jan 24, 2022
Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

Luka Sosiashvili 1 Feb 16, 2022
small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

Hannah Haberkern 3 Dec 14, 2022
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
Kennedy Institute of Rheumatology University of Oxford Project November 2019

TradingBot6M Kennedy Institute of Rheumatology University of Oxford Project November 2019 Run Change api.txt to binance api key: https://www.binance.c

Kannan SAR 2 Nov 16, 2021
This is a python script to navigate and extract the FSD50K dataset

FSD50K navigator This is a script I use to navigate the sound dataset from FSK50K.

sweemeng 2 Nov 23, 2021
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

PB2 101 Dec 07, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Provide a market analysis (R)

market-study Provide a market analysis (R) - FRENCH Produisez une étude de marché Prérequis Pour effectuer ce projet, vous devrez maîtriser la manipul

1 Feb 13, 2022
Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps

Intake 851 Jan 01, 2023
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

0 Dec 01, 2021
Includes all files needed to satisfy hw02 requirements

HW 02 Data Sets Mean Scale Score for Asian and Hispanic Students, Grades 3 - 8 This dataset provides insights into the New York City education system

7 Oct 28, 2021
Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are st

32 Dec 20, 2022
Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

Brad Chapman 560 Jan 03, 2023
A columnar data container that can be compressed.

Unmaintained Package Notice Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During

944 Dec 09, 2022
Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Overview docs tests package Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era

Tensorwerk 193 Nov 29, 2022