DNA sequence classification by Deep Neural Network

Overview

DNA sequence classification by Deep Neural Network: Project Overview

  • worked on the DNA sequence classification problem where the input is the DNA sequence and the output class states whether a certain histone protein is present on the sequence or not.
  • used one of the datasets from 12 different datasets that we have collected. The name of the dataset is H3K4me2
  • To represent a sequence, we have utilized k-mer representation
  • For the sequence embedding we have used one-hot encoding
  • Different word embedding models: Word2Vec, BERT, Keras Embedding layer, Bi-LSTM, and CNN

Bioinformatics Project - B.Sc. in Computer Science and Engineering (CSE)

Created by: - Md. Tarek Hasan, Mohammed Jawwadul Islam, Md Fahad Al Rafi, Arifa Akter, Sumayra Islam

Date of Completion: - Fall 2021 Trimester (Nov 2021 - Jan 2022)

Linkedin of Jawwadul

Linkedin of Tarek

Linkedin of Fahad

Linkedin of Arifa

Linkedin of Sumayra

Code and Resources Used

  • Python Version: 3.7.11
  • Packages: numpy, pandas, keras, tensorflow, sklearn
  • Dataset from: Nguyen who is one the authors of the paper titled “DNA sequence classification by convolutional neural network”

Features of the Dataset

DNA sequences wrapped around histone proteins are the subject of datasets

  • For our experiment, we selected one of the datasets entitled H3K4me2.
  • H3K4me2 has 30683 DNA sequences whose 18143 samples fall under the positive class, the rest of the samples fall under the negative class, and it makes the problem binary class classification.
  • The ratio of the positive-negative class is around (59:41)%.
  • The class label represents the presence of H3K4me2 histone proteins in the sequences.
  • The base length of the sequences is 500.

Data Preprocessing

  • The datasets were gathered in.txt format. We discovered that the dataset contains id, sequence, and class label during the Exploratory Data Analysis phase of our work.
  • We dropped the id column from the dataset because it is the only trait that all of the samples share.
  • Except for two samples, H3K4me2 includes 36799 DNA sequences, the majority of which are 500 bases long. Those two sequences have lengths of 310 and 290, respectively. To begin, we employed the zero-padding strategy to tackle the problem. However, because there are only two examples of varying lengths, we dropped those two samples from the dataset later for experiments, as these samples may cause noise.
  • we have used the K-mer sequence representation technique to represent a DNA sequence, we have used the K-mer sequence representation technique
  • For sequence emdedding after applying the 3-mer representation technique, we have experimented using different embedding techniques. The first three embedding methods are named SequenceEmbedding1D, SequenceEmbedding2D, SequenceEmbedding2D_V2, Word2Vec and BERT.
    • SequenceEmbedding1D is the one-dimensional representation of a single DNA sequence which is basically the one-hot encoding.
    • SequenceEmbedding2D is the two-dimensional representation of a single DNA sequence where the first row is the one-hot encoding of a sequence after applying 3-mer representation. The second row is the one-hot encoding of a left-rotated sequence after applying 3-mer representation.
    • the third row of SequenceEmbedding2D_V2 is the one-hot encoding of a right-rotated sequence after applying 3-mer representation.
    • Word2Vec and BERT are the word embedding techniques for language modeling.

Deep Learning Models

After the completion of sequence embedding, we have used deep learning models for the classification task. We have used two different deep learning models for this purpose, one is Convolutional Neural Network (CNN) and the other is Bidirectional Long Short-Term Memory (Bi-LSTM).

Experimental Analysis

After the data cleaning phase, we had 36797 samples. We have used 80% of the whole dataset for training and the rest of the samples for testing. The dataset has been split using train_test_split from sklearn.model_selection stratifying by the class label. We have utilized 10% of the training data for validation purposes. For the first five experiments we have used batch training as it was throwing an exception of resource exhaustion.

The evaluation metrics we used for our experiments are accuracy, precision, recall, f1-score, and Matthews Correlation Coefficient (MCC) score. The minimum value of accuracy, precision, recall, f1-score can be 0 and the maximum value can be 1. The minimum value of the MCC score can be -1 and the maximum value can be 1.

image

Discussion

MCC score 0 indicates the model's randomized predictions. The recall score indicates how well the classifier can find all positive samples. We can say that the model's ability to classify all positive samples has been at an all-time high over the last five experiments. The highest MCC score we received was 0.1573, indicating that the model is very near to predicting in a randomized approach. We attain a maximum accuracy of 60.27%, which is much lower than the state-of-the-art result of 71.77%. To improve the score, we need to emphasize more on the sequence embedding approach. Furthermore, we can experiment with various deep learning techniques.

Owner
Mohammed Jawwadul Islam Fida
CSE student. Founding Vice President of Students' International Affairs Society at CIAC, UIU
Mohammed Jawwadul Islam Fida
Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code

Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code.

Yasunori Shimura 7 Jul 27, 2022
An example project demonstrating how the Autonomous Learning Library can be used to build new reinforcement learning agents.

About This repository shows how Autonomous Learning Library can be used to build new reinforcement learning agents. In particular, it contains a model

Chris Nota 5 Aug 30, 2022
Object-Centric Learning with Slot Attention

Slot Attention This is a re-implementation of "Object-Centric Learning with Slot Attention" in PyTorch (https://arxiv.org/abs/2006.15055). Requirement

Untitled AI 72 Jan 02, 2023
Pytorch implementation of Feature Pyramid Network (FPN) for Object Detection

fpn.pytorch Pytorch implementation of Feature Pyramid Network (FPN) for Object Detection Introduction This project inherits the property of our pytorc

Jianwei Yang 912 Dec 21, 2022
The code of paper 'Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection'

Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection Pytorch implemetation of paper 'Learning to Aggregate and Personalize

Tencent YouTu Research 136 Dec 29, 2022
A command line simple note taking app

Why yet another note taking program? note was designed with a very specific target in mind: me, and my 2354 scraps of paper. It runs from the command

64 Nov 20, 2022
A High-Level Fusion Scheme for Circular Quantities published at the 20th International Conference on Advanced Robotics

Monte Carlo Simulation to the Paper A High-Level Fusion Scheme for Circular Quantities published at the 20th International Conference on Advanced Robotics

Sören Kohnert 0 Dec 06, 2021
LowRankModels.jl is a julia package for modeling and fitting generalized low rank models.

LowRankModels.jl LowRankModels.jl is a Julia package for modeling and fitting generalized low rank models (GLRMs). GLRMs model a data array by a low r

Madeleine Udell 183 Dec 17, 2022
Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

Noah Getz 3 Jun 22, 2022
A PyTorch Lightning Callback for pushing models to the Hugging Face Hub 🤗⚡️

hf-hub-lightning A callback for pushing lightning models to the Hugging Face Hub. Note: I made this package for myself, mostly...if folks seem to be i

Nathan Raw 27 Dec 14, 2022
Rule Based Classification Project For Python

Rule-Based-Classification-Project (ENG) Business Problem: A game company wants to create new level-based customer definitions (personas) by using some

Deniz Can OĞUZ 4 Oct 29, 2022
code associated with ACL 2021 DExperts paper

DExperts Hi! This repository contains code for the paper DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts to appear at

Alisa Liu 68 Dec 15, 2022
Can we visualize a large scientific data set with a surrogate model? We're building a GAN for the Earth's Mantle Convection data set to see if we can!

EarthGAN - Earth Mantle Surrogate Modeling Can a surrogate model of the Earth’s Mantle Convection data set be built such that it can be readily run in

Tim 0 Dec 09, 2021
Pytorch Implementation of Auto-Compressing Subset Pruning for Semantic Image Segmentation

Pytorch Implementation of Auto-Compressing Subset Pruning for Semantic Image Segmentation Introduction ACoSP is an online pruning algorithm that compr

Merantix 8 Dec 07, 2022
Measure WWjj polarization fraction

WlWl Polarization Measure WWjj polarization fraction Paper: arXiv:2109.09924 Notice: This code can only be used for the inference process, if you want

4 Apr 10, 2022
ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection

ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection This repository contains implementation of the

Visual Understanding Lab @ Samsung AI Center Moscow 190 Dec 30, 2022
This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

RobustFreqCNN About This repository contains the implementation of the paper "Towards Frequency-Based Explanation for Robust CNN" arxiv. It primarly d

Sarosij Bose 2 Jan 23, 2022
StyleTransfer - Open source style transfer project, based on VGG19

StyleTransfer - Open source style transfer project, based on VGG19

Patrick martins de lima 9 Dec 13, 2021
CLOOB training (JAX) and inference (JAX and PyTorch)

cloob-training Pretrained models There are two pretrained CLOOB models in this repo at the moment, a 16 epoch and a 32 epoch ViT-B/16 checkpoint train

Katherine Crowson 64 Nov 27, 2022
This is the pytorch code for the paper Curious Representation Learning for Embodied Intelligence.

Curious Representation Learning for Embodied Intelligence This is the pytorch code for the paper Curious Representation Learning for Embodied Intellig

19 Oct 19, 2022