Authors: Ada Madejska, MCDB, UCSB (contact: [email protected]) Nick Noll, UCSB This pipeline takes error-prone Nanopore reads and tries to increase the percentage identity of the results of identifying species with BLAST. The reads in fastq format are put through the pipeline which includes the following steps. 1. Quality control - very short and very long reads (reads that highly deviate from the usual length of the 16S sequence) are dropped. 2. Kmer frequency matrix - make a kmer frequency matrix based on the reads from the quality control step. The value of k can be changed (k=5 or 6 is recommended) 3. UMAP projection and HDBSCAN clustering - the kmer frequency matrix is used to create a UMAP projection. The default parameters for UMAP and HDBSCAN functions have been chosen based on mock dataset but can be changed. 4. Refinement - based on our tests on mock datasets, sometimes reads from different species can cluster together. To prevent that, we include a refinement step based on MSA of Clustal Omega on each cluster. The alignment outputs a guide tree which is used for dividing the cluster into smaller subclusters. The distance threshold can be changed to suit each dataset. 5. Consensus making - lastly, based on the defined clusters, the last step creates a consensus sequence based on majority calling. The direction of the reads is fixed using minimap2, the alignment is performed by MAFFT, and the consensus is created using em_cons. The reads are run through BLASTN to check for identity of each cluster. Software Dependencies: To successfully run the pipeline, certain software need to be installed. 1. Minimap2 - for the consensus making step (https://github.com/lh3/minimap2) 2. MAFFT - for alignment in the consensus making step (https://mafft.cbrc.jp/alignment/software/) 3. EM_CONS - for creating the consensus (http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html) 4. NCBIN - for identification of the consensus sequences in the database (https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) (a 16S database is also required) 5. CLUSTALO - for the refinement step (http://www.clustal.org/omega/) Specifications: This pipeline runs in python3.8.10 and julia v"1.4.1". The following Python libraries are also required: BioPython hdbscan matplotlib pandas sklearn umap Following Julia packages are required: Pkg DataFrames CSV
A pipeline that creates consensus sequences from a Nanopore reads. I
Overview
Mining the Stack Overflow Developer Survey
Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m
Random dataframe and database table generator
Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien
๐งช Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.
๐งช๐ ๐. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python a
wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information
Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta
Stitch together Nanopore tiled amplicon data without polishing a reference
Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri
High Dimensional Portfolio Selection with Cardinality Constraints
High-Dimensional Portfolio Selecton with Cardinality Constraints This repo contains code for perform proximal gradient descent to solve sample average
Projects that implement various aspects of Data Engineering.
DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio
Making the DAEN information accessible.
The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica
Visions provides an extensible suite of tools to support common data analysis operations
Visions And these visions of data types, they kept us up past the dawn. Visions provides an extensible suite of tools to support common data analysis
Analytical view of olist e-commerce in Brazil
Analysis of E-Commerce Public Dataset by Olist The objective of this project is to propose an analytical view of olist e-commerce in Brazil. For this
Python Package for DataHerb: create, search, and load datasets.
The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.
My first Python project is a simple Mad Libs program.
Python CLI Mad Libs Game My first Python project is a simple Mad Libs program. Mad Libs is a phrasal template word game created by Leonard Stern and R
Python package for analyzing sensor-collected human motion data
Python package for analyzing sensor-collected human motion data
Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.
EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G
Pipetools enables function composition similar to using Unix pipes.
Pipetools Complete documentation pipetools enables function composition similar to using Unix pipes. It allows forward-composition and piping of arbit
A collection of learning outcomes data analysis using Python and SQL, from DQLab.
Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu
A meta plugin for processing timelapse data timepoint by timepoint in napari
napari-time-slicer A meta plugin for processing timelapse data timepoint by timepoint. It enables a list of napari plugins to process 2D+t or 3D+t dat
This mini project showcase how to build and debug Apache Spark application using Python
Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a
A pipeline that creates consensus sequences from a Nanopore reads. I
A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.
Intake is a lightweight package for finding, investigating, loading and disseminating data.
Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps