SARS-Cov-2 Recombinant Finder for fasta sequences

Overview

Sc2rf - SARS-Cov-2 Recombinant Finder

Pronounced: Scarf

What's this?

Sc2rf can search genome sequences of SARS-CoV-2 for potential recombinants - new virus lineages that have (partial) genes from more than one parent lineage.

Is it already usable?

This is a very young project, started on March 5th, 2022. As such, proceed with care. Results may be wrong or misleading, and with every update, anything can still change a lot.

Anyway, I'm happy that scientists are already seeing benefits from Sc2rf and using it to prepare lineage proposals for cov-lineages/pango-designation.

Though I already have a lot of ideas and plans for Sc2rf (see at the bottom of this document), I'm very open for suggestions and feature requests. Please write an issue, start a discussion or get in touch via mail or twitter!

Example output

Screenshot of the terminal output of Sc2rf

Requirements and Installation

You need at least Python 3.6 and you need to install the requirements first. You might use something like python3 -m pip install -r requirements.txt to do that. There's a setup.py which you should probably ignore, since it's work in progress and does not work as intented yet.

Also, you need a terminal which supports ANSI control sequences to display colored text. On Linux, MacOS, etc. it should probably work.

On Windows, color support is tricky. On a recent version of Windows 10, it should work, but if it doesn't, install Windows Terminal from GitHub or Microsoft Store and run it from there.

Basic Usage

Start with a .fasta file with one or more sequences which might contain recombinants. Your sequences have to be aligned to the reference.fasta. If they are not, you will get an error message like:

Sequence hCoV-19/Phantasialand/EFWEFWD not properly aligned, length is 29718 instead of 29903.

(For historical reasons, I always used Nextclade to get aligned sequences, but you might also use Nextalign or any other tool. Installing them is easy on Linux or MacOS, but not on Windows. You can also use a web-based tool like MAFFT.)

Then call:

sc2rf.py <your_filename.fasta>

If you just need some fasta files for testing, you can search the pango-lineage proposals for recombinant issues with fasta-files, or take some files from my shared-sequences repository, which might not contain any actual recombinants, but hundreds of sequences that look like they were!

No output / some sequences not shown

By default, a lot filters are active to show only the likely recombinants, so that you can input 10000s of sequences and just get output for the interesting ones. If you want, you can disable all filters like that, which is only recommended for small input files with less than 100 sequences:

sc2rf.py --parents 1-35 --breakpoints 0-100 \
--unique 1 --max-ambiguous 10000 <your_filename.fasta>

or even

sc2rf.py --parents 1-35 --breakpoints 0-100 \
--unique 1 --max-ambiguous 10000 --force-all-parents \
--clades all <your_filename.fasta>

The meaning of these parameters is described below.

Advanced Usage

You can execute sc2rf.py -h to get excactly this help message:

usage: sc2rf.py [-h] [--primers [PRIMER ...]]
                [--primer-intervals [INTERVAL ...]]
                [--parents INTERVAL] [--breakpoints INTERVAL]
                [--clades [CLADES ...]] [--unique NUM]
                [--max-intermission-length NUM]
                [--max-intermission-count NUM]
                [--max-name-length NUM] [--max-ambiguous NUM]
                [--force-all-parents]
                [--select-sequences INTERVAL]
                [--enable-deletions] [--show-private-mutations]
                [--rebuild-examples] [--mutation-threshold NUM]
                [--add-spaces [NUM]] [--sort-by-id [NUM]]
                [--verbose] [--ansi] [--hide-progress]
                [--csvfile CSVFILE]
                [input ...]

Analyse SARS-CoV-2 sequences for potential, unknown recombinant
variants.

positional arguments:
  input                 input sequence(s) to test, as aligned
                        .fasta file(s) (default: None)

optional arguments:
  -h, --help            show this help message and exit

  --primers [PRIMER ...]
                        Filenames of primer set(s) to visualize.
                        The .bed formats for ARTIC and EasySeq
                        are recognized and supported. (default:
                        None)

  --primer-intervals [INTERVAL ...]
                        Coordinate intervals in which to
                        visualize primers. (default: None)

  --parents INTERVAL, -p INTERVAL
                        Allowed number of potential parents of a
                        recombinant. (default: 2-4)

  --breakpoints INTERVAL, -b INTERVAL
                        Allowed number of breakpoints in a
                        recombinant. (default: 1-4)

  --clades [CLADES ...], -c [CLADES ...]
                        List of variants which are considered as
                        potential parents. Use Nextstrain clades
                        (like "21B"), or Pango Lineages (like
                        "B.1.617.1") or both. Also accepts "all".
                        (default: ['20I', '20H', '20J', '21I',
                        '21J', 'BA.1', 'BA.2', 'BA.3'])

  --unique NUM, -u NUM  Minimum of substitutions in a sample
                        which are unique to a potential parent
                        clade, so that the clade will be
                        considered. (default: 2)

  --max-intermission-length NUM, -l NUM
                        The maximum length of an intermission in
                        consecutive substitutions. Intermissions
                        are stretches to be ignored when counting
                        breakpoints. (default: 2)

  --max-intermission-count NUM, -i NUM
                        The maximum number of intermissions which
                        will be ignored. Surplus intermissions
                        count towards the number of breakpoints.
                        (default: 8)

  --max-name-length NUM, -n NUM
                        Only show up to NUM characters of sample
                        names. (default: 30)

  --max-ambiguous NUM, -a NUM
                        Maximum number of ambiguous nucs in a
                        sample before it gets ignored. (default:
                        50)

  --force-all-parents, -f
                        Force to consider all clades as potential
                        parents for all sequences. Only useful
                        for debugging.

  --select-sequences INTERVAL, -s INTERVAL
                        Use only a specific range of input
                        sequences. DOES NOT YET WORK WITH
                        MULTIPLE INPUT FILES. (default: 0-999999)

  --enable-deletions, -d
                        Include deletions in lineage comparision.

  --show-private-mutations
                        Display mutations which are not in any of
                        the potential parental clades.

  --rebuild-examples, -r
                        Rebuild the mutations in examples by
                        querying cov-spectrum.org.

  --mutation-threshold NUM, -t NUM
                        Consider mutations with a prevalence of
                        at least NUM as mandatory for a clade
                        (range 0.05 - 1.0, default: 0.75).

  --add-spaces [NUM]    Add spaces between every N colums, which
                        makes it easier to keep your eye at a
                        fixed place. (default without flag: 0,
                        default with flag: 5)

  --sort-by-id [NUM]    Sort the input sequences by the ID. If
                        you provide NUM, only the first NUM
                        characters are considered. Useful if this
                        correlates with meaning full meta
                        information, e.g. the sequencing lab.
                        (default without flag: 0, default with
                        flag: 999)

  --verbose, -v         Print some more information, mostly
                        useful for debugging.

  --ansi                Use only ASCII characters to be
                        compatible with ansilove.

  --hide-progress       Don't show progress bars during long
                        task.

  --csvfile CSVFILE     Path to write results in CSV format.
                        (default: None)

An Interval can be a single number ("3"), a closed interval
("2-5" ) or an open one ("4-" or "-7"). The limits are inclusive.
Only positive numbers are supported.

Interpreting the output

To be written...

There already is a short Twitter thread which explains the basics.

Source material attribution

  • virus_properties.json contains data from LAPIS / cov-spectrum which uses data from NCBI GenBank, prepared and hosted by Nextstrain, see blog post.
  • reference.fasta is taken from Nextstrain's nextclade_data, see NCBI for attribution.
  • mapping.csv is a modified version of the table on the covariants homepage by Nextstrain.
  • Example output / screenshot based on Sequences published by the German Robert-Koch-Institut.
  • Primers:
    • ARTIC primers CC-BY-4.0 by the ARTICnetwork project
    • EasySeq primers by Coolen, J. P., Wolters, F., Tostmann, A., van Groningen, L. F., Bleeker-Rovers, C. P., Tan, E. C., ... & Melchers, W. J. Removed until I understand the format if the .bed file. There will be an issue soon.
    • midnight primers CC-BY-4.0 by Silander, Olin K, Massey University

The initial version of this program was written in cooperation with @flauschzelle.

TODO / IDEAS / PLANS

  • Move these TODOs into actual issues
  • add disclaimer and link to pango-designation
  • provide a sample file (maybe both .fasta and .csv, as long as the csv step is still needed)
  • accept aligned fasta
    • as input file
    • as piped stream
  • If we still accept csv/ssv input, autodetect the delimiter either by file name or by analysing the first line
  • find a way to handle already designated recombinant lineages
  • Output structured results
    • csv
    • html?
    • fasta of all sequences that match the criteria, which enables efficient multi-pass strategies
  • filter sequences
    • by ID
    • by metadata
  • take metadata csv
  • document the output in README
  • check / fix --enabled-deletions
  • adjustable threshold for mutation prevalence
  • new color mode (with background color and monochrome text on top)
  • new bar mode (with colored lines beneath each sequence, one for each example sequence, and "intermissions" shown in the color of the "surrounding" lineage, but not as bright)
  • interactive mode, for filtering, reordering, etc.
  • sort sequences within each block
  • re-think this whole "intermission" concept
  • select a single sequence and let the tool refine the choice of parental sequences, not just focusing on commonly known lineages (going up and down in the tree)
  • use more common terms to describe things (needs feedback from people with actual experience in the field)
Owner
Lena Schimmel
Lena Schimmel
CarND-LaneLines-P1 - Lane Finding Project for Self-Driving Car ND

Finding Lane Lines on the Road Overview When we drive, we use our eyes to decide where to go. The lines on the road that show us where the lanes are a

Udacity 769 Dec 27, 2022
This is the repository of the NeurIPS 2021 paper "Curriculum Disentangled Recommendation withNoisy Multi-feedback"

Curriculum_disentangled_recommendation This is the repository of the NeurIPS 2021 paper "Curriculum Disentangled Recommendation with Noisy Multi-feedb

14 Dec 20, 2022
Official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021.

Introduction This repository is the official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021. Data-free Kno

NVIDIA Research Projects 50 Jan 05, 2023
Implementation of Squeezenet in pytorch, pretrained models on Cifar 10 data to come

Pytorch Squeeznet Pytorch implementation of Squeezenet model as described in https://arxiv.org/abs/1602.07360 on cifar-10 Data. The definition of Sque

gaurav pathak 86 Oct 28, 2022
naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply c

Max Halford 24 Dec 20, 2022
Hepsiburada - Hepsiburada Urun Bilgisi Cekme

Hepsiburada Urun Bilgisi Cekme from hepsiburada import Marka nike = Marka("nike"

Ilker Manap 8 Oct 26, 2022
MIMO-UNet - Official Pytorch Implementation

MIMO-UNet - Official Pytorch Implementation This repository provides the official PyTorch implementation of the following paper: Rethinking Coarse-to-

Sungjin Cho 248 Jan 02, 2023
This is a work in progress reimplementation of Instant Neural Graphics Primitives

Neural Hash Encoding This is a work in progress reimplementation of Instant Neural Graphics Primitives Currently this can train an implicit representa

Penn 79 Sep 01, 2022
An Artificial Intelligence trying to drive a car by itself on a user created map

An Artificial Intelligence trying to drive a car by itself on a user created map

Akhil Sahukaru 17 Jan 13, 2022
FwordCTF 2021 Infrastructure and Source code of Web/Bash challenges

FwordCTF 2021 You can find here the source code of the challenges I wrote (Web and Bash) in FwordCTF 2021 and the source code of the platform with our

Kahla 5 Nov 25, 2022
All course materials for the Zero to Mastery Machine Learning and Data Science course.

Zero to Mastery Machine Learning Welcome! This repository contains all of the code, notebooks, images and other materials related to the Zero to Maste

Daniel Bourke 1.6k Jan 08, 2023
implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks

YOLOR implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks To reproduce the results in the paper, please us

Kin-Yiu, Wong 1.8k Jan 04, 2023
Clean and readable code for Decision Transformer: Reinforcement Learning via Sequence Modeling

Minimal implementation of Decision Transformer: Reinforcement Learning via Sequence Modeling in PyTorch for mujoco control tasks in OpenAI gym

Nikhil Barhate 104 Jan 06, 2023
Oriented Response Networks, in CVPR 2017

Oriented Response Networks [Home] [Project] [Paper] [Supp] [Poster] Torch Implementation The torch branch contains: the official torch implementation

ZhouYanzhao 217 Dec 12, 2022
automatic color-grading

color-matcher Description color-matcher enables color transfer across images which comes in handy for automatic color-grading of photographs, painting

hahnec 168 Jan 05, 2023
Cooperative Driving Dataset: a dataset for multi-agent driving scenarios

Cooperative Driving Dataset (CODD) The Cooperative Driving dataset is a synthetic dataset generated using CARLA that contains lidar data from multiple

Eduardo Henrique Arnold 124 Dec 28, 2022
official code for dynamic convolution decomposition

Revisiting Dynamic Convolution via Matrix Decomposition (ICLR 2021) A pytorch implementation of DCD. If you use this code in your research please cons

Yunsheng Li 110 Nov 23, 2022
An algorithm study of the 6th iOS 10 set of Boost Camp Web Mobile

알고리즘 스터디 🔥 부스트캠프 웹모바일 6기 iOS 10조의 알고리즘 스터디 입니다. 개인적인 사정 등으로 S034, S055만 참가하였습니다. 스터디 목적 상진: 코테 합격 + 부캠끝나고 아침에 일어나기 위해 필요한 사이클 기완: 꾸준하게 자리에 앉아 공부하기 +

2 Jan 11, 2022
Beginner-friendly repository for Hacktober Fest 2021. Start your contribution to open source through baby steps. 💜

Hacktober Fest 2021 🎉 Open source is changing the world – one contribution at a time! 🎉 This repository is made for beginners who are unfamiliar wit

Abhilash M Nair 32 Dec 11, 2022
Cortex-compatible model server for Python and TensorFlow

Nucleus model server Nucleus is a model server for TensorFlow and generic Python models. It is compatible with Cortex clusters, Kubernetes clusters, a

Cortex Labs 14 Nov 27, 2022