Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Overview

Analysis of SARS-CoV-2 reads in sequencing of 2018-2019 Antarctica samples in PRJNA692319

The samples analyzed here are described in this preprint, which is a pre-print by Istvan Csabai and co-workers that describes SARS-CoV-2 reads in samples from Antarctica sequencing in China. I was originally alerted to the pre-print by Carl Zimmer on Dec-23-2021. Istvan Csabai and coworkers subsequently posted a second pre-print that also analyzes the host reads.

Repeating key parts of the analysis

The code in this repo independently repeats some of the analyses.

To run the analysis, build the conda environment in environment.yml and then run the analysis using Snakefile. To do this on the Hutch cluster, using run.bash:

sbatch -c 16 run.bash

The results are placed in the ./results/ subdirectory. Most of the results files are not tracked due to file-size limitations, but the following key files are tracked:

  • results/alignment_counts.csv gives the number of reads aligning to SARS-CoV-2 for each sample. This confirms that three accessions (SRR13441704, SRR13441705, and SRR13441708) have most of the SARS-CoV-2 reads, although a few other samples also have some.
  • results/variant_analysis.csv reports all variants found in the samples relative to Wuhan-Hu-1.
  • results/variant_analysis_to_outgroup.csv reports the variants found in the samples that represent mutations from Wuhan-Hu-1 towards the two closest bat coronavirus relatives, RaTG13 and BANAL-20-52. Note that some of the reads contain three key mutations relative to Wuhan-Hu-1 (C8782T, C18060T, and T28144C) that move the sequence closer to the bat coronavirus relatives. These mutations define one of the two plausible progenitors for all currently known human SARS-CoV-2 sequences (see Kumar et al (2021) and Bloom (2021)).

Archived links after initially hearing about pre-print

I archived the following links on Dec-23-2021 after hearing about the pre-print from Carl Zimmer:

Deletion of some samples from SRA

On Jan-3-2022, I received an e-mail one of the pre-print authors, Istvan Csabai, saying that three of the samples (appearing to be the ones with the most SARS-CoV-2 reads) had been removed from the SRA. He also noted that bioRxiv had refused to publish their pre-print without explanation; the file he attached indicates the submission ID was BIORXIV-2021-472446v1. I confirmed that three of the accessions had indeed been removed from the SRA as shown in the following archived links:

I also e-mailed Richard Sever at bioRxiv to ask why the pre-print was rejected, and explained I had repeated and validated the key findings. Richard Sever said he could not give details about the pre-print review process, but that in the future the authors could appeal if they thought the rejection was unfounded.

Details from Istvan Csabai

On Jan-4-2022, I chatted with Istvan Csabai. He had contacted the authors of the pre-print, and shared their reply to him. The authors had prepped the samples in early 2019, and submitted to Sangon BioTech for sequencing in December, getting the results back in early January.

Second pre-print from Csabai and restoration of deleted files

Istvan Csabai then worked on a second pre-print that analyzed host reads and made various findings, including co-contamination with African green monkey (Vero?) and human DNA. He sent me pre-print drafts on Jan-16-2022 and on Jan-24-2022, and I provided comments on both drafts and agreed to be listed in the Acknowledgments.

On Feb-3-2022, Istvan Csabai told me that the second pre-print had also been rejected from bioRxiv. Because I had previously contacted Richard Sever when I heard the first pre-print was rejected, I suggested Istvan could CC me on an e-mail to Richard Sever appealing the rejection, which he did. Unfortunately, Richard Sever declined the appeal, so instead Istvan posted the pre-print on Resarch Square.

At that point on Feb-3-2022, I also re-checked the three deletion accessions (SRR13441704, SRR13441705, and SRR13441708). To my surprise, all three were now again available by public access. Here are archived links demonstrating that they were again available:

I confirmed that the replaced accessions were identical to the deleted ones.

Inquiry to authors of PRJNA692319

On Feb-8-2022, I e-mailed the Chinese authors of the paper to ask about the sample deletion and restoration. They e-mailed back almost immediately. They confirmed what they had told Istvan: they had sequenced the samples with Sangon Biotech (Shanghai) after extracting the DNA in December 2019 from their samples. The suspect that contamination of the samples happened at Sangon Biotech. They deleted the three most contaminated samples from the Sequence Read Archive. They do not know why the samples were then "un-deleted."

Owner
Jesse Bloom
I research the evolution of viruses and proteins.
Jesse Bloom
Instant-nerf-pytorch - NeRF trained SUPER FAST in pytorch

instant-nerf-pytorch This is WORK IN PROGRESS, please feel free to contribute vi

94 Nov 22, 2022
Awesome Remote Sensing Toolkit based on PaddlePaddle.

基于飞桨框架开发的高性能遥感图像处理开发套件,端到端地完成从训练到部署的全流程遥感深度学习应用。 最新动态 PaddleRS 即将发布alpha版本!欢迎大家试用 简介 PaddleRS是遥感科研院所、相关高校共同基于飞桨开发的遥感处理平台,支持遥感图像分类,目标检测,图像分割,以及变化检测等常用遥

146 Dec 11, 2022
Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models

Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models Abstract Many applications of generative models rely on the marginali

Stanford Intelligent Systems Laboratory 9 Jun 06, 2022
ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

PENet: Precise and Efficient Depth Completion This repo is the PyTorch implementation of our paper to appear in ICRA2021 on "Towards Precise and Effic

232 Dec 25, 2022
Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

FLASH - Pytorch Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time Install $ pip install FLASH-pytorch

Phil Wang 209 Dec 28, 2022
Analyses of the individual electric field magnitudes with Roast.

Aloi Davide - PhD Student (UoB) Analysis of electric field magnitudes (wp2a dataset only at the moment) and correlation analysis with Dynamic Causal M

Davide Aloi 7 Dec 15, 2022
CodeContests is a competitive programming dataset for machine-learning

CodeContests CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of pro

DeepMind 1.6k Jan 08, 2023
Deep Learning pipeline for motor-imagery classification.

BCI-ToolBox 1. Introduction BCI-ToolBox is deep learning pipeline for motor-imagery classification. This repo contains five models: ShallowConvNet, De

DongHee 18 Oct 31, 2022
Facilitating Database Tuning with Hyper-ParameterOptimization: A Comprehensive Experimental Evaluation

A Comprehensive Experimental Evaluation for Database Configuration Tuning This is the source code to the paper "Facilitating Database Tuning with Hype

DAIR Lab 9 Oct 29, 2022
Drone detection using YOLOv5

This drone detection system uses YOLOv5 which is a family of object detection architectures and we have trained the model on Drone Dataset. Overview I

Tushar Sarkar 27 Dec 20, 2022
Python scripts for performing road segemtnation and car detection using the HybridNets multitask model in ONNX.

ONNX-HybridNets-Multitask-Road-Detection Python scripts for performing road segemtnation and car detection using the HybridNets multitask model in ONN

Ibai Gorordo 45 Jan 01, 2023
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

WebDataset WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives and us

1.1k Jan 08, 2023
Optimizing Value-at-Risk and Conditional Value-at-Risk of Black Box Functions with Lacing Values (LV)

BayesOpt-LV Optimizing Value-at-Risk and Conditional Value-at-Risk of Black Box Functions with Lacing Values (LV) About This repository contains the s

1 Nov 11, 2021
A general, feasible, and extensible framework for classification tasks.

Pytorch Classification A general, feasible and extensible framework for 2D image classification. Features Easy to configure (model, hyperparameters) T

Eugene 26 Nov 22, 2022
Implementation of ResMLP, an all MLP solution to image classification, in Pytorch

ResMLP - Pytorch Implementation of ResMLP, an all MLP solution to image classification out of Facebook AI, in Pytorch Install $ pip install res-mlp-py

Phil Wang 178 Dec 02, 2022
AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

Frank Liu 26 Oct 13, 2022
For auto aligning, cropping, and scaling HR and LR images for training image based neural networks

ImgAlign For auto aligning, cropping, and scaling HR and LR images for training image based neural networks Usage Make sure OpenCV is installed, 'pip

15 Dec 04, 2022
Mixed Transformer UNet for Medical Image Segmentation

MT-UNet Update 2021/11/19 Thank you for your interest in our work. We have uploaded the code of our MTUNet to help peers conduct further research on i

dotman 92 Dec 25, 2022
Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021) Jiaxi Jiang, Kai Zhang, Radu Timofte Computer Vision Lab, ETH Zurich, Switzerland 🔥

Jiaxi Jiang 282 Jan 02, 2023
Predict Breast Cancer Wisconsin (Diagnostic) using Naive Bayes

Naive-Bayes Predict Breast Cancer Wisconsin (Diagnostic) using Naive Bayes Downloading Data Set Use our Breast Cancer Wisconsin Data Set Also you can

Faeze Habibi 0 Apr 06, 2022