A tool for batch processing large fasta files and accompanying metadata table to upload to repositories via API

Overview

Fasta Uploader

A tool for batch processing large fasta files and accompanying metadata table to repositories via API

The python fasta_uploader.py script breaks large fasta files (e.g. 500mb) and related (one-to-one) tab-delimited sample contextual data into smaller batches of 1000 or some specified # of records which can then be uploaded to a given sequence repository if an API endpoint is selected. Currently there is one option for the API interface: VirusSeq.

This tool is developed by the SFU Centre for Infectious Disease Epidemiology and One Health in conjunction with VirusSeq and it works well with DataHarmonizer!

Authors: Damion Dooley, Nithu Sara John

Details

Given a fasta file and a sample metadata file with a column that matches to fasta file record identifiers, break both into respective sets of smaller batches of records which are submitted to an API for processing.

Processing is three step:

  1. Construct batches of files. Since only two files are read and parsed in one go, processing of them is reliable after that point, so no further error reporting is needed during the batch file generation process.

    1. Importantly, if rerunning fasta_batch_submit.py, this step will be skipped unless -f --force parameter is run. Currently input files are still required in this case.
  2. IF API option is included, submit each *.queued.fasta batch to API, wait for it to finish or error out (capture error report) and proceed to next batch.

    1. Some types of error trigger sudden death, i.e. sys.exit() because they would also occur in subsequent API batch calls. For example missing tabular data column names will trigger an exit. Once resolved, rerun with -r to force regeneration of output files.
    2. There is an option to just try submitting one of the batches, e.g. the first one, via "-n 0" parameter. This allows error debugging of just the first batch. Once error patterns are determined, those that apply to remaining source contextual data can be applied, and first batch removed from source fasta and contextual data files, and the whole batching can be redone using -r reset, or by manually deleting the output files and rerunning.
  3. The processing status of existing API requests is reported from the API server end. Some may be queued by the API server, others may have been processed successfully, and others may have line-by-line errors in field content that are converted by fasta_uploader.py into new [output file batch.#].queued.fasta and [output batch.#].queued.tsv files which can be edited and then submitted back to the API by rerunning the program with the same command line parameters.

Requires Biopython and Requests modules

  • "pip install biopython"
  • "pip install requests"

Usage

Run the command in a folder with the appropriate input files, and output files can be generated there too. Rerun it in the same folder to incrementally fix any submission errors and then restart submission.

python fasta_uploader.py [options]

Options:

-h, --help
  Show this help message and exit.
-f FASTA_FILE, --fasta=FASTA_FILE
  Provide a fasta file name.
-m METADATA_FILE, --metadata=METADATA_FILE
  Provide a COMMA .csv or TAB .tsv delimited sample contextual data file name.
-b BATCH, --batch=BATCH
  Provide number of fasta records to include in each batch. Default is 1000.
-o OUTPUT_FILE, --output=OUTPUT_FILE
  Provide an output file name/path.
-k KEY_FIELD, --key=KEY_FIELD
  Provide the metadata field name to match to fasta record identifier.
-n BATCH_NUMBER, --number=BATCH_NUMBER
  Process only given batch number to API instead of all batches.

Parameters involved in optional API call:

-a API, --api=API     
  Provide the target API to send data too.  A batch submission job will be initiated for it. Default is "VirusSeq_Portal".
-u API_TOKEN, --user=API_TOKEN
  An API user token is required for API access.
-d, --dev
  Test against a development server rather than live one.  Provide an API endpoint URL.
 -s, --short
  Report up to given # of fasta record related errors for each batch submission.  Useful for taking care of repeated errors first based on first instance.
 -r, --reset
  Regenerate all batch files and begin API resubmission process even if batch files already exist under given output file pattern.

For example:

python fasta_uploader.py -f "consensus_final.fasta" -m "final set 1.csv" -k "fasta header name" -a VirusSeq_Portal -u ENTER_API_KEY_HERE

This will convert consensus_final.fasta and related final set 1.csv contextual data records into batches of 1000 records by default, and will begin submitting each batch to the VirusSeq portal.

python fasta_uploader.py -f "consensus_final.fasta" -m "final set 1.csv" -n 0 -k "fasta header name" -a VirusSeq_Portal -u ENTER_API_KEY_HERE

Like the above but only first batch is submitted so that one can see any errors, and if they apply to all batches, can fix them in original "final set 1.csv" file. Once batch 0 is fixed, all its records can be removed from the consensus_final.fasta and final set 1.csv.csv files, and the whole job can be resubmitted.

Owner
Centre for Infectious Disease and One Health
Hsiao Laboratory at Simon Fraser University
Centre for Infectious Disease and One Health
Provides a convenient way to append numpy arrays to a file.

Provides a convenient way to append numpy arrays to a file. The NpendWriter and NpendReader classes are used to write and read numpy arrays respective

3 May 14, 2022
Python interface for reading and appending tar files

Python interface for reading and appending tar files, while keeping a fast index for finding and reading files in the archive. This interface has been

Lawrence Livermore National Laboratory 1 Nov 12, 2021
Object-oriented file system path manipulation

path (aka path pie, formerly path.py) implements path objects as first-class entities, allowing common operations on files to be invoked on those path

Jason R. Coombs 1k Dec 28, 2022
Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series. The Fourier series can be animated and visualized, the function can be output as a tw

Alexander 12 Jan 01, 2023
A Python script to backup your favorite Discord gifs

About the project Discord recently felt like it would be a good idea to limit the favorites to 250, which made me lose most of my gifs... Luckily for

4 Aug 03, 2022
Python virtual filesystem for SQLite to read from and write to S3

Python virtual filesystem for SQLite to read from and write to S3

Department for International Trade 70 Jan 04, 2023
A simple Python code that takes input from a csv file and makes it into a vcf file.

Contacts-Maker A simple Python code that takes input from a csv file and makes it into a vcf file. Imagine a college or a large community where each y

1 Feb 13, 2022
Creates folders into a directory to categorize files in that directory by file extensions and move all things from sub-directories to current directory.

Categorize and Uncategorize Your Folders Table of Content TL;DR just take me to how to install. What are Extension Categorizer and Folder Dumper Insta

Furkan Baytekin 1 Oct 17, 2021
RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem. These files are exposed either in their original format, or as PDF files that contain your annotations. This le

Robert Schroll 82 Nov 24, 2022
Dragon Age: Origins toolset to extract/build .erf files, patch language-specific .dlg files, and view the contents of files in the ERF or GFF format

DAOTools This is a set of tools for Dragon Age: Origins modding. It can patch the text lines of .dlg files, extract and build an .erf file, and view t

8 Dec 06, 2022
Transforme rapidamente seu arquivo CSV (de qualquer tamanho) para SQL de forma rápida.

Transformador de CSV para SQL Transforme rapidamente seu arquivo CSV (de qualquer tamanho) para SQL de forma rápida, e com isso insira seus dados usan

William Rodrigues 4 Oct 17, 2022
dotsend is a web application which helps you to upload your large files and share file via link

dotsend is a web application which helps you to upload your large files and share file via link

Devocoe 0 Dec 03, 2022
csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

Alex 38 Dec 31, 2022
A simple file sharing tool written in python

Share it A simple file sharing tool written in python Installation If you are using Windows os you can directly Run .exe file -- download If you are

Sachit Yadav 7 Dec 16, 2022
Nmap XML output to CSV and HTTP/HTTPS URLS.

xml-to-csv-url Convert NMAP's XML output to CSV file and print URL addresses for HTTP/HTTPS ports. NOTE: OS Version Parsing is not working properly ye

1 Dec 21, 2021
Convert CSV files into a SQLite database

csvs-to-sqlite Convert CSV files into a SQLite database. Browse and publish that SQLite database with Datasette. Basic usage: csvs-to-sqlite myfile.cs

Simon Willison 731 Dec 27, 2022
File-manager - A basic file manager, written in Python

File Manager A basic file manager, written in Python. Installation Install Pytho

Samuel Ko 1 Feb 05, 2022
Publicly Open Amazon AWS S3 Bucket Viewer

S3Viewer Publicly open storage viewer (Amazon S3 Bucket, Azure Blob, FTP server, HTTP Index Of/) s3viewer is a free tool for security researchers that

Sharon Brizinov 377 Dec 02, 2022
An object-oriented approach to Python file/directory operations.

Unipath An object-oriented approach to file/directory operations Version: 1.1 Home page: https://github.com/mikeorr/Unipath Docs: https://github.com/m

Mike Orr 506 Dec 29, 2022
Add Ranges and page numbers to IIIF Manifest from a CSV.

Add Ranges and page numbers to IIIF Manifest from CSV specific to a workflow of the Bibliotheca Hertziana.

Raffaele Viglianti 3 Apr 28, 2022