Extract an archive file (zip file or tar file) stored on AWS S3

Overview

S3 Extract

Extract an archive file (zip file or tar file) stored on AWS S3.

Details

Downloads archive from S3 into memory, then extract and re-upload to given destination.

The following S3 information is expected to be given as Environment Variables:

  • AWS_ENDPOINT_URL
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • CERT_PATH (optional)
  • CERT_DL_URL (optional)
  • SIGNATURE_VERSION (optional, defaults to "s3v4")
  • REGION_NAME (optional, defaults to "us-east-1")

Additionally, these information needed for ClearML remote execution can also be given as Env Variable (optional, args will override env var):

  • DEFAULT_DOCKER_IMG
  • DEFAULT_QUEUE

For those not familiar, environment variables can be set through various ways, some being:

  • export = in terminal
  • can be set in ~/.bashrc as well for more permanence

Iteratively extracting a folder of zips/tars is supported as well, through --src-is-dir flag.

Usage

usage: run.py [-h] [--src-is-dir] [--dst-bucket DST_BUCKET] [--verbose] [--remote] [--clml-proj CLML_PROJ] [--clml-task-name CLML_TASK_NAME] [--clml-task-type CLML_TASK_TYPE]
              [--docker-img DOCKER_IMG] [--queue QUEUE]
              src_bucket src_path dst_path

positional arguments:
  src_bucket            Source bucket
  src_path              Source path
  dst_path              Destination path

optional arguments:
  -h, --help            show this help message and exit
  --src-is-dir          Flag to indicate that given src path is a directory. Will iteratively extract any files in it ending with .zip or .tar.
  --dst-bucket DST_BUCKET
                        Destination bucket (optional), will default to Source bucket.
  --verbose             print out current upload filename as it progresses
  --remote              use clearml to remotely run job
  --clml-proj CLML_PROJ
                        ClearML Project Name
  --clml-task-name CLML_TASK_NAME
                        ClearML Task Name
  --clml-task-type CLML_TASK_TYPE
                        ClearML Task Type, e.g. training, testing, inference, etc
  --docker-img DOCKER_IMG
                        Base docker image to pull for ClearML remote execution
  --queue QUEUE         ClearML remote execution queue

Example usage:

python run.py my-bucket dataset/coco/images.tar dataset/coco/ --verbose --remote --clml-proj coco --clml-task-name coco_extraction --docker-img ubuntu/20.04 --queue 1xGPU
Owner
Evan
Evan
Python library and shell utilities to monitor filesystem events.

Watchdog Python API and shell utilities to monitor file system events. Works on 3.6+. If you want to use Python 2.6, you should stick with watchdog

Yesudeep Mangalapilly 5.6k Jan 04, 2023
MetaMove is written in Python3 and aims at easing batch renaming operations based on file meta data.

MetaMove MetaMove is written in Python3 and aims at easing batch renaming operations based on file meta data. MetaMove abuses eval combined with f-str

Jan Philippi 2 Dec 28, 2021
A Python library that provides basic functions to read / write Aseprite format files

A Python library that provides basic functions to read / write Aseprite format files

Joe Trewin 1 Jan 13, 2022
Simple archive format designed for quickly reading some files without extracting the entire archive

Simple archive format designed for quickly reading some files without extracting the entire archive

Jarred Sumner 336 Dec 30, 2022
Utils for streaming large files (S3, HDFS, gzip, bz2...)

smart_open ā€” utils for streaming large files in Python What? smart_open is a Python 3 library for efficient streaming of very large files from/to stor

RARE Technologies 2.7k Jan 06, 2023
Extract longest transcript or longest CDS transcript from GTF annotation file or gencode transcripts fasta file.

Extract longest transcript or longest CDS transcript from GTF annotation file or gencode transcripts fasta file.

laojunjun 13 Nov 23, 2022
Python's Filesystem abstraction layer

PyFilesystem2 Python's Filesystem abstraction layer. Documentation Wiki API Documentation GitHub Repository Blog Introduction Think of PyFilesystem's

pyFilesystem 1.8k Jan 02, 2023
pydicom - Read, modify and write DICOM files with python code

pydicom is a pure Python package for working with DICOM files. It lets you read, modify and write DICOM data in an easy "pythonic" way.

DICOM in Python 1.5k Jan 04, 2023
Sheet Data Image/PDF-to-CSV Converter

Sheet Data Image/PDF-to-CSV Converter

Quy Truong 5 Nov 22, 2021
LightCSV - This CSV reader is implemented in just pure Python.

LightCSV Simple light CSV reader This CSV reader is implemented in just pure Python. It allows to specify a separator, a quote char and column titles

Jose Rodriguez 6 Mar 05, 2022
QSynthesis is a Python3 API to perform I/O based program synthesis of bitvector expressions.

QSynthesis is a Python3 API to perform I/O based program synthesis of bitvector expressions. It aims at facilitating code deobfuscation. The algorithm is greybox approach combining both a blackbox I/

Quarkslab 103 Dec 30, 2022
Get Your TXT File Length !.

TXTLen Get Your TXT File Length !. Hi šŸ‘‹ , I'm Alireza A Python Developer Boy šŸ”­ Iā€™m currently working on my C# projects šŸŒ± Iā€™m currently Learning CSh

Alireza Hasanzadeh 1 Jan 06, 2022
Kartothek - a Python library to manage large amounts of tabular data in a blob store

Kartothek - a Python library to manage (create, read, update, delete) large amounts of tabular data in a blob store

15 Dec 25, 2022
Python virtual filesystem for SQLite to read from and write to S3

Python virtual filesystem for SQLite to read from and write to S3

Department for International Trade 70 Jan 04, 2023
Simple, convenient and cross-platform file date changing library. šŸ“šŸ“…

Simple, convenient and cross-platform file date changing library.

kubinka0505 15 Dec 18, 2022
Organize the files into the relevant sub-folders

This program can be used to organize files in a directory by their file extension. And move duplicate files to a duplicates folder.

Thushara Thiwanka 2 Dec 15, 2021
dotsend is a web application which helps you to upload your large files and share file via link

dotsend is a web application which helps you to upload your large files and share file via link

Devocoe 0 Dec 03, 2022
Python package to read and display segregated file names present in a directory based on type of the file

tpyfilestructure Python package to read and display segregated file names present in a directory based on type of the file. Installation You can insta

Tharun Kumar T 2 Nov 28, 2021
Object-oriented file system path manipulation

path (aka path pie, formerly path.py) implements path objects as first-class entities, allowing common operations on files to be invoked on those path

Jason R. Coombs 1k Dec 28, 2022
Extract the windows major and minor build numbers from an ISO file, and automatically sort the iso files.

WindowsBuildFromISO Extract the windows major and minor build numbers from an ISO file, and automatically sort the iso files. Features Parse multiple

Podalirius 9 Nov 09, 2022