Python script for transferring data between three drives in two separate stages

Last update: Nov 10, 2021

Related tags

Overview

Waterlock

Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs hash verification and persistently tracks data transfer progress using SQLite.

I am not responsible for any lost data. This was an evening coding project. Use at your own discretion.

Use Case & Features

The use-case Waterlock was designed for is moving files from one computer (i.e. your home server) to a intermediary drive (i.e. a portable hard drive), and then from the hard drive to another computer (i.e. an offsite backup server).

It will fill the intermediary drive with as many files as it can, aside from a user-configurable amount of reserve-space.
It performs blake2 checksums with every file copy, comparing it to the initial hash value stored in the SQLite database to ensure that data is not corrupted.
It uses a SQLite database to track what data has been moved. As a result, you can incrementally move data from one location to another with minimal user input.
Every time Waterlock is run on the source location, it will check for any files that have been recently modified (based on timestamp, not hash). Any modified files will have their hash & modification timestamps updated in the database, in addition to being marked as unmoved such that they are transferred again and updated. Note that Waterlock does not version files. Nevertheless, silently corrupted files should theoretically not be transferred over unless their modification timestamp has been adjusted.
Every time Waterlock is run on the source location, it will check for any files that were previously moved to the intermediary drive but did not reach the destination. If these files are no longer on the intermediary drive due to accidental deletion for instance, Waterlock will move those files to the intermediary drive again.

Example Use Case: I use Waterlock to transfer large files that are too large to transfer over the network to an offsite backup location at a relatives house. Each time I visit I run the script on my home server to load the external drive, then run it again on the offsite-backup server.

Usage

Change the settings at the top of the script, using absolute file paths. While relative paths may work, they are more error prone due to string formatting issues. Store the script on the intermediary drive itself and run it from there. It will automatically create waterlock.db and a cargo folder where the data will be stored. Note that after the final transfer to the destination, Waterlock will not delete data on the intermediary drive.

python waterlock.py

If you are familiar with Python, you can also fully verify all the files on the middle or destination drives to ensure that the hashes match what is stored in the database. This is done using two additional class functions called verify_middle() and verify_destination(). The code to verify files on the destination would be as follows:

if __name__ == "__main__":
    wl = Waterlock( source_directory=source_directory, 
                    end_directory=end_direcotry, 
                    reserved_space=reserved_space
                    )
    wl.start()
    wl.verify_destination()

Why 'Waterlock'?

It is named Waterlock after marine locks used to move ships through waterways of different water levels in multiple stages.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Python script for transferring data between three drives in two separate stages

Related tags

Overview

Waterlock

Use Case & Features

Usage

Why 'Waterlock'?

You might also like...

Catalogue data - A Python Scripts to prepare catalogue data

This is a python script to navigate and extract the FSD50K dataset

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Releases(latest)

Owner

David Swanlund

An extension to pandas dataframes describe function.

CSV database for chihuahua (HUAHUA) blockchain transactions

🌍 Create 3d-printable STLs from satellite elevation data 🌏

Visions provides an extensible suite of tools to support common data analysis operations

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

Python Kalman filtering and optimal estimation library. Implements Kalman filter, particle filter, Extended Kalman filter, Unscented Kalman filter, g-h (alpha-beta), least squares, H Infinity, smoothers, and more. Has companion book 'Kalman and Bayesian Filters in Python'.

follow-analyzer helps GitHub users analyze their following and followers relationship

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

Example Of Splunk Search Query With Python And Splunk Python SDK

Python-based Space Physics Environment Data Analysis Software

SparseLasso: Sparse Solutions for the Lasso

A data structure that extends pyspark.sql.DataFrame with metadata information.

bigdata_analyse 大数据分析项目

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

ELFXtract is an automated analysis tool used for enumerating ELF binaries

PyIOmica (pyiomica) is a Python package for omics analyses.

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

Churn prediction with PySpark

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

The Dash Enterprise App Gallery "Oil & Gas Wells" example