Fileson - JSON File database tools

Overview

Fileson - JSON File database tools

Fileson is a set of Python scripts to create JSON file databases and use them to do various things, like compare differences between two databases. There are a few key files:

  • fileson.py contains Fileson class to read, manipulate and write Fileson databases. Relies on logdict.py, a logging-enabled hashmap.
  • fileson_util.py is a command-line toolkit to create Fileson databases and do useful things with them
  • fileson_backup.py contains helper logic for creating crypto keys, encryption/decryption, upload/download from S3, and most importantly, backup/restore functionality. | fileson_tool.py is a config-based interface to simple backups.

API documentation (everything very much subject to change) available at https://fileson.readthedocs.io/en/latest/

Quickstart to backup

If you are not that interested in the details of this library, set up your backup process in a few straightforward steps:

Prerequisites (S3 and boto3)

  1. Sign up for AWS and create an S3 bucket.
  2. Create a new identity that has privileges for writing to that bucket. Yes, you will need to google 'grant identity access to s3 bucket' for how to do this.
  3. Use something like S3 Browser to check you can upload to your bucket with your newly created credentials.
  4. Get boto3 for Python and configure the credentials. Maybe even do a test with the S3 sample code (boto3 quickstart documentation is excellent)

Using the fileson_tool.py

  1. Edit the included fileson.ini (and create an encryption key if you want encrypted backups, see the comments inside the ini file)
  2. Run python3 fileson_tool.py scan to create the .fson files for your backup entries.
  3. Run python3 fileson_tool.py backup to back everything up. This will take long, so maybe use -e entryname to do it one by one.
  4. Repeat from (2) whenever you want to update the backup!

The backup process should tolerate interruptions with ctrl-c and carry on where it left later (it logs every upload and flushes the log to disk after every file).

Tip: You may want to have the fileson.ini in a separate directory and run the scan and backup commands from there, so you have a nice folder to (also) back up to your cloud -- encrypted and name-obfuscated back up files are of little use without the .fson and .log files!

Create a Fileson database

[email protected]:~$ python3 fileson_util.py scan files.fson ~/mydir

Fileson databases are essentially log files with JSON objects per row, containing directory and file information (name, modified date, size) for ~/mydir and some additional metadata for each scan (changes to entries are appended to the end).

To calculate an SHA1 checksum for the files as well:

[email protected]:~$ python3 fileson_util.py scan files.fson ~/mydir -c sha1

Calculating SHA1 checksums is somewhat slow, around 1 GB/s on modern m.2 SSD and 150 MB/s on a mechanical drive, so you can use -c sha1fast to only include the beginning of the file. It will differentiate most cases quite well.

Fileson databases are versioned. Once a database exists, repeated call to fileson_util.py scan will update the database, keeping track of the changes. You can then use this information to view changes between given runs, etc.

Normally SHA1 checksums are carried over if the previous version had a file with same name, size and modification time. For a stricter version, you can use -s or --strict to require full path match. Note that this means calculating new checksum for all moved files.

Duplicate detection

Once you have a Fileson database ready, you can do fun things like see if you have any duplicates in your folder (cryptic string before duplicates identifies the checksum collision, whether it is based on size or sha1):

[email protected]:~$ python3 fileson_util.py duplicates pics.fson

1afc8e06e081b772eadd6a981a83f67077e2ef10
2009/2009-03-07/DSC_3962-2.NEF
2009/2009-03-07/DSC_3962.NEF

Many folders tend to have a lot of small files common (including empty files), for example source code with git repositories, and that is OK so you can use for example -m 1M to only show duplicates that have a minimum size of 1 MB.

You can skip database creation and give a directory to the command as well:

[email protected]:~$ python3 fileson_util.py duplicates /mnt/d/SomeFolder -m 1M -c sha1fast

Change detection

Once you have a Fileson database or two, you can compare them with fileson_util.py diff. Like the duplicate command, one or both can be a directory. Note that two files with different checksum types will essentially differ on all files.

[email protected]:~$ python3 fileson_util.py diff myfiles-2010.fson myfiles-2020.fson \
  myfiles-2010-2020.delta

The myfiles-2010-2020.delta now contains a row per difference between the two databases/directories -- files that exist only in origin, only in target, or have changed.

Let's say you move some.zip around a bit (JSON formatted for clarity):

[email protected]:~$ python3 fileson_util.py scan files.fson ~/mydir -c sha1
[email protected]:~$ mv ~/mydir/some.zip ~/mydir/subdir/newName.zip
[email protected]:~$ python3 fileson_util.py diff files.fson ~/mydir -c sha1 -p
{"path": ".", "src": {"modified_gmt": "2021-02-28 19:42:05"},
    "dest": {"modified_gmt": "2021-02-28 19:42:26"}}
{"path": "some.zip", "src": {"size": 0, "modified_gmt": "2021-02-23 21:57:25"},
    "dest": null}
{"path": "subdir", "src": {"modified_gmt": "2021-02-28 19:42:05"},
    "dest": {"modified_gmt": "2021-02-28 19:42:26"}}
{"path": "subdir/newName.zip", "src": null,
    "dest": {"size": 0, "modified_gmt": "2021-02-23 21:57:25"}}

Doing an incremental backup would involve grabbing the deltas which have src set to null. With SHA1 checksums, you could also only upload the new file if the file blob has not been uploaded before (keeping a separate Fileson object log of backed up files).

Loading Fileson databases has special syntax similar to git where you can revert to previous versions with db.fson~1 to get the previous version or db.fson~3 to back down 3 steps. This makes printing out changes after a scan a breeze. Instead of the fileson_util.py diff invocation above, you could update the db and see what changed:

[email protected]:~$ python3 fileson_util.py scan files.fson
[email protected]:~$ python3 fileson_util.py diff files.fson~1 files.fson -p
[ same output as the above diff ]

Note that you did not have to specify checksum type or directory, as it is detected automatically from the Fileson DB.

Use Fileson for simple backups to local or cloud

Fileson contains a robust set of utilities to make backups locally or into S3, either unencrypted or with secure AES256 encryption. For S3 you need to have boto3 client configured first.

Encryption

Encryption is done with 256 bit key that you can generate easily:

[email protected]:~$ python3 fileson_backup.py keygen password salt > my.key

Now my.key contains a 64-hex key generated with given password and salt (with PBKDF2 using AES256 and 1 million iterations by default). You can use the key to encrypt and decrypt data.

[email protected]:~$ python3 fileson_backup.py encrypt some.txt some.enc my.key
[email protected]:~$ python3 fileson_backup.py decrypt some.enc some2.txt my.key
[email protected]:~$ diff some.txt some2.txt

Uploading to S3 and downloading

A simple upload/download client is also provided:

[email protected]:~$ python3 fileson_backup.py upload some.txt s3://mybucket/objpath
[email protected]:~$ python3 fileson_backup.py download s3://mybucket/objpath some2.txt
[email protected]:~$ diff some.txt some2.txt

Just add -k my.key to encrypt/decrypt files on the fly with upload and download.

Backup up a Fileson-scanned directory

Once you have a Fileson database at hand, you can do a backup run. Certain considerations:

  1. Base path of files is taken from Fileson DB, so if you used a relative path when scanning, backup command needs to be run in the same directory.
  2. To avoid backing up same files over and over, second command is a backup logfile, essentially recording SHA1 hashes and locations of files backed up.
  3. You need to specify either a local directory or S3 path

Backup log is essentially a Fileson DB for your backup location, and it is written line-by-line as backup is progressing. So if the backup process gets interrupted, you can just rerun the backup command and it should resume with next item that was not yet backed up.

Here is an example of simple backup to a local folder:

[email protected]:~$ python3 fileson_scan.py scan db.fson ~/mydir -c sha1
[email protected]:~$ python3 fileson_backup.py backup db.fson db_backup.log /mnt/backup

That's it. Once files change, re-run scan to update changes and then backup to upload any added objects.

Note: Support for removing files that no longer exist in db.fson from backup location is not yet done.

Owner
Joonas Pihlajamaa
Joonas Pihlajamaa
Make JSON serialization easier

Make JSON serialization easier

4 Jun 30, 2022
Json utils is a python module that you can use when working with json files.

Json-utils Json utils is a python module that you can use when working with json files. it comes packed with a lot of featrues Features Converting jso

Advik 4 Apr 24, 2022
Generate code from JSON schema files

json-schema-codegen Generate code from JSON schema files. Table of contents Introduction Currently supported languages Requirements Installation Usage

Daniele Esposti 30 Dec 23, 2022
Wikidot-forum-dump - Simple Python script that dumps a Wikidot wiki forum into JSON structures.

wikidot-forum-dump Script is partially based on 2stacks by bluesoul: https://github.com/scuttle/2stacks To dump a Wiki's forum, edit config.py and put

ZZYZX 1 Jun 29, 2022
A Cobalt Strike Scanner that retrieves detected Team Server beacons into a JSON object

melting-cobalt 👀 A tool to hunt/mine for Cobalt Strike beacons and "reduce" their beacon configuration for later indexing. Hunts can either be expans

Splunk GitHub 150 Nov 23, 2022
Package to Encode/Decode some common file formats to json

ZnJSON Package to Encode/Decode some common file formats to json Available via pip install znjson In comparison to pickle this allows having readable

ZINC 2 Feb 02, 2022
Convert your subscriptions csv file into a valid json for Newpipe!

Newpipe-CSV-Fixer Convert your Google subscriptions CSV file into a valid JSON for Newpipe! Thanks to nikcorg for sharing how to convert the CSV into

Juanjo 44 Dec 29, 2022
Define your JSON schema as Python dataclasses

Define your JSON schema as Python dataclasses

62 Sep 20, 2022
simdjson : Parsing gigabytes of JSON per second

JSON is everywhere on the Internet. Servers spend a *lot* of time parsing it. We need a fresh approach. The simdjson library uses commonly available SIMD instructions and microparallel algorithms to

16.3k Dec 29, 2022
API that provides Wordle (ES) solutions in JSON format

Wordle (ES) solutions API that provides Wordle (ES) solutions in JSON format.

Álvaro García Jaén 2 Feb 10, 2022
Python script for converting .json to .md files using Mako templates.

Install Just install poetry and update script dependencies Usage Put your settings in settings.py and .json data (optionally, with attachments) in dat

Alexey Borontov 6 Dec 07, 2021
No more boilerplate to check and build a Python object from JSON.

JSONloader This module is for you if you're tired of writing boilerplate that: builds a straightforward Python object from loaded JSON. checks that yo

3 Feb 05, 2022
A Python application to transfer Zeek ASCII (not JSON) logs to Elastic/OpenSearch.

zeek2es.py This Python application translates Zeek's ASCII TSV logs into ElasticSearch's bulk load JSON format. For JSON logs, see Elastic's File Beat

Corelight, Inc. 28 Dec 22, 2022
Ibmi-json-beautify - Beautify json string with python

Ibmi-json-beautify - Beautify json string with python

Jefferson Vaughn 3 Feb 02, 2022
import json files directly in your python scripts

Install Install from git repository pip install git+https://github.com/zaghaghi/direct-json-import.git Use With the following json in a file named inf

Hamed Zaghaghi 51 Dec 01, 2021
Editor for json/standard python data

Editor for json/standard python data

1 Dec 07, 2021
MOSP is a platform for creating, editing and sharing validated JSON objects of any type.

MONARC Objects Sharing Platform Presentation MOSP is a platform for creating, editing and sharing validated JSON objects of any type. You can use any

CASES Luxembourg 72 Dec 14, 2022
Same as json.dumps or json.loads, feapson support feapson.dumps and feapson.loads

Same as json.dumps or json.loads, feapson support feapson.dumps and feapson.loads

boris 5 Dec 01, 2021
Marshall python objects to and from JSON

Pymarshaler - Marshal and Unmarshal Python Objects Disclaimer This tool is in no way production ready About Pymarshaler allows you to marshal and unma

Hernan Romer 9 Dec 20, 2022
JSONManipulator is a Python package to retrieve, add, delete, change and store objects in JSON files.

JSONManipulator JSONManipulator is a Python package to retrieve, add, delete, change and store objects in JSON files. Installation Use the package man

Andrew Polukhin 1 Jan 07, 2022