Lazydata: Scalable data dependencies for Python projects

Overview

CircleCI

lazydata: scalable data dependencies

lazydata is a minimalist library for including data dependencies into Python projects.

Problem: Keeping all data files in git (e.g. via git-lfs) results in a bloated repository copy that takes ages to pull. Keeping code and data out of sync is a disaster waiting to happen.

Solution: lazydata only stores references to data files in git, and syncs data files on-demand when they are needed.

Why: The semantics of code and data are different - code needs to be versioned to merge it, and data just needs to be kept in sync. lazydata achieves exactly this in a minimal way.

Benefits:

  • Keeps your git repository clean with just code, while enabling seamless access to any number of linked data files
  • Data consistency assured using file hashes and automatic versioning
  • Choose your own remote storage backend: AWS S3 or (coming soon:) directory over SSH

lazydata is primarily designed for machine learning and data science projects. See this medium post for more.

Getting started

In this section we'll show how to use lazydata on an example project.

Installation

Install with pip (requires Python 3.5+):

$ pip install lazydata

Add to your project

To enable lazydata, run in project root:

$ lazydata init 

This will initialise lazydata.yml which will hold the list of files managed by lazydata.

Tracking a file

To start tracking a file use track("<path_to_file>") in your code:

my_script.py

from lazydata import track

# store the file when loading  
import pandas as pd
df = pd.read_csv(track("data/my_big_table.csv"))

print("Data shape:" + df.shape)

Running the script the first time will start tracking the file:

$ python my_script.py
## lazydata: Tracking a new file data/my_big_table.csv
## Data shape: (10000,100)

The file is now tracked and has been backed-up in your local lazydata cache in ~/.lazydata and added to lazydata.yml:

files:
  - path: data/my_big_table.csv
    hash: 2C94697198875B6E...
    usage: my_script.py

If you re-run the script without modifying the data file, lazydata will just quickly check that the data file hasn't changed and won't do anything else.

If you modify the data file and re-run the script, this will add another entry to the yml file with the new hash of the data file, i.e. data files are automatically versioned. If you don't want to keep past versions, simply remove them from the yml.

And you are done! This data file is now tracked and linked to your local repository.

Sharing your tracked files

To access your tracked files from multiple machines add a remote storage backend where they can be uploaded. To use S3 as a remote storage backend run:

$ lazydata add-remote s3://mybucket/lazydata

This will configure the S3 backend and also add it to lazydata.yml for future reference.

You can now git commit and push your my_script.py and lazydata.yml files as you normally would.

To copy the stored data files to S3 use:

$ lazydata push

When your collaborator pulls the latest version of the git repository, they will get the script and the lazydata.yml file as usual.

Data files will be downloaded when your collaborator runs my_script.py and the track("my_big_table.csv") is executed:

$ python my_script.py
## lazydata: Downloading stored file my_big_table.csv ...
## Data shape: (10000,100)

To get the data files without running the code, you can also use the command line utility:

# download just this file
$ lazydata pull my_big_table.csv

# download everything used in this script
$ lazydata pull my_script.py

# download everything stored in the data/ directory and subdirs
$ lazydata pull data/

# download the latest version of all data files
$ lazydata pull

Because lazydata.yml is tracked by git you can safely make and switch git branches.

Data dependency scenarios

You can achieve multiple data dependency scenarios by putting lazydata.track() into different parts of the code:

  • Jupyter notebook data dependencies by using tracking in notebooks
  • Data pipeline output tracking by tracking saved files
  • Class-level data dependencies by tracking files in __init__(self)
  • Module-level data dependencies by tracking files in __init__.py
  • Package-level data dependencies by tracking files in setup.py

Coming soon...

  • Examine stored file provenance and properties
  • Faceting multiple files into portable datasets
  • Storing data coming from databases and APIs
  • More remote storage options

Stay in touch

This is an early stable beta release. To find out about new releases subscribe to our new releases mailing list.

Contributing

The library is licenced under Apache-2 licence. All contributions are welcome!

Comments
  • lazydata command not recognizable on Windows

    lazydata command not recognizable on Windows

    Add to your project To enable lazydata, run in project root:

    $ lazydata init

    This resulted in:

    'lazydata' is not recognized as an internal or external command, operable program or batch file.

    on windows 10.

    opened by lastmeta 6
  • Adding support for custom endpoint

    Adding support for custom endpoint

    Useful for users that do not want to rely on Amazon S3 while using this package.

    I'm running a Minio Server for storage, which mimics an S3 container.

    boto3 (api doc here) supports custom endpoints that it's going to hit via the S3 API.

    I would be good to add tests to this behaviour, along with testing pulls and pushes for normal behaviour. Using mocks maybe ?

    EDIT: I also corrected a line which caused the package not to work for python 3.5, see commit 0d4f8fc

    opened by zbitouzakaria 6
  • corrupted lazydata.yml if application crashes

    corrupted lazydata.yml if application crashes

    I noticed when the python application that is tracking some data crashes at some point, it can leave behind a corrupted yaml file. It is not that uncommon when you are in an exploratory phase of building an ML model to write a code that can crash for example because of memory issues etc. It would be great if yaml file handle closes after each track call to ensure the file does not get corrupted! Thanks! I really like this project!

    opened by rmanak 4
  • All local file revisions hardlink to the latest revision

    All local file revisions hardlink to the latest revision

    I tested this out by creating and tracking a single file through multiple revisions.

    Let's say we have a big_file.csv whose content look like this:

    a, b, c
    1, 2, 3
    

    We first track it using this script:

    from lazydata import track
    
    # store the file when loading  
    import pandas as pd
    df = pd.read_csv(track("big_file.csv"))
    
    print("Data shape:" + str(df.shape))
    

    Change the file content multiple times, for ex:

    a, b, c
    1, 2, 3
    4, 5, 6
    

    And keep executing the script between the multiple revisions:

    (dev3.5)  ~/test_lazydata > python my_script.py 
    LAZYDATA: Tracking new file `big_file.csv`
    Data shape:(1, 3)
    (dev3.5)  ~/test_lazydata > vim big_file.csv  # changing file
    (dev3.5)  ~/test_lazydata > python my_script.py
    LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
    Data shape:(2, 3)
    (dev3.5)  ~/test_lazydata > vim big_file.csv  # changing file
    (dev3.5)  ~/test_lazydata > python my_script.py
    LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
    Data shape:(3, 3)
    (dev3.5)  ~/test_lazydata > vim big_file.csv  # changing file
    (dev3.5)  ~/test_lazydata > python my_script.py
    LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
    Data shape:(4, 3)
    

    A simple ls afterwards points to the mistake:

    (dev3.5)  ~/test_lazydata > ls -lah
    total 20
    drwxrwxr-x  2 zakaria zakaria 4096 sept.  5 16:14 .
    drwxr-xr-x 56 zakaria zakaria 4096 sept.  5 16:14 ..
    -rw-rw-r--  5 zakaria zakaria   44 sept.  5 16:14 big_file.csv
    -rw-rw-r--  1 zakaria zakaria  482 sept.  5 16:14 lazydata.yml
    -rw-rw-r--  1 zakaria zakaria  158 sept.  5 16:12 my_script.py
    

    Notice the number of hardlinks to big_file.csv. There should only be one. What is happening is that all the revisions point to the same file.

    You can also check ~/.lazydata/data directly for the content of the different files. It'a all the same.

    opened by zbitouzakaria 3
  • SyntaxError: invalid syntax

    SyntaxError: invalid syntax

    Using Python 2.7.6 and the example script you provide, with a file outside of the github repo (file exists and file path is correct, I checked):

    from lazydata import track
    
    with open(track("/home/lg390/tmp/data/some_data_file.txt"), "r") as f:
        print(f.read())
    

    I get

    -> % python sample_script.py
    Traceback (most recent call last):
      File "sample_script.py", line 1, in <module>
        from lazydata import track
      File "/usr/local/lib/python2.7/dist-packages/lazydata/__init__.py", line 1, in <module>
        from .tracker import track
      File "/usr/local/lib/python2.7/dist-packages/lazydata/tracker.py", line 11
        def track(path:str) -> str:
                      ^
    SyntaxError: invalid syntax
    
    opened by LaurentGatto 3
  • Add http link remote backend

    Add http link remote backend

    I have a project https://github.com/rominf/profanityfilter that could benefit from lazydata. I think it would be cool to move badword dictionaries out of repository and track them alongside hunspell dictionaries with lazydata. The problem is that I want these files to be accessible by end users, that means, I don't want them to be stored in AWS. Instead, I would like them downloaded by http link.

    opened by rominf 2
  • Comparison with DVC

    Comparison with DVC

    Hello!

    First of all thank you for your contribution to the community! I’ve just found out about this and it seems to be a nice project that is growing!

    You are probably familiar with dvc (https://github.com/iterative/dvc).

    I’ve been investigating it in order to include it in my ML pipeline. Can you explain briefly how/if Lazydata differs from dvc? And any advantages and disadvantages? I understand that there may be some functionalities that maybe are not yet implemented purely due to time constraints or similar. I’m more interested in knowing if there are any differences in terms of paradigm.

    Ps- if you have a different channel for these kind of questions please let me know.

    Thank you very much!

    opened by silverdna 2
  • Publish a release on PyPI

    Publish a release on PyPI

    I've made a PR #13 that you've accepted. Unfortunately, I cannot use the library the easy way because you didn't upload the latest changes on PyPI.

    I propose to implement #20 first.

    opened by rominf 0
  • Move backends requirements to extras_require

    Move backends requirements to extras_require

    If the package has optional features that require their own dependencies you can use extras_require.

    I propose to make use of extras_require for all backends that require dependencies to minimize the number of installed packages. For example, I do not use s3, but all 11 packages are installed, 6 of them are needed for s3.

    opened by rominf 0
  • Azure integration

    Azure integration

    Here's a start of the azure integration. Haven't written tests yet but let me know what you think. Also sorry for some of the style changes, I have an autoformatter on (black). Let me know if you want me to turn that off.

    Ref #18

    opened by avril-affine 3
  • Implementing multiple backends by re-using snakemake.remote or pyfilesystem2

    Implementing multiple backends by re-using snakemake.remote or pyfilesystem2

    Would it be possible to wrap the classes implementing snakemake.remote.AbstractRemoteObject (snakemake.remote, AbstractRemoteObject) into lazydata.remote.RemoteStorage class?

    This would allow to implement the following remote storage providers in one go (https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html):

    • Amazon Simple Storage Service (AWS S3): snakemake.remote.S3
    • Google Cloud Storage (GS): snakemake.remote.GS
    • File transfer over SSH (SFTP): snakemake.remote.SFTP
    • Read-only web (HTTP[S]): snakemake.remote.HTTP
    • File transfer protocol (FTP): snakemake.remote.FTP
    • Dropbox: snakemake.remote.dropbox
    • XRootD: snakemake.remote.XRootD
    • GenBank / NCBI Entrez: snakemake.remote.NCBI
    • WebDAV: snakemake.remote.webdav
    • GFAL: snakemake.remote.gfal
    • GridFTP: snakemake.remote.gridftp
    • iRODS: snakemake.remote.iRODS
    • EGA: snakemake.remote.EGA

    Pyfilesystem2

    Another alternative would be to write a wrapper around pyfilesystem2: https://github.com/PyFilesystem/pyfilesystem2. It supports the following filesystems: https://www.pyfilesystem.org/page/index-of-filesystems/

    Builtin

    • FTPFS File Transfer Protocol.
    • ...

    Official

    Filesystems in the PyFilesystem organisation on GitHub.

    • S3FS Amazon S3 Filesystem.
    • WebDavFS WebDav Filesystem.

    Third Party

    • fs.archive Enhanced archive filesystems.
    • fs.dropboxfs Dropbox Filesystem.
    • fs-gcsfs Google Cloud Storage Filesystem.
    • fs.googledrivefs Google Drive Filesystem.
    • fs.onedrivefs Microsoft OneDrive Filesystem.
    • fs.smbfs A filesystem running over the SMB protocol.
    • fs.sshfs A filesystem running over the SSH protocol.
    • fs.youtube A filesystem for accessing YouTube Videos and Playlists.
    • fs.dnla A filesystem for accessing accessing DLNA Servers
    opened by Avsecz 3
  • lazydata track - tracking files produced by other CLI tools

    lazydata track - tracking files produced by other CLI tools

    First, thanks for the amazing package. Exactly what I was looking for!

    It would be great to also have a command lazydata track <file1> <file2> ..., which would run lazydata.track() on the specified files. That way, the user can use CLI tools outside of python while still easily tracking the produced files.

    opened by Avsecz 2
Releases(1.0.19)
python-bigquery Apache-2python-bigquery (🥈34 · ⭐ 3.5K · 📈) - Google BigQuery API client library. Apache-2

Python Client for Google BigQuery Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google

Google APIs 550 Jan 01, 2023
Making it easy to query APIs via SQL

Shillelagh Shillelagh (ʃɪˈleɪlɪ) is an implementation of the Python DB API 2.0 based on SQLite (using the APSW library): from shillelagh.backends.apsw

Beto Dealmeida 207 Dec 30, 2022
aioodbc - is a library for accessing a ODBC databases from the asyncio

aioodbc aioodbc is a Python 3.5+ module that makes it possible to access ODBC databases with asyncio. It relies on the awesome pyodbc library and pres

aio-libs 253 Dec 31, 2022
Asynchronous Python client for InfluxDB

aioinflux Asynchronous Python client for InfluxDB. Built on top of aiohttp and asyncio. Aioinflux is an alternative to the official InfluxDB Python cl

Gustavo Bezerra 159 Dec 27, 2022
Python client for InfluxDB

InfluxDB-Python InfluxDB-Python is a client for interacting with InfluxDB. Development of this library is maintained by: Github ID URL @aviau (https:/

InfluxData 1.6k Dec 24, 2022
Pony Object Relational Mapper

Downloads Pony Object-Relational Mapper Pony is an advanced object-relational mapper. The most interesting feature of Pony is its ability to write que

3.1k Jan 04, 2023
A supercharged SQLite library for Python

SuperSQLite: a supercharged SQLite library for Python A feature-packed Python package and for utilizing SQLite in Python by Plasticity. It is intended

Plasticity 703 Dec 30, 2022
A collection of awesome sqlite tools, scripts, books, etc

Awesome Series @ Planet Open Data World (Countries, Cities, Codes, ...) • Football (Clubs, Players, Stadiums, ...) • SQLite (Tools, Books, Schemas, ..

Planet Open Data 205 Dec 16, 2022
Logica is a logic programming language that compiles to StandardSQL and runs on Google BigQuery.

Logica: language of Big Data Logica is an open source declarative logic programming language for data manipulation. Logica is a successor to Yedalog,

Evgeny Skvortsov 1.5k Dec 30, 2022
High level Python client for Elasticsearch

Elasticsearch DSL Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built o

elastic 3.6k Jan 03, 2023
MySQL database connector for Python (with Python 3 support)

mysqlclient This project is a fork of MySQLdb1. This project adds Python 3 support and fixed many bugs. PyPI: https://pypi.org/project/mysqlclient/ Gi

PyMySQL 2.2k Dec 25, 2022
The JavaScript Database, for Node.js, nw.js, electron and the browser

The JavaScript Database Embedded persistent or in memory database for Node.js, nw.js, Electron and browsers, 100% JavaScript, no binary dependency. AP

Louis Chatriot 13.2k Jan 02, 2023
Monty, Mongo tinified. MongoDB implemented in Python !

Monty, Mongo tinified. MongoDB implemented in Python ! Inspired by TinyDB and it's extension TinyMongo. MontyDB is: A tiny version of MongoDB, against

David Lai 522 Jan 01, 2023
Apache Libcloud is a Python library which hides differences between different cloud provider APIs and allows you to manage different cloud resources through a unified and easy to use API

Apache Libcloud - a unified interface for the cloud Apache Libcloud is a Python library which hides differences between different cloud provider APIs

The Apache Software Foundation 1.9k Dec 25, 2022
Python script to clone SQL dashboard from one workspace to another

Databricks dashboard clone Unofficial project to allow Databricks SQL dashboard copy from one workspace to another. Resource clone Setup: Create a fil

Quentin Ambard 12 Jan 01, 2023
A library for python made by me,to make the use of MySQL easier and more pythonic

my_ezql A library for python made by me,to make the use of MySQL easier and more pythonic This library was made by Tony Hasson , a 25 year old student

3 Nov 19, 2021
Example Python codes that works with MySQL and Excel files (.xlsx)

Python x MySQL x Excel by Zinglecode Example Python codes that do the processes between MySQL database and Excel spreadsheet files. YouTube videos MyS

Potchara Puttawanchai 1 Feb 07, 2022
SAP HANA Connector in pure Python

SAP HANA Database Client for Python Important Notice This public repository is read-only and no longer maintained. The active maintained alternative i

SAP Archive 299 Nov 20, 2022
Dinamopy is a python helper library for dynamodb

Dinamopy is a python helper library for dynamodb. You can define your access patterns in a json file and can use dynamic method names to make operations.

Rasim Andıran 2 Jul 18, 2022
dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.

dbd: database prototyping tool dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL d

Zdenek Svoboda 47 Dec 07, 2022