Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Overview

Overview

docs Documentation Status
tests
package

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

  • Free software: Apache 2.0 license

What is Hangar?

Hangar is based off the belief that too much time is spent collecting, managing, and creating home-brewed version control systems for data. At its core Hangar is designed to solve many of the same problems faced by traditional code version control system (i.e. Git), just adapted for numerical data:

  • Time travel through the historical evolution of a dataset
  • Zero-cost Branching to enable exploratory analysis and collaboration
  • Cheap Merging to build datasets over time (with multiple collaborators)
  • Completely abstracted organization and management of data files on disk
  • Ability to only retrieve a small portion of the data (as needed) while still maintaining complete historical record
  • Ability to push and pull changes directly to collaborators or a central server (i.e. a truly distributed version control system)

The ability of version control systems to perform these tasks for codebases is largely taken for granted by almost every developer today; however, we are in-fact standing on the shoulders of giants, with decades of engineering which has resulted in these phenomenally useful tools. Now that a new era of "Data-Defined software" is taking hold, we find there is a strong need for analogous version control systems which are designed to handle numerical data at large scale... Welcome to Hangar!

The Hangar Workflow:

   Checkout Branch
          |
          ▼
 Create/Access Data
          |
          ▼
Add/Remove/Update Samples
          |
          ▼
       Commit

Log Style Output:

*   5254ec (master) : merge commit combining training updates and new validation samples
|\
| * 650361 (add-validation-data) : Add validation labels and image data in isolated branch
* | 5f15b4 : Add some metadata for later reference and add new training samples received after initial import
|/
*   baddba : Initial commit adding training images and labels

Learn more about what Hangar is all about at https://hangar-py.readthedocs.io/

Installation

Hangar is in early alpha development release!

pip install hangar

Documentation

https://hangar-py.readthedocs.io/

Development

To run the all tests run:

tox

Note, to combine the coverage data from all the tox environments run:

Windows
set PYTEST_ADDOPTS=--cov-append
tox
Other
PYTEST_ADDOPTS=--cov-append tox
Comments
  • Dataloaders for PyTorch & Tensorflow

    Dataloaders for PyTorch & Tensorflow

    Motivation and Context

    PyTorch DataLoader for loading data from hangar directly into PyTorch

    If it fixes an open issue, please link to the issue here:

    #13

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [ ] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [x] Ready for review
    • [ ] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [ ] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    opened by hhsecond 27
  • API Redesign

    API Redesign

    Motivation and Context

    Why is this change required? What problem does it solve?:

    To simplify user interface with arraysets and provide some concept of a dataset as a view across arraysets.

    NOTE: This initial PR is a proof of concept only, and will require extensive discussion before the final design is agreed upon

    If it fixes an open issue, please link to the issue here:

    related to #79 and many conversations on the Hangar Users Slack Channel

    Description

    Describe your changes in detail:

    Added CheckoutIndexer class which is inhereted in ReaderCheckout and WriterCheckout to enable the following API. (originally proposed by @lantiga and @elistevens)

    dset = repo.checkout(write=True)
    # get an arrayset of the dataset (i.e. a "column" of the dataset?)
    aset = dset['foo']
    
    # get a specific array from 'foo' (returns a named tuple)
    arr = dset['foo', '1']
    # set it too
    dset['foo', '1'] = arr
    
    # get data from dset (returns a named tuple)
    subarr = dset['foo', '1']
    # and set into it
    dset['foo', '1'] = subarr + 1
    
    # get a sample of a dataset across 'foo' and 'bar' (returns a named tuple)
    sample = dset[('foo', 'bar'), '1']
    
    # get a sample of all arraysets in the checkout (returns a named tuple)
    sample = dset[:, '1']
    sample = dset[..., '1']
    
    # get multiple samples
    sample_ids = ['1', '2', '3']
    batch = dset[('foo', 'bar'), sample_ids]
    batch = dset[:, sample_ids]
    batch = dset[..., sample_ids]
    

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [ ] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [x] Ready for review
    • [ ] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [x] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by rlizzo 14
  • Rename datasets to datacell

    Rename datasets to datacell

    Motivation and Context

    Why is this change required? What problem does it solve?:

    The name Datasets no longer fit the appropriate description as a container of tensor/array data.

    In order to make this clearer, datasets has been replaced by the term datacells. Since this more accuratly describes the ability for a single sample in a dataset to be made up of individual pieces spread across datacells.

    Description

    Describe your changes in detail:

    The new explantation is in full HERE, but for the sake of brevity, this diagram illustrates the "unique" relationship that a dataset has to samples and datacells:

       A Dataset is thought of as containing Samples, but is actually defined by
        Datacells, which store parts of fully defined Samples in structures
           common across the full aggregation of Samples in the Dataset
    
       _____________________________________
             S1     |    S2    |     S3     |  <------------------------|
       --------------------------------------                           |
           image    |  image   |   image    |  <- Datacell 1  <--|      |
         filename   | filename |  filename  |  <- Datacell 2  <--|-- Dataset
           label    |  label   |   label    |  <- Datacell 3  <--|
         annotation |    -     | annotation |  <- Datacell 4  <--|
    
    
       If a sample does not have a piece of data, lack of info in the Datacell
             makes no difference in any way to the larger picture.
    

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [ ] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [x] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [x] Ready for review
    • [ ] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [x] Current tests cover modifications made
    • [ ] New tests have been added to the test suite
    • [x] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.

    Please give this a review @lantiga @hhsecond

    Awaiting Review 
    opened by rlizzo 14
  • Arrayset Subsamples

    Arrayset Subsamples

    Motivation and Context

    Why is this change required? What problem does it solve?:

    This is a large PR which started with the motivation of allowing arraysets to contain subsamples under a common key. Though minimal work was needed for the technical implementation (with esentially no changes made to the hangar core record parsing, history traversal, or tensor storage backends), the integration of the API into the current model proved difficult, which required some major refactoring of what was previously known as ArraysetDataReader and ArraysetDataWriter classes.

    Description

    Describe your changes in detail:

    Rather than try to combine every possible API method needed by flat and nested arrayset access into a frankenstein monster class, each access convention implements it's own API class methods (fully independent from one another). The appropriate constructors are selected based on the constains_subsamples argument in init_arrayset(). The argument is recorded in the schema so the correct type can be identified in subsequent checkouts.

    I'm working on putting together a summary of the API. That will follow shortly.

    At the moment, about half the tests for the new nested sample container are missing, and I need to re-evaluate some implementation details for how backend file handles are dealt with.

    Screenshots (if appropriate):

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [ ] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [ ] Ready for review
    • [x] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [x] New tests have been added to the test suite
    • [x] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    enhancement WIP Awaiting Review 
    opened by rlizzo 13
  • Hangar Real World Quick Start Tutorial

    Hangar Real World Quick Start Tutorial

    Motivation and Context

    Why is this change required? What problem does it solve?:

    New tutorial covering only the basic stuff for version 0.5 release, such as Repository creation and initialization, adding data to columns and committing changes.

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [x] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [x] Ready for review
    • [ ] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [x] Current tests cover modifications made
    • [ ] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    opened by alessiamarcolini 11
  • [BUG REPORT] Multiprocess pytorch dataloaders

    [BUG REPORT] Multiprocess pytorch dataloaders

    Describe the bug A clear and concise description of what the bug is.

    The pytorch dataloader cannot currently be run with multiprocess workers:

    >>> torch_dset = make_torch_dataset(aset, index_range=slice(1, 100))
    >>> loader = DataLoader(torch_dset, batch_size=16, num_workers=2)
    >>> for batch in loader:
    ...     train_model(batch)
    Exception: Cannot pickle `hangar.dataloaders.TorchDataset.BatchTuple`
    

    This is because the the BatchTuple wrappers passed to hangar.dataloaders.TorchDataset are dynamically defined namedtuple classes, whose definition is not appropriately scoped for a forked subprocess to introspect it's name/contents upon pickling.

    https://github.com/tensorwerk/hangar-py/blob/e2c7a89ccb9ddb379e8a3fa8f20dae20fcfb6345/src/hangar/dataloaders/torchloader.py#L63-L74

    As the structure needs to be dynamically defined based on user arguments, we cannot just place the BatchTuple definition in the main body of the module.

    Two possible solutions:

    @lantiga and @hhsecond, let me know what you prefer, or any other solutions you might have.

    1) keep the current return type exactally the same, but add the definition of BatchTuple to globals() before it is passed to TorchLoader

            wrapper = namedtuple('BatchTuple', field_names=field_names)
        else:
            wrapper = namedtuple('BatchTuple', field_names=gasets.arrayset_names, rename=True)
    
        globals()[`BatchTuple`] = wrapper
        return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)
    

    This works, but is generally bad practice to manually modify global scope.

    2) return a dict of field_names and tensors instead of a namedtuple

            wrapper = tuple(field_names)
        else:
            wrapper = tuple(gasets.arrayset_names)
    
        return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)
    

    And in TorchDataset replace: https://github.com/tensorwerk/hangar-py/blob/e2c7a89ccb9ddb379e8a3fa8f20dae20fcfb6345/src/hangar/dataloaders/torchloader.py#L135 with

        return dict(zip(self.wrapper, out))
    

    Which still works, and does not modify globals(), but changes the output o the function to something "not quite as nice" as a namedtuple.

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [ ] Unexpected Behavior, Exceptions or Error Thrown
    • [x] Performance Bottleneck
    Bug: Priority 3 
    opened by rlizzo 10
  • Plugins revamp

    Plugins revamp

    Motivation and Context

    Revamping plugin system to make at actually pluggable for different data types etc.

    Description

    • Introducing new io module. This will help users to use the import/export functionality of plugins through the program if they don't wan't to interact with the low-level hangar APIs
    • We potentially can move other modules like dataset inside io module
    • Test cases are work in progress
    • Docs are work in progress

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [x] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [ ] Ready for review
    • [x] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [x] Current tests cover modifications made
    • [x] New tests have been added to the test suite
    • [x] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    Awaiting Review 
    opened by hhsecond 10
  • [BUG REPORT] Commit inside context manager throws RuntimeError

    [BUG REPORT] Commit inside context manager throws RuntimeError

    Describe the bug If we try to commit inside the context manager (before __exit__()), hangar throws RuntimeError saying No changes made in the staging area. Cannot commit.. We should allow the user to do commits inside the context manager IMO but probably with a warning about the performance hit

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [x] Unexpected Behavior, Exceptions or Error Thrown
    • [ ] Performance Bottleneck

    To Reproduce

    import numpy as np
    
    from hangar import Repository
    repo = Repository(path='myhangarrepo')
    repo.init(user_name='Sherin Thomas', user_email='[email protected]', remove_old=True)
    
    # generate data
    data = []
    for i in range(1000):
        data.append(np.random.rand(28, 28))
    data = np.array(data)
    
    co = repo.checkout(write=True)
    data_dset = co.datasets.init_dataset('mnist_data', prototype=data[0])
    co.commit('datasets init')
    co.close()
    co = repo.checkout(write=True)
    data_dset = co.datasets['mnist_data']
    
    with data_dset:
        for i in range(len(data)):
            sample_name = str(i)
            data_dset[sample_name] = data[i]
            co.commit('dataset curation: stage 1')  # this throws error
    co.close()
    

    Expected behavior It should not break the program instead raise a warning about the performance hit

    Bug: Priority 2 PR In Progress 
    opened by hhsecond 9
  • Dataloaders for PyTorch

    Dataloaders for PyTorch

    @rlizzo I was thinking an API like hangar.dataloaders.pytorch (let me know if you have another structuring in your mind). Basically, the idea is to load data in batches synchronously or asynchronously and enable the features of PyTorch DataLoader. My plan is to have a Dataset class and a DataLoader class which is essentially the way PyTorch's data loader work.

    enhancement Resolved 
    opened by hhsecond 9
  • [BUG REPORT] New repo creation is unfriendly

    [BUG REPORT] New repo creation is unfriendly

    Describe the bug

    The message HANGAR RUNTIME WARNING: no repository exists at /some/path/__hangar, please use init_repo function makes me think my script is doing something wrong, even though the next thing that I do is call repo.init().

    Additionally, if the path does not exist, there should be an option to have it be created.

    A single default param exists=True flag could handle both cases, creating the directory and suppressing the init warning when set to False

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [x] Unexpected Behavior, Exceptions or Error Thrown
    • [ ] Performance Bottleneck
    Bug: Priority 2 
    opened by elistevens 8
  • Import export in CLI

    Import export in CLI

    Motivation and Context

    Why is this change required? What problem does it solve?:

    Introducing import-export utility over the command line

    If it fixes an open issue, please link to the issue here:

    This PR won't solve issue #72 completely but starting with image import and export. Also, the release is going to be experimental and the APIs are subjective to change

    Description

    Describe your changes in detail:

    • Moving CLI as a module
    • Introducing import option using click
    • Introducing export option using click
    • Base class for the introduction of plugin system

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [x] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [ ] Ready for review
    • [x] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [x] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by hhsecond 7
  • WIP Switch To mkdocs

    WIP Switch To mkdocs

    Motivation and Context

    Why is this change required? What problem does it solve?:

    better documentation generator.

    Does not work at the current moment (unhappy with API plugin for numpy style docstrings).

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [x] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [ ] Ready for review
    • [x] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [ ] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [ ] My code follows the code style of this project.
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [ ] I have read the CONTRIBUTING document.
    • [ ] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    opened by rlizzo 3
  • [FEATURE REQUEST] Read checkout to be able to read data from staging area

    [FEATURE REQUEST] Read checkout to be able to read data from staging area

    Is your feature request related to a problem? Please describe. Currently, read checkout won't be able to read data from the staging area. I was wondering what would be the technical difficulties for reading data from staging area

    Describe the solution you'd like repo.checkout(stage=True) or something similar could give access to the staging area. I think the major bottleneck from implementing this is the fact that the staging area could change and might even make the data unavailable. Maybe we could have a flag changes when the data changes and on read from a stage checkout, it could let the reading process know that the data changed (not the specific information but just that the data changed) and it can invalidate the checkout?

    enhancement 
    opened by hhsecond 1
  • [QUESTION & DOCS]: hangar versus DVID?

    [QUESTION & DOCS]: hangar versus DVID?

    Executive Summary how does the approach of hangar compare with DVID? I am looking at how solutions to managing really large datasets, and stored ML models. The project dvid seems like its doing something similar to hangar?

    I don't know enough about devops to be able to determine what kind of solution I could or should choose or what one is buying into when they choose one.

    question documentation 
    opened by kurtsansom 0
  • [BUG REPORT] Diff status always returns CLEAN inside CM

    [BUG REPORT] Diff status always returns CLEAN inside CM

    Describe the bug Diff status always returns CLEAN inside CM

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [x] Unexpected Behavior, Exceptions or Error Thrown
    • [ ] Performance Bottleneck

    To Reproduce

    from hangar import Repository
    import numpy as np
    
    
    repo = Repository('.')
    repo.init(user_name='me', user_email='[email protected]', remove_old=True)
    co = repo.checkout(write=True)
    co.add_ndarray_column('x', prototype=np.array([1]))
    co.commit('added columns')
    co.close()
    
    co = repo.checkout(write=True)
    x = co.columns['x']
    
    
    with x:
        for i in range(10):
            x[i] = np.array([i])
            print(co.diff.status())  # this should return DIRTY but returns CLEAN
    print(co.diff.status())  # this returns DIRTY as expected
    co.commit('adding file')
    
    
    Bug: Awaiting Priority Assignment 
    opened by hhsecond 0
  • [BUG REPORT] Transaction registers closes early

    [BUG REPORT] Transaction registers closes early

    Describe the bug In case of multiple columns, if we open only few of such columns in context manager and still tries to write on other columns inside the context manager, transactions get False. A sample script to reproduce is given below

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [x] Unexpected Behavior, Exceptions or Error Thrown
    • [ ] Performance Bottleneck

    To Reproduce

    from hangar import Repository
    import numpy as np
    
    
    repo = Repository('.')
    repo.init(user_name='me', user_email='[email protected]', remove_old=True)
    co = repo.checkout(write=True)
    co.add_ndarray_column('x', prototype=np.array([1]))
    co.add_ndarray_column('y', prototype=np.array([1]))
    co.commit('added columns')
    co.close()
    
    co = repo.checkout(write=True)
    x = co.columns['x']
    y = co.columns['y']
    
    
    with x:  # note that we are opening only `x` in the CM
        for i in range(10):
            y[i] = np.array([i])  # but we are trying to update `y` column
            x[i] = np.array([i])
    co.commit('adding file')
    co.close()
    

    Desktop (please complete the following information):

    • OS: Ubuntu 19.10
    • Python: 3.7
    • Hangar: 0.5.1.dev0 (master, at the time of writing)
    Bug: Priority 1 
    opened by hhsecond 2
  • Commit Level Metadata

    Commit Level Metadata

    May be mention that this metadata is commit level and will not be part of the history.

    Originally posted by @hhsecond in https://github.com/tensorwerk/hangar-py/pull/180

    enhancement 
    opened by rlizzo 0
Releases(v0.5.2)
  • v0.5.2(May 8, 2020)

    v0.5.2 (2020-05-08)

    New Features

    • New column data type supporting arbitrary bytes data. (#198) @rlizzo

    Improvements

    • str typed columns can now accept data containing any unicode code-point. In prior releases data containing any non-ascii character could not be written to this column type. (#198) @rlizzo

    Bug Fixes

    • Fixed issue where str and (newly added) bytes column data could not be fetched / pushed between a local client repository and remote server. (#198) @rlizzo
    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Apr 6, 2020)

  • v0.5.0(Apr 4, 2020)

    v0.5.0 (2020-04-4)

    Improvements

    • Python 3.8 is now fully supported. (#193) @rlizzo
    • Major backend overhaul which defines column layouts and data types in the same interchangable / extensable manner as storage backends. This will allow rapid development of new layouts and data type support as new use cases are discovered by the community. (#184) @rlizzo
    • Column and backend classes are now fully serializable (pickleable) for read-only checkouts. (#180) @rlizzo
    • Modularized internal structure of API classes to easily allow new columnn layouts / data types to be added in the future. (#180) @rlizzo
    • Improved type / value checking of manual specification for column backend and backend_options. (#180) @rlizzo
    • Standardized column data access API to follow python standard library dict methods API. (#180) @rlizzo
    • Memory usage of arrayset checkouts has been reduced by ~70% by using C-structs for allocating sample record locating info. (#179) @rlizzo
    • Read times from the HDF5_00 and HDF5_01 backend have been reduced by 33-38% (or more for arraysets with many samples) by eliminating redundant computation of chunked storage B-Tree. (#179) @rlizzo
    • Commit times and checkout times have been reduced by 11-18% by optimizing record parsing and memory allocation. (#179) @rlizzo

    New Features

    • Added str type column with same behavior as ndarray column (supporting both single-level and nested layouts) added to replace functionality of removed metadata container. (#184) @rlizzo
    • New backend based on LMDB has been added (specifier of lmdb_30). (#184) @rlizzo
    • Added .diff() method to Repository class to enable diffing changes between any pair of commits / branches without needing to open the diff base in a checkout. (#183) @rlizzo
    • New CLI command hangar diff which reports a summary view of changes made between any pair of commits / branches. (#183) @rlizzo
    • Added .log() method to Checkout objects so graphical commit graph or machine readable commit details / DAG can be queried when operating on a particular commit. (#183) @rlizzo
    • "string" type columns now supported alongside "ndarray" column type. (#180) @rlizzo
    • New "column" API, which replaces "arrayset" name. (#180) @rlizzo
    • Arraysets can now contain "nested subsamples" under a common sample key. (#179) @rlizzo
    • New API to add and remove samples from and arrayset. (#179) @rlizzo
    • Added repo.size_nbytes and repo.size_human to report disk usage of a repository on disk. (#174) @rlizzo
    • Added method to traverse the entire repository history and cryptographically verify integrity. (#173) @rlizzo

    Changes

    • Argument syntax of __getitem__() and get() methods of ReaderCheckout and WriterCheckout classes. The new format supports handeling arbitrary arguments specific to retrieval of data from any column type. (#183) @rlizzo

    Removed

    • metadata container for str typed data has been completly removed. It is replaced by a highly extensible and much more user-friendly str typed column. (#184) @rlizzo
    • __setitem__() method in WriterCheckout objects. Writing data to columns via a checkout object is no longer supported. (#183) @rlizzo

    Bug Fixes

    • Backend data stores no longer use file symlinks, improving compatibility with some types file systems. (#171) @rlizzo
    • All arrayset types ("flat" and "nested subsamples") and backend readers can now be pickled -- for parallel processing -- in a read-only checkout. (#179) @rlizzo

    Breaking changes

    • New backend record serialization format is incompatible with repositories written in version 0.4 or earlier.
    • New arrayset API is incompatible with Hangar API in version 0.4 or earlier.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0dev3(Apr 4, 2020)

  • v0.5.0dev2(Apr 4, 2020)

  • v0.4.0(Nov 26, 2019)

    Release Notes

    New Features

    • Added ability to delete branch names/pointers from a local repository via both API and CLI. #128 @rlizzo
    • Added local keyword arg to arrayset key/value iterators to return only locally available samples #131 @rlizzo
    • Ability to change the backend storage format and options applied to an arrayset after initialization. #133 @rlizzo
    • Added blosc compression to HDF5 backend by default on PyPi installations. #146 @rlizzo
    • Added Benchmarking Suite to Test for Performance Regressions in PRs. #155 @rlizzo
    • Added new backend optimized to increase speeds for fixed size arrayset access. #160 @rlizzo

    Improvements

    • Removed msgpack and pyyaml dependencies. Cleaned up and improved remote client/server code. #130 @rlizzo
    • Multiprocess Torch DataLoaders allowed on Linux and MacOS. #144 @rlizzo
    • Added CLI options commit, checkout, arrayset create, & arrayset remove. #150 @rlizzo
    • Plugin system revamp. #134 @hhsecond
    • Documentation Improvements and Typo-Fixes. #156 @alessiamarcolini
    • Removed implicit removal of arrayset schema from checkout if every sample was removed from arrayset. This could potentially result in dangling accessors which may or may not self-destruct (as expected) in certain edge-cases. #159 @rlizzo
    • Added type codes to hash digests so that calculation function can be updated in the future without breaking repos written in previous Hangar versions. #165 @rlizzo

    Bug Fixes

    • Programatic access to repository log contents now returns branch heads alongside other log info. #125 @rlizzo
    • Fixed minor bug in types of values allowed for Arrayset names vs Sample names. #151 @rlizzo
    • Fixed issue where using checkout object to access a sample in multiple arraysets would try to create a namedtuple instance with invalid field names. Now incompatible field names are automatically renamed with their positional index. #161 @rlizzo
    • Explicitly raise error if commit argument is set while checking out a repository with write=True. #166 @rlizzo

    Breaking changes

    • New commit reference serialization format is incompatible with repositories written in version 0.3.0 or earlier.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0b0(Oct 19, 2019)

  • v0.3.0(Sep 10, 2019)

    New Features

    • API addition allowing reading and writing arrayset data from a checkout object directly. (#115) @rlizzo
    • Data importer, exporters, and viewers via CLI for common file formats. Includes plugin system for easy extensibility in the future. (#103) (@rlizzo, @hhsecond)

    Improvements

    • Added tutorial on working with remote data. (#113) @rlizzo
    • Added Tutorial on Tensorflow and PyTorch Dataloaders. (#117) @hhsecond
    • Large performance improvement to diff/merge algorithm (~30x previous). (#112) @rlizzo
    • New commit hash algorithm which is much more reproducible in the long term. (#120) @rlizzo
    • HDF5 backend updated to increase speed of reading/writing variable sized dataset compressed chunks (#120) @rlizzo

    Bug Fixes

    • Fixed ML Dataloaders errors for a number of edge cases surrounding partial-remote data and non-common keys. (#110) (@hhsecond, @rlizzo)

    Breaking changes

    • New commit hash algorithm is incompatible with repositories written in version 0.2.0 or earlier
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Aug 9, 2019)

    See changelog for full details

    New Features

    • Numpy memory-mapped array file backend added.
    • Remote server data backend added.
    • Selection heuristics to determine appropriate backend from arrayset schema.
    • Partial remote clones and fetch operations now fully supported.
    • CLI has been placed under test coverage, added interface usage to docs.
    • TensorFlow and PyTorch Machine Learning Dataloader Methods (Experimental Release).

    Improvements

    • Record format versioning and standardization so to not break backwards compatibility in the future.
    • Backend addition and update developer protocols and documentation.
    • Read-only checkout arrayset sample get methods now are multithread and multiprocess safe.
    • Read-only checkout metadata sample get methods are thread safe if used within a context manager.
    • Samples can be assigned integer names in addition to string names.
    • Forgetting to close a write-enabled checkout before terminating the python process will close the checkout automatically for many situations.
    • Repository software version compatability methods added to ensure upgrade paths in the future.
    • Many tests added (including support for Mac OSX on Travis-CI). lead

    Bug Fixes

    • Diff results for fast forward merges now returns sensible results.
    • Many type annotations added, and developer documentation improved.

    Breaking changes

    • Renamed all references to datasets in the API / world-view to arraysets.
    • These are backwards incompatible changes. For all versions > 0.2, repository upgrade utilities will be provided if breaking changes occur.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(May 24, 2019)

  • v0.1.0(May 24, 2019)

    New Features

    • Remote client-server config negotiation and administrator permissions (#10) @rlizzo
    • Allow single python process to access multiple repositories simultaneously (#20) @rlizzo
    • Fast-Forward and 3-Way Merge and Diff methods now fully supported and behaving as expected (#32) @rlizzo

    Improvements

    • Initial test-case specification (#14) @hhsecond
    • Checkout test-case work (#25) @hhsecond
    • Metadata test-case work (#27) @hhsecond
    • Any potential failure cases raise exceptions instead of silently returning (#16) @rlizzo
    • Many usability improvements in a variety of commits

    Bug Fixes

    • Ensure references to checkout dataset or metadata objects cannot operate after the checkout is closed. (#41) @rlizzo
    • Sensible exception classes and error messages raised on a variety of situations (Many commits) @hhsecond & @rlizzo
    • Many minor issues addressed.

    API Additions

    • Refer to API documentation (#23)

    Breaking changes

    • All repositories written with previous versions of Hangar are liable to break when using this version. Please upgrade versions immediately.
    Source code(tar.gz)
    Source code(zip)
Owner
Tensorwerk
Tensorwerk
Geospatial data-science analysis on reasons behind delay in Grab ride-share services

Grab x Pulis Detailed analysis done to investigate possible reasons for delay in Grab services for NUS Data Analytics Competition 2022, to be found in

Keng Hwee 6 Jun 07, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

📈 Statistical Quality Control 📉 This repo contains a simple but effective tool made using python which can be used for quality control in statistica

SasiVatsal 8 Oct 18, 2022
PipeChain is a utility library for creating functional pipelines.

PipeChain Motivation PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Austra

Michael Milton 2 Aug 07, 2022
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams Motivation When dataset freshness is critical, the annotating of high speed

4 Aug 02, 2022
Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Insurance-Fraud-Claims Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance com

1 Jan 27, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
Port of dplyr and other related R packages in python, using pipda.

Unlike other similar packages in python that just mimic the piping syntax, datar follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from t

179 Dec 21, 2022
Predictive Modeling & Analytics on Home Equity Line of Credit

Predictive Modeling & Analytics on Home Equity Line of Credit Data (Python) HMEQ Data Set In this assignment we will use Python to examine a data set

Dhaval Patel 1 Jan 09, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

PizzaOrders_DataPipeline There is a Tony who is owning a New Pizza shop. He knew that pizza alone was not going to help him get seed funding to expand

Melwin Varghese P 4 Jun 05, 2022
My solution to the book A Collection of Data Science Take-Home Challenges

DS-Take-Home Solution to the book "A Collection of Data Science Take-Home Challenges". Note: Please don't contact me for the dataset. This repository

Jifu Zhao 1.5k Jan 03, 2023
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

PB2 101 Dec 07, 2022
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

Brain Imaging Data Structure 180 Dec 18, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022
Picka: A Python module for data generation and randomization.

Picka: A Python module for data generation and randomization. Author: Anthony Long Version: 1.0.1 - Fixed the broken image stuff. Whoops What is Picka

Anthony 108 Nov 30, 2021