A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Overview

Squirrel Core

Share, load, and transform data in a collaborative, flexible, and efficient way

Python PyPI
Conda Documentation Status Downloads License DOI Generic badge Slack


What is Squirrel?

Squirrel is a Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.

  1. SPEED: Avoid data stall, i.e. the expensive GPU will not be idle while waiting for the data.

  2. COSTS: First, avoid GPU stalling, and second allow to shard & cluster your data and store & load it in bundles, decreasing the cost for your data bucket cloud storage.

  3. FLEXIBILITY: Work with a flexible standard data scheme which is adaptable to any setting, including multimodal data.

  4. COLLABORATION: Make it easier to share data & code between teams and projects in a self-service model.

Stream data from anywhere to your machine learning model as easy as:

it = (Catalog.from_plugins()["imagenet"].get_driver()
      .get_iter("train")
      .map(lambda r: (augment(r["image"]), r["label"]))
      .batched(100))

Check out our full getting started tutorial notebook. If you have any questions or would like to contribute, join our Slack community.

Installation

You can install squirrel-core by

pip install "squirrel-core[all]"

Documentation

Read our documentation at ReadTheDocs

Example Notebooks

Check out the Squirrel-datasets repository for open source and community-contributed tutorial and example notebooks of using Squirrel.

Contributing

Squirrel is open source and community contributions are welcome!

Check out the contribution guide to learn how to get involved.

The humans behind Squirrel

We are Merantix Momentum, a team of ~30 machine learning engineers, developing machine learning solutions for industry and research. Each project comes with its own challenges, data types and learnings, but one issue we always faced was scalable data loading, transforming and sharing. We were looking for a solution that would allow us to load the data in a fast and cost-efficient way, while keeping the flexibility to work with any possible dataset and integrate with any API. That's why we build Squirrel – and we hope you'll find it as useful as we do! By the way, we are hiring!

Citation

If you use Squirrel in your research, please cite it using:

@article{2022squirrelcore,
  title={Squirrel: A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.},
  author={Squirrel Developer Team},
  journal={GitHub. Note: https://github.com/merantix-momentum/squirrel-core},
  doi={10.5281/zenodo.6418280},
  year={2022}
}
Comments
  • Update Doc-String of MapDriver.get_iter

    Update Doc-String of MapDriver.get_iter

    • Better document the behavior of max_workers and link to official ThreadPoolExecutor documentation.
    • Update *_map doc-strings that use ThreadPoolExecutor and link to official ThreadPoolExecutor documentation.

    Fixes #60 issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [X] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [X] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [X] I have kept the PR small so that it can be easily reviewed
    • [X] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by kai-tub 14
  • Refactoring DataFrameDriver and related drivers

    Refactoring DataFrameDriver and related drivers

    Description

    Refactors the DataFrameDriver and all data frame-related drivers. In particular:

    • Fixes a current bug (I believe) that storage options are not passed down when reading using the pandas/dask interface. This affected the implementation of CsvDriver.
    • Refactors the DataFrameDriver base class to provide a common interface for all drivers that use some read functionality from pandas or dask. The base class now handles the storage options and read arguments precedence for all derived classes
    • Using the new abstraction, adds FeatherDriver, JsonDriver, ParquetDriver, and XlsDriver and refactors CsvDriver
    • This does break some datasets using the CsvDriver as the read_csv_kwargs are now renamed to a common read_kwargs. However, so far, only two research datasets used this property. See the corresponding PR in squirrel-datasets. So far, this is a bit of a rough sketch. I tested the existing CsvDriver-based datasets but otherwise, this requires a bit more cleanup I suppose.
    • Renames the previous use_dask option to engine across all data frame drivers.
    • Change the default DataFrame engine to Pandas.

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [x] All dependency changes have been reflected in the pip requirement files.
    opened by MaxSchambach 8
  • zip multiple iterables as a source

    zip multiple iterables as a source

    Description

    The use case: we have a store with samples, and 1..n other stores that each contain only features. These stores must have the same keys and same number of samples per shard.

    IterableZipSource makes it possible to zip items from several iterables and use that as a source. For instance:

    it1 = MessagepackDriver(url1).get_iter()
    it2 = MessagepackDriver(url2).get_iter()
    
    it3 = IterableZipSource(iterables=[it1, it2]).collect()
    
    

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AlirezaSohofi 5
  • Add csv driver option to specify csv read args

    Add csv driver option to specify csv read args

    Description

    Adds a read_csv_kwargs argument to CsvDriver initialization which is used in all read_csv calls in the class. Does not break backward compatibility, as get_df and get_iter still allow to specify kwargs for read_csv which will take precedence over the ones given at initialization.

    This makes the creation of new catalog entries based on the CsvDriver much easier as dataset-specific read options (such as seperator, dtypes, etc.) can be specified in the driver_kwargs.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by MaxSchambach 4
  • Quantify randomness of shuffle in squirrel

    Quantify randomness of shuffle in squirrel

    Description

    Introduce a function to measure the randomness of a shuffle operation in the squirrel pipeline by implementing a simple example driver, random sampling and comparing the distances of sampled trajectories.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by winfried-ripken 4
  • Explain automatic version iteration

    Explain automatic version iteration

    Description

    Adding explanation of the default version iteration behaviour of the catalog, which was not clearly stated before.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ x ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AdemFr 3
  • Add storage options kwargs to FPGen

    Add storage options kwargs to FPGen

    Description

    FilePathGenerator does not expose storage options, used by fsspec when instantiating a filesystem. This can prove to be problematic, when advanced options are needed for accessing the data, e.g. when needing requester_pays argument for accessing data within google bucket. This change adds such kwargs to the constructor of FilePathGenerator object, which are passed onto fsspec.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [x] All dependency changes have been reflected in the pip requirement files.
    opened by mg515 3
  • Warn when creating driver that points to an empty directory

    Warn when creating driver that points to an empty directory

    Description

    A warning is shown when we iterate over a driver, that points to an empty or non-existent directory.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 3
  • Nvidia DALI external source integration

    Nvidia DALI external source integration

    Description

    Motivation Squirrel is a fast data loading framework, and Nvidia DALI is a fast, gpu-accelerated library for complex ML workloads such as run-time augmentations. The aim is to provide users with an intuitive interface to use Squirrel as a backend for Nvidia DALI.

    Context For more context, check out these internal benchmarks below. Running the Squirrel pipeline without any augmentations is approx. 33k samples / sec. If you use Squirrel as an external source and an affine image augmentation from DALI you can reach approx. 28k samples / sec. This suggests that DALI can make use full use of Squirrel's speed as the data loading speed is almost not slowed down by the run-time augmentations (33k vs 28k). DALI brings two things to the table: you can augment data in batches and not one-by-one as is necessary with the other frameworks, and you can do it on the GPU.

    Screen Shot 2022-08-03 at 15 36 30

    Code Design

    • DALI comes with a concept called "pipeline" (docs), that defines how data should be read and transformed by DALI.
    • We use the external_source data reader API in the DALI pipeline, which we can provide with a modified Squirrel Iterable, the squirrel.iterstream.DaliExternalSource.
    • As suggested by Nvidia DALI staff, I benchmarked loading the samples one-by-one and let DALI do the batching. It turned out that batching in Squirrel was much faster (18.2k sps vs 32.2 sps). This suggests that DALI profits from the async loading in Squirrel here.
    • As suggested by Nvidia DALI staff, I tried using parallel external source, which is multi-proc dataloading by DALI. As stated in their docs, DALI prefers single samples (un-batched) here to let DALI handle the multi-proc logic of parallel data fetching. The problem here is that DALI would like a Callable external source here, Iterables are not allowed for parallel fetching. While this is technically possible (e.g. fit your dataset in one shard and then access the items by their keys, i.e. shard names), indexability is not straightforward and not yet integrated in squirrel. Since DALI already nearly makes use of Squirrel's full performance, we don't see that DALI could speed things up here. But it's worth investigating once the feature is implemented in Squirrel.
    • There was no performance increase by returning cupy arrays on the GPU to the external_source reader. Numpy was slightly faster, so users are advised to return numpy arrays in their collation function.

    Usage Pattern

    • users will simply turn their iterable into an external source with the iterstream API.
    # define a dummy pipeline
    @pipeline_def 
    def pipeline(it: DaliExternalSource, device: str) -> Tuple[DataNode]:
        img, label = fn.external_source(source=it, num_outputs=2, device=device)
        enhanced = fn.brightness_contrast(img, contrast=2) # do other augmentations here
        return enhanced, label
        
    it = squirrel_iterator.to_dali_external_source(batch_size, my_collation_fn)
    pipe = pipeline(it, device, batch_size=BATCH_SIZE)
    pipe.build()
    
    loader = DALIGenericIterator([pipe], ["img", "label"])
    for item in loader: 
        # ... 
    

    Things to Discuss

    1. I tried turning the iterstream into a DALIGenericIterator directly and abstract the above code away, but in my mind that does not make a lot of sense, as DALI users are used to the above API and we are really just an external source. The user will need to define their custom pipeline anyway for their use-case, so I don't see a big benefit of abstracting the below code away into a squirrel functionality - possibly adding some assumptions here and there and thereby limiting the original functionality of DALI (wdyt @AlirezaSohofi ?).
    2. We would need to find out if the self.i and self.n parameters need to be set for the external source as indicated here. For now, it seems to work out of the box, but maybe for more complex use-cases these variables are needed for DALI to keep track of the loaded samples. Sidenote: Currently DaliExternalSource could also simply be replaced with squirrel_iterable.batched(bs, fn), but I assume that self.i and self.n are needed somehow (input from NVIDIA needed here), so it's useful to have DaliExternalSource where we can add more features.
    3. Please check out the test_to_dali_external_source_gpu_multi_epoch. After iterating over Squirrel's generator once the iterable is empty. Hence after each epoch we need to create a new DALIGenericIterator. Afaik this is also how e.g. Pytorch Lightning handles it. Let me know if that sounds ok, or if we need to loop over the data.
    4. Tests & Requirements: Note that I added pytests for the code, but did not update the requirements accordingly, because the CI currently doesn't run GPU tasks. Moreover, we won't ask users to install DALI for now (also, there are many different versions for different cuda drivers), so we assume people will prefer installing themselves. The DaliExternalSource doesn't depend on any DALI code, so the DALI install is technically not required.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by axkoenig 3
  • Store the processing steps in a stream

    Store the processing steps in a stream

    Description

    Store more information in Composables:

    • Which Squirrel version is used
    • Git info e.g. commit-hash, remote repository
    • Log processing steps when chaining Composables

    This aims to provide the user more information about the stream. When a Composable stores sensitive information e.g. url in FilePathGenerator, then this should not be logged.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [x] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 3
  • [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

    [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

    Hey, I've stumbled across a potentially easy-to-misunderstand part of the MapDriver.get_iter documentation:

    https://github.com/merantix-momentum/squirrel-core/blob/8e2942313c7d7dd974b1ca2f2308895f660d3d26/squirrel/driver/driver.py#L68-L155

    The documentation of max_workers states that by default None will be used and also mentions that this will cause async_map to be called but I missed these parts of the documentation and was surprised to see that so many threads were allocated.

    I am/was not too familiar with the ThreadPoolExecutor interface and find it somewhat surprising that None equals numer_of_processors x 5 according to the ThreadPoolExecutor definition. Maybe it would be helpful to explicitly state that by default ThreadPoolExecutor will be used with so many threads? The documentation string reads a bit unintuitive as the starts out that max_worker defines how many items are fetched simultaneously with max_worker and then continues to state that otherwise map is used. From that perspective, max_workers=None doesn't sound like it should be using any threads at all. Without knowing the default values of ThreadPoolExecutor I would make it more explicit that to disable threading one has to set max_workers=0/1 and that by default many threads are used.

    I am happy to add a PR with my suggested doc-string update if you agree! :)

    enhancement 
    opened by kai-tub 3
  • Interaction Nvidia DALI and Squirrel

    Interaction Nvidia DALI and Squirrel

    Description

    Describes in detail how Squirrel and DALI can work together. Also includes benchmarks on how to best utilize DALI and how it compares to transforms in Torchvision.

    Attaching PDF rendered version of the Sphinx documentation here. Unfortunately, I couldn't get syntax highlighting to work.

    Apparent next steps are figuring out how Squirrel and DALI can work together in multi-processing. It is not obvious how we could implement this, and if this provides a performance boost. Using a DALI parallel external source would probably be the way to go, but DALI expects a callable here that fetches individual images given a specific image index. This can be implemented easily if we set shard-size=1, but our initial experiments showed that larger shard sizes are more desirable.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by axkoenig 0
  • Bugfix deserializer kwargs

    Bugfix deserializer kwargs

    Description

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by mg515 0
  • PoC to cache data

    PoC to cache data

    driver = MessagepackDriver(url=url, cache_url=another_url)
    

    Description

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AlirezaSohofi 0
  • Safety checks for store and driver using FilePathGenerator

    Safety checks for store and driver using FilePathGenerator

    Description

    For both store and driver we need to asses if a URL points to an empty directory or nested empty directories.

    • For drivers, warning the user when using empty directories alerts the user early on that the url might be invalid
    • For stores, we want to only overwrite an existing non-empty directory when it is explicitly allowed

    In both cases, checking if the directories/nested directories are empty are done through the FilePathGenerator

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 2
  • [DRAFT] Support for different SquirrelStore compression modes

    [DRAFT] Support for different SquirrelStore compression modes

    Description

    See #59

    Fixes #59 issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [X] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [X] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [X] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [X] I have kept the PR small so that it can be easily reviewed
    • [X] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [X] All dependency changes have been reflected in the pip requirement files.

    Draft State!

    This is a draft PR to make it easier to discuss the different pros and cons of various solutions. This is not in a final state.

    I tried to add some test and verify that they pass locally, but the tests spam a lot of ValueError: Bucket is requester pays. Setrequester_pays=Truewhen creating the GCSFileSystem. and it is hard to tell where these tests/errors are coming from. The contributing guideline provides no further information on how to run the tests.

    opened by kai-tub 9
  • [FEATURE] Allow configuring compression mode in MessagepackSerializer

    [FEATURE] Allow configuring compression mode in MessagepackSerializer

    Hey,

    Thank you for working on this library! I think it has a huge potential, especially for dataset creators to provide their dataset in an optimized deep-learning format that is well suited for distribution. The performance of the MessagepackSerializer is amazing and being able to distribute subsets of the dataset (shards) is something I never wanted but really want to utilize in the future!

    I have played around with some "MessagepackSerializer" configurations and according to some internal benchmarks, it would be helpful to allow the user to configure the compression algorithm.

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L28-L48

    Currently, the compression mode is "locked" to gzip. I assume the main reason is due to the wide usage of gzip and to keep the code 'simple' as it makes it easy in the deserializer to know that the gzip compression was used:

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L58-L81

    Here I would like to note that given the extension, fsspec (default) could also infer the compression by inspecting the filename suffix. But I can see how this might cause problems if somebody would like to switch out fsspec with something else (although I would have no idea with what and why :D )

    Other spots within the codebase that are coupled to this compression assumption are the methods from the SquirrelStore:

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L12-L67

    Or to show the significant parts:

    • get: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L40-L41

    • set: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L59-L60

    • keys: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L66-L67

    In my internal benchmarks, I was able to greatly speed up the data loading by simply using no compression at all (None). Although I am fully aware that the correct compression mode heavily depends on the specific hardware/use case. But even in a network limited domain, I can see reasons to then prefer xz instead due to its better compression ratio and relatively similar decompression speed to gzip.

    IMHO, I think it should be ok to not store any suffix at all for the squirrel store. If I/a user looks inside of the squirrel store URL I think it is not mandatory to show what compression algorithm was used. The user could/should use the designated driver/metadata that comes bundled with the dataset and let the driver handle the correct decompression.

    If you don't agree I still think the gz extension doesn't have to be 'hardcoded' into these functions. This is actually something that confused me when I was looking at the internals of the code base. So instead, we could use something like:

    comp = kwargs.get("compression", "gzip")
    comp_to_ext_dict[comp] # just to show the concept
    

    With these modifications, it should be possible to utilize different compression modes and make them easily configurable. I would be very happy to create a PR and contribute to this project!

    enhancement 
    opened by kai-tub 3
Releases(v0.18.0)
  • v0.18.0(Nov 10, 2022)

    What's Changed

    • zip_index method for Composable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/92
    • Quantify randomness of shuffle in squirrel by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/86
    • Change Catalog repr to sorted set by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/94
    • Installation instruction by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/96
    • Upgrade requirements by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/97
    • Reference Huggingface, Hub and Torchvision Drivers by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/99
    • Update requirements by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/101
    • Refactoring DataFrameDriver and related drivers by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/98

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.7...v0.18.0

    Source code(tar.gz)
    Source code(zip)
  • v0.17.7(Oct 7, 2022)

    What's Changed

    • Add hooks to check backwards compatibility with py3.6+ by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/87
    • Add pyupgrade, yaml formatting and update all hooks by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/88
    • Fix file driver storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/85
    • Peng add kwargs to map by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/90
    • Add hooks to csv driver by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/91
    • Explain automatic version iteration by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/84
    • Add csv driver option to specify csv read args by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/93

    New Contributors

    • @MaxSchambach made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/93

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.4...v0.17.7

    Source code(tar.gz)
    Source code(zip)
  • v0.17.4(Aug 31, 2022)

    What's Changed

    • Make this repo installable with all python versions by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/82
    • Fix storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/83

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.2...v0.17.4

    Source code(tar.gz)
    Source code(zip)
  • v0.17.2(Aug 25, 2022)

    What's Changed

    • Make CatalogSource visible in the API by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/71
    • Minor tweaks in documentation by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/73
    • Introduce rst linting via precommit hook by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/74
    • Remove binary file in tests dir by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/75
    • Unifies folder-creation behaviour when instantiation SquirrelStore by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/72
    • Bugfix - Register Torch Composables by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/78
    • Upgrade infra to py3.9 by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/79
    • Add storage options kwargs to FPGen by @mg515 in https://github.com/merantix-momentum/squirrel-core/pull/81

    New Contributors

    • @axkoenig made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/78
    • @mg515 made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/81

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.16.0...v0.17.2

    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Jul 26, 2022)

    What's Changed

    • introduce loop and fixed size iterable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/47
    • Move cla assistant to workflows by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/62
    • *add tutorials, *ignore test in api-ref, *remove unused execption by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/63
    • First draft of advanced section for iterstreams by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/55
    • Update Doc-String of MapDriver.get_iter by @kai-tub in https://github.com/merantix-momentum/squirrel-core/pull/61
    • Composable.compose gets source as kwarg, which is equal to self by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/66
    • Peng add pytorch convenience functions to composable by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/69
    • partial function for keys method by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/70

    New Contributors

    • @kai-tub made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/61

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/0.14.2...v0.16.0

    Source code(tar.gz)
    Source code(zip)
  • 0.14.2(Jun 23, 2022)

    What's Changed

    • change squirrel test using a tmp public bucket by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/46
    • Update fs.open mode for catalog by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/48
    • CatalogKey can be used to index catalog by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/49
    • accept callable as source for composable to make it completly lazy by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/44
    • add sphinxcontrib-mermaid by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/51
    • Architecture overview by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/54
    • *add advanced store *reorganize sections *add icon,favicon by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/53
    • Create codeql-analysis.yml by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/52
    • Upgrade numpy & numba by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/57
    • Winnie bump pyjwt by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/58

    New Contributors

    • @AdemFr made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/48
    • @pzdkn made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/53
    • @winfried-loetzsch made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/57

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.2...0.14.2

    Source code(tar.gz)
    Source code(zip)
  • v0.13.2(May 18, 2022)

    What's Changed

    • Fix SourceCombiner.get_iter() not interleaving correctly by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/45

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.1...v0.13.2

    Source code(tar.gz)
    Source code(zip)
  • v0.13.1(May 18, 2022)

    What's Changed

    • Add community files by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/38
    • Minor requirement changes by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/40
    • messagepack unpacker set use_list argument to False by default by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/39

    New Contributors

    • @AlpAribal made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/40

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.3...v0.13.1

    Source code(tar.gz)
    Source code(zip)
  • v0.12.3(Apr 11, 2022)

    What's Changed

    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/31
    • pin numpy and update PR template by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/34
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/33
    • update document links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/36
    • update version to 0.12.3 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/37

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.2...v0.12.3

    Source code(tar.gz)
    Source code(zip)
  • v0.12.2(Apr 6, 2022)

    What's Changed

    • update img to github raw file so public pypi can load it by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/26
    • Tiansu add readthedocs.yml by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/27
    • add dependencies for readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/28
    • fix readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/29
    • update readthedocs links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/30
    • Tiansu move leftover commits by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/32

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.1...v0.12.2

    Source code(tar.gz)
    Source code(zip)
  • v0.12.1(Apr 5, 2022)

    What's Changed

    • update docs link by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/12
    • add logo by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/13
    • remove old extra file by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/14
    • add back keyring until public release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/16
    • key_hook param of get_iter accepts SplitByRank and SplitByWorker, par… by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/15
    • fix install instruction by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/18
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/19
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/20
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/21
    • Tiansu update black by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/22
    • add CLA bot by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/23
    • switch to publish in public pypi by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/24
    • update version to 0.12.1 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/25

    New Contributors

    • @ThomasWollmann made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/13
    • @AlirezaSohofi made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/15

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.0...v0.12.1

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Mar 12, 2022)

    What's Changed

    • add basic files to get infrastructure running by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/3
    • new semantic versioning format for dev release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/4
    • tiansu copy squirrel codebase by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/5
    • Tiansu add docs by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/9
    • add pypi classifiers by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/10
    • change version norm by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/11

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/commits/v0.12.0

    Source code(tar.gz)
    Source code(zip)
Public implementation of "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression" from CoRL'21

Self-Supervised Reward Regression (SSRR) Codebase for CoRL 2021 paper "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression "

19 Dec 12, 2022
SFD implement with pytorch

S³FD: Single Shot Scale-invariant Face Detector A PyTorch Implementation of Single Shot Scale-invariant Face Detector Description Meanwhile train hand

Jun Li 251 Dec 22, 2022
A Pytorch Implementation of a continuously rate adjustable learned image compression framework.

GainedVAE A Pytorch Implementation of a continuously rate adjustable learned image compression framework, Gained Variational Autoencoder(GainedVAE). N

39 Dec 24, 2022
A platform for intelligent agent learning based on a 3D open-world FPS game developed by Inspir.AI.

Wilderness Scavenger: 3D Open-World FPS Game AI Challenge This is a platform for intelligent agent learning based on a 3D open-world FPS game develope

46 Nov 24, 2022
The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

Kexun Zhang 96 Jan 03, 2023
AdamW optimizer for bfloat16 models in pytorch.

Image source AdamW optimizer for bfloat16 models in pytorch. Bfloat16 is currently an optimal tradeoff between range and relative error for deep netwo

Alex Rogozhnikov 8 Nov 20, 2022
Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality".

personalized-breath Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality". Guideline To ex

Manh-Ha Bui 2 Nov 15, 2021
Bunch of different tools which helps visualizing and annotating images for semantic/instance segmentation tasks

Data Framework for Semantic/Instance Segmentation Bunch of different tools which helps visualizing, transforming and annotating images for semantic/in

Bruno Fernandes Carvalho 5 Dec 21, 2022
Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"

Time-Sensitive-QA The repo contains the dataset and code for NeurIPS2021 (dataset track) paper Time-Sensitive Question Answering dataset. The dataset

wenhu chen 35 Nov 14, 2022
PURE: End-to-End Relation Extraction

PURE: End-to-End Relation Extraction This repository contains (PyTorch) code and pre-trained models for PURE (the Princeton University Relation Extrac

Princeton Natural Language Processing 657 Jan 09, 2023
《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Image2Reverb Image2Reverb is an end-to-end neural network that generates plausible audio impulse responses from single images of acoustic environments

Nikhil Singh 48 Nov 27, 2022
A task Provided by A respective Artenal Ai and Ml based Company to complete it

A task Provided by A respective Alternal Ai and Ml based Company to complete it .

Parth Madan 1 Jan 25, 2022
[CVPR 2022] Unsupervised Image-to-Image Translation with Generative Prior

GP-UNIT - Official PyTorch Implementation This repository provides the official PyTorch implementation for the following paper: Unsupervised Image-to-

Shuai Yang 125 Jan 03, 2023
Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in Pytorch

Retrieval-Augmented Denoising Diffusion Probabilistic Models (wip) Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in P

Phil Wang 55 Jan 01, 2023
Convert onnx models to pytorch.

onnx2torch onnx2torch is an ONNX to PyTorch converter. Our converter: Is easy to use – Convert the ONNX model with the function call convert; Is easy

ENOT 264 Dec 30, 2022
Compact Bidirectional Transformer for Image Captioning

Compact Bidirectional Transformer for Image Captioning Requirements Python 3.8 Pytorch 1.6 lmdb h5py tensorboardX Prepare Data Please use git clone --

YE Zhou 19 Dec 12, 2022
make ASCII Art by Deep Learning

DeepAA This is convolutional neural networks generating ASCII art. This repository is under construction. This work is accepted by NIPS 2017 Workshop,

OsciiArt 1.4k Dec 28, 2022
FaRL for Facial Representation Learning

FaRL for Facial Representation Learning This repo hosts official implementation of our paper General Facial Representation Learning in a Visual-Lingui

Microsoft 19 Jan 05, 2022
A Decentralized Omnidirectional Visual-Inertial-UWB State Estimation System for Aerial Swar.

Omni-swarm A Decentralized Omnidirectional Visual-Inertial-UWB State Estimation System for Aerial Swarm Introduction Omni-swarm is a decentralized omn

HKUST Aerial Robotics Group 99 Dec 23, 2022
Ankou: Guiding Grey-box Fuzzing towards Combinatorial Difference

Ankou Ankou is a source-based grey-box fuzzer. It intends to use a more rich fitness function by going beyond simple branch coverage and considering t

SoftSec Lab 54 Dec 24, 2022