A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Overview

Squirrel Core

Share, load, and transform data in a collaborative, flexible, and efficient way

Python PyPI
Conda Documentation Status Downloads License DOI Generic badge Slack


What is Squirrel?

Squirrel is a Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.

  1. SPEED: Avoid data stall, i.e. the expensive GPU will not be idle while waiting for the data.

  2. COSTS: First, avoid GPU stalling, and second allow to shard & cluster your data and store & load it in bundles, decreasing the cost for your data bucket cloud storage.

  3. FLEXIBILITY: Work with a flexible standard data scheme which is adaptable to any setting, including multimodal data.

  4. COLLABORATION: Make it easier to share data & code between teams and projects in a self-service model.

Stream data from anywhere to your machine learning model as easy as:

it = (Catalog.from_plugins()["imagenet"].get_driver()
      .get_iter("train")
      .map(lambda r: (augment(r["image"]), r["label"]))
      .batched(100))

Check out our full getting started tutorial notebook. If you have any questions or would like to contribute, join our Slack community.

Installation

You can install squirrel-core by

pip install "squirrel-core[all]"

Documentation

Read our documentation at ReadTheDocs

Example Notebooks

Check out the Squirrel-datasets repository for open source and community-contributed tutorial and example notebooks of using Squirrel.

Contributing

Squirrel is open source and community contributions are welcome!

Check out the contribution guide to learn how to get involved.

The humans behind Squirrel

We are Merantix Momentum, a team of ~30 machine learning engineers, developing machine learning solutions for industry and research. Each project comes with its own challenges, data types and learnings, but one issue we always faced was scalable data loading, transforming and sharing. We were looking for a solution that would allow us to load the data in a fast and cost-efficient way, while keeping the flexibility to work with any possible dataset and integrate with any API. That's why we build Squirrel – and we hope you'll find it as useful as we do! By the way, we are hiring!

Citation

If you use Squirrel in your research, please cite it using:

@article{2022squirrelcore,
  title={Squirrel: A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.},
  author={Squirrel Developer Team},
  journal={GitHub. Note: https://github.com/merantix-momentum/squirrel-core},
  doi={10.5281/zenodo.6418280},
  year={2022}
}
Comments
  • Update Doc-String of MapDriver.get_iter

    Update Doc-String of MapDriver.get_iter

    • Better document the behavior of max_workers and link to official ThreadPoolExecutor documentation.
    • Update *_map doc-strings that use ThreadPoolExecutor and link to official ThreadPoolExecutor documentation.

    Fixes #60 issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [X] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [X] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [X] I have kept the PR small so that it can be easily reviewed
    • [X] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by kai-tub 14
  • Refactoring DataFrameDriver and related drivers

    Refactoring DataFrameDriver and related drivers

    Description

    Refactors the DataFrameDriver and all data frame-related drivers. In particular:

    • Fixes a current bug (I believe) that storage options are not passed down when reading using the pandas/dask interface. This affected the implementation of CsvDriver.
    • Refactors the DataFrameDriver base class to provide a common interface for all drivers that use some read functionality from pandas or dask. The base class now handles the storage options and read arguments precedence for all derived classes
    • Using the new abstraction, adds FeatherDriver, JsonDriver, ParquetDriver, and XlsDriver and refactors CsvDriver
    • This does break some datasets using the CsvDriver as the read_csv_kwargs are now renamed to a common read_kwargs. However, so far, only two research datasets used this property. See the corresponding PR in squirrel-datasets. So far, this is a bit of a rough sketch. I tested the existing CsvDriver-based datasets but otherwise, this requires a bit more cleanup I suppose.
    • Renames the previous use_dask option to engine across all data frame drivers.
    • Change the default DataFrame engine to Pandas.

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [x] All dependency changes have been reflected in the pip requirement files.
    opened by MaxSchambach 8
  • zip multiple iterables as a source

    zip multiple iterables as a source

    Description

    The use case: we have a store with samples, and 1..n other stores that each contain only features. These stores must have the same keys and same number of samples per shard.

    IterableZipSource makes it possible to zip items from several iterables and use that as a source. For instance:

    it1 = MessagepackDriver(url1).get_iter()
    it2 = MessagepackDriver(url2).get_iter()
    
    it3 = IterableZipSource(iterables=[it1, it2]).collect()
    
    

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AlirezaSohofi 5
  • Add csv driver option to specify csv read args

    Add csv driver option to specify csv read args

    Description

    Adds a read_csv_kwargs argument to CsvDriver initialization which is used in all read_csv calls in the class. Does not break backward compatibility, as get_df and get_iter still allow to specify kwargs for read_csv which will take precedence over the ones given at initialization.

    This makes the creation of new catalog entries based on the CsvDriver much easier as dataset-specific read options (such as seperator, dtypes, etc.) can be specified in the driver_kwargs.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by MaxSchambach 4
  • Quantify randomness of shuffle in squirrel

    Quantify randomness of shuffle in squirrel

    Description

    Introduce a function to measure the randomness of a shuffle operation in the squirrel pipeline by implementing a simple example driver, random sampling and comparing the distances of sampled trajectories.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by winfried-ripken 4
  • Explain automatic version iteration

    Explain automatic version iteration

    Description

    Adding explanation of the default version iteration behaviour of the catalog, which was not clearly stated before.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ x ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AdemFr 3
  • Add storage options kwargs to FPGen

    Add storage options kwargs to FPGen

    Description

    FilePathGenerator does not expose storage options, used by fsspec when instantiating a filesystem. This can prove to be problematic, when advanced options are needed for accessing the data, e.g. when needing requester_pays argument for accessing data within google bucket. This change adds such kwargs to the constructor of FilePathGenerator object, which are passed onto fsspec.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [x] All dependency changes have been reflected in the pip requirement files.
    opened by mg515 3
  • Warn when creating driver that points to an empty directory

    Warn when creating driver that points to an empty directory

    Description

    A warning is shown when we iterate over a driver, that points to an empty or non-existent directory.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 3
  • Nvidia DALI external source integration

    Nvidia DALI external source integration

    Description

    Motivation Squirrel is a fast data loading framework, and Nvidia DALI is a fast, gpu-accelerated library for complex ML workloads such as run-time augmentations. The aim is to provide users with an intuitive interface to use Squirrel as a backend for Nvidia DALI.

    Context For more context, check out these internal benchmarks below. Running the Squirrel pipeline without any augmentations is approx. 33k samples / sec. If you use Squirrel as an external source and an affine image augmentation from DALI you can reach approx. 28k samples / sec. This suggests that DALI can make use full use of Squirrel's speed as the data loading speed is almost not slowed down by the run-time augmentations (33k vs 28k). DALI brings two things to the table: you can augment data in batches and not one-by-one as is necessary with the other frameworks, and you can do it on the GPU.

    Screen Shot 2022-08-03 at 15 36 30

    Code Design

    • DALI comes with a concept called "pipeline" (docs), that defines how data should be read and transformed by DALI.
    • We use the external_source data reader API in the DALI pipeline, which we can provide with a modified Squirrel Iterable, the squirrel.iterstream.DaliExternalSource.
    • As suggested by Nvidia DALI staff, I benchmarked loading the samples one-by-one and let DALI do the batching. It turned out that batching in Squirrel was much faster (18.2k sps vs 32.2 sps). This suggests that DALI profits from the async loading in Squirrel here.
    • As suggested by Nvidia DALI staff, I tried using parallel external source, which is multi-proc dataloading by DALI. As stated in their docs, DALI prefers single samples (un-batched) here to let DALI handle the multi-proc logic of parallel data fetching. The problem here is that DALI would like a Callable external source here, Iterables are not allowed for parallel fetching. While this is technically possible (e.g. fit your dataset in one shard and then access the items by their keys, i.e. shard names), indexability is not straightforward and not yet integrated in squirrel. Since DALI already nearly makes use of Squirrel's full performance, we don't see that DALI could speed things up here. But it's worth investigating once the feature is implemented in Squirrel.
    • There was no performance increase by returning cupy arrays on the GPU to the external_source reader. Numpy was slightly faster, so users are advised to return numpy arrays in their collation function.

    Usage Pattern

    • users will simply turn their iterable into an external source with the iterstream API.
    # define a dummy pipeline
    @pipeline_def 
    def pipeline(it: DaliExternalSource, device: str) -> Tuple[DataNode]:
        img, label = fn.external_source(source=it, num_outputs=2, device=device)
        enhanced = fn.brightness_contrast(img, contrast=2) # do other augmentations here
        return enhanced, label
        
    it = squirrel_iterator.to_dali_external_source(batch_size, my_collation_fn)
    pipe = pipeline(it, device, batch_size=BATCH_SIZE)
    pipe.build()
    
    loader = DALIGenericIterator([pipe], ["img", "label"])
    for item in loader: 
        # ... 
    

    Things to Discuss

    1. I tried turning the iterstream into a DALIGenericIterator directly and abstract the above code away, but in my mind that does not make a lot of sense, as DALI users are used to the above API and we are really just an external source. The user will need to define their custom pipeline anyway for their use-case, so I don't see a big benefit of abstracting the below code away into a squirrel functionality - possibly adding some assumptions here and there and thereby limiting the original functionality of DALI (wdyt @AlirezaSohofi ?).
    2. We would need to find out if the self.i and self.n parameters need to be set for the external source as indicated here. For now, it seems to work out of the box, but maybe for more complex use-cases these variables are needed for DALI to keep track of the loaded samples. Sidenote: Currently DaliExternalSource could also simply be replaced with squirrel_iterable.batched(bs, fn), but I assume that self.i and self.n are needed somehow (input from NVIDIA needed here), so it's useful to have DaliExternalSource where we can add more features.
    3. Please check out the test_to_dali_external_source_gpu_multi_epoch. After iterating over Squirrel's generator once the iterable is empty. Hence after each epoch we need to create a new DALIGenericIterator. Afaik this is also how e.g. Pytorch Lightning handles it. Let me know if that sounds ok, or if we need to loop over the data.
    4. Tests & Requirements: Note that I added pytests for the code, but did not update the requirements accordingly, because the CI currently doesn't run GPU tasks. Moreover, we won't ask users to install DALI for now (also, there are many different versions for different cuda drivers), so we assume people will prefer installing themselves. The DaliExternalSource doesn't depend on any DALI code, so the DALI install is technically not required.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by axkoenig 3
  • Store the processing steps in a stream

    Store the processing steps in a stream

    Description

    Store more information in Composables:

    • Which Squirrel version is used
    • Git info e.g. commit-hash, remote repository
    • Log processing steps when chaining Composables

    This aims to provide the user more information about the stream. When a Composable stores sensitive information e.g. url in FilePathGenerator, then this should not be logged.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [x] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 3
  • [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

    [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

    Hey, I've stumbled across a potentially easy-to-misunderstand part of the MapDriver.get_iter documentation:

    https://github.com/merantix-momentum/squirrel-core/blob/8e2942313c7d7dd974b1ca2f2308895f660d3d26/squirrel/driver/driver.py#L68-L155

    The documentation of max_workers states that by default None will be used and also mentions that this will cause async_map to be called but I missed these parts of the documentation and was surprised to see that so many threads were allocated.

    I am/was not too familiar with the ThreadPoolExecutor interface and find it somewhat surprising that None equals numer_of_processors x 5 according to the ThreadPoolExecutor definition. Maybe it would be helpful to explicitly state that by default ThreadPoolExecutor will be used with so many threads? The documentation string reads a bit unintuitive as the starts out that max_worker defines how many items are fetched simultaneously with max_worker and then continues to state that otherwise map is used. From that perspective, max_workers=None doesn't sound like it should be using any threads at all. Without knowing the default values of ThreadPoolExecutor I would make it more explicit that to disable threading one has to set max_workers=0/1 and that by default many threads are used.

    I am happy to add a PR with my suggested doc-string update if you agree! :)

    enhancement 
    opened by kai-tub 3
  • Interaction Nvidia DALI and Squirrel

    Interaction Nvidia DALI and Squirrel

    Description

    Describes in detail how Squirrel and DALI can work together. Also includes benchmarks on how to best utilize DALI and how it compares to transforms in Torchvision.

    Attaching PDF rendered version of the Sphinx documentation here. Unfortunately, I couldn't get syntax highlighting to work.

    Apparent next steps are figuring out how Squirrel and DALI can work together in multi-processing. It is not obvious how we could implement this, and if this provides a performance boost. Using a DALI parallel external source would probably be the way to go, but DALI expects a callable here that fetches individual images given a specific image index. This can be implemented easily if we set shard-size=1, but our initial experiments showed that larger shard sizes are more desirable.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by axkoenig 0
  • Bugfix deserializer kwargs

    Bugfix deserializer kwargs

    Description

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by mg515 0
  • PoC to cache data

    PoC to cache data

    driver = MessagepackDriver(url=url, cache_url=another_url)
    

    Description

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AlirezaSohofi 0
  • Safety checks for store and driver using FilePathGenerator

    Safety checks for store and driver using FilePathGenerator

    Description

    For both store and driver we need to asses if a URL points to an empty directory or nested empty directories.

    • For drivers, warning the user when using empty directories alerts the user early on that the url might be invalid
    • For stores, we want to only overwrite an existing non-empty directory when it is explicitly allowed

    In both cases, checking if the directories/nested directories are empty are done through the FilePathGenerator

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 2
  • [DRAFT] Support for different SquirrelStore compression modes

    [DRAFT] Support for different SquirrelStore compression modes

    Description

    See #59

    Fixes #59 issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [X] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [X] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [X] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [X] I have kept the PR small so that it can be easily reviewed
    • [X] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [X] All dependency changes have been reflected in the pip requirement files.

    Draft State!

    This is a draft PR to make it easier to discuss the different pros and cons of various solutions. This is not in a final state.

    I tried to add some test and verify that they pass locally, but the tests spam a lot of ValueError: Bucket is requester pays. Setrequester_pays=Truewhen creating the GCSFileSystem. and it is hard to tell where these tests/errors are coming from. The contributing guideline provides no further information on how to run the tests.

    opened by kai-tub 9
  • [FEATURE] Allow configuring compression mode in MessagepackSerializer

    [FEATURE] Allow configuring compression mode in MessagepackSerializer

    Hey,

    Thank you for working on this library! I think it has a huge potential, especially for dataset creators to provide their dataset in an optimized deep-learning format that is well suited for distribution. The performance of the MessagepackSerializer is amazing and being able to distribute subsets of the dataset (shards) is something I never wanted but really want to utilize in the future!

    I have played around with some "MessagepackSerializer" configurations and according to some internal benchmarks, it would be helpful to allow the user to configure the compression algorithm.

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L28-L48

    Currently, the compression mode is "locked" to gzip. I assume the main reason is due to the wide usage of gzip and to keep the code 'simple' as it makes it easy in the deserializer to know that the gzip compression was used:

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L58-L81

    Here I would like to note that given the extension, fsspec (default) could also infer the compression by inspecting the filename suffix. But I can see how this might cause problems if somebody would like to switch out fsspec with something else (although I would have no idea with what and why :D )

    Other spots within the codebase that are coupled to this compression assumption are the methods from the SquirrelStore:

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L12-L67

    Or to show the significant parts:

    • get: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L40-L41

    • set: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L59-L60

    • keys: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L66-L67

    In my internal benchmarks, I was able to greatly speed up the data loading by simply using no compression at all (None). Although I am fully aware that the correct compression mode heavily depends on the specific hardware/use case. But even in a network limited domain, I can see reasons to then prefer xz instead due to its better compression ratio and relatively similar decompression speed to gzip.

    IMHO, I think it should be ok to not store any suffix at all for the squirrel store. If I/a user looks inside of the squirrel store URL I think it is not mandatory to show what compression algorithm was used. The user could/should use the designated driver/metadata that comes bundled with the dataset and let the driver handle the correct decompression.

    If you don't agree I still think the gz extension doesn't have to be 'hardcoded' into these functions. This is actually something that confused me when I was looking at the internals of the code base. So instead, we could use something like:

    comp = kwargs.get("compression", "gzip")
    comp_to_ext_dict[comp] # just to show the concept
    

    With these modifications, it should be possible to utilize different compression modes and make them easily configurable. I would be very happy to create a PR and contribute to this project!

    enhancement 
    opened by kai-tub 3
Releases(v0.18.0)
  • v0.18.0(Nov 10, 2022)

    What's Changed

    • zip_index method for Composable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/92
    • Quantify randomness of shuffle in squirrel by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/86
    • Change Catalog repr to sorted set by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/94
    • Installation instruction by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/96
    • Upgrade requirements by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/97
    • Reference Huggingface, Hub and Torchvision Drivers by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/99
    • Update requirements by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/101
    • Refactoring DataFrameDriver and related drivers by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/98

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.7...v0.18.0

    Source code(tar.gz)
    Source code(zip)
  • v0.17.7(Oct 7, 2022)

    What's Changed

    • Add hooks to check backwards compatibility with py3.6+ by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/87
    • Add pyupgrade, yaml formatting and update all hooks by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/88
    • Fix file driver storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/85
    • Peng add kwargs to map by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/90
    • Add hooks to csv driver by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/91
    • Explain automatic version iteration by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/84
    • Add csv driver option to specify csv read args by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/93

    New Contributors

    • @MaxSchambach made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/93

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.4...v0.17.7

    Source code(tar.gz)
    Source code(zip)
  • v0.17.4(Aug 31, 2022)

    What's Changed

    • Make this repo installable with all python versions by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/82
    • Fix storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/83

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.2...v0.17.4

    Source code(tar.gz)
    Source code(zip)
  • v0.17.2(Aug 25, 2022)

    What's Changed

    • Make CatalogSource visible in the API by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/71
    • Minor tweaks in documentation by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/73
    • Introduce rst linting via precommit hook by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/74
    • Remove binary file in tests dir by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/75
    • Unifies folder-creation behaviour when instantiation SquirrelStore by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/72
    • Bugfix - Register Torch Composables by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/78
    • Upgrade infra to py3.9 by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/79
    • Add storage options kwargs to FPGen by @mg515 in https://github.com/merantix-momentum/squirrel-core/pull/81

    New Contributors

    • @axkoenig made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/78
    • @mg515 made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/81

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.16.0...v0.17.2

    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Jul 26, 2022)

    What's Changed

    • introduce loop and fixed size iterable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/47
    • Move cla assistant to workflows by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/62
    • *add tutorials, *ignore test in api-ref, *remove unused execption by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/63
    • First draft of advanced section for iterstreams by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/55
    • Update Doc-String of MapDriver.get_iter by @kai-tub in https://github.com/merantix-momentum/squirrel-core/pull/61
    • Composable.compose gets source as kwarg, which is equal to self by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/66
    • Peng add pytorch convenience functions to composable by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/69
    • partial function for keys method by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/70

    New Contributors

    • @kai-tub made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/61

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/0.14.2...v0.16.0

    Source code(tar.gz)
    Source code(zip)
  • 0.14.2(Jun 23, 2022)

    What's Changed

    • change squirrel test using a tmp public bucket by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/46
    • Update fs.open mode for catalog by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/48
    • CatalogKey can be used to index catalog by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/49
    • accept callable as source for composable to make it completly lazy by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/44
    • add sphinxcontrib-mermaid by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/51
    • Architecture overview by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/54
    • *add advanced store *reorganize sections *add icon,favicon by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/53
    • Create codeql-analysis.yml by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/52
    • Upgrade numpy & numba by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/57
    • Winnie bump pyjwt by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/58

    New Contributors

    • @AdemFr made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/48
    • @pzdkn made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/53
    • @winfried-loetzsch made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/57

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.2...0.14.2

    Source code(tar.gz)
    Source code(zip)
  • v0.13.2(May 18, 2022)

    What's Changed

    • Fix SourceCombiner.get_iter() not interleaving correctly by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/45

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.1...v0.13.2

    Source code(tar.gz)
    Source code(zip)
  • v0.13.1(May 18, 2022)

    What's Changed

    • Add community files by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/38
    • Minor requirement changes by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/40
    • messagepack unpacker set use_list argument to False by default by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/39

    New Contributors

    • @AlpAribal made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/40

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.3...v0.13.1

    Source code(tar.gz)
    Source code(zip)
  • v0.12.3(Apr 11, 2022)

    What's Changed

    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/31
    • pin numpy and update PR template by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/34
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/33
    • update document links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/36
    • update version to 0.12.3 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/37

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.2...v0.12.3

    Source code(tar.gz)
    Source code(zip)
  • v0.12.2(Apr 6, 2022)

    What's Changed

    • update img to github raw file so public pypi can load it by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/26
    • Tiansu add readthedocs.yml by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/27
    • add dependencies for readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/28
    • fix readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/29
    • update readthedocs links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/30
    • Tiansu move leftover commits by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/32

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.1...v0.12.2

    Source code(tar.gz)
    Source code(zip)
  • v0.12.1(Apr 5, 2022)

    What's Changed

    • update docs link by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/12
    • add logo by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/13
    • remove old extra file by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/14
    • add back keyring until public release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/16
    • key_hook param of get_iter accepts SplitByRank and SplitByWorker, par… by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/15
    • fix install instruction by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/18
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/19
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/20
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/21
    • Tiansu update black by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/22
    • add CLA bot by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/23
    • switch to publish in public pypi by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/24
    • update version to 0.12.1 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/25

    New Contributors

    • @ThomasWollmann made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/13
    • @AlirezaSohofi made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/15

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.0...v0.12.1

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Mar 12, 2022)

    What's Changed

    • add basic files to get infrastructure running by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/3
    • new semantic versioning format for dev release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/4
    • tiansu copy squirrel codebase by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/5
    • Tiansu add docs by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/9
    • add pypi classifiers by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/10
    • change version norm by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/11

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/commits/v0.12.0

    Source code(tar.gz)
    Source code(zip)
A fast, dataset-agnostic, deep visual search engine for digital art history

imgs.ai imgs.ai is a fast, dataset-agnostic, deep visual search engine for digital art history based on neural network embeddings. It utilizes modern

Fabian Offert 5 Dec 14, 2022
学习 python3 以来写的一些垃圾玩具……

和东哥做兄弟 Author: chiupam 版权 未经本人同意,仓库内所有资源文件,禁止任何公众号、自媒体、开发者进行任何形式的转载、发布、搬运。 声明 这不是一个开源项目,只是把 GitHub 当作一个代码的存储空间,本项目不接受任何开源要求。 仅用于学习研究,禁止用于商业用途,不能保证其合法性

Chiupam 67 Mar 26, 2022
这是一个deeplabv3-plus-pytorch的源码,可以用于训练自己的模型。

DeepLabv3+:Encoder-Decoder with Atrous Separable Convolution语义分割模型在Pytorch当中的实现 目录 性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Download 训练步骤

Bubbliiiing 350 Dec 28, 2022
Dense Prediction Transformers

Vision Transformers for Dense Prediction This repository contains code and models for our paper: Vision Transformers for Dense Prediction René Ranftl,

Intelligent Systems Lab Org 1.3k Jan 02, 2023
[WWW 2022] Zero-Shot Stance Detection via Contrastive Learning

PT-HCL for Zero-Shot Stance Detection The code of this repository is constantly being updated... Please look forward to it! Introduction This reposito

Akuchi 12 Dec 21, 2022
Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergen

281 Dec 30, 2022
This repository introduces a short project about Transfer Learning for Classification of MRI Images.

Transfer Learning for MRI Images Classification This repository introduces a short project made during my stay at Neuromatch Summer School 2021. This

Oscar Guarnizo 3 Nov 15, 2022
Code to reproduce the results in "Visually Grounded Reasoning across Languages and Cultures", EMNLP 2021.

marvl-code [WIP] This is the implementation of the approaches described in the paper: Fangyu Liu*, Emanuele Bugliarello*, Edoardo M. Ponti, Siva Reddy

25 Nov 15, 2022
Mitsuba 2: A Retargetable Forward and Inverse Renderer

Mitsuba Renderer 2 Documentation Mitsuba 2 is a research-oriented rendering system written in portable C++17. It consists of a small set of core libra

Mitsuba Physically Based Renderer 2k Jan 07, 2023
Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

SSC-GAN_repo Pytorch implementation for 'Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation'.PDF SSC-GAN:Sem

tyty 4 Aug 28, 2022
An example of time series augmentation methods with Keras

Time Series Augmentation This is a collection of time series data augmentation methods and an example use using Keras. News 2020/04/16: Repository Cre

九州大学 ヒューマンインタフェース研究室 229 Jan 02, 2023
PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

This is the official implementation of the following paper: Torsten Scholak, Nathan Schucher, Dzmitry Bahdanau. PICARD - Parsing Incrementally for Con

ElementAI 217 Jan 01, 2023
Neural Radiance Fields Using PyTorch

This project is a PyTorch implementation of Neural Radiance Fields (NeRF) for reproduction of results whilst running at a faster speed.

Vedant Ghodke 1 Feb 11, 2022
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go This repository provides our implementation of the CVPR 2021 paper NeuroMorp

Meta Research 35 Dec 08, 2022
Ensembling Off-the-shelf Models for GAN Training

Vision-aided GAN video (3m) | website | paper Can the collective knowledge from a large bank of pretrained vision models be leveraged to improve GAN t

345 Dec 28, 2022
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022
Simple Tensorflow implementation of Toward Spatially Unbiased Generative Models (ICCV 2021)

Spatial unbiased GANs — Simple TensorFlow Implementation [Paper] : Toward Spatially Unbiased Generative Models (ICCV 2021) Abstract Recent image gener

Junho Kim 16 Apr 15, 2022
Official codebase used to develop Vision Transformer, MLP-Mixer, LiT and more.

Big Vision This codebase is designed for training large-scale vision models on Cloud TPU VMs. It is based on Jax/Flax libraries, and uses tf.data and

Google Research 701 Jan 03, 2023
Social Fabric: Tubelet Compositions for Video Relation Detection

Social-Fabric Social Fabric: Tubelet Compositions for Video Relation Detection This repository contains the code and results for the following paper:

Shuo Chen 7 Aug 09, 2022
Code for "Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo"

Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo This repository includes the source code for our CVPR 2021 paper on multi-view mult

Jiahao Lin 66 Jan 04, 2023