Large dataset storage format for Pytorch

Last update: Oct 22, 2022

Overview

H5Record

Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip)

Support python 3

pip install h5record

Why?

Writing large dataset is still a wild west in pytorch. Approaches seen in the wild include:
- large directory with lots of small files : slow IO when complex file is fetched, deserialized frequently
- database approach : depend on what kind of database engine used, usually multi-process read is not supported
- the above method scale non linear in terms of data - storage size
TFRecord solved the above problems well ( multiprocess fetch, (de)compression ), fast serialization ( protobuf )
However TFRecord port does not support data size evaluation (used frequently by Dataloader ), no index level access available ( important for data evaluation or verification )

H5Record aim to tackle TFRecord problems by compressing the dataset into HDF5 file with an easy to use interface through predefined interfaces ( String, Image, Sequences, Integer).

Some advantage of using H5Record

Support multi-process read
Relatively simple to use and low technical debt
Support compression/de-compression on the fly
Quick load to memory if required

Simple usage

pip install h5record

Sentence Similarity

from h5record import H5Dataset, Float, String

schema = (
    String(name='sentence1'),
    String(name='sentence2'),
    Float(name='label')
)
data = [
    ['Sent 1.', 'Sent 2', 0.1],
    ['Sent 3', 'Sent 4', 0.2],
]

def pair_iter():
    for row in data:
        yield {
            'sentence1': row[0],
            'sentence2': row[1],
            'label': row[2]
        }

dataset = H5Dataset(schema, './question_pair.h5', pair_iter())
for idx in range(len(dataset)):
    print(dataset[idx])

Note

Due to in progress development, this package should be use in care in storage with FAT, FAT-32 format

Comparison between different compression algorithm

No chunking is used

Compression Type	File size	Read speed row/second
no compression	2.0G	2084.55 it/s
lzf	1.7G	1496.14 it/s
gzip	1.1G	843.78 it/s

benchmarked in i7-9700, 1TB NVMe SSD

If you are interested to learn more feel free to checkout the note as well!

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

36 Oct 30, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

25 Jun 20, 2021

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

157 Dec 11, 2022

This is the dataset and code release of the OpenRooms Dataset.

95 Jan 8, 2023

Comments

Example about Image dataset

Thanks for your work. Do you have an end to end example about image dataset which includes creating h5records file similar to tfrecord files and then using it in dataloader mechanism just like tf dataset api loader mechanism?
documentation question

opened by meet-minimalist 1

Releases(1.0.4)

1.0.4(Jun 8, 2021)

Minor bug fix
Source code(tar.gz)
Source code(zip)
1.0.3(Jun 6, 2021)
Support for image sequence, float16 sequence, float sequence and float16 datatype

Fix bugs

Source code(tar.gz)
Source code(zip)
1.0.1(Jun 5, 2021)

Source code(tar.gz)
Source code(zip)

Large dataset storage format for Pytorch

Related tags

Overview

H5Record

Why?

Simple usage

Note

Comparison between different compression algorithm

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

LIVECell - A large-scale dataset for label-free live cell segmentation

A large-scale face dataset for face parsing, recognition, generation and editing.

N-Omniglot is a large neuromorphic few-shot learning dataset

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

This is the dataset and code release of the OpenRooms Dataset.

Comments

Example about Image dataset

Releases(1.0.4)

1.0.4(Jun 8, 2021)

1.0.3(Jun 6, 2021)

1.0.1(Jun 5, 2021)

Owner

theblackcat102

The official PyTorch code for 'DER: Dynamically Expandable Representation for Class Incremental Learning' accepted by CVPR2021

MaRS - a recursive filtering framework that allows for truly modular multi-sensor integration

Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

A spherical CNN for weather forecasting

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

An implementation of the paper "A Neural Algorithm of Artistic Style"

A modified version of DeepMind's Alphafold2 to divide CPU part (MSA and template searching) and GPU part (prediction model)

PixelPyramids: Exact Inference Models from Lossless Image Pyramids (ICCV 2021)

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Towards Fine-Grained Reasoning for Fake News Detection

This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Codes for NeurIPS 2021 paper "On the Equivalence between Neural Network and Support Vector Machine".

A universal memory dumper using Frida

"Domain Adaptive Semantic Segmentation without Source Data" (ACM MM 2021)

FADNet++: Real-Time and Accurate Disparity Estimation with Configurable Networks

Privacy as Code for DSAR Orchestration: Privacy Request automation to fulfill GDPR, CCPA, and LGPD data subject requests.

Use stochastic processes to generate samples and use them to train a fully-connected neural network based on Keras

public repo for ESTER dataset and modeling (EMNLP'21)

Lightweight Salient Object Detection in Optical Remote Sensing Images via Feature Correlation

Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.