An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Last update: Dec 20, 2022

Related tags

Overview

Simple Tar Dataset

An unopinionated replacement for PyTorch's Dataset and ImageFolder classes, for datasets stored as uncompressed Tar archives.

Just Tar it: No particular structure is enforced in the Tar archive. This means that you can just archive your files with no modification, and handle any data/meta-data with your dataset code.

Why? Storing a dataset as millions of small files makes access inefficient, and can create other difficulties in large-scale scenarios (e.g. running out of inodes, inneficient operations in distributed filesystems which are optimised for fewer large files). A Tar file is a simple and uncompressed archive format for which numerous utilities exist, and it allows fast random access into a single archive file.

Example

The default TarDataset simply loads all PNG, JPG and JPEG images from a Tar file, and allows you to iterate them.

Images are returned as Tensor. Here some RGB values are printed.

from tardataset import TarDataset

dataset = TarDataset('example-data/colors.tar')

for (idx, image) in enumerate(dataset):
  print(f"Image #{idx}, color: {image[:,0,0]}")

Usage

For image classification datasets, where images are usually stored in one folder per class (e.g. ImageNet), TarImageFolder is a drop-in replacement for torchvision.dataset.ImageFolder.

For more complex scenarios -- say, you store some data in one or more JSON files, or you have folders with video frames in specific formats -- you can subclass TarDataset, and read the data in any format you like.

Jupyter notebook tutorial

There is a more comprehensive set of examples as a Jupyter notebook in example.ipynb.

Full "ImageNet in a Tar file" example

A large-scale data loading example is given in imagenet-example.py. Only the section of code responsible for data loading was modified from the official PyTorch ImageNet example.

First, ensure that the data is in the expected format for the original example to work, in a folder named ILSVRC12. Then, create a Tar archive from it (tar cf ILSVRC12.tar ILSVRC12 on Linux or a utility like 7-Zip on Windows). Finally, run our modified imagenet-example.py, passing it the path to the Tar archive instead.

Author

João Henriques, Visual Geometry Group (VGG), University of Oxford

An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Related tags

Overview

Simple Tar Dataset

Example

Usage

Jupyter notebook tutorial

Full "ImageNet in a Tar file" example

Author

Owner

Joao Henriques

Robustness via Cross-Domain Ensembles

Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation

Efficient training of deep recommenders on cloud.

Deep metric learning methods implemented in Chainer

PyTorch implementations of the beta divergence loss.

Open Source Differentiable Computer Vision Library for PyTorch

Graph Analysis From Scratch

《A-CNN: Annularly Convolutional Neural Networks on Point Clouds》(2019)

Airbus Ship Detection Challenge

Multiview Dataset Toolkit

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Face Depixelizer based on "PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models" repository.

Wenet STT Python

A fast Evolution Strategy implementation in Python

A fast, dataset-agnostic, deep visual search engine for digital art history

Learning from History: Modeling Temporal Knowledge Graphs with Sequential Copy-Generation Networks

Ensemble Visual-Inertial Odometry (EnVIO)

Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train format

A simplistic and efficient pure-python neural network library from Phys Whiz with CPU and GPU support.

Language model Prompt And Query Archive