Installation:

pip install lm_dataloader

Design Philosophy

A library to unify lm dataloading at large scale
Simple interface, any tokenizer can be integrated
Minimal changes needed from small -> large scale (many multiple GPU nodes)
follows fairseq / megatron's 'mmap' dataformat, but with improvements. Those being:
- Easily combine multiple datasets
- Easily split a dataset into train / val / test splits
- Easily build a weighted dataset out of a list of existing ones along with weights.
- unified into a single 'file' (which is actually a directory containing a .bin / .idx file)
- index files that are built on the fly are hidden files, leaving less mess in the directory.
- More straightforward interface, better documentation.
- Inspectable with a command line tool
- Can load from urls
- Can load from S3 buckets
- Can load from GCS buckets
- Can tokenize on the fly instead of preprocessing

Misc. TODO: - [ ] Option to set mpu globally (for distributed dataloading)

Example usage

To tokenize a dataset contained in a .jsonl file (where the text to be tokenized can be accessed under the 'text' key):

import lm_dataloader as lmdl
from transformers import GPT2TokenizerFast 

jsonl_path = "test.jsonl"
output = "my_dataset.lmd"
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

lmdl.encode(
    jsonl_path,
    output_prefix=output,
    tokenize_fn=tokenizer.encode,
    tokenizer_vocab_size=len(tokenizer),
    eod_token=tokenizer.eos_token_id,
)

This will create a dataset at "my_dataset.lmd" which can be loaded as an indexed torch dataset like so:

from lm_dataloader import LMDataset
from transformers import GPT2TokenizerFast 

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
seq_length = tokenizer.model_max_length # or whatever the sequence length of your model is

dataset = LMDataset("my_dataset.lmd", seq_length=seq_length)

# peek at 0th index
print(dataset[0])

Command line utilities

There are also command line utilities provided to inspect / merge datasets, e.g:

lm-dataloader inspect my_dataset.lmd

Launches an interactive terminal to inspect the data in my_dataset.lmd

And:

lm-dataloader merge my_dataset.lmd,my_dataset_2.lmd new_dataset.lmd

Merges the datasets at "my_dataset.lmd" and "my_dataset_2.lmd" into a new file at "new_dataset.lmd".

Dataloader tools for language modelling

Related tags

Overview

Installation:

Design Philosophy

Example usage

Command line utilities

Owner

Human Pose estimation with TensorFlow framework

https://arxiv.org/abs/2102.11005

VR Viewport Pose Model for Quantifying and Exploiting Frame Correlations

Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

Intel® Neural Compressor is an open-source Python library running on Intel CPUs and GPUs

PyTorch code for the paper "FIERY: Future Instance Segmentation in Bird's-Eye view from Surround Monocular Cameras"

AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.

A configurable, tunable, and reproducible library for CTR prediction

A general python framework for single object tracking in LiDAR point clouds, based on PyTorch Lightning.

Rainbow is all you need! A step-by-step tutorial from DQN to Rainbow

上海交通大学全自动抢课脚本，支持准点开抢与抢课后持续捡漏两种模式。2021/06/08更新。

This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection', CVPR 2019.

A library for optimization on Riemannian manifolds

Face Library is an open source package for accurate and real-time face detection and recognition

For the paper entitled ''A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining''

Learning embeddings for classification, retrieval and ranking.

Black box hyperparameter optimization made easy.

Code release for "BoxeR: Box-Attention for 2D and 3D Transformers"

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

RETRO-pytorch - Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch