Dataloader tools for language modelling

Overview

Installation:

pip install lm_dataloader

Design Philosophy

  • A library to unify lm dataloading at large scale

  • Simple interface, any tokenizer can be integrated

  • Minimal changes needed from small -> large scale (many multiple GPU nodes)

  • follows fairseq / megatron's 'mmap' dataformat, but with improvements. Those being:

    • Easily combine multiple datasets
    • Easily split a dataset into train / val / test splits
    • Easily build a weighted dataset out of a list of existing ones along with weights.
    • unified into a single 'file' (which is actually a directory containing a .bin / .idx file)
    • index files that are built on the fly are hidden files, leaving less mess in the directory.
    • More straightforward interface, better documentation.
    • Inspectable with a command line tool
    • Can load from urls
    • Can load from S3 buckets
    • Can load from GCS buckets
    • Can tokenize on the fly instead of preprocessing

Misc. TODO: - [ ] Option to set mpu globally (for distributed dataloading)

Example usage

To tokenize a dataset contained in a .jsonl file (where the text to be tokenized can be accessed under the 'text' key):

import lm_dataloader as lmdl
from transformers import GPT2TokenizerFast 

jsonl_path = "test.jsonl"
output = "my_dataset.lmd"
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

lmdl.encode(
    jsonl_path,
    output_prefix=output,
    tokenize_fn=tokenizer.encode,
    tokenizer_vocab_size=len(tokenizer),
    eod_token=tokenizer.eos_token_id,
)

This will create a dataset at "my_dataset.lmd" which can be loaded as an indexed torch dataset like so:

from lm_dataloader import LMDataset
from transformers import GPT2TokenizerFast 

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
seq_length = tokenizer.model_max_length # or whatever the sequence length of your model is

dataset = LMDataset("my_dataset.lmd", seq_length=seq_length)

# peek at 0th index
print(dataset[0])

Command line utilities

There are also command line utilities provided to inspect / merge datasets, e.g:

lm-dataloader inspect my_dataset.lmd

Launches an interactive terminal to inspect the data in my_dataset.lmd

And:

lm-dataloader merge my_dataset.lmd,my_dataset_2.lmd new_dataset.lmd

Merges the datasets at "my_dataset.lmd" and "my_dataset_2.lmd" into a new file at "new_dataset.lmd".

Efficient Deep Learning Systems course

Efficient Deep Learning Systems This repository contains materials for the Efficient Deep Learning Systems course taught at the Faculty of Computer Sc

Max Ryabinin 173 Dec 29, 2022
Implementation of Pix2Seq in PyTorch

pix2seq-pytorch Implementation of Pix2Seq paper Different from the paper image input size 1280 bin size 1280 LambdaLR scheduler used instead of Linear

Tony Shin 9 Dec 15, 2022
Compare neural networks by their feature similarity

PyTorch Model Compare A tiny package to compare two neural networks in PyTorch. There are many ways to compare two neural networks, but one robust and

Anand Krishnamoorthy 181 Jan 04, 2023
Conformer: Local Features Coupling Global Representations for Visual Recognition

Conformer: Local Features Coupling Global Representations for Visual Recognition (arxiv) This repository is built upon DeiT and timm Usage First, inst

Zhiliang Peng 378 Jan 08, 2023
Streamlit app demonstrating an image browser for the Udacity self-driving-car dataset with realtime object detection using YOLO.

Streamlit Demo: The Udacity Self-driving Car Image Browser This project demonstrates the Udacity self-driving-car dataset and YOLO object detection in

Streamlit 992 Jan 04, 2023
Official source code to CVPR'20 paper, "When2com: Multi-Agent Perception via Communication Graph Grouping"

When2com: Multi-Agent Perception via Communication Graph Grouping This is the PyTorch implementation of our paper: When2com: Multi-Agent Perception vi

34 Nov 09, 2022
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Vision CAIR Research Group, KAUST 140 Dec 28, 2022
Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

Automatic Number Plate Recognition Automatic Number Plate Recognition (ANPR) is the process of reading the characters on the plate with various optica

Meftun AKARSU 52 Dec 22, 2022
Creating multimodal multitask models

Fusion Brain Challenge The English version of the document can be found here. Обновления 01.11 Мы выкладываем пример данных, аналогичных private test

Sber AI 43 Nov 28, 2022
EZ graph is an easy to use AI solution that allows you to make and train your neural networks without a single line of code.

EZ-Graph EZ Graph is a GUI that allows users to make and train neural networks without writing a single line of code. Requirements python 3 pandas num

1 Jul 03, 2022
计算机视觉中用到的注意力模块和其他即插即用模块PyTorch Implementation Collection of Attention Module and Plug&Play Module

PyTorch实现多种计算机视觉中网络设计中用到的Attention机制,还收集了一些即插即用模块。由于能力有限精力有限,可能很多模块并没有包括进来,有任何的建议或者改进,可以提交issue或者进行PR。

PJDong 599 Dec 23, 2022
Hidden-Fold Networks (HFN): Random Recurrent Residuals Using Sparse Supermasks

Hidden-Fold Networks (HFN): Random Recurrent Residuals Using Sparse Supermasks by Ángel López García-Arias, Masanori Hashimoto, Masato Motomura, and J

Ángel López García-Arias 4 May 19, 2022
Official implementation of NeurIPS'2021 paper TransformerFusion

TransformerFusion: Monocular RGB Scene Reconstruction using Transformers Project Page | Paper | Video TransformerFusion: Monocular RGB Scene Reconstru

Aljaz Bozic 118 Dec 25, 2022
A unofficial pytorch implementation of PAN(PSENet2): Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network Requirements pytorch 1.1+ torchvision 0.3+ pyclipper opencv3 gcc

zhoujun 400 Dec 26, 2022
Transformer part of 12th place solution in Riiid! Answer Correctness Prediction

kaggle_riiid Transformer part of 12th place solution in Riiid! Answer Correctness Prediction. Please see here for more information. Execution You need

Sakami Kosuke 2 Apr 23, 2022
GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms

GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms Trying to publish a new machine learning model and can't write a decent title for your pa

264 Nov 08, 2022
Starter code for the ICCV 2021 paper, 'Detecting Invisible People'

Detecting Invisible People [ICCV 2021 Paper] [Website] Tarasha Khurana, Achal Dave, Deva Ramanan Introduction This repository contains code for Detect

Tarasha Khurana 28 Sep 16, 2022
Code for Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019) We propose Disentangled Audio-Visual System (DAVS) to ad

Hang_Zhou 750 Dec 23, 2022
Codes for CVPR2021 paper "PWCLO-Net: Deep LiDAR Odometry in 3D Point Clouds Using Hierarchical Embedding Mask Optimization"

PWCLO-Net: Deep LiDAR Odometry in 3D Point Clouds Using Hierarchical Embedding Mask Optimization (CVPR 2021) This is the official implementation of PW

Intelligent Robotics and Machine Vision Lab 42 Dec 18, 2022
Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

Alexis David Jacq 163 Dec 26, 2022