Data pipelines for both TensorFlow and PyTorch!

Overview

rapidnlp-datasets

Python package PyPI version Python

Data pipelines for both TensorFlow and PyTorch !

If you want to load public datasets, try:

If you want to load local, personal dataset with minimized boilerplate, use rapidnlp-datasets!

installation

pip install -U rapidnlp-datasets

If you work with PyTorch, you should install PyTorch first.

If you work with TensorFlow, you should install TensorFlow first.

Usage

Here are few examples to show you how to use this library.

sequence-classification-quickstart

In PyTorch,

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForSequenceClassification
>>> dataset = DatasetForSequenceClassification.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForSequenceClassifiation
>>> dataset, d = TFDatasetForSequenceClassifiation.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/sequence_classification.tfrecord")
2021-12-08 14:52:41,295    INFO             utils.py  128] Finished to write 2 examples to tfrecords.
>>> dataset = TFDatasetForSequenceClassifiation.from_tfrecord_files("testdata/sequence_classification.tfrecord")
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

question-answering-quickstart

In PyTorch:

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForQuestionAnswering
>>>
>>> dataset = DatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForQuestionAnswering
>>> dataset, d = TFDatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
2021-12-08 15:09:06,747    INFO question_answering_dataset.py  101] Read 3 examples in total.
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
... 

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/qa.tfrecord")
2021-12-08 15:09:31,329    INFO             utils.py  128] Finished to write 3 examples to tfrecords.
>>> dataset = TFDatasetForQuestionAnswering.from_tfrecord_files(
        "testdata/qa.tfrecord",
        batch_size=32,
        padding="batch",
    )
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
... 

token-classification-quickstart

masked-language-models-quickstart

simcse-quickstart

You might also like...
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

cdf_att_classification classes = {0: 'cat', 1: 'dog', 2: 'flower'} In this project we use both Resnet and Self-attention layer for cdf-Classification.

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

🤗 Push your spaCy pipelines to the Hugging Face Hub
🤗 Push your spaCy pipelines to the Hugging Face Hub

spacy-huggingface-hub: Push your spaCy pipelines to the Hugging Face Hub This package provides a CLI command for uploading any trained spaCy pipeline

AI pipelines for Nvidia Jetson Platform

Jetson Multicamera Pipelines Easy-to-use realtime CV/AI pipelines for Nvidia Jetson Platform. This project: Builds a typical multi-camera pipeline, i.

This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.
This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.

OpenVINO Inference API This is a repository for an object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operati

Machine learning framework for both deep learning and traditional algorithms
Machine learning framework for both deep learning and traditional algorithms

NeoML is an end-to-end machine learning framework that allows you to build, train, and deploy ML models. This framework is used by ABBYY engineers for

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

A transformer which can randomly augment VOC format dataset (both image and bbox) online.
A transformer which can randomly augment VOC format dataset (both image and bbox) online.

VocAug It is difficult to find a script which can augment VOC-format dataset, especially the bbox. Or find a script needs complex requirements so it i

Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

Releases(v0.2.0)
Tandem Mass Spectrum Prediction with Graph Transformers

MassFormer This is the original implementation of MassFormer, a graph transformer for small molecule MS/MS prediction. Check out the preprint on arxiv

Röst Lab 13 Oct 27, 2022
Colour detection is necessary to recognize objects, it is also used as a tool in various image editing and drawing apps.

Colour Detection On Image Colour detection is the process of detecting the name of any color. Simple isn’t it? Well, for humans this is an extremely e

Astitva Veer Garg 1 Jan 13, 2022
CN24 is a complete semantic segmentation framework using fully convolutional networks

Build status: master (production branch): develop (development branch): Welcome to the CN24 GitHub repository! CN24 is a complete semantic segmentatio

Computer Vision Group Jena 123 Jul 14, 2022
Source code for Fixed-Point GAN for Cloud Detection

FCD: Fixed-Point GAN for Cloud Detection PyTorch source code of Nyborg & Assent (2020). Abstract The detection of clouds in satellite images is an ess

Joachim Nyborg 8 Dec 22, 2022
YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

Introduction Yolov5-face is a real-time,high accuracy face detection. Performance Single Scale Inference on VGA resolution(max side is equal to 640 an

DeepCam Shenzhen 1.4k Jan 07, 2023
This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

MoEBERT This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022). Installation Create an

Simiao Zuo 34 Dec 24, 2022
Gif-caption - A straightforward GIF Captioner written in Python

Broksy's GIF Captioner Have you ever wanted to easily caption a GIF without havi

3 Apr 09, 2022
Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

Liyan 5 Dec 07, 2022
PyElastica is the Python implementation of Elastica, an open-source software for the simulation of assemblies of slender, one-dimensional structures using Cosserat Rod theory.

PyElastica PyElastica is the python implementation of Elastica: an open-source project for simulating assemblies of slender, one-dimensional structure

Gazzola Lab 105 Jan 09, 2023
EMNLP'2021: SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Dec 29, 2022
Consistency Regularization for Adversarial Robustness

Consistency Regularization for Adversarial Robustness Official PyTorch implementation of Consistency Regularization for Adversarial Robustness by Jiho

40 Dec 17, 2022
Malware Env for OpenAI Gym

Malware Env for OpenAI Gym Citing If you use this code in a publication please cite the following paper: Hyrum S. Anderson, Anant Kharkar, Bobby Fila

ENDGAME 563 Dec 29, 2022
Attention-guided gan for synthesizing IR images

SI-AGAN Attention-guided gan for synthesizing IR images This repository contains the Tensorflow code for "Pedestrian Gender Recognition by Style Trans

1 Oct 25, 2021
Learning from Synthetic Data with Fine-grained Attributes for Person Re-Identification

Less is More: Learning from Synthetic Data with Fine-grained Attributes for Person Re-Identification Suncheng Xiang Shanghai Jiao Tong University Over

SunchengXiang 68 Dec 13, 2022
SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

Wentao Zhu 24 May 20, 2022
Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices, ACM Multimedia 2021

Codes for ECBSR Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices Xindong Zhang, Hui Zeng, Lei Zhang ACM Multimedia 202

xindong zhang 236 Dec 26, 2022
Paper Code:A Self-adaptive Weighted Differential Evolution Approach for Large-scale Feature Selection

1. SaWDE.m is the main function 2. DataPartition.m is used to randomly partition the original data into training sets and test sets with a ratio of 7

wangxb 14 Dec 08, 2022
Generate fine-tuning samples & Fine-tuning the model & Generate samples by transferring Note On

UPMT Generate fine-tuning samples & Fine-tuning the model & Generate samples by transferring Note On See main.py as an example: from model import PopM

7 Sep 01, 2022
Apollo optimizer in tensorflow

Apollo Optimizer in Tensorflow 2.x Notes: Warmup is important with Apollo optimizer, so be sure to pass in a learning rate schedule vs. a constant lea

Evan Walters 1 Nov 09, 2021
PyTorch implementation of CloudWalk's recent work DenseBody

densebody_pytorch PyTorch implementation of CloudWalk's recent paper DenseBody. Note: For most recent updates, please check out the dev branch. Update

Lingbo Yang 401 Nov 19, 2022