The tool to make NLP datasets ready to use

Overview

chazutsu

chazutsu_top.PNG
photo from Kaikado, traditional Japanese chazutsu maker

PyPI version Build Status codecov

chazutsu is the dataset downloader for NLP.

>>> import chazutsu
>>> r = chazutsu.datasets.IMDB().download()
>>> r.train_data().head(5)

Then

   polarity  rating                                             review
0         0       3  You'd think the first landing on the Moon woul...
1         1       9  I took a flyer in renting this movie but I got...
2         1      10  Sometimes I just want to laugh. Don't you? No ...
3         0       2  I knew it wasn't gunna work out between me and...
4         0       2  Sometimes I rest my head and think about the r...

You can use chazutsu on Jupyter.

Install

pip install chazutsu

Supported datasetd

chazutsu supports various kinds of datasets!
Please see the details here!

  • Sentiment Analysis
    • Movie Review Data
    • Customer Review Datasets
    • Large Movie Review Dataset(IMDB)
  • Text classification
    • 20 Newsgroups
    • Reuters News Courpus (RCV1-v2)
  • Language Modeling
    • Penn Tree Bank
    • WikiText-2
    • WikiText-103
    • text8
  • Text Summarization
    • DUC2003
    • DUC2004
    • Gigaword
  • Textual entailment
    • The Multi-Genre Natural Language Inference (MultiNLI)
  • Question Answering
    • The Stanford Question Answering Dataset (SQuAD)

How it works

chazutsu not only download the dataset, but execute expand archive file, shuffle, split, picking samples process also (You can disable the process by arguments if you don't need).

chazutsu_process1.png

r = chazutsu.datasets.MovieReview.polarity(shuffle=False, test_size=0.3, sample_count=100).download()
  • shuffle: The flag argument for executing shuffle or not(True/False).
  • test_size: The ratio of the test dataset (If dataset already prepares train and test dataset, this value is ignored).
  • sample_count: You can pick some samples from the dataset to avoid the editor freeze caused by the heavy text file.
  • force: Don't use cache, re-download the dataset.

chazutsu supports fundamental process for tokenization.

chazutsu_process2.png

>>> import chazutsu
>>> r = chazutsu.datasets.MovieReview.subjectivity().download()
>>> r.train_data().head(3)

Then

    subjectivity                                             review
0             0  . . . works on some levels and is certainly wo...
1             1  the hulk is an anger fueled monster with incre...
2             1  when the skittish emma finds blood on her pill...

Now we want to convert this data to train various frameworks.

fixed_len = 10
r.make_vocab(vocab_size=1000)
r.column("review").as_word_seq(fixed_len=fixed_len)
X, y = r.to_batch("train")
assert X.shape == (len(y), fixed_len, len(r.vocab))
assert y.shape == (len(y), 1)
  • make_vocab
    • vocab_resources: resources to make vocabulary ("train", "valid", "test")
    • columns_for_vocab: The columns to make vocabulary
    • tokenizer: Tokenizer
    • vocab_size: Vocacbulary size
    • min_word_freq: Minimum word count to include the vocabulary
    • unknown: The tag used for out of vocabulary word
    • padding: The tag used to pad the sequence
    • end_of_sentence: If you want to clarify the end-of-line by specific tag, then use this.
    • reserved_words: The word that should included in vocabulary (ex. tag for padding)
    • force: Don't use cache, re-create the dataset.

If you don't want to load all the training data? You can use to_batch_iter.

Additional Feature

Use on Jupyter

You can use chazutsu on Jupyter Notebook.

on_jupyter.png

Before you execute chazutsu on Jupyter, you have to enable widget extention by below command.

jupyter nbextension enable --py --sys-prefix widgetsnbextension
LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating (Dataset) The dataset is from Amazon Review Data (2018)

Immanuvel Prathap S 1 Jan 16, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 02, 2023
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Pretrained Language Model This repository provides the latest pretrained language models and its related optimization techniques developed by Huawei N

HUAWEI Noah's Ark Lab 2.6k Jan 08, 2023
The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

Nicholas Heller 4 Jan 22, 2022
Generate vector graphics from a textual caption

VectorAscent: Generate vector graphics from a textual description Example "a painting of an evergreen tree" python text_to_painting.py --prompt "a pai

Ajay Jain 97 Dec 15, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
jiant is an NLP toolkit

jiant is an NLP toolkit The multitask and transfer learning toolkit for natural language processing research Why should I use jiant? jiant supports mu

ML² AT CILVR 1.5k Jan 04, 2023
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

Sinkhorn Transformer This is a reproduction of the work outlined in Sparse Sinkhorn Attention, with additional enhancements. It includes a parameteriz

Phil Wang 217 Nov 25, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
SurvTRACE: Transformers for Survival Analysis with Competing Events

⭐ SurvTRACE: Transformers for Survival Analysis with Competing Events This repo provides the implementation of SurvTRACE for survival analysis. It is

Zifeng 13 Oct 06, 2022