simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Overview

simplet5

Quickly train T5 models in just 3 lines of code + ONNX support

PyPI version License

simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quickly train your T5 models.

T5 models can be used for several NLP tasks such as summarization, QA, QG, translation, text generation, and more.

Here's a link to Medium article along with an example colab notebook

Install

pip install --upgrade simplet5

Usage

simpleT5 for summarization task Open In Collab

# import
from simplet5 import SimpleT5

# instantiate
model = SimpleT5()

# load
model.from_pretrained("t5","t5-base")

# train
model.train(train_df=train_df, # pandas dataframe with 2 columns: source_text & target_text
            eval_df=eval_df, # pandas dataframe with 2 columns: source_text & target_text
            source_max_token_len = 512, 
            target_max_token_len = 128,
            batch_size = 8,
            max_epochs = 5,
            use_gpu = True,
            outputdir = "outputs",
            early_stopping_patience_epochs = 0
            )

# load trained T5 model
model.load_model("t5","path/to/trained/model/directory", use_gpu=False)

# predict
model.predict("input text for prediction")

# need faster inference on CPU, get ONNX support
model.convert_and_load_onnx_model("path/to/T5 model/directory")
model.onnx_predict("input text for prediction")
Comments
  • Suppress the Output Models

    Suppress the Output Models

    Hello there!

    I'd like to ask if there is any possible way to eliminate all models, except for the last trained one. When I fine tune a model, it gives me X different models if I fine tune the model X epochs. I just need the last model and couldn't find a way to prevent writing those models to disk.

    Thanks!

    opened by bayismet 6
  • TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask In onnx_predict function

    TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask In onnx_predict function

    Hello, when I run the fine-tuned mt5 model under onnx, I get the following error:

    `TypeError Traceback (most recent call last) in ----> 1 model.onnx_predict(text)

    ~\AppData\Roaming\Python\Python38\site-packages\simplet5\simplet5.py in onnx_predict(self, source_text) 469 """ generates prediction from ONNX model """ 470 token = self.onnx_tokenizer(source_text, return_tensors="pt") --> 471 tokens = self.onnx_model.generate( 472 input_ids=token["input_ids"], 473 attention_mask=token["attention_mask"],

    C:\ProgramData\Anaconda3\lib\site-packages\torch\autograd\grad_mode.py in decorate_context(*args, **kwargs) 26 def decorate_context(*args, **kwargs): 27 with self.class(): ---> 28 return func(*args, **kwargs) 29 return cast(F, decorate_context) 30

    C:\ProgramData\Anaconda3\lib\site-packages\transformers\generation_utils.py in generate(self, input_ids, max_length, min_length, do_sample, early_stopping, num_beams, temperature, top_k, top_p, repetition_penalty, bad_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, encoder_no_repeat_ngram_size, num_return_sequences, max_time, max_new_tokens, decoder_start_token_id, use_cache, num_beam_groups, diversity_penalty, prefix_allowed_tokens_fn, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, forced_bos_token_id, forced_eos_token_id, remove_invalid_values, synced_gpus, **model_kwargs) 1051 input_ids, expand_size=num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs 1052 ) -> 1053 return self.beam_search( 1054 input_ids, 1055 beam_scorer,

    C:\ProgramData\Anaconda3\lib\site-packages\transformers\generation_utils.py in beam_search(self, input_ids, beam_scorer, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs) 1788 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) 1789 -> 1790 outputs = self( 1791 **model_inputs, 1792 return_dict=True,

    C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs) 1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1050 or _global_forward_hooks or _global_forward_pre_hooks): -> 1051 return forward_call(*input, **kwargs) 1052 # Do not call functions when jit is used 1053 full_backward_hooks, non_full_backward_hooks = [], []

    TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask'`

    I tried to downgrade transformers and onnxruntime but the error still remains.

    opened by farshadfiruzi 6
  • colab error

    colab error

    When running google colab, the line below produced error:

    model.load_model("t5","outputs/SimpleT5-epoch-2-train-loss-0.9526", use_gpu=True)

    Error: 404 Client Error: Not Found for url: https://huggingface.co/outputs/SimpleT5-epoch-2-train-loss-0.9526/resolve/main/config.json

    Please help. Thanks a lot.

    opened by yzhang-github-pub 6
  • shows object has no attribute 'convert_and_load_onnx_model'

    shows object has no attribute 'convert_and_load_onnx_model'

    Shows the error AttributeError: 'SimpleT5' object has no attribute 'convert_and_load_onnx_model' while running the example notebook provided in the repository

    https://github.com/Shivanandroy/simpleT5/blob/main/examples/simpleT5-summarization.ipynb

    opened by pradeepdev-1995 4
  • Adding logger

    Adding logger

    With this change, users can provide a PyTorch Lightning logger object to the .train() method:

    from pytorch_lightning.loggers import WandbLogger
    
    wandb_logger = WandbLogger(project="my-project", name="run-name")
    
    model.train(
        train_df=train_df,
        eval_df=eval_df,
        logger=wandb_logger
    )
    
    opened by versae 3
  • codet5 support added

    codet5 support added

    Needed to use this package for my experiemnts,

    Salesforce/codet5-base uses a roberta tokenizer hence this pull request.

    users can now specify : model.from_pretrained("codet5","Salesforce/codet5-base")

    If you want to read through codet5

    Here are the links: https://huggingface.co/Salesforce/codet5-base

    Kind regards, Mosh

    opened by mosh98 3
  • byT5 with version 0.1.2

    byT5 with version 0.1.2

    hi there, it seems that the newest version of simpleT5 does no longer work with byT5. The line elif model_type == "byt5": is commented out. The newest version of transformers seems to use a new type of tokenizer T5TokenizerFast and ByT5TokenizerFast does not exist. Any ideas about how to fix that?

    opened by kimgerdes 2
  • Is there any option for fine-tuning mt5 models instead of training from scratch?

    Is there any option for fine-tuning mt5 models instead of training from scratch?

    Hi, Thanks for the amazing simpleT5 package. I use the following script to train a mt5 model for summarization task.

    from simplet5 import SimpleT5

    model = SimpleT5()

    model.from_pretrained(model_type="mt5", model_name="google/mt5-small")

    model.train(train_df=train_df, eval_df=test_df, source_max_token_len=256, target_max_token_len=64, batch_size=8, max_epochs=3, use_gpu=True, outputdir = "outputs", early_stopping_patience_epochs = 0, )

    When I run this code, training start from scratch. My question is that is there any flag to fine-tune the mt5 model instead of training from scratch?

    opened by farshadfiruzi 2
  • Kernel dies every time when I start training the model

    Kernel dies every time when I start training the model

    Hi Shiva, Thank you very much for a such clean and neat wrapper for training ML models. I am using t5(precisely t5-small) as the base to train my model for summarization. I use the dataset using datasets from huggingface. However, everytime when I initiate the training code, the kernel dies and restarts. Any help here is much appreciated!

    Following is my code.

    Import dependencies

    %%capture
    !pip install --user simplet5==0.1.4
    !pip install transformers
    !pip install wandb
    !pip install pandas
    !pip install datasets
    !pip install --user simpletransformers
    

    Load data using datasets from huggingface

    import pandas as pd
    import warnings
    warnings.filterwarnings("ignore")
    from datasets import load_dataset
    dataset = load_dataset("scitldr")
    

    Preparing the train and eval data

    train_df = dataset["train"].to_pandas().copy()
    train_df.drop(columns=["source_labels","rouge_scores","paper_id"],inplace=True)
    train_df.rename(columns={"source":"source_text","target":"target_text"}, inplace=True)
    train_df.count() ## No NaN found - zero 1992 dataset
    
    train_df['source_text'] = train_df['source_text'].astype('str').str.rstrip(']\'')
    train_df['source_text'] = train_df['source_text'].astype('str').str.lstrip('[\'')
    train_df['target_text'] = train_df['target_text'].astype('str').str.rstrip(']\'')
    train_df['target_text'] = train_df['target_text'].astype('str').str.lstrip('[\'')
    
    train_df["source_text"]=train_df["source_text"].str.replace('\'','')
    train_df["target_text"]=train_df["target_text"].str.replace('\'','')
    train_df["source_text"]="summarize: "+train_df["source_text"]
    train_df.to_csv("train.csv")
    
    eval_df = dataset["validation"].to_pandas().copy()
    eval_df.drop(columns=["source_labels","rouge_scores","paper_id"],inplace=True)
    eval_df.rename(columns={"source":"source_text","target":"target_text"}, inplace=True)
    eval_df.count() ## No NaN found - zero 1992 dataset
    
    eval_df['source_text'] = eval_df['source_text'].astype('str').str.rstrip(']\'')
    eval_df['source_text'] = eval_df['source_text'].astype('str').str.lstrip('[\'')
    eval_df['target_text'] = eval_df['target_text'].astype('str').str.rstrip(']\'')
    eval_df['target_text'] = eval_df['target_text'].astype('str').str.lstrip('[\'')
    
    eval_df["source_text"]=train_df["source_text"].str.replace('\'','')
    eval_df["target_text"]=train_df["target_text"].str.replace('\'','')
    eval_df["source_text"]="summarize: "+train_df["source_text"]
    eval_df.to_csv("eval.csv")
    

    Loading simpleT5 and wandb_logger and finally loading the model and training code

    from simplet5 import SimpleT5
    from pytorch_lightning.loggers import WandbLogger
    wandb_logger = WandbLogger(project="ask-poc-logger")
    model = SimpleT5()
    model.from_pretrained("t5","t5-small")
    model.train(train_df=train_df[0:100], 
                eval_df=eval_df[0:100],
                source_max_token_len = 512, 
                target_max_token_len = 100,
                batch_size = 2,
                max_epochs = 3,
                use_gpu = True,
                outputdir = "outputs",
                logger = wandb_logger
                )
    

    I am running this code on the following machine. A vertex AI workbench from Google Cloud. N1-Standard-16 machine type with 16 core and 60 GB Memory. And added GPU P100. Any help is much appreciated ! Thanks in advance!

    opened by kkrishnan90 1
  • Is the task string necessary?

    Is the task string necessary?

    Hi,

    I have fine-tuned the model to write a compliment for a person, given the person's profile and it works pretty well. In the training examples, I haven't prepended the string 'summarize :' to the source_string column entries. Is it necessary (does it lead to better results) to prepend the string indicating the task?

    opened by nikogamulin 1
  • Push finished model

    Push finished model

    Is there a way of automatically pushing the checkpoints to the HuggingFace hub? I am running this mainly in Colab. Works great but often the Colab has timed out, and the checkpoints are lost.

    opened by peregilk 2
  • Unicode Charecter training issue

    Unicode Charecter training issue

    I tried to train My model for translating English to Bengali. After Training when I run the code, The output is not Unicode Bengali character.

    I Eat Rice (eng)=> আমি ভাত খাই (Bn)

    this type of data is input to the model while training. After complete, when I tested the model by inputting "I Eat Rice" I was expecting "আমি ভাত খাই" as output. But instead of this, the model gave me "Ich esse Reis." I dont know what kind of language is this. Its not related to bengali.

    opened by rahat10120141 5
  • ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

    ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

    import soundfile as sf from scipy.io import wavfile from IPython.display import Audio from transformers import Wav2Vec2ForCTC, Wav2Vec2CTCTokenizer

    import speech_recognition as sr import io from pydub import AudioSegment

    tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("facebook/wav2vec2-base-960h") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

    r = sr.Recognizer() with sr.Microphone(sample_rate=16000) as source: print("speak") while True: audio = r.listen(source) data = io.BytesIO(audio.get_wav_data()) clip = AudioSegment.from_file(data) x = torch.FloatTensor(clip.get_array_of_samples()) print(x)

        inputs = tokenizer(x, sampling_rate=16000, return_tensors='pt', padding='longest').input_values
        logits = model(inputs).logits
        tokens = torch.argmax(logits, axis=-1)
        text = tokenizer.batch_decode(tokens)
    
        print('you said: ', str(text).lower())
    
    opened by Ushanjay 1
  • Saved model name not customizable

    Saved model name not customizable

    def training_epoch_end(self, training_step_outputs): """ save tokenizer and model on epoch end """ self.average_training_loss = np.round( torch.mean(torch.stack([x["loss"] for x in training_step_outputs])).item(), 4, ) path = f"{self.outputdir}/simplet5-epoch-{self.current_epoch}-train-loss-{str(self.average_training_loss)}-val-loss-{str(self.average_validation_loss)}"

    Will be very helpful if you can allow the name customizable (note the 'path' assignment).

    Btw, SimpleT5 is simply cool!

    opened by ke-lara 0
Releases(v0.1.4)
  • v0.1.4(Feb 15, 2022)

    SimpleT5 v0.1.4

    • Added support for all the pytorch-lightning supported loggers #11
    • Added support to save model only at last epoch #21
    • Added dataloader_num_workers parameter to.train( ) method to specify number of worker in train/test/val dataloader #19
    • fixed warnings and made compatible with latest transformers and pytorch-lightning
    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Sep 4, 2021)

  • version-0.1.0(Jul 13, 2021)

    SimpleT5 - version 0.1.0

    • Supports ByT5 model - Thanks to @mapmeld for his contribution
    from simplet5 import SimpleT5
    model = SimpleT5()
    model.from_pretrained("byt5", "google/byt5-small")
    
    • Added precision flag to support mixed precision training
    # train
    model.train(train_df=train_df, # pandas dataframe with 2 columns: source_text & target_text
                eval_df=eval_df, # pandas dataframe with 2 columns: source_text & target_text
                source_max_token_len = 512, 
                target_max_token_len = 128,
                batch_size = 8,
                max_epochs = 5,
                use_gpu = True,
                outputdir = "outputs",
                early_stopping_patience_epochs = 0,
                precision = 32
                )
    
    Source code(tar.gz)
    Source code(zip)
Owner
Shivanand Roy
Data Scientist.
Shivanand Roy
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 06, 2022
NLP Text Classification

多标签文本分类任务 近年来随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以

Jason 1 Nov 11, 2021
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack 🐙 Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 2.2k Jan 03, 2023
NeuTex: Neural Texture Mapping for Volumetric Neural Rendering

NeuTex: Neural Texture Mapping for Volumetric Neural Rendering Paper: https://arxiv.org/abs/2103.00762 Running Run on the provided DTU scene cd run ba

Fanbo Xiang 68 Jan 06, 2023
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 07, 2022
Application to help find best train itinerary, uses speech to text, has a spam filter to segregate invalid inputs, NLP and Pathfinding algos.

T-IAI-901-MSC2022 - GROUP 18 Gestion de projet Notre travail a été organisé et réparti dans un Trello. https://trello.com/b/X3s2fpPJ/ia-projet Install

1 Feb 05, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

Rohit Govindan 1 Jan 27, 2022
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021