Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Overview

MT5_paddle

Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

English | 简体中文

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Abstract: The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

This project is an open source implementation of MT5 on Paddle 2.x.

Environment Installation

label value
python >=3.6
GPU V100
Frame PaddlePaddle2.1.2
Cuda 10.1
Cudnn 7.6

Cloud platform used in this recurrence:https://aistudio.baidu.com/

# Clone the repository
git clone https://github.com/27182812/MT5_paddle
# Enter the root directory
cd MT5_paddle
# Install the necessary python libraries locally
pip install -r requirements.txt

"test.ipynb" has run results display.

Quick Start

(一)Tokenizer Accuracy Alignment

### 对齐tokenizer
text = "Welcome to use paddle and paddlenlp!"
torch_tokenizer = PTT5Tokenizer.from_pretrained("./mt5-large")
paddle_tokenizer = PDT5Tokenizer.from_pretrained("./mt5-large")
torch_inputs = torch_tokenizer(text)
paddle_inputs = paddle_tokenizer(text)
print(torch_inputs)
print(paddle_inputs)

(二)Model Accuracy Alignment

run python compare.py,Comparing the accuracy between huggingface and paddle.

python compare.py
# MT5-large-pytorch vs paddle MT5-large-paddle
mean difference: tensor(2.0390e-06)
max difference: tensor(0.0004)

(三)Weights Transform

run python convert.py,transform weights of huggingface model to weights of paddle model. The weight path needs to be replaced

(四)Downstream task fine-tuning

run python train.py. "args.py" is for parameter.

Reference

大佬的T5代码:https://github.com/JunnYu/paddle_t5

@unknown{unknown,
author = {Xue, Linting and Constant, Noah and Roberts, Adam and Kale, Mihir and Al-Rfou, Rami and Siddhant, Aditya and Barua, Aditya and Raffel, Colin},
year = {2020},
month = {10},
pages = {},
title = {mT5: A massively multilingual pre-trained text-to-text transformer}
}
1 Jun 28, 2022
🤕 spelling exceptions builder for lazy people

🤕 spelling exceptions builder for lazy people

Vlad Bokov 3 May 12, 2022
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Lightning ASR Modular and extensible speech recognition library leveraging pytorch-lightning and hydra What is Lightning ASR • Installation • Get Star

Soohwan Kim 40 Sep 19, 2022
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

DevMysT 5 Jul 17, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022
This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Splinter This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to

Ori Ram 88 Dec 31, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The KLEJ Benchmark Baselines The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language und

Allegro Tech 17 Oct 18, 2022
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.8k Jan 04, 2023
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speech functionality. I wanted to create it because the existing tools that I experimented with did not satisfy me in

Nuked 1 Aug 21, 2022
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
Retraining OpenAI's GPT-2 on Discord Chats

Train OpenAI's GPT-2 on Discord Chats Retraining a Text Generation Model on Discord Chats using gpt-2-simple that wraps existing model fine-tuning and

Ayush Mishra 4 Oct 27, 2022
Speech to text streamlit app

Speech to text Streamlit-app! 👄 This speech to text recognition is powered by t

Charly Wargnier 9 Jan 01, 2023