LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Overview

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Tasks | Datasets | LongLM | Baselines | Paper

Introduction

LOT is a benchmark for evaluating Chinese long text modeling. LOT consists of two understanding tasks and two generation tasks. We construct new datasets for these tasks based on human-written Chinese stories.

Furthermore, we release an encoder-decoder-based Chinese long text pretraining model named LongLM with up to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks including text infilling and conditional continuation. Extensive experiments show that LongLM outperforms similar-sized pretraining models substantially on both the understanding and generation tasks in LOT.

Tasks

We design LOT as an aggregation of two understanding tasks including Cloze Test (ClozeT) and Sentence Position Prediction (SenPos), and two generation tasks including Plot Completion (PlotCom) and Outline-conditioned Generation (OutGen). We show the task descriptions in the table below.

Datasets

We show the data statistics in the table below. The abbreviation sent/len is short for sentence/length, respectively. The datasets and evaluation scripts can be downloaded from THUCloud.

LongLM

1. Parameters

  • $d_m$: the dimension of hidden states
  • $d_{ff}$: the dimension of feed forward layers
  • $d_{kv}$: the dimension of the keys/values in the self-attention layers
  • $n_h$: the number of attention heads
  • $n_e$: the number of hidden layers of the encoder
  • $n_d$: the number of hidden layers of the decoder
  • #P: the number of parameters

2. Pretraining Tasks

3. Pretraining Data

We collect 120G novels as the pretraining data for LongLM. The pretraining data will be publicly available soon.

4. Checkpoints

  1. Download: The checkpoints and example data can be downloaded from THUCloud. The training and generation scripts are under the directory longlm. You can also use the official script provided by Transformers to fine-tune the model.

  2. Model Loading:

    from transformers import T5Tokenizer, T5ForConditionalGeneration
    tokenizer = T5Tokenizer.from_pretrained('LongLM-large')
    model = T5ForConditionalGeneration.from_pretrained('LongLM-large')
    
    • Dependencies: torch=1.8.1, transformers=4.6.1
  3. Training:

    Execute bash ./finetune.sh to fine-tune LongLM. If deepspeed is available, you can execute bash ./finetune_deepspped.sh to accelerate.

    env CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 CUDA_LAUNCH_BLOCKING=1 python3 -m torch.distributed.launch --nproc_per_node=8 \
    finetune_trainer.py \
    --data_dir=./data \ # directory of data
    --train_name=train \ # file prefix of the training data
    --output_dir=./save_model \ # output directory to save the checkpoint
    --save_total_limit=10 \ # maximum number of the saved checkpoints
    --per_gpu_train_batch_size=3 \ # batch size for training
    --per_gpu_eval_batch_size=3 \ # batch size for evaluation
    --num_train_epochs=1 \ # number of training epochs
    --logging_steps=5 \ # number of stps to log the loss value
    --model_name_or_path=./LongLM-small \ # path to the pretrained model
    --warmup_steps=100 \ # number of steps for warmup
    --learning_rate=1e-4 \ # learning rate
    --n_val=100 \ # number of examples for validation
    --do_train --do_eval \ # whether to training/validation
    --evaluation_strategy steps \ # strategy of evaluation
    --gradient_accumulation_steps=40 # number of steps for gradient accumulation
    --overwrite_output_dir \
    --load_best_model_at_end
  4. Generation:

    ",return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(device) gen = model.generate(input_ids, do_sample=True, decoder_start_token_id=1, top_p=0.9, max_length=512) ">
    input_ids = tokenizer("小咕噜对,
         
          "
         ,return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(device)
    
    gen = model.generate(input_ids, do_sample=True, decoder_start_token_id=1, top_p=0.9, max_length=512)

Baselines

1. Understanding Tasks

The example data, training and evaluation scripts of LongLM are under the directory ./baselines/understanding. You can execute bash ./finetune.sh to fine-tune LongLM and execute bash ./eval.sh to evaluate the fine-tuned model.

2. Generation Tasks

The training script of LongLM for the generation tasks is the same as pretraining script. The generation script and example data can be found under ./baseline/generation. You can execute bash ./gen.sh for generation.

Citation

@misc{guan2021lot,
      title={LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation}, 
      author={Jian Guan and Zhuoer Feng and Yamei Chen and Ruilin He and Xiaoxi Mao and Changjie Fan and Minlie Huang},
      year={2021},
      eprint={2108.12960},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
Conversational AI groups from Tsinghua University
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 02, 2023
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 07, 2022
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

HNLP 1.1k Dec 16, 2022
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 03, 2023
Chinese NER with albert/electra or other bert descendable model (keras)

Chinese NLP (albert/electra with Keras) Named Entity Recognization Project Structure ./ ├── NER │   ├── __init__.py │   ├── log

2 Nov 20, 2022
Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

DeepMind 9.7k Jan 02, 2023
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

Tae-Hwan Jung 775 Jan 08, 2023
A simple version of DeTR

DeTR-Lite A simple version of DeTR Before you enjoy this DeTR-Lite The purpose of this project is to allow you to learn the basic knowledge of DeTR. P

Jianhua Yang 11 Jun 13, 2022
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
硕士期间自学的NLP子任务,供学习参考

NLP_Chinese_down_stream_task 自学的NLP子任务,供学习参考 任务1 :短文本分类 (1).数据集:THUCNews中文文本数据集(10分类) (2).模型:BERT+FC/LSTM,Pytorch实现 (3).使用方法: 预训练模型使用的是中文BERT-WWM, 下载地

12 May 31, 2022
Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

Dani El-Ayyass 57 Dec 07, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023
An extensive UI tool built using new data scraped from BBC News

BBC-News-Analyzer An extensive UI tool built using new data scraped from BBC New

Antoreep Jana 1 Dec 31, 2021
JaQuAD: Japanese Question Answering Dataset

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)

SkelterLabs 84 Dec 27, 2022