NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Last update: Apr 07, 2022

Related tags

Text Data & NLP pretrain4ir_tutorial

Overview

pretrain4ir_tutorial

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

用作NLPIR实验室, Pre-training for IR方向入门.

代码包括了如下部分:

tasks/ : 生成预训练数据
pretrain/: 在生成的数据上Pre-training (MLM + NSP)
finetune/: Fine-tuning on MS MARCO

Preinstallation

First, prepare a Python3 environment, and run the following commands:

  git clone [email protected]:zhengyima/pretrain4ir_tutorial.git pretrain4ir_tutorial
  cd pretrain4ir_tutorial
  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Besides, you should download the BERT model checkpoint in format of huggingface transformers, and save them in a directory BERT_MODEL_PATH. In our paper, we use the version of bert-base-uncased. you can download it from the huggingface official model zoo, or Tsinghua mirror.

生成预训练数据

代码库提供了最简单易懂的预训练任务 rand。该任务随机从文档中选取1~5个词作为query, 用来demo面向IR的预训练。

生成rand预训练任务数据命令: cd tasks/rand && bash gen.sh

你可以自己编写脚本, 仿照rand任务, 生成你自己认为合理的预训练任务的数据。

Notes: 运行rand任务的shell之前, 你需要先将 gen.sh 脚本中的 msmarco_docs_path 参数改为MSMARCO数据集的文档tsv 路径; 将bert_model参数改为下载好的bert模型目录;

模型预训练

代码库提供了模型预训练的相关代码, 见pretrain。该代码完成了MLM+NSP两个任务的预训练。

模型预训练命令: cd pretrain && bash train_bert.sh

Notes: 注意要修改train_bert中的相应参数：将bert_model参数改为下载好的bert模型目录; train_file改为你上一步生成好的预训练数据文件路径。

模型Fine-tune

代码库提供了在MSMARCO Document Ranking任务上进行Fine-tune的相关代码。见finetune。该代码完成了在MSMARCO上通过point-wise进行fine-tune的流程。

模型fine-tune命令: cd finetune && bash train_bert.sh

Leaderboard

Tasks	[email protected] on dev set
PROP-MARCO	0.4201
PROP-WIKI	0.4188
BERT-Base	0.4184
rand	0.4123

Homework

设计一个你认为合理的预训练任务, 并对BERT模型进行预训练, 并在MSMARCO上完成fine-tune, 在Leaderboard上更新你在dev set上的结果。

你需要做的是:

编写你自己的预训练数据生成脚本, 放到 tasks/yourtask 目录下。
使用以上脚本, 生成自己的预训练数据。
运行代码库提供的pre-train与fine-tune脚本, 跑出结果, 更新Leaderboard。

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Related tags

Overview

pretrain4ir_tutorial

Preinstallation

生成预训练数据

模型预训练

模型Fine-tune

Leaderboard

Homework

Links

Owner

ZYMa

Deep learning for NLP crash course at ABBYY.

Prompt tuning toolkit for GPT-2 and GPT-Neo

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

InferSent sentence embeddings

Natural language Understanding Toolkit

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

A natural language modeling framework based on PyTorch

Knowledge Management for Humans using Machine Learning & Tags

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

SurvTRACE: Transformers for Survival Analysis with Competing Events

This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Submit issues and feature requests for our API here.

A full spaCy pipeline and models for scientific/biomedical documents.