CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Related tags

Text Data & NLPCCQA
Overview

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

This is the official repository for the code and models of the paper CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. If you use our dataset, code or any parts thereof, please cite this paper:

@misc{huber-etal-2021-ccqa,
  title={CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training}, 
  author={Patrick Huber and Armen Aghajanyan and Barlas Oğuz and Dmytro Okhonko and Wen-tau Yih and Sonal Gupta and Xilun Chen},
  year={2021},
  eprint={2110.07731},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Getting Common Crawl Snapshots

The Common Crawl project provides monthly web snapshots of new and updates websites in raw HTML format. Every monthly snapshot (~50-70TB) is further separated into smaller WARC (Web ARChive) files. To download a single WARC file, go to the Common Crawl website for the respective month (e.g. May 2021) and download the WARC paths file. The downloaded WARC paths file contains a \newline separated list of download destination of the actual files. Pick a path and prepend s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ for the complete URL. Once downloaded, gunzip the archive and a single Common Crawl web archive is ready to be processed.

Dataset Generation

Dependencies

Below are the required dependencies to run the dataset generation, curation and model evaluations.

  • Rust
  • Rust packages: clap, html-escape, indicatif, kuchiki, rayon, regex, serde, serde_json, warc (see Cargo.toml file for versions)
  • Python 3.7.3
  • Python dependencies: fasttext language identification, fasttext==0.9.2, lxml==4.3.2

Processing Common Crawl data (Rust)

  • Build the cargo package with cargo build from within the rust folder
  • Run the script with cargo run <path/to/warc/file> <path/to/output/file.mhtml>

Curating the minified HTML data (Python)

To generate json objects for every webpage in the minified HTML, run

python mhtml_to_json.py <path/to/fasttext/lid.176.bin> <path/to/mhtml/file> <path/to/output/file>

Aggregating datapoints to remove duplicate URL entries (Python)

As mentioned in the paper, we use the original dataset for our in-domain pre-training experiments. However, we also provide a cleaned version of the dataset, aggregating same-URL duplicates into a single object. To run the datapoint aggregation script, execute

python json_duplicate_filter.py <path/to/json/file> <path/to/output/file>

Converting json dataset into closed-book and passage retrieval formats (Python)

To be able to train closed-book (sequence-to-sequence) and passage retrieval (DPR) models on the CCQA dataset, the corpus needs to be further processed

Closed-book processing

To prepare the dataset for closed-book question-answering training, run:

python closed_book_processing.py <path/to/json/file> <path/to/output/file> <--only_english> <--keep_markup>

Passage retrieval (DPR) processing

To prepare the dataset for passage rertieval (DPR) training, run:

python passage_retrieval_processing.py <path/to/json/file> <path/to/output/file> <--only_english> <--keep_markup>

CCQA In-Domain Pre-Trained Model Checkpoints

BART and T5 checkpoints are Huggingface transformer models tested with transformers version 4.8.2

The DPR model checkpoint can be downloaded for the original DPR codebase or the DPR v2 codebase

LICENSE

The majority of CCQA is licensed under CC-BY-NC, however portions of the project are available under separate license terms: crowbook-text-processing is licensed under the MPL-2.0 license.

Owner
Meta Research
Meta Research
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
LCG T-TEST USING EUCLIDEAN METHOD

This project has been created for statistical usage, purposing for determining ATL takers and nontakers using LCG ttest and Euclidean Method, especially for internal business case in Telkomsel.

2 Jan 21, 2022
Machine Psychology: Python Generated Art

Machine Psychology: Python Generated Art A limited collection of 64 algorithmically generated artwork. Each unique piece is then given a title by the

Pixegami Team 67 Dec 13, 2022
A natural language processing model for sequential sentence classification in medical abstracts.

NLP PubMed Medical Research Paper Abstract (Randomized Controlled Trial) A natural language processing model for sequential sentence classification in

Hemanth Chandran 1 Jan 17, 2022
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

Nicholas Heller 4 Jan 22, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
Estimation of the CEFR complexity score of a given word, sentence or text.

NLP-Swedish … allows to estimate CEFR (Common European Framework of References) complexity score of a given word, sentence or text. CEFR scores come f

3 Apr 30, 2022
A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

RapidOCR Team 767 Jan 09, 2023
端到端的长本文摘要模型(法研杯2020司法摘要赛道)

端到端的长文本摘要模型(法研杯2020司法摘要赛道)

苏剑林(Jianlin Su) 334 Jan 08, 2023
Practical Machine Learning with Python

Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.

Dipanjan (DJ) Sarkar 2k Jan 08, 2023
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (S

InstaDeep Ltd 72 Dec 09, 2022
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022