Tools to download and cleanup Common Crawl data

Related tags

Text Data & NLPcc_net
Overview

cc_net

Tools to download and clean Common Crawl as introduced in our paper CCNet.

If you found these resources useful, please consider citing:

@inproceedings{wenzek2020ccnet,
  title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4003--4012},
  year={2020}
}

CircleCI

Installation

We only tried this on Linux but installation should be possible on MacOS too.

  1. Create or simlink a data folder to where you want to download the corpus.

  2. Run make install. This will download some resources and install required packages.

  3. If you have a C++ 17 compiler you can also run pip install .[getpy], it provides more memory efficient hashset.

  4. Install the following tools manually if make install failed:

Training Language Models

The Makefile is used to train Sentence Piece and LM on Wikipedia data.

  • make help shows help
  • make lang=de lm trains a Sentence Piece and a LM on German Wikipedia
  • make all_lm trains the same model than in the paper
  • make lang=de dl_lm downloads the LM trained for the paper
  • make dl_all_lm downloads all of them

Pipeline overview

The full mining pipeline is divided in 3 steps:

  • hashes downloads one Common-Crawl snapshot, and compute hashes for each paragraph
  • mine removes duplicates, detects language, run the LM and split by lang/perplexity buckets
  • regroup regroup the files created by mine in chunks of 4Gb

Each step needs the previous step to be over before starting. You can launch the full pipeline using python -m cc_net.

  • python -m cc_net --help shows help
  • python -m cc_net --dump 2019-13 treats a specific snapshot
  • python -m cc_net -l my -l gu restricts to specific languages
  • python -m cc_net --lm_dir my_lms/ uses custom LMs
  • python -m cc_net --lang_threshold 0.3 set a specific field in mine.Config
  • python -m cc_net --config test runs on a tiny subset of a snapshot
  • python -m cc_net --config config/my_config.json uses configuration from the given config file

Reproducing our work

Given the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we computed. You can reconstruct the corpus used in the paper by using:

python -m cc_net --conf reproduce --dump 2019-09

Extract XLM-R data

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) paper was trained on data extracted by an internal version of cc_net.

Due to the format being a little bit different please use the following command instead:

python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8

If you use this version of the data please also consider citing:

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

Adapting to your infrastructure

Given the computation cost of running the full pipeline we distributed the computation on a Slurm cluster using submitit. submitit will default to spawning processes on your machine if Slurm cluster is found. You should tweak --task_parallelism to something adapated to your machine. Defaults are 512 for mining and 20 for reproducing.

To run the tasks in-process use --execution debug.

Output format

Generated files are compressed JSON files. There is one JSON object per line.

List of fields:

  • url: webpage URL (part of CC)
  • date_download: date of download (part of CC)
  • digest: sha1 digest of the webpage (part of CC)
  • length: number of chars
  • nlines: number of lines
  • source_domain: web domain of the webpage
  • title: page title (part of CC)
  • raw_content: webpage content after deduplication
  • original_nlines: number of lines before deduplication
  • original_length: number of chars before deduplication
  • language: language detected by FastText LID
  • language_score: language score
  • perplexity: perplexity of a LM trained on Wikipedia

Sample JSON object:

{
  "url": "http://www.pikespeakhospice.org/members/1420",
  "date_download": "2019-02-15T18:40:25Z",
  "digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
  "length": 752,
  "nlines": 5,
  "source_domain": "www.pikespeakhospice.org",
  "title": "LeeRoy Aragon",
  "raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
  "original_nlines": 7,
  "original_length": 754,
  "language": "en",
  "language_score": 0.99,
  "perplexity": 255.11,
}

You can peak at those files using UNIX tools zcat and jq, eg: zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .

jq can do some complicated filtering. jsonql.py provides a Python API with multiprocess support to do more complicated operations like LM scoring of the document.

License

By contributing to cc_net, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.

Owner
Meta Research
Meta Research
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Natural Language Processing for Adverse Drug Reaction (ADR) Detection This repo contains code from a project to identify ADRs in discharge summaries a

Medicines Optimisation Service - Austin Health 21 Aug 05, 2022
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
Application to help find best train itinerary, uses speech to text, has a spam filter to segregate invalid inputs, NLP and Pathfinding algos.

T-IAI-901-MSC2022 - GROUP 18 Gestion de projet Notre travail a été organisé et réparti dans un Trello. https://trello.com/b/X3s2fpPJ/ia-projet Install

1 Feb 05, 2022
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Adversarial Purification with Score-based Generative Models by Jongmin Yoon, Sung Ju Hwang, Juho Lee This repository includes the official PyTorch imp

15 Dec 15, 2022
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

40 Nov 25, 2022
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

Donghoon Shin 3 Dec 02, 2022