This repository contains the scripts for downloading and validating scripts for the documents

Related tags

Deep LearningHC4
Overview

HC4: HLTCOE CLIR Common-Crawl Collection

This repository contains the scripts for downloading and validating scripts for the documents. Document ids, topics, and qrel files are in resources/hc4/

Required packages for the scripts are recorded in requirements.txt.

Topics and Qrels

Topics are stored in jsonl format and located in resources/hc4. The language(s) the topic is annotated for is recored in the language_with_qrels field. We provide the English topic title and description for all topics and human translation for the languages that it has qrels for. We also provide machine translation of them in all three languages for all topics. Narratives(field narratives) are all in English and has one entry for each of the languages that has qrels. Each topic also has an English report(field report) that is designed to record the prior knowledge the searcher has.

Qrels are stored in the classic TREC style located in resources/hc4/{lang}.

Download Documents

To download the documents from Common Crawl, please use the following command. If you plan to use HC4 with ir_datasets, please specify ~/.ir_datasets/hc4 as the storage or make a soft link to to the directory you wish to store the documents. The document ids and hashs are stored in resources/hc4/{lang}/ids*.jsonl.gz. Russian document ids are separated into 8 files.

python download_documents.py --storage ./data/ \
                             --zho ./resources/hc4/zho/ids.jsonl.gz \
                             --fas ./resources/hc4/fas/ids.jsonl.gz \
                             --rus ./resources/hc4/rus/ids.*.jsonl.gz \
                             --jobs 4 \
                             --check_hash 

If you wish to only download the documents for one language, just specify the id file for the language you wish to download. We encourage using the flag --check_hash to varify the documents downloaded match with the documents we intend to use in the collection. The full description of the arguments can be found when execute with the --help flag.

Validate

After documents are downloaded, please run the validate_hc4_documents.py to verify all documents are downloaded for each language.

python validate_hc4_documents.py --hc4_file ./data/zho/hc4_docs.jsonl \
                                 --id_file ./resources/hc4/zho/ids.jsonl.gz \
                                 --qrels ./resources/hc4/zho/*.qrels.v1-0.txt

Reference

If you use this collection, please kindly cite our dataset paper with the following bibtex entry.

@inproceedings{hc4,
	author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang},
	title = {{HC4}: A New Suite of Test Collections for Ad Hoc {CLIR}},
	booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
	year = {2022}
}
Owner
JHU Human Language Technology Center of Excellence
JHU Human Language Technology Center of Excellence
Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by

Mehdi Cherti 135 Dec 30, 2022
High frequency AI based algorithmic trading module.

Flow Flow is a high frequency algorithmic trading module that uses machine learning to self regulate and self optimize for maximum return. The current

59 Dec 14, 2022
A complete speech segmentation system using Kaldi and x-vectors for voice activity detection (VAD) and speaker diarisation.

bbc-speech-segmenter: Voice Activity Detection & Speaker Diarization A complete speech segmentation system using Kaldi and x-vectors for voice activit

BBC 16 Oct 27, 2022
Official PyTorch implementation for FastDPM, a fast sampling algorithm for diffusion probabilistic models

Official PyTorch implementation for "On Fast Sampling of Diffusion Probabilistic Models". FastDPM generation on CIFAR-10, CelebA, and LSUN datasets. S

Zhifeng Kong 68 Dec 26, 2022
Get 2D point positions (e.g., facial landmarks) projected on 3D mesh

points2d_projection_mesh Input 2D points (e.g. facial landmarks) on an image Camera parameters (extrinsic and intrinsic) of the image Aligned 3D mesh

5 Dec 08, 2022
yolov5目标检测模型的知识蒸馏(基于响应的蒸馏)

代码地址: https://github.com/Sharpiless/yolov5-knowledge-distillation 教师模型: python train.py --weights weights/yolov5m.pt \ --cfg models/yolov5m.ya

52 Dec 04, 2022
A whale detector design for the Kaggle whale-detector challenge!

CNN (InceptionV1) + STFT based Whale Detection Algorithm So, this repository is my PyTorch solution for the Kaggle whale-detection challenge. The obje

Tarin Ziyaee 92 Sep 28, 2021
TensorFlow implementation of Elastic Weight Consolidation

Elastic weight consolidation Introduction A TensorFlow implementation of elastic weight consolidation as presented in Overcoming catastrophic forgetti

James Stokes 67 Oct 11, 2022
Python Blood Vessel Topology Analysis

Python Blood Vessel Topology Analysis This repository is not being updated anymore. The new version of PyVesTo is called PyVaNe and is available at ht

6 Nov 15, 2022
Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra

850-Safra-DS-ModuloI Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra Para aprender mais Git https://learngitbranc

Brian Nunes 7 Dec 10, 2022
Transformer Tracking (CVPR2021)

TransT - Transformer Tracking [CVPR2021] Official implementation of the TransT (CVPR2021) , including training code and trained models. We are revisin

chenxin 465 Jan 06, 2023
SVG Icon processing tool for C++

BAWR This is a tool to automate the icons generation from sets of svg files into fonts and atlases. The main purpose of this tool is to add it to the

Frank David Martínez M 66 Dec 14, 2022
People movement type classifier with YOLOv4 detection and SORT tracking.

Movement classification The goal of this project would be movement classification of people, in other words, walking (normal and fast) and running. Yo

4 Sep 21, 2021
Angora is a mutation-based fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without symbolic execution.

Angora Angora is a mutation-based coverage guided fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without s

833 Jan 07, 2023
Fit Fast, Explain Fast

FastExplain Fit Fast, Explain Fast Installing pip install fast-explain About FastExplain FastExplain provides an out-of-the-box tool for analysts to

8 Dec 15, 2022
A lightweight python AUTOmatic-arRAY library.

A lightweight python AUTOmatic-arRAY library. Write numeric code that works for: numpy cupy dask autograd jax mars tensorflow pytorch ... and indeed a

Johnnie Gray 62 Dec 27, 2022
A GUI for Face Recognition, based upon Docker, Tkinter, GPU and a camera device.

Face Recognition GUI This repository is a GUI version of Face Recognition by Adam Geitgey, where e.g. Docker and Tkinter are utilized. All the materia

Kasper Henriksen 6 Dec 05, 2022
FridaHookAppTool - Frida Hook App Tool With Python

FridaHookAppTool(以下是Hook mpaas框架的例子) mpaas移动开发框架ios端抓包hook脚本 使用方法:链接数据线,开启burp设置

13 Nov 30, 2022
The repository contain code for building compiler using puthon.

Building Compiler This is a python implementation of JamieBuild's "Super Tiny Compiler" Overview JamieBuilds developed a wonderfully educative compile

Shyam Das Shrestha 1 Nov 21, 2021
An Api for Emotion recognition.

PLAYEMO Playemo was built from the ground-up with Flask, a python tool that makes it easy for developers to build APIs. Use Cases Is Python your langu

greek geek 2 Jul 16, 2022