DaReCzech is a dataset for text relevance ranking in Czech

Last update: Jul 26, 2022

Related tags

Overview

DaReCzech Dataset

DaReCzech is a dataset for text relevance ranking in Czech. The dataset consists of more than 1.6M annotated query-documents pairs, which makes it one of the largest available datasets for this task.

The dataset was introduced in paper Siamese BERT-based Model for Web Search Relevance RankingEvaluated on a New Czech Dataset which has been accepted at the IAAI 2022 (Innovative Application Award).

Obtaining the Annotated Data

Please, first read a disclaimer that contains the terms of use. If you comply with them, send an email to [email protected] and the link to the dataset will be sent to you.

Overview

DaReCzech is divided into four parts:

Train-big (more than 1.4M records) – intended for training of a (neural) text relevance model
Train-small (97k records) – intended for GBRT training (with a text relevance feature trained on Train-big)
Dev (41k records)
Test (64k records)

Each set is distributed as a .tsv file with 6 columns:

ID – unique record ID
query – user query
url – URL of annotated document
doc – representation of the document under the URL, each document is represented using its title, URL and Body Text Extract (BTE) that was obtained using the internal module of our search engine
title: document title
label – the annotated relevance of the document to the query. There are 5 relevance labels ranging from 0 (the document is not useful for given query) to 1 (document is for given query useful)

The files are UTF-8 encoded. The values never contain a tab and are not quoted nor escaped – to load the dataset in pandas, use

import csv
import pandas as pd
pd.read_csv(path, sep='\t', quoting=csv.QUOTE_NONE)

Baselines

We provide code to train two BERT-based baseline models: a query-doc model (train_querydoc_model.py) and a siamese model (train_siamese_model.py).

Before running the scripts, install requirements that are listed in requirements.txt. The scripts were tested with Python 3.6.

pip install -r requirements.txt

Model Training

To train a query-doc model with default settings, run:

python train_querydoc_model.py train_big.tsv dev.tsv outputs

To train a siamese model without a teacher, run:

python train_siamese_model.py train_big.tsv dev.tsv outputs

To train a siamese model with a trained query-doc teacher, run:

python train_siamese_model.py train_big.tsv dev.tsv outputs --teacher path_to_query_doc_checkpoint

Note that example scripts run training with our (unsupervisedly) pretrained Small-E-Czech model.

Model Evaluation

To evaluate the trained query-doc model on test data, run:

python evaluate_model.py model_path test.tsv --is_querydoc

To evaluate the trained siamese model on test data, run:

python evaluate_model.py model_path test.tsv --is_siamese

Acknowledgements

If you use the dataset in your work, please cite the original paper:

@article{kocian2021siamese,
  title={Siamese BERT-based Model for Web Search Relevance RankingEvaluated on a New Czech Dataset},
  author={Kocián, Matěj and Náplava, Jakub and Štancl, Daniel and Kadlec, Vladimír},
  journal={arXiv preprint arXiv:2112.01810},
  year={2021}
}

DaReCzech is a dataset for text relevance ranking in Czech

Related tags

Overview

DaReCzech Dataset

Obtaining the Annotated Data

Overview

Baselines

Model Training

Model Evaluation

Acknowledgements

Owner

Seznam.cz a.s.

Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

Computational inteligence project on faces in the wild dataset

一个目标检测的通用框架(不需要cuda编译)，支持Yolo全系列(v2~v5)、EfficientDet、RetinaNet、Cascade-RCNN等SOTA网络。

A curated list of programmatic weak supervision papers and resources

ESL: Event-based Structured Light

Gesture recognition on Event Data

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector of the financial market.

Covid19-Forecasting - An interactive website that tracks, models and predicts COVID-19 Cases

.NET bindings for the Pytorch engine

Molecular AutoEncoder in PyTorch

A script that trains a model to recognize handwritten digits using the MNIST data set.

領域を指定し、キーを入力することで画像を保存するツールです。クラス分類用のデータセット作成を想定しています。

This repository contains the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

Temporal Knowledge Graph Reasoning Triggered by Memories

Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

SOLOv2 on onnx & tensorRT

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.