DaReCzech is a dataset for text relevance ranking in Czech

Overview

DaReCzech Dataset

DaReCzech is a dataset for text relevance ranking in Czech. The dataset consists of more than 1.6M annotated query-documents pairs, which makes it one of the largest available datasets for this task.

The dataset was introduced in paper Siamese BERT-based Model for Web Search Relevance RankingEvaluated on a New Czech Dataset which has been accepted at the IAAI 2022 (Innovative Application Award).

Obtaining the Annotated Data

Please, first read a disclaimer that contains the terms of use. If you comply with them, send an email to [email protected] and the link to the dataset will be sent to you.

Overview

DaReCzech is divided into four parts:

  • Train-big (more than 1.4M records) – intended for training of a (neural) text relevance model
  • Train-small (97k records) – intended for GBRT training (with a text relevance feature trained on Train-big)
  • Dev (41k records)
  • Test (64k records)

Each set is distributed as a .tsv file with 6 columns:

  • ID – unique record ID
  • query – user query
  • url – URL of annotated document
  • doc – representation of the document under the URL, each document is represented using its title, URL and Body Text Extract (BTE) that was obtained using the internal module of our search engine
  • title: document title
  • label – the annotated relevance of the document to the query. There are 5 relevance labels ranging from 0 (the document is not useful for given query) to 1 (document is for given query useful)

The files are UTF-8 encoded. The values never contain a tab and are not quoted nor escaped – to load the dataset in pandas, use

import csv
import pandas as pd
pd.read_csv(path, sep='\t', quoting=csv.QUOTE_NONE)

Baselines

We provide code to train two BERT-based baseline models: a query-doc model (train_querydoc_model.py) and a siamese model (train_siamese_model.py).

Before running the scripts, install requirements that are listed in requirements.txt. The scripts were tested with Python 3.6.

pip install -r requirements.txt

Model Training

To train a query-doc model with default settings, run:

python train_querydoc_model.py train_big.tsv dev.tsv outputs

To train a siamese model without a teacher, run:

python train_siamese_model.py train_big.tsv dev.tsv outputs

To train a siamese model with a trained query-doc teacher, run:

python train_siamese_model.py train_big.tsv dev.tsv outputs --teacher path_to_query_doc_checkpoint

Note that example scripts run training with our (unsupervisedly) pretrained Small-E-Czech model.

Model Evaluation

To evaluate the trained query-doc model on test data, run:

python evaluate_model.py model_path test.tsv --is_querydoc

To evaluate the trained siamese model on test data, run:

python evaluate_model.py model_path test.tsv --is_siamese

Acknowledgements

If you use the dataset in your work, please cite the original paper:

@article{kocian2021siamese,
  title={Siamese BERT-based Model for Web Search Relevance RankingEvaluated on a New Czech Dataset},
  author={Kocián, Matěj and Náplava, Jakub and Štancl, Daniel and Kadlec, Vladimír},
  journal={arXiv preprint arXiv:2112.01810},
  year={2021}
}
Owner
Seznam.cz a.s.
Seznam.cz a.s.
Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT : Cross-Attention Multi-Scale Vision Transformer for Image Classification This is an unofficial PyTorch implementation of CrossViT: Cross-Att

Rishikesh (ऋषिकेश) 103 Nov 25, 2022
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

This is a playground for pytorch beginners, which contains predefined models on popular dataset. Currently we support mnist, svhn cifar10, cifar100 st

Aaron Chen 2.4k Dec 28, 2022
Computational inteligence project on faces in the wild dataset

Table of Contents The general idea How these scripts work? Loading data Needed modules and global variables Parsing the arrays in dataset Extracting a

tooraj taraz 4 Oct 21, 2022
一个目标检测的通用框架(不需要cuda编译),支持Yolo全系列(v2~v5)、EfficientDet、RetinaNet、Cascade-RCNN等SOTA网络。

一个目标检测的通用框架(不需要cuda编译),支持Yolo全系列(v2~v5)、EfficientDet、RetinaNet、Cascade-RCNN等SOTA网络。

Haoyu Xu 203 Jan 03, 2023
A curated list of programmatic weak supervision papers and resources

A curated list of programmatic weak supervision papers and resources

Jieyu Zhang 118 Jan 02, 2023
ESL: Event-based Structured Light

ESL: Event-based Structured Light Video (click on the image) This is the code for the 2021 3DV paper ESL: Event-based Structured Light by Manasi Mugli

Robotics and Perception Group 29 Oct 24, 2022
Gesture recognition on Event Data

Event based Gesture Recognition Gesture recognition on Event Data usually involv

2 Feb 14, 2022
Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Faster R-CNN pretrained on VisualGenome This repository modifies maskrcnn-benchmark for object detection and attribute prediction on VisualGenome data

Shizhe Chen 7 Apr 20, 2021
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

880 Jan 07, 2023
This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector of the financial market.

GPlearn_finiance_stock_futures_extension This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector

Chengwei <a href=[email protected]"> 189 Dec 25, 2022
Covid19-Forecasting - An interactive website that tracks, models and predicts COVID-19 Cases

Covid-Tracker This is an interactive website that tracks, models and predicts CO

Adam Lahmadi 1 Feb 01, 2022
.NET bindings for the Pytorch engine

TorchSharp TorchSharp is a .NET library that provides access to the library that powers PyTorch. It is a work in progress, but already provides a .NET

Matteo Interlandi 17 Aug 30, 2021
Molecular AutoEncoder in PyTorch

MolEncoder Molecular AutoEncoder in PyTorch Install $ git clone https://github.com/cxhernandez/molencoder.git && cd molencoder $ python setup.py insta

Carlos Hernández 80 Dec 05, 2022
A script that trains a model to recognize handwritten digits using the MNIST data set.

handwritten-digits-recognition A script that trains a model to recognize handwritten digits using the MNIST data set. Then it loads external files and

Hamza Sayih 1 Oct 30, 2021
領域を指定し、キーを入力することで画像を保存するツールです。クラス分類用のデータセット作成を想定しています。

image-capture-class-annotation 領域を指定し、キーを入力することで画像を保存するツールです。 クラス分類用のデータセット作成を想定しています。 Requirement OpenCV 3.4.2 or later Usage 実行方法は以下です。 起動後はマウスクリック4

KazuhitoTakahashi 5 May 28, 2021
This repository contains the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

1.1k Dec 30, 2022
Temporal Knowledge Graph Reasoning Triggered by Memories

MTDM Temporal Knowledge Graph Reasoning Triggered by Memories To alleviate the time dependence, we propose a memory-triggered decision-making (MTDM) n

4 Sep 25, 2022
Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

CCOP Code of our paper Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning Requirement Install OpenSelfSup Install Detectron2

Chenhongyi Yang 21 Dec 13, 2022
SOLOv2 on onnx & tensorRT

SOLOv2.tensorRT: NOTE: code based on WXinlong/SOLO add support to TensorRT inference onnxruntime tensorRT full_dims and dynamic shape postprocess with

47 Nov 26, 2022
Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have underg

Nafis Ahmed 1 Dec 28, 2021