⚖️ A Statutory Article Retrieval Dataset in French.

Overview

DOI

A Statutory Article Retrieval Dataset in French

This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code to reproduce the experimental results from the associated paper by A. Louis, G. Spanakis, and G. Van Dijck.

Abstract. Statutory article retrieval is the task of automatically retrieving law articles relevant to a legal question. While recent advances in natural language processing have sparked considerable interest in many legal tasks, statutory article retrieval remains primarily untouched due to the scarcity of large-scale and high-quality annotated datasets. To address this bottleneck, we introduce the Belgian Statutory Article Retrieval Dataset (BSARD), which consists of 1,100+ French native legal questions labeled by experienced jurists with relevant articles from a corpus of 22,600+ Belgian law articles. Using BSARD, we benchmark several unsupervised information retrieval methods based on term weighting and pooled embeddings. Our best performing baseline achieves 50.8% [email protected], which is promising for the feasibility of the task and indicates that there is still substantial room for improvement. By the specificity of the data domain and addressed task, BSARD presents a unique challenge problem for future research on legal information retrieval.

Documentation

Detailed documentation on the dataset and how to reproduce the main experimental results can be found here.

Citation

For attribution in academic contexts, please cite this work as:

@article{louis2021statutory,
  title = {A Statutory Article Retrieval Dataset in French},
  author = {Louis, Antoine and Spanakis, Gerasimos and Van Dijck, Gijs},
  journal = {arXiv preprint arXiv:2108.11792},
  year = {2021},
}

License

This repository is licensed under the terms of the CC BY-NC-SA 4.0 license.

You might also like...
🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)
🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)
Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

Legal text retrieval for python
Legal text retrieval for python

legal-text-retrieval Overview This system contains 2 steps: generate training data containing negative sample found by mixture score of cosine(tfidf)

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Comments
  • transformers library version should be specified

    transformers library version should be specified

    Hi,

    As specified in the documentation, I made the environment from the yml file: time conda env create -n bsard -f environment.yml

    However, I run into an error when doing the grid search:

    Traceback (most recent call last):                                                                                                                                                   
      File "scripts/search_hyperparameters.py", line 11, in <module>
        from retriever import Word2vecRetriever, FasttextRetriever, BERTRetriever
      File "/u/salaunol/Documents/_2021_automne/bsard/scripts/retriever.py", line 16, in <module>
        from transformers import (CamembertModel, CamembertTokenizer,
    ImportError: cannot import name 'CamembertModel' from 'transformers' (/u/salaunol/anaconda3/envs/bsard/lib/python3.8/site-packages/transformers/__init__.py)
    

    The transformers version was the following: transformers 2.1.1 pyhd3eb1b0_0

    I fixed it with pip install transformers==2.5.0 but it would be preferable to specify the version for each library in environment.yml

    Edit: issue submitted too fast

    opened by oliviersalaun 1
Releases(v1.0)
  • v1.0(Aug 26, 2021)

    The Belgian Statutory Article Retrieval Dataset (BSARD) v1.0 is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.

    Source code(tar.gz)
    Source code(zip)
Owner
Maastricht Law & Tech Lab
The Lab aims to offer innovative education and to build a creative community of researchers at the intersections of law, technology and data science.
Maastricht Law & Tech Lab
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others

1 Jan 13, 2022
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022
Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Ankur Dhuriya 10 Oct 13, 2022
American Sign Language (ASL) to Text Converter

Signterpreter American Sign Language (ASL) to Text Converter Recommendations Although there is grayscale and gaussian blur, we recommend that you use

0 Feb 20, 2022
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

Fabio Cermelli 9 Dec 28, 2022
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Calibre Recipe für "Analyse & Kritik" Dies ist ein "Recipe" für die Konvertierung der aktuellen Ausgabe der Zeitung Analyse & Kritik in ein Ebook. Es

Henning 3 Jan 04, 2022
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

Dani El-Ayyass 57 Dec 07, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022