KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Related tags

Deep LearningKIND
Overview

KIND (Kessler Italian Named-entities Dataset)

KIND is an Italian dataset for Named-Entity Recognition.

It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.

For the construction of the dataset, we decide to use texts available for free, under a license that permits both research and commercial use.

In particular we release four chapters with texts taken from: (i) Wikinews (WN) as a source of news texts belonging to the last decades; (ii) some Italian fiction books (FIC) whose authors died more than 70 years ago; (iii) writings and speeches from Italian politicians Aldo Moro (AM) and (iv) Alcide De Gasperi (ADG).

Wikinews

Wikinews is a multi-language free project of collaborative journalism. The Italian chapter contains more than 11,000 news articles, released under the Creative Commons Attribution 2.5 License.

In building KIND, we randomly choose 1,000 articles evenly distributed in the last 20 years, for a total of 308,622 tokens.

Literature

Regarding fiction literature, we annotate 86 book chapters taken from 10 books written by Italian authors, who all died more than 70 years ago, for a total of 192,448 tokens. The plain texts are taken from the Liber Liber website.

In particular, we choose: Il giorno delle Mésules (Ettore Castiglioni, 12,853 tokens), L'amante di Cesare (Augusto De Angelis, 13,464 tokens), Canne al vento (Grazia Deledda, 13,945 tokens), 1861-1911 - Cinquant’anni di vita nazionale ricordati ai fanciulli (Guido Fabiani, 10,801 tokens), Lettere dal carcere (Antonio Gramsci, 10,655), Anarchismo e democrazia (Errico Malatesta, 11,557 tokens), L'amore negato (Maria Messina, 31,115 tokens), La luna e i falò (Cesare Pavese, 10,705 tokens), La coscienza di Zeno (Italo Svevo, 56,364 tokens), Le cose piu grandi di lui (Luciano Zuccoli, 20,989 tokens).

In selecting works without copyright, we favored texts as recent as possible, so that the model trained on this data can be used efficiently on novels written in the last years, since the language used in these novels is more likely to be similar to the language used in the novels of our days.

Aldo Moro's Works

Writings belonging to Aldo Moro have recently been collected by the University of Bologna and published on a platform called Edizione Nazionale delle Opere di Aldo Moro.

The project is still ongoing and, by now, it contains 806 documents for a total of about one million tokens.

In the first release of KIND, we include 392,604 tokens from the Aldo Moro's works dataset, with silver annotations (see the reference below).

Alcide De Gasperi's Writings

Finally, we annotate 158 document (150,632 tokens) from Alcide Digitale, spanning 50 years of European history.

The complete corpus contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954.

License

The NER annotations in (i), (ii), and (iii) are released under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Annotation from Alcide De Gasperi's writings are released under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Owner
Digital Humanities
Digital Humanities Unit at Fondazione Bruno Kessler
Digital Humanities
3D HourGlass Networks for Human Pose Estimation Through Videos

3D-HourGlass-Network 3D CNN Based Hourglass Network for Human Pose Estimation (3D Human Pose) from videos. This was my summer'18 research project. Dis

Naman Jain 51 Jan 02, 2023
Learning kernels to maximize the power of MMD tests

Code for the paper "Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy" (arXiv:1611.04488; published at ICLR 2017), by Douga

Danica J. Sutherland 201 Dec 17, 2022
Code for the KDD 2021 paper 'Filtration Curves for Graph Representation'

Filtration Curves for Graph Representation This repository provides the code from the KDD'21 paper Filtration Curves for Graph Representation. Depende

Machine Learning and Computational Biology Lab 16 Oct 16, 2022
Fast, general, and tested differentiable structured prediction in PyTorch

Fast, general, and tested differentiable structured prediction in PyTorch

HNLP 1.1k Dec 16, 2022
Line-level Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Line-level Handwritten Text Recognition with TensorFlow This model is an extended version of the Simple HTR system implemented by @Harald Scheidl and

Hoàng Tùng Lâm (Linus) 72 May 07, 2022
Distinguishing Commercial from Editorial Content in News

Distinguishing Commercial from Editorial Content in News In this repository you can find the following: An anonymized version of the data used for my

Timo Kats 3 Sep 26, 2022
Official implementation for “Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior”

HEP Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior Implementation Python3 PyTorch=1.0 NVIDIA GPU+CUDA Training process The

FengZhang 34 Dec 04, 2022
Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

Scribble-Supervised LiDAR Semantic Segmentation Dataset and code release for the paper Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORA

102 Dec 25, 2022
Save-restricted-v-3 - Save restricted content Bot For telegram

Save restricted content Bot Contact: Telegram A stable telegram bot to get restr

DEVANSH 11 Dec 21, 2022
Garbage classification using structure data.

垃圾分类模型使用说明 1.包含以下数据文件 文件 描述 data/MaterialMapping.csv 物体以及其归类的信息 data/TestRecords 光谱原始测试数据 CSV 文件 data/TestRecordDesc.zip CSV 文件描述文件 data/Boundaries.cs

wenqi 1 Dec 10, 2021
[NeurIPS 2020] Official Implementation: "SMYRF: Efficient Attention using Asymmetric Clustering".

SMYRF: Efficient attention using asymmetric clustering Get started: Abstract We propose a novel type of balanced clustering algorithm to approximate a

Giannis Daras 46 Dec 22, 2022
Source code for our CVPR 2019 paper - PPGNet: Learning Point-Pair Graph for Line Segment Detection

PPGNet: Learning Point-Pair Graph for Line Segment Detection PyTorch implementation of our CVPR 2019 paper: PPGNet: Learning Point-Pair Graph for Line

SVIP Lab 170 Oct 25, 2022
Repository containing detailed experiments related to the paper "Memotion Analysis through the Lens of Joint Embedding".

Memotion Analysis Through The Lens Of Joint Embedding This repository contains the experiments conducted as described in the paper 'Memotion Analysis

Nethra Gunti 1 Mar 16, 2022
An implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks in PyTorch.

Neural Attention Distillation This is an implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep

Yige-Li 84 Jan 04, 2023
Practical tutorials and labs for TensorFlow used by Nvidia, FFN, CNN, RNN, Kaggle, AE

TensorFlow Tutorial - used by Nvidia Learn TensorFlow from scratch by examples and visualizations with interactive jupyter notebooks. Learn to compete

Alexander R Johansen 1.9k Dec 19, 2022
BBScan py3 - BBScan py3 With Python

BBScan_py3 This repository is forked from lijiejie/BBScan 1.5. I migrated the fo

baiyunfei 12 Dec 30, 2022
TrackTech: Real-time tracking of subjects and objects on multiple cameras

TrackTech: Real-time tracking of subjects and objects on multiple cameras This project is part of the 2021 spring bachelor final project of the Bachel

5 Jun 17, 2022
Rl-quickstart - Reinforcement Learning Quickstart

Reinforcement Learning Quickstart To get setup with the repository, git clone ht

UCLA DataRes 3 Jun 16, 2022
Thermal Control of Laser Powder Bed Fusion using Deep Reinforcement Learning

This repository is the implementation of the paper "Thermal Control of Laser Powder Bed Fusion Using Deep Reinforcement Learning", linked here. The project makes use of the Deep Reinforcement Library

BaratiLab 11 Dec 27, 2022
Log4j JNDI inj. vuln scanner

Log-4-JAM - Log 4 Just Another Mess Log4j JNDI inj. vuln scanner Requirements pip3 install requests_toolbelt Usage # make sure target list has http/ht

Ashish Kunwar 66 Nov 09, 2022