An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models.

Overview

Kazakh Named Entity Recognition

This repository contains an open-source Kazakh named entity recognition dataset (KazNERD), named entity annotation guidelines (in Kazakh), and NER model training codes (CRF, BiLSTM-CNN-CRF, BERT and XLM-RoBERTa).

  1. KazNERD Corpus
  2. Annotation Guidelines
  3. NER Models
    1. CRF
    2. BiLSTM-CNN-CRF
    3. BERT and XLM-RoBERTa
  4. Citation

1. KazNERD Corpus

KazNERD contains 112,702 sentences, extracted from the television news text, and 136,333 annotations for 25 entity classes. All sentences in the dataset were manually annotated by two native Kazakh-speaking linguists, supervised by an ISSAI researcher. The IOB2 scheme was used for annotation. The dataset, in CoNLL 2002 format, is located here.

2. Annotation Guidelines

The annotation guidelines followed to build KazNERD are located here. The guidelines contain rules for annotating 25 named entity classes and their examples. The guidelines are in the Kazakh language.

3. NER Models

3.1 CRF

Conda Environment Setup for CRF

The CRF-based NER model training codes are based on Python 3.8. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdCRF python=3.8
conda activate knerdCRF
conda install -c anaconda nltk scikit-learn
conda install -c conda-forge sklearn-crfsuite seqeval

Start CRF training

$ cd crf
$ python runCRF_KazNERD.py

3.2 BiLSTM-CNN-CRF

Conda Environment Setup for BiLSTM-CNN-CRF

The BiLSTM-CNN-CRF-based NER model training codes are based on Python 3.8 and PyTorch 1.7.1. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdLSTM python=3.8
conda activate knerdLSTM
# Check https://pytorch.org/get-started/previous-versions/#v171
# to install a PyTorch version suitable for your OS and CUDA
# or feel free to adapt the code to a newer PyTorch version
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch   # we used this version
conda install -c conda-forge tqdm seqeval

Start BiLSTM-CNN-CRF training

$ cd BiLSTM_CNN_CRF
$ bash run_train_p.sh

3.3 BERT and XLM-RoBERTa

Conda Environment Setup for BERT and XLM-RoBERTa

The BERT- and XLM-RoBERTa-based NER models training codes are based on Python 3.8 and PyTorch 1.7.1. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdBERT python=3.8
conda activate knerdBERT
# Check https://pytorch.org/get-started/previous-versions/#v171
# to install a PyTorch version suitable for your OS and CUDA
# or feel free to adapt the code to a newer PyTorch version
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch   # we used this version
conda install -c anaconda numpy
conda install -c conda-forge seqeval
pip install transformers
pip install datasets

Start BERT training

$ cd bert
$ python run_finetune_kaznerd.py bert

Start XLM-RoBERTa training

$ cd bert
$ python run_finetune_kaznerd.py roberta

4. Citation

@misc{yeshpanov2021kaznerd,
      title={KazNERD: Kazakh Named Entity Recognition Dataset}, 
      author={Rustem Yeshpanov and Yerbolat Khassanov and Huseyin Atakan Varol},
      year={2021},
      eprint={2111.13419},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
ISSAI
Institute of Smart Systems and Artificial Intelligence
ISSAI
基于Pytorch实现优秀的自然图像分割框架!(包括FCN、U-Net和Deeplab)

语义分割学习实验-基于VOC数据集 usage: 下载VOC数据集,将JPEGImages SegmentationClass两个文件夹放入到data文件夹下。 终端切换到目标目录,运行python train.py -h查看训练 (torch) Li Xiang 28 Dec 21, 2022

Asymmetric metric learning for knowledge transfer

Asymmetric metric learning This is the official code that enables the reproduction of the results from our paper: Asymmetric metric learning for knowl

20 Dec 06, 2022
Code for paper Adaptively Aligned Image Captioning via Adaptive Attention Time

Adaptively Aligned Image Captioning via Adaptive Attention Time This repository includes the implementation for Adaptively Aligned Image Captioning vi

Lun Huang 45 Aug 27, 2022
This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

GMPQ: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation This is the pytorch implementation for the paper: Generalizable Mix

18 Sep 02, 2022
Deep learning algorithms for muon momentum estimation in the CMS Trigger System

Deep learning algorithms for muon momentum estimation in the CMS Trigger System The Compact Muon Solenoid (CMS) is a general-purpose detector at the L

anuragB 2 Oct 06, 2021
MGFN: Multi-Graph Fusion Networks for Urban Region Embedding was accepted by IJCAI-2022.

Multi-Graph Fusion Networks for Urban Region Embedding (IJCAI-22) This is the implementation of Multi-Graph Fusion Networks for Urban Region Embedding

202 Nov 18, 2022
The fastest way to visualize GradCAM with your Keras models.

VizGradCAM VizGradCam is the fastest way to visualize GradCAM in Keras models. GradCAM helps with providing visual explainability of trained models an

58 Nov 19, 2022
Consumer Fairness in Recommender Systems: Contextualizing Definitions and Mitigations

Consumer Fairness in Recommender Systems: Contextualizing Definitions and Mitigations This is the repository for the paper Consumer Fairness in Recomm

7 Nov 30, 2022
Google AI Open Images - Object Detection Track: Open Solution

Google AI Open Images - Object Detection Track: Open Solution This is an open solution to the Google AI Open Images - Object Detection Track 😃 More c

minerva.ml 46 Jun 22, 2022
The final project of "Applying AI to 3D Medical Imaging Data" from "AI for Healthcare" nanodegree - Udacity.

Quantifying Hippocampus Volume for Alzheimer's Progression Background Alzheimer's disease (AD) is a progressive neurodegenerative disorder that result

Omar Laham 1 Jan 14, 2022
[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

LBYL-Net This repo implements paper Look Before You Leap: Learning Landmark Features For One-Stage Visual Grounding CVPR 2021. Getting Started Prerequ

SVIP Lab 45 Dec 12, 2022
Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

CasRel-pytorch-reimplement Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The o

longlongman 170 Dec 01, 2022
Contrastive Language-Image Pretraining

CLIP [Blog] [Paper] [Model Card] [Colab] CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pair

OpenAI 11.5k Jan 08, 2023
Add gui for YoloV5 using PyQt5

HEAD 更新2021.08.16 **添加图片和视频保存功能: 1.图片和视频按照当前系统时间进行命名 2.各自检测结果存放入output文件夹 3.摄像头检测的默认设备序号更改为0,减少调试报错 温馨提示: 1.项目放置在全英文路径下,防止项目报错 2.默认使用cpu进行检测,自

Ruihao Wang 65 Dec 27, 2022
Co-GAIL: Learning Diverse Strategies for Human-Robot Collaboration

CoGAIL Table of Content Overview Installation Dataset Training Evaluation Trained Checkpoints Acknowledgement Citations License Overview This reposito

Jeremy Wang 29 Dec 24, 2022
Deep learning model for EEG artifact removal

DeepSeparator Introduction Electroencephalogram (EEG) recordings are often contaminated with artifacts. Various methods have been developed to elimina

23 Dec 21, 2022
Python Implementation of the CoronaWarnApp (CWA) Event Registration

Python implementation of the Corona-Warn-App (CWA) Event Registration This is an implementation of the Protocol used to generate event and location QR

MaZderMind 17 Oct 05, 2022
Official Pytorch implementation of Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

Scene Representation Networks This is the official implementation of the NeurIPS submission "Scene Representation Networks: Continuous 3D-Structure-Aw

Vincent Sitzmann 365 Jan 06, 2023
Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

Self-attention building blocks for computer vision applications in PyTorch Implementation of self attention mechanisms for computer vision in PyTorch

AI Summer 962 Dec 23, 2022
利用Tensorflow实现基于CNN的中文短文本分类

Text Classification with CNN 使用卷积神经网络进行中文文本分类 CNN做句子分类的论文可以参看: Convolutional Neural Networks for Sentence Classification 还可以去读dennybritz大牛的博客:Implemen

Jeremiah 4 Nov 08, 2022