Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

Related tags

Text Data & NLPNERDA
Overview

NERDA

Build status codecov PyPI PyPI - Downloads License

Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning pretrained transformers for Named Entity Recognition (=NER) tasks.

You can also utilize NERDA to access a selection of precooked NERDA models, that you can use right off the shelf for NER tasks.

NERDA is built on huggingface transformers and the popular pytorch framework.

Installation guide

NERDA can be installed from PyPI with

pip install NERDA

If you want the development version then install directly from GitHub.

Named-Entity Recogntion tasks

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.1

Example Task:

Task

Identify person names and organizations in text:

Jim bought 300 shares of Acme Corp.

Solution

Named Entity Type
'Jim' Person
'Acme Corp.' Organization

Read more about NER on Wikipedia.

Train Your Own NERDA Model

Say, we want to fine-tune a pretrained Multilingual BERT transformer for NER in English.

Load package.

from NERDA.models import NERDA

Instantiate a NERDA model (with default settings) for the CoNLL-2003 English NER data set.

from NERDA.datasets import get_conll_data
model = NERDA(dataset_training = get_conll_data('train'),
              dataset_validation = get_conll_data('valid'),
              transformer = 'bert-base-multilingual-uncased')

By default the network architecture is analogous to that of the models in Hvingelby et al. 2020.

The model can then be trained/fine-tuned by invoking the train method, e.g.

model.train()

Note: this will take some time depending on the dimensions of your machine (if you want to skip training, you can go ahead and use one of the models, that we have already precooked for you in stead).

After the model has been trained, the model can be used for predicting named entities in new texts.

# text to identify named entities in.
text = 'Old MacDonald had a farm'
model.predict_text(text)
([['Old', 'MacDonald', 'had', 'a', 'farm']], [['B-PER', 'I-PER', 'O', 'O', 'O']])

This means, that the model identified 'Old MacDonald' as a PERson.

Please note, that the NERDA model configuration above was instantiated with all default settings. You can however customize your NERDA model in a lot of ways:

  • Use your own data set (finetune a transformer for any given language)
  • Choose whatever transformer you like
  • Set all of the hyperparameters for the model
  • You can even apply your own Network Architecture

Read more about advanced usage of NERDA in the detailed documentation.

Use a Precooked NERDA model

We have precooked a number of NERDA models for Danish and English, that you can download and use right off the shelf.

Here is an example.

Instantiate a multilingual BERT model, that has been finetuned for NER in Danish, DA_BERT_ML.

from NERDA.precooked import DA_BERT_ML()
model = DA_BERT_ML()

Down(load) network from web:

model.download_network()
model.load_network()

You can now predict named entities in new (Danish) texts

# (Danish) text to identify named entities in:
# 'Jens Hansen har en bondegård' = 'Old MacDonald had a farm'
text = 'Jens Hansen har en bondegård'
model.predict_text(text)
([['Jens', 'Hansen', 'har', 'en', 'bondegård']], [['B-PER', 'I-PER', 'O', 'O', 'O']])

List of Precooked Models

The table below shows the precooked NERDA models publicly available for download.

Model Language Transformer Dataset F1-score
DA_BERT_ML Danish Multilingual BERT DaNE 82.8
DA_ELECTRA_DA Danish Danish ELECTRA DaNE 79.8
EN_BERT_ML English Multilingual BERT CoNLL-2003 90.4
EN_ELECTRA_EN English English ELECTRA CoNLL-2003 89.1

F1-score is the micro-averaged F1-score across entity tags and is evaluated on the respective test sets (that have not been used for training nor validation of the models).

Note, that we have not spent a lot of time on actually fine-tuning the models, so there could be room for improvement. If you are able to improve the models, we will be happy to hear from you and include your NERDA model.

Model Performance

The table below summarizes the performance (F1-scores) of the precooked NERDA models.

Level DA_BERT_ML DA_ELECTRA_DA EN_BERT_ML EN_ELECTRA_EN
B-PER 93.8 92.0 96.0 95.1
I-PER 97.8 97.1 98.5 97.9
B-ORG 69.5 66.9 88.4 86.2
I-ORG 69.9 70.7 85.7 83.1
B-LOC 82.5 79.0 92.3 91.1
I-LOC 31.6 44.4 83.9 80.5
B-MISC 73.4 68.6 81.8 80.1
I-MISC 86.1 63.6 63.4 68.4
AVG_MICRO 82.8 79.8 90.4 89.1
AVG_MACRO 75.6 72.8 86.3 85.3

'NERDA'?

'NERDA' originally stands for 'Named Entity Recognition for DAnish'. However, this is somewhat misleading, since the functionality is no longer limited to Danish. On the contrary it generalizes to all other languages, i.e. NERDA supports fine-tuning of transformers for NER tasks for any arbitrary language.

Background

NERDA is developed as a part of Ekstra Bladet’s activities on Platform Intelligence in News (PIN). PIN is an industrial research project that is carried out in collaboration between the Technical University of Denmark, University of Copenhagen and Copenhagen Business School with funding from Innovation Fund Denmark. The project runs from 2020-2023 and develops recommender systems and natural language processing systems geared for news publishing, some of which are open sourced like NERDA.

Shout-outs

Read more

The detailed documentation for NERDA including code references and extended workflow examples can be accessed here.

Contact

We hope, that you will find NERDA useful.

Please direct any questions and feedbacks to us!

If you want to contribute (which we encourage you to), open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Owner
Ekstra Bladet
GitHub of Ekstra Bladet Analyse
Ekstra Bladet
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022
A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022
Script and models for clustering LAION-400m CLIP embeddings.

clustering-laion400m Script and models for clustering LAION-400m CLIP embeddings. Models were fit on the first million or so image embeddings. A subje

Peter Baylies 22 Oct 04, 2022
端到端的长本文摘要模型(法研杯2020司法摘要赛道)

端到端的长文本摘要模型(法研杯2020司法摘要赛道)

苏剑林(Jianlin Su) 334 Jan 08, 2023
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Turn clang-tidy warnings and fixes to comments in your pull request

clang-tidy pull request comments A GitHub Action to post clang-tidy warnings and suggestions as review comments on your pull request. What platisd/cla

Dimitris Platis 30 Dec 13, 2022
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Udit Arora 19 Oct 28, 2022
NLP Text Classification

多标签文本分类任务 近年来随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以

Jason 1 Nov 11, 2021
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Simplemma: a simple multilingual lemmatizer for Python Purpose Lemmatization is the process of grouping together the inflected forms of a word so they

Adrien Barbaresi 70 Dec 29, 2022
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python) 日本語は以下に続きます (Japanese follows) English: This book is written in Japanese and primaril

Ryuichi Yamamoto 189 Dec 29, 2022
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

79 Dec 27, 2022
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

fluz 11 Nov 16, 2022
Prithivida 690 Jan 04, 2023
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 07, 2023