DaCy: The State of the Art Danish NLP pipeline using SpaCy

Overview

DaCy: A SpaCy NLP Pipeline for Danish

release version versions python versions python versions Code style: flake8

DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Art performance on all Benchmark tasks for Danish. This repository contains code for reproducing DaCy. To download the models use the DaNLP package (request pending), SpaCy (request pending) or downloading the project directly here.

Reproduction

the folder DaCy contains a SpaCy project which will allow for a reproduction of the results. This folder also includes the evaluation metrics on DaNE.

Usage

To load in the project using the direct download simple place the downloaded "packages" folder in your directory load the model using SpaCy:

import spacy
nlp = spacy.load("da_dacy_large_tft-0.0.0")

More explicitly from the unpacked folder it is:

nlp = spacy.load("da_dacy_large_tft-0.0.0/da_dacy_large_tft/da_dacy_large_tft-0.0.0")

Thus if you get an error you might be loading from the outer folder called da_dacy_large_tft-0.0.0 rather than the inner.

To obtains SOTA performance in lemmatization as well you should add this lemmatization pipeline as well:

import lemmy.pipe

pipe = lemmy.pipe.load('da')

# Add the component to the spaCy pipeline.
nlp.add_pipe(pipe, after='tagger')

# Lemmas can now be accessed using the `._.lemmas` attribute on the tokens.
nlp("akvariernes")[0]._.lemmas

This requires you install the package beforehand, this is done easily using:

pip install lemmy

Performance and Training

The following table show the performance on DaNE when compared to other models. Highest scores are highlighted with bold and second highest is underlined

Want to learn more about how the model was trained, check out this blog post.

Issues and Usage Q&A

To ask questions, report issues or request features 🤔 , please use the GitHub Issue Tracker. Question related to SpaCy is referred to the SpaCy GitHub or forum.

Acknowledgements

This is really an acknowledgement of great open-source software and contributors. This wouldn't have been possible with the work by the SpaCy team which developed an integrated the software. Huggingface for developing Transformers and making model sharing convenient. BotXO for training and sharing the Danish BERT model and Malte Bertelsen for making it easily available. DaNLP has made it extremely easy to get access to Danish resources to train on and even supplied some of the tagged data themselves and does a great job of actually developing these datasets.

References

If you use this library in your research, please kindly cite:

@inproceedings{enevoldsen2020dacy,
    title={DaCy: A SpaCy NLP Pipeline for Danish},
    author={Enevoldsen, Kenneth},
    year={2021}
}

LICENSE

DaCy is released under the Apache License, Version 2.0. See the LICENSE file for more details.

Comments
  • Make cache dir configurable

    Make cache dir configurable

    I would like to make the default cache dir configurable with an environmental variable. This is a simple PR to allow one to do that with the variable DACY_CACHE_DIR.

    opened by dhpollack 9
  • Remove protobuf dependency

    Remove protobuf dependency

    dacy has a very tight version bound on some auxiliary libraries like protobuf. It's not apparent why this is required as it does not appear to be a library used internally, but it could of course be intentional. But the version is lagging enough that it is starting to cause compatibility problems with other libraries, so if it can be relaxed that would be very helpful.

    enhancement 
    opened by Bonnevie 4
  • Add Tutorials:

    Add Tutorials: "Extracting text statistics and readability metrics using DaCy and Textdescriptives"

    After removing readability it would be nice with a tutorial on: "Extracting text statistics and readability metrics using DaCy and Textdescriptives"

    Potentially using the packages to describe the examining the language complexity between conversational data and legal documents on DAGW or a similar task using a publicly available dataset.

    enhancement 
    opened by KennethEnevoldsen 4
  • loosen requirements

    loosen requirements

    The requirements of this package are unnecessarily strict. Specifically, I am having issues with tqdm. I have a more in-depth explaination in the issue that I create centre-for-humanities-computing/DaCy#75. There are also a few optimizations to your setup.py file. I notice that the requirements.txt file is not used, which could cause a mismatch when doing pip install -r requirements.txt and pip install .

    opened by dhpollack 4
  • ContextualVersionConflict Traceback (most recent call last)

    ContextualVersionConflict Traceback (most recent call last)

    Moved from #133, originally posted by @EaLindhardt

    I've tried to download dacy through anaconda, both with pip and conda install and the different ways of installing: https://centre-for-humanities-computing.github.io/DaCy/installation.html

    when running

    import dacy

    i get the following

    `--------------------------------------------------------------------------- ContextualVersionConflict Traceback (most recent call last) Input In [14], in <cell line: 1>() ----> 1 import dacy

    File ~\AppData\Roaming\Python\Python39\site-packages\dacy_init_.py:4, in 1 from dacy.hate_speech import make_offensive_transformer # noqa 2 from dacy.sentiment import make_emotion_transformer # noqa ----> 4 from .about import download_url, title, version # noqa 5 from .download import download_model # noqa 6 from .load import load, models, where_is_my_dacy

    File ~\AppData\Roaming\Python\Python39\site-packages\dacy\about.py:3, in 1 import pkg_resources ----> 3 version = pkg_resources.get_distribution("dacy").version 4 title = "dacy" 5 download_url = "https://github.com/centre-for-humanities-computing/DaCy"

    File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:477, in get_distribution(dist) 475 dist = Requirement.parse(dist) 476 if isinstance(dist, Requirement): --> 477 dist = get_provider(dist) 478 if not isinstance(dist, Distribution): 479 raise TypeError("Expected string, Requirement, or Distribution", dist)

    File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:353, in get_provider(moduleOrReq) 351 """Return an IResourceProvider for the named module or requirement""" 352 if isinstance(moduleOrReq, Requirement): --> 353 return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0] 354 try: 355 module = sys.modules[moduleOrReq]

    File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:897, in WorkingSet.require(self, *requirements) 888 def require(self, *requirements): 889 """Ensure that distributions matching requirements are activated 890 891 requirements must be a string or a (possibly-nested) sequence (...) 895 included, even if they were already activated in this working set. 896 """ --> 897 needed = self.resolve(parse_requirements(requirements)) 899 for dist in needed: 900 self.add(dist)

    File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:788, in WorkingSet.resolve(self, requirements, env, installer, replace_conflicting, extras) 785 if dist not in req: 786 # Oops, the "best" so far conflicts with a dependency 787 dependent_req = required_by[req] --> 788 raise VersionConflict(dist, req).with_context(dependent_req) 790 # push the new requirements onto the stack 791 new_requirements = dist.requires(req.extras)[::-1]

    ContextualVersionConflict: (spacy 3.3.1 (c:\users\au576018\anaconda3\lib\site-packages), Requirement.parse('spacy<3.3.0,>=3.2.0'), {'dacy'})`

    How do I solve this?

    @EaLindhardt will you please add the following information:

    • DaCy Version Used:
    • Operating System:
    • Python Version Used:
    • spaCy Version Used:
    • Environment Information:

    you can also type python -m spacy info --markdown and copy-paste the result here along with the DaCy version, which you can get using python -c "import dacy; print(dacy.__version__)"

    bug Stale 
    opened by KennethEnevoldsen 3
  • Update WandbLogger in configs to v2

    Update WandbLogger in configs to v2

    Update WandbLogger in configs to v2. This version has the same experiment tracking features as v1 but also has model checkpointing and dataset versioning possibilities.

    opened by scottire 3
  • Augmentation

    Augmentation

    • [x] Entity augmentation
      • [x] Gender augmentation (awareness of gender)
      • [x] Second order person augmentation (Lastname, Firstname)
      • [ ] Usernames (autogenerates e.g. WhiteTruffle101 or Kenneth Enevoldsen -> KennethEnevoldsen)
    • [ ] Mispellings Augmentations, se e.g. this repo
      • [x] Keystroke error based on keyboard distance
    • [ ] Historic augmentations
      • [x]   æ->ae, å -> aa (and a), ø->oe
      • [ ] uppercasing of nouns
    • [ ] Social media
      • [ ] Adding hashtags augmentation
    • [ ] Others, potentially see this tweet or this kaggle summary
    enhancement 
    opened by KennethEnevoldsen 3
  • :arrow_up: Update sphinxext-opengraph requirement from <0.7.0,>=0.6.3 to >=0.6.3,<0.8.0

    :arrow_up: Update sphinxext-opengraph requirement from <0.7.0,>=0.6.3 to >=0.6.3,<0.8.0

    Updates the requirements on sphinxext-opengraph to permit the latest version.

    Release notes

    Sourced from sphinxext-opengraph's releases.

    v0.7.4

    What's Changed

    New Contributors

    Full Changelog: https://github.com/wpilibsuite/sphinxext-opengraph/compare/v0.7.3...v0.7.4

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies python 
    opened by dependabot[bot] 2
  • :arrow_up: Bump schneegans/dynamic-badges-action from 1.2.0 to 1.3.0

    :arrow_up: Bump schneegans/dynamic-badges-action from 1.2.0 to 1.3.0

    Bumps schneegans/dynamic-badges-action from 1.2.0 to 1.3.0.

    Release notes

    Sourced from schneegans/dynamic-badges-action's releases.

    Dynamic Badges v1.3.0

    This release adds the possibility to auto-generate the badge color. You can read the full changelog.

    Changelog

    Sourced from schneegans/dynamic-badges-action's changelog.

    Dynamic Badges Action 1.3.0

    Release Date: 2022-04-18

    Changes

    • Added the possibility to generate the badge color automatically between red and green based on a numerical value and its bounds. Thanks to @​LucasWolfgang for this contribution!
    Commits
    • a6775a6 :memo: Add changelog entry
    • 7ce4e74 :wrench: USe color range for example badge
    • a3f7e7f :memo: Improve documentation
    • 6511e52 :memo: Tweak documentation
    • e43bdee :sparkles: Tweak formatting of the code
    • 3dd7c22 :sparkles: Apply clang-format
    • ee32073 :wrench: Fix typo
    • 9bce11b :Thanks again! : Merge pull request #11 from LucasWolfgang/master
    • 53c821a :tada: Added saturation and lightness parameters
    • 6363528 :tada: Added saturation and lightness parameters
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies github_actions 
    opened by dependabot[bot] 2
  • Address cuda warnings and spaCy version warning.

    Address cuda warnings and spaCy version warning.

    When running:

    import dacy
    
    for model in dacy.models():
        print(model)
    
    dacy_nlp = dacy.load('medium')
    
    doc = dacy_nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering bygget i SpaCy.")
    
    print('hej')
    

    I get the following warning:

    
    da_dacy_small_tft-0.0.0
    da_dacy_medium_tft-0.0.0
    da_dacy_large_tft-0.0.0
    da_dacy_small_trf-0.1.0
    da_dacy_medium_trf-0.1.0
    da_dacy_large_trf-0.1.0
    /venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_medium_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
      warnings.warn(warn_msg)
    /venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_small_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
      warnings.warn(warn_msg)
    /venv/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:406: UserWarning: Automatically converting a transformer component from spacy-transformers v1.0 to v1.1+. If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spacy-transformers version. For more details and available updates, run: python -m spacy validate
      warnings.warn(warn_msg)
    /venv/lib/python3.9/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
      warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
    /venv/lib/python3.9/site-packages/spacy/pipeline/attributeruler.py:150: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
      matches = self.matcher(doc, allow_missing=True, as_spans=False)
    hej
    

    Notably this this includes three warning, including SpaCy version, cuda device and matcher object (see also #72)

    originally version sent to me by mail

    Note: While this is a warning there, DaCy still works as intended. The version of spaCy does not influence model performance.

    opened by KennethEnevoldsen 2
Releases(v2.3.1)
Owner
Kenneth Enevoldsen
Student and Instructor at Cognitive Science Aarhus University Student Programmer at CHCAA, Junior Waste management consultant at JHN Processor
Kenneth Enevoldsen
Seonghwan Kim 24 Sep 11, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 03, 2023
Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

InterDigital 21 Dec 29, 2022
TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

Laboratory for Social Machines 84 Dec 20, 2022
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 09, 2022
Anuvada: Interpretable Models for NLP using PyTorch

Anuvada: Interpretable Models for NLP using PyTorch So, you want to know why your classifier arrived at a particular decision or why your flashy new d

EDGE 102 Oct 01, 2022
This code is the implementation of Text Emotion Recognition (TER) with linguistic features

APSIPA-TER This code is the implementation of Text Emotion Recognition (TER) with linguistic features. The network model is BERT with a pretrained mod

kenro515 1 Feb 08, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
Journey is a NLP-Powered Developer assistant

Journey Journey is a NLP-Powered Developer assistant Using on the powerful Natural Language Processing library Mindmeld, this projects aims to assist

Christian Eilers 21 Dec 11, 2022
ConvBERT-Prod

ConvBERT 目录 0. 仓库结构 1. 简介 2. 数据集和复现精度 3. 准备数据与环境 3.1 准备环境 3.2 准备数据 3.3 准备模型 4. 开始使用 4.1 模型训练 4.2 模型评估 4.3 模型预测 5. 模型推理部署 5.1 基于Inference的推理 5.2 基于Serv

yujun 7 Apr 08, 2022
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 07, 2022
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

Wordle_Bot Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time. It will log onto the wordle website and en

Lucas Polidori 15 Dec 11, 2022
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 02, 2022