A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

Last update: Jan 03, 2023

Overview

spaCyOpenTapioca

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata.

Installation
How to use
Local OpenTapioca
Vizualization

Installation

pip install spacyopentapioca

git clone https://github.com/UB-Mannheim/spacyopentapioca
cd spacyopentapioca/
pip install .

How to use

After installation the OpenTapioca pipeline can be used without any other pipelines:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works in Germany.")
for span in doc.ents:
    print((span.text, span.kb_id_, span.label_, span._.description, span._.score))

('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 3.6533377082098895)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.1099332471902863)

The types and aliases are also available:

for span in doc.ents:
    print((span._.types, span._.aliases[0:5]))

({'Q43229': False, 'Q618123': False, 'Q5': True, 'P2427': False, 'P1566': False, 'P496': True}, ['كريستيان دروستين', 'Крістіан Дростен', 'Christian Heinrich Maria Drosten', 'کریستین دروستن', '크리스티안 드로스텐'])
({'Q43229': True, 'Q618123': True, 'Q5': False, 'P2427': False, 'P1566': True, 'P496': False}, ['IJalimani', 'R. F. A.', 'Alemania', '도이칠란트', 'Germaniya'])

The Wikidata QIDs are attached to tokens:

for token in doc:
    print((token.text, token.ent_kb_id_))

('Christian', 'Q1079331')
('Drosten', 'Q1079331')
('works', '')
('in', '')
('Germany', 'Q183')
('.', '')

The raw response of the OpenTapioca API can be accessed in the doc- and span-objects:

raw_annotations1 = doc._.annotations
raw_annotations2 = [span._.annotations for span in doc.ents]

The partial metadata for the response returned by the OpenTapioca API is

doc._.metadata

All span-extensions are:

span._.annotations
span._.description
span._.aliases
span._.rank
span._.score
span._.types
span._.label
span._.extra_aliases
span._.nb_sitelinks
span._.nb_statements

Note that spaCyOpenTapioca does a tiny processing of entities appearing in doc.ents. All entities returned by OpenTapioca can be found in doc.spans['all_entities_opentapioca'].

Local OpenTapioca

If OpenTapioca is deployed locally, specify the URL of the new OpenTapioca API in the config:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca', config={"url": OpenTapiocaAPI})
doc = nlp("Christian Drosten works in Germany.")

Vizualization

NER vizualization in spaCy via displaCy cannot show yet the links to entities. This can be added into spaCy as proposed in issue 9129.

Comments

AttributeError: 'NoneType' object has no attribute 'text' when using nlp.pipe()

Hi, when I process multiple text documents as a batch, I have failure with the error message: AttributeError: 'NoneType' object has no attribute 'text'. However, processing each text document by itself produces no such error. Here is a easy to reproduce example:

docs = ["""String of 126 characters. String of 126 characters. String of 126 characters. String of 126 characters. String of 126 characte""","""Any string which is 93 characters. Any string which is 93 characters. Any string which is 93 """]
nlp = spacy.blank("en")
nlp.add_pipe("opentapioca")
for doc in nlp.pipe(docs):
    print(doc)

Fulll stack trace below:

AttributeError                            Traceback (most recent call last)
<command-370658210397732> in <module>
      4 nlp = spacy.blank("en")
      5 nlp.add_pipe("opentapioca")
----> 6 for doc in nlp.pipe(docs):
      7     print(doc)

/databricks/python/lib/python3.8/site-packages/spacy/language.py in pipe(self, texts, as_tuples, batch_size, disable, component_cfg, n_process)
   1570         else:
   1571             # if n_process == 1, no processes are forked.
-> 1572             docs = (self._ensure_doc(text) for text in texts)
   1573             for pipe in pipes:
   1574                 docs = pipe(docs)

/databricks/python/lib/python3.8/site-packages/spacy/util.py in _pipe(docs, proc, name, default_error_handler, kwargs)
   1597     if hasattr(proc, "pipe"):
   1598         yield from proc.pipe(docs, **kwargs)
-> 1599     else:
   1600         # We added some args for pipe that __call__ doesn't expect.
   1601         kwargs = dict(kwargs)

/databricks/python/lib/python3.8/site-packages/spacyopentapioca/entity_linker.py in pipe(self, stream, batch_size)
    117                     self.make_request, doc): doc for doc in docs}
    118                 for doc, future in zip(docs, concurrent.futures.as_completed(future_to_url)):
--> 119                     yield self.process_single_doc_after_call(doc, future.result())

/databricks/python/lib/python3.8/site-packages/spacyopentapioca/entity_linker.py in process_single_doc_after_call(self, doc, r)
     66                                      alignment_mode='expand')
     67                 log.warning('The OpenTapioca-entity "%s" %s does not fit the span "%s" %s in spaCy. EXPANDED!',
---> 68                             ent['tags'][0]['label'][0], (start, end), span.text, (span.start_char, span.end_char))
     69             span._.annotations = ent
     70             span._.description = ent['tags'][0]['desc']

AttributeError: 'NoneType' object has no attribute 'text'

I don't know what about the lengths of the strings causes an issue, but they do seem to matter in some way. Adding or removing a couple characters from either string can resolve the issue.

opened by coltonpeltier-db 6

Add methods to highlights

In the same way by clicking a NER highlighting leads to a web side it would perhaps be possible to extend this functionality and pass a method to be run when clicking the highlighted NER.

opened by joseberlines 4
Add CodeQL workflow for GitHub code scanning
Hi UB-Mannheim/spacyopentapioca!

This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

Questions? Check out the FAQ below!

FAQ

Click here to expand the FAQ section

How often will the code scanning analysis run?

By default, code scanning will trigger a scan with the CodeQL engine on the following events:

On every pull request — to flag up potential security problems for you to investigate before merging a PR.

On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.

Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

What will this cost?

Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

What types of problems does CodeQL find?

The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

How do I upgrade my CodeQL engine?

No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

The analysis doesn’t seem to be working

If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

How do I disable LGTM.com?

If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

Which source code hosting platforms does code scanning support?

GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

How do I know this PR is legitimate?

This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

I have another question / how do I get in touch?

Please join the discussion here to ask further questions and send us suggestions!
opened by lgtm-com[bot] 1
'ent_kb_id' referenced before assignment

Hello, while trying this example : nlp("M. Knajdek"), An error occurs in the entity_linker.py file UnboundLocalError: local variable 'ent_kb_id' referenced before assignment on line 67 in the file. This is due to the . separator.

opened by TheNizzo 1
Added logging & Fixed Reference Error

Added logger to allow user to suppress logs coming from spacyopentapioca.

Fixed thelocal variable 'etype' referenced before assignment error at line 65.

opened by jordanparker6 1

Releases(v.0.1.6)

v.0.1.6(Nov 16, 2022)
fixed batching problems by @Hmkhalla

Source code(tar.gz)
Source code(zip)
v0.1.5(Oct 4, 2022)
added batching via nlp.pipe() by @davidberenstein1957

Source code(tar.gz)
Source code(zip)
v0.1.4(Nov 26, 2021)
fixed #3

Source code(tar.gz)
Source code(zip)
v0.1.3(Nov 5, 2021)
added docs & binder by @shigapov

fixed reference error & added logging by @jordanparker6

Source code(tar.gz)
Source code(zip)
v0.1.2(Sep 13, 2021)
fixed the case of overlapping spans

added span-extensions for label, extra_aliases, nb_sitelinks and nb_statements

Source code(tar.gz)
Source code(zip)
v0.1.1(Sep 10, 2021)
fixed entity type evaluator

explained NEL vizualization

Source code(tar.gz)
Source code(zip)
v0.1.0(Sep 10, 2021)
sends requests to the OpenTapioca API

attaches annotations to doc- and span-objects in spaCy

Source code(tar.gz)
Source code(zip)

Owner

Universitätsbibliothek Mannheim

Mannheim University Library

GitHub Repository

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

1.3k Dec 30, 2022

BiNE: Bipartite Network Embedding

BiNE: Bipartite Network Embedding This repository contains the demo code of the paper: BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiang

214 Nov 24, 2022

[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Learning Signal-Agnostic Manifolds of Neural Fields This is the uncleaned code for the paper Learning Signal-Agnostic Manifolds of Neural Fields. The

60 Dec 12, 2022

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

77 Dec 19, 2022

Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

762 Dec 29, 2022

Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

33 Dec 27, 2022

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

20 May 17, 2022

AI and Machine Learning workflows on Anthos Bare Metal.

Hybrid and Sovereign AI on Anthos Bare Metal Table of Contents Overview Terraform as IaC Substrate ABM Cluster on GCE using Terraform TensorFlow ResNe

8 Nov 26, 2022

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

186 Dec 29, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 04, 2023

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

【关于 NLP】那些你不知道的事作者：杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金介绍本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。目录架构一、【

1.4k Dec 30, 2022

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

60 Dec 31, 2022

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

Welcome to Spokestack Python! This library is intended for developing voice interfaces in Python. This can include anything from Raspberry Pi applicat

133 Sep 20, 2022

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

Related tags

Overview

spaCyOpenTapioca

Table of contents

Installation

How to use

Local OpenTapioca

Vizualization

Comments

AttributeError: 'NoneType' object has no attribute 'text' when using nlp.pipe()

Add methods to highlights

Add CodeQL workflow for GitHub code scanning

FAQ

How often will the code scanning analysis run?

What will this cost?

What types of problems does CodeQL find?

How do I upgrade my CodeQL engine?

The analysis doesn’t seem to be working

How do I disable LGTM.com?

Which source code hosting platforms does code scanning support?

How do I know this PR is legitimate?

I have another question / how do I get in touch?

'ent_kb_id' referenced before assignment

Added logging & Fixed Reference Error

Releases(v.0.1.6)

v.0.1.6(Nov 16, 2022)

v0.1.5(Oct 4, 2022)

v0.1.4(Nov 26, 2021)

v0.1.3(Nov 5, 2021)

v0.1.2(Sep 13, 2021)

v0.1.1(Sep 10, 2021)

v0.1.0(Sep 10, 2021)

Owner

Universitätsbibliothek Mannheim

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

BiNE: Bipartite Network Embedding

[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Simple text to phones converter for multiple languages

Ray-based parallel data preprocessing for NLP and ML.

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

AI and Machine Learning workflows on Anthos Bare Metal.

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

This repository has a implementations of data augmentation for NLP for Japanese.

Implementation of legal QA system based on SentenceKoBART

Refactored version of FastSpeech2

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

A high-level yet extensible library for fast language model tuning via automatic prompt search

SimBERT升级版（SimBERTv2）！

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。