nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Overview

Logo

nlabel is currently alpha software and in an early stage of development.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is also a system to collate results from various taggers and keep track of used models and configurations.

Apart from its standard persistence through sqlite and json files, nlabel's binary arriba format combines especially low storage requirements with high performance (see benchmarks below).

Through arriba, nlabel is thus especially suitable for

  • inspecting many features on few documents
  • inspecting few features on many documents

To support external tool chains, nlabel supports exporting to REFI-QDA.

Quick Start

Processing text works occurs in two steps. First, a NLP instance is built from an existing NLP pipeline:

from nlabel import NLP

import spacy

nlp = NLP(spacy.load("en_core_web_sm"), renames={
    'pos': 'upos',
    'tag': 'xpos'
}, require_gpu=False)

In the example, above nlp now contains a pipeline based on spacy's en_core_web_sm model. We instruct nlp to generate embedding vectors via vectors, and to rename two tags, namely pos to upos and tag to xpos.

In the next step, we run the pipeline and look at its output:

doc = nlp(
    "If you're going to San Francisco,"
    "be sure to wear some flowers in your hair.")

for sent in doc.sentences:
    for token in doc.tokens:
        print(token.text, token.upos, token.vector)

You can ask a doc which tags it carries, by calling doc.tags. In the example above, this would give:

['dep', 'ent_iob', 'lemma', 'morph', 'sentence', 'token', 'upos', 'xpos']

In the following sections, some of internal concepts will be explained. To get directly to code that will generate archives for document collections, skip to Importing a CSV to a local archive.

Tags and Labels

nlabel handles everything as tags, even it is has no label. That means that nlabel regards tokens and sentences as as tags with labels. Tags can both be iterated but also asked for labels. Tags can also be regarded as containers that contain other tags. The following examples illustrate the concepts:

for ent in doc.ents:
    print(ent.label, ent.text)

outputs

GPE San Francisco,

while

for ent in doc.ents:
    for token in ent.tokens:
        print(ent.label, token.text, token.xpos)

outputs

GPE San PROPN
GPE Francisco PROPN

NLP engines

To plug in a different nlp engine, set nlp differently:

import stanza
nlp = NLP(stanza.Pipeline('en'))

Since we renamed tag and pos, in the spacy example above, this would work without additional work.

At the moment nlabel has implementations for spacy, stanza, flair and deeppavlov. You can also write your own nlp data generators (based on nlabel.nlp.Tagger).

While NLP usually auto-detects the type of NLP parser you provide it, there are specialized constructors (NLP.spacy, NLP.flair, etc.) that cover some border cases.

Saving and Loading Documents

Documents can be saved to disk:

doc.save("path/to/file")

By default, this will generate a json-based format that should be easy to parse, even if you do decide to not use nlabel after this point - see bahia json documentation.

Of course, you can also use nlabel to load its own documents:

from nlabel import Document

with Document.open("path/to/file") as doc:
    for sent in doc.sentences:
        for token in sent.tokens:
            print(token.text, token.upos, token.vector)

Working with Archives

To store data from multiple taggers and texts, the approach from the last section would generate lots of separate files. nlabel offers a much better alternative through Archives.

There will be more detailed info on archives later on, for now, here is a quick run-through of how to use them.

A first example

This creates (or opens an existing) archive using the carenero engine (details later on), and adds a newly parsed document to it.

with open_archive("/path/to/archive", engine="carenero") as archive:
    doc = nlp(text)
    archive.add(doc)

Opening the archive later would allow us to retrieve all documents:

with open_archive("/path/to/archive", "r") as archive:
    for doc in archive.iter():
        print(doc.text)

Archives know some more things like the number of documents - use len(archive) - or information about its taggers (see next section).

Multiple Taggers

Things get interesting when using more than one tagger, e.g.:

with open_archive("/path/to/archive", engine="carenero") as archive:
    archive.add(nlp1(text))  # e.g. spacy
    archive.add(nlp2(text))  # e.g. stanza

In such an archive, calling archive.iter() will produce an error:

there are 2 taggers with conflicting tag names in this archive,
please use a selector

The reason for this error message is that spacy's and stanza's tag names clash, and nlabel would not know how to deciper doc.tokens to map either to spacy's or stanza's token data.

To resolve this issue, we can specify which tagger to use in iter.

To do this, we can first ask the archive for the taggers it knows by calling archive.taggers. Each tagger carries a unique signature that identifies it. For example, print(archive.taggers[0]) might the following signature:

env:
  machine: arm64
  platform: macOS-12.1-arm64-arm-64bit
  runtime:
    nlabel: 0.0.1.dev0
    python: 3.9.7
library:
  name: spacy
  version: 3.2.1
model:
  lang: en
  name: core_web_sm
  version: 3.2.0
renames:
  pos: upos
  tag: xpos
type: nlp
vectors:
  token:
    type: native

To iterate over documents getting tag data from this tagger, we can use archive.iter(archive.taggers[0]).

More commonly, we want to select a tagger based on its attributes, not on its index in an archive. To do this, we can use a MongoDB style query syntax:

spacy_tagger = archive.taggers[{
    'library': {
        'name': 'spacy'
    }
}]

This will return the tagger, that carries the name 'spacy' in the 'library' section of its signature. If there are no or multiple such taggers, we will get a KeyError.

As shorthand for the query above, you can also use:

spacy_tagger = archive.taggers[{
    'library.name': 'spacy'
}]

Mixing and Bridging Taggers

What happens if we want not exactly one tagger, but the output from multiple taggers.

Archive.iter() also allows to specify single tags and even rename them.

Using spacy_tagger from the last section and a new stanza_tagger:

for doc in archive.iter( spacy_tagger.sentence, spacy_tagger.xpos, stanza_tagger.xpos.to('st_xpos'))):

With these docs, we now can access spacy's sentence and xpos tags, but also stanza's xpos tag, which we rename to st_xpos to avoid a name clash with spacy's `xpos' tag:

    for token in doc.tokens:  # spacy tokens
        print(token.xpos)  # spacy xpos
        print(token.st_xpos)  # stanza xpos

Note that this only works, if stanza's tokenization for a token exactly matches that of spacy.

The Design of nlabel and Inherent Quirks

nlabel does not differentiate between tags and structuring entities such as sentences and tokens. All of them are the same concept to nlabel: labeled spans, that can be containers to other spans.

What can look like a bug at times, is a very conscious design decision: nlabel is completely agnostic to tags in terms of knowing only a single concept that it applies to everything.

Due to this design, there are various formulations in the API that are perfectly valid but rather confusing.

Obviously, it is desirable to write code that avoids these valid but quirky formulations.

Anything is a span with a label

The code below will look for a tag called "pos" that is perfectly aligned with the current token. If such a tag exists, nlabel considers it to be the "token's pos tag", and will return this tag's label.

for token in doc.tokens:
    print(token.pos)

Here is a quirky twist on the code above:

for token in doc.tokens:
    print(token.sentence)

This is allowed. The code will do the same thing as above: first it looks for a tag called "sentence" that is perfectly aligned with the current token. If such a tag exists, its label is returned.

Since the "sentence" tags provided by nlp libraries carry no labels, and "sentence" tags are not aligned to "token" tags, this will fail at step one or two, and therefore just return an empty label. Still, it is valid in terms of nlabel's concepts.

Using the "label" attribute

for ent in sentence.ents:
    print(ent.label)

The following code does exactly the same thing (avoid using it):

for ent in sentence.ents:
    print(ent.ent)

Label Types

There are four label types in nlabel:

description notes
labels all labels constisting of value and score
label first label only ignores ensuing labels
strs string list of label values ignores scores
str first label value as string ignores score and ensuing labels

strs and labels are suitable for getting output from taggers that return multiple labels.

The default type is str. The exception to this rule are morphology tags (e.g. spacy's morph and stanza's feats, which default to strs).

To specify label types, use the .to(label_type=x) method on tags, when specifying them to Archive.iter or Group.view.

Groups and Views

Groups are an underlying building block of nlabel. You might not encounter them directly.

A group contains data from multiple taggers for one shared text. If you need to collect data for multiple texts, use archives.

Documents can be combined into Groups, which will then contain information from multiple taggers:

from nlabel import Group

group = Group.join([doc1, doc2])

Groups have a view method that works similar to the iter method available in Archives.

Computing Embeddings

The following code uses a spacy model to generate token vectors from spacy's native vector attribute:

nlp = NLP.spacy(
    spacy_model,
    vectors={'token': nlabel.embeddings.native})

Spacy's vector attribute is usually filled via spacy's own Tok2Vec and Transformer components or external extensions such as spacy-sentence-bert.

Alternatively, the following code constructs a model that computes transformer embeddings for tokens via flair:

nlp = NLP.flair(
    vectors={'token': nlabel.embeddings.huggingface(
        "dbmdz/bert-base-german-cased", layers="-1, -2")},
    from_spacy=spacy_model)

from_spacy indicates that sentence splitter and tokenizer should be taken from the provided spacy model.

Archives

Engines

nlabel comes with three different persistence engines:

  • carenero is for collecting data, esp. in a batch setting - by supporting restartability and transaction safety, and enabling export of full data or sub sets of it into bahia or arriba.
  • bahia is suitable for archival purposes, as it is just a thin wrapper around a zip of human-readable json files; it is not the ideal format for exports.
  • arriba is a binary format optimized for read performance, it is suitable for data analysis; it is not suitable for exports.

Storage Size

The following graph shows data from a real-world dataset, consisting of 18861 texts (125.3 MB text data), tagged with 4 taggers and a total of 31 tags (no embedding data). Y axis shows size in GB (note logarithmic scale). REFI-QDA is roughly 100 times the size of arriba.

storage size requirements for different engines

Random Access Speeds

The exact speed of arriba depends on the task and data, but but often arriba performs 10 to 100 times faster than bahia and carenero on real-world projects. From the same data set as earlier (when extracting all POS tags from one of 4 taggers over 2000 documents):

access times for different engines

The carenero/ALL benchmarks shows the time when accessing all tags from all taggers through carenero.

More Engine Details

These engines support storing both tagging data and embedding vectors. In the ordering above, they go from slower to faster.

carenero bahia arriba
data collection + - -
exporting + - -
read speeds - - +
suitable for archival - + -

(*) bahia supports writes, but does not avoid adding duplicates or support proper restartability in batch settings, i.e. it is not suited to incremental updates.

Additional Examples

Importing a CSV to a local archive

Create a carenero archive from a CSV:

from nlabel.importers import CSV

import spacy

csv = CSV(
    "/path/to/some.csv",
    keys=['zeitung_id', 'text_type_id', 'filename'],
    text='text')
csv.importer(spacy.load("en_core_web_sm")).to_local_archive()

This will create an archive located in the same folder as the CSV. The code above is restartable, i.e. it is okay to interrupt and continue later - it will not add duplicate entries.

Once the archive has been created, one can either use it directly, e.g. iterating its documents:

from nlabel import open_archive

with open_archive("some/archive.nlabel", mode="r") as archive:
    for doc in archive.iter(some_selector):
        for x in doc.tokens:
            print(x.text, x.xpos, x.vector)

Or, one can save the archive to different formats for faster traversal:

archive.save("demo2", engine="bahia")
archive.save("demo3", engine="arriba")

The open_archive call from above works with all archive types.

Note that the iter call on archives takes an optional view description that allows picking/renaming tags as described earlier.

Exporting to a remote archive

For larger jobs, it is often useful to separate computation and storage, and to allow multiple computation processes (both often applies to GPU cluster environments). Since carenero's sqlite is bad at handling concurrent writes, the solution is starting a dedicated web service that handles the writing on a dedicated machine.

On machine A, start an archive server (it will write a carenero archive to the given path):

python -m nlabel.importers.server /path/to/archive.nlabel --password your_pwd

On machine B, you can start one or multiple importers writing to that remote archive. Modifying the example from the local archive:

from nlabel import RemoteArchive

remote_archive = RemoteArchive("http://localhost:8000", ("user", "your_pwd"))
csv.importer(spacy.load("en_core_web_sm")).to_remote_archive(
    remote_archive, batch_size=8)

Exporting REFI-QDA

The following code exports ent tags to a REFI-QDA project.

from nlabel import NLP

import spacy
nlp = NLP(spacy.load("en_core_web_lg"))
text = 'some longer text...'
doc = nlp(text)

doc.save_to_qda(
    "/path/to/your.qdp", {
        'tagger': {
        },
        'tags': {
            'ent'
        }
    })

A save_to_qda method is also part of cantenero and bahia archives.

Owner
Bernhard Liebl
Bernhard Liebl
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Introduction Funnel-Transformer is a new self-attention model that gradually compresses the sequence of hidden states to a shorter one and hence reduc

GUOKUN LAI 197 Dec 11, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Graph4NLP Graph4NLP is an easy-to-use library for R&D at the intersection of Deep Learning on Graphs and Natural Language Processing (i.e., DLG4NLP).

Graph4AI 1.5k Dec 23, 2022
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022
pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

297 Dec 29, 2022
GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

GNES.ai 1.2k Jan 06, 2023
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

Belinda Li 20 May 17, 2022
An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

NLP-Pytorch-Assignment An assignment from my grad-level data mining course (before I started personal projects) demonstrating some experience with NLP

David Thorne 0 Feb 06, 2022
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

Yuliang Liu 45 Oct 04, 2022
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 290 Dec 26, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
Prithivida 690 Jan 04, 2023
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022
Host your own GPT-3 Discord bot

GPT3 Discord Bot Host your own GPT-3 Discord bot i'd host and make the bot invitable myself, however GPT3 terms of service prohibit public use of GPT3

[something hillarious here] 8 Jan 07, 2023
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022