nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Overview

Logo

nlabel is currently alpha software and in an early stage of development.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is also a system to collate results from various taggers and keep track of used models and configurations.

Apart from its standard persistence through sqlite and json files, nlabel's binary arriba format combines especially low storage requirements with high performance (see benchmarks below).

Through arriba, nlabel is thus especially suitable for

  • inspecting many features on few documents
  • inspecting few features on many documents

To support external tool chains, nlabel supports exporting to REFI-QDA.

Quick Start

Processing text works occurs in two steps. First, a NLP instance is built from an existing NLP pipeline:

from nlabel import NLP

import spacy

nlp = NLP(spacy.load("en_core_web_sm"), renames={
    'pos': 'upos',
    'tag': 'xpos'
}, require_gpu=False)

In the example, above nlp now contains a pipeline based on spacy's en_core_web_sm model. We instruct nlp to generate embedding vectors via vectors, and to rename two tags, namely pos to upos and tag to xpos.

In the next step, we run the pipeline and look at its output:

doc = nlp(
    "If you're going to San Francisco,"
    "be sure to wear some flowers in your hair.")

for sent in doc.sentences:
    for token in doc.tokens:
        print(token.text, token.upos, token.vector)

You can ask a doc which tags it carries, by calling doc.tags. In the example above, this would give:

['dep', 'ent_iob', 'lemma', 'morph', 'sentence', 'token', 'upos', 'xpos']

In the following sections, some of internal concepts will be explained. To get directly to code that will generate archives for document collections, skip to Importing a CSV to a local archive.

Tags and Labels

nlabel handles everything as tags, even it is has no label. That means that nlabel regards tokens and sentences as as tags with labels. Tags can both be iterated but also asked for labels. Tags can also be regarded as containers that contain other tags. The following examples illustrate the concepts:

for ent in doc.ents:
    print(ent.label, ent.text)

outputs

GPE San Francisco,

while

for ent in doc.ents:
    for token in ent.tokens:
        print(ent.label, token.text, token.xpos)

outputs

GPE San PROPN
GPE Francisco PROPN

NLP engines

To plug in a different nlp engine, set nlp differently:

import stanza
nlp = NLP(stanza.Pipeline('en'))

Since we renamed tag and pos, in the spacy example above, this would work without additional work.

At the moment nlabel has implementations for spacy, stanza, flair and deeppavlov. You can also write your own nlp data generators (based on nlabel.nlp.Tagger).

While NLP usually auto-detects the type of NLP parser you provide it, there are specialized constructors (NLP.spacy, NLP.flair, etc.) that cover some border cases.

Saving and Loading Documents

Documents can be saved to disk:

doc.save("path/to/file")

By default, this will generate a json-based format that should be easy to parse, even if you do decide to not use nlabel after this point - see bahia json documentation.

Of course, you can also use nlabel to load its own documents:

from nlabel import Document

with Document.open("path/to/file") as doc:
    for sent in doc.sentences:
        for token in sent.tokens:
            print(token.text, token.upos, token.vector)

Working with Archives

To store data from multiple taggers and texts, the approach from the last section would generate lots of separate files. nlabel offers a much better alternative through Archives.

There will be more detailed info on archives later on, for now, here is a quick run-through of how to use them.

A first example

This creates (or opens an existing) archive using the carenero engine (details later on), and adds a newly parsed document to it.

with open_archive("/path/to/archive", engine="carenero") as archive:
    doc = nlp(text)
    archive.add(doc)

Opening the archive later would allow us to retrieve all documents:

with open_archive("/path/to/archive", "r") as archive:
    for doc in archive.iter():
        print(doc.text)

Archives know some more things like the number of documents - use len(archive) - or information about its taggers (see next section).

Multiple Taggers

Things get interesting when using more than one tagger, e.g.:

with open_archive("/path/to/archive", engine="carenero") as archive:
    archive.add(nlp1(text))  # e.g. spacy
    archive.add(nlp2(text))  # e.g. stanza

In such an archive, calling archive.iter() will produce an error:

there are 2 taggers with conflicting tag names in this archive,
please use a selector

The reason for this error message is that spacy's and stanza's tag names clash, and nlabel would not know how to deciper doc.tokens to map either to spacy's or stanza's token data.

To resolve this issue, we can specify which tagger to use in iter.

To do this, we can first ask the archive for the taggers it knows by calling archive.taggers. Each tagger carries a unique signature that identifies it. For example, print(archive.taggers[0]) might the following signature:

env:
  machine: arm64
  platform: macOS-12.1-arm64-arm-64bit
  runtime:
    nlabel: 0.0.1.dev0
    python: 3.9.7
library:
  name: spacy
  version: 3.2.1
model:
  lang: en
  name: core_web_sm
  version: 3.2.0
renames:
  pos: upos
  tag: xpos
type: nlp
vectors:
  token:
    type: native

To iterate over documents getting tag data from this tagger, we can use archive.iter(archive.taggers[0]).

More commonly, we want to select a tagger based on its attributes, not on its index in an archive. To do this, we can use a MongoDB style query syntax:

spacy_tagger = archive.taggers[{
    'library': {
        'name': 'spacy'
    }
}]

This will return the tagger, that carries the name 'spacy' in the 'library' section of its signature. If there are no or multiple such taggers, we will get a KeyError.

As shorthand for the query above, you can also use:

spacy_tagger = archive.taggers[{
    'library.name': 'spacy'
}]

Mixing and Bridging Taggers

What happens if we want not exactly one tagger, but the output from multiple taggers.

Archive.iter() also allows to specify single tags and even rename them.

Using spacy_tagger from the last section and a new stanza_tagger:

for doc in archive.iter( spacy_tagger.sentence, spacy_tagger.xpos, stanza_tagger.xpos.to('st_xpos'))):

With these docs, we now can access spacy's sentence and xpos tags, but also stanza's xpos tag, which we rename to st_xpos to avoid a name clash with spacy's `xpos' tag:

    for token in doc.tokens:  # spacy tokens
        print(token.xpos)  # spacy xpos
        print(token.st_xpos)  # stanza xpos

Note that this only works, if stanza's tokenization for a token exactly matches that of spacy.

The Design of nlabel and Inherent Quirks

nlabel does not differentiate between tags and structuring entities such as sentences and tokens. All of them are the same concept to nlabel: labeled spans, that can be containers to other spans.

What can look like a bug at times, is a very conscious design decision: nlabel is completely agnostic to tags in terms of knowing only a single concept that it applies to everything.

Due to this design, there are various formulations in the API that are perfectly valid but rather confusing.

Obviously, it is desirable to write code that avoids these valid but quirky formulations.

Anything is a span with a label

The code below will look for a tag called "pos" that is perfectly aligned with the current token. If such a tag exists, nlabel considers it to be the "token's pos tag", and will return this tag's label.

for token in doc.tokens:
    print(token.pos)

Here is a quirky twist on the code above:

for token in doc.tokens:
    print(token.sentence)

This is allowed. The code will do the same thing as above: first it looks for a tag called "sentence" that is perfectly aligned with the current token. If such a tag exists, its label is returned.

Since the "sentence" tags provided by nlp libraries carry no labels, and "sentence" tags are not aligned to "token" tags, this will fail at step one or two, and therefore just return an empty label. Still, it is valid in terms of nlabel's concepts.

Using the "label" attribute

for ent in sentence.ents:
    print(ent.label)

The following code does exactly the same thing (avoid using it):

for ent in sentence.ents:
    print(ent.ent)

Label Types

There are four label types in nlabel:

description notes
labels all labels constisting of value and score
label first label only ignores ensuing labels
strs string list of label values ignores scores
str first label value as string ignores score and ensuing labels

strs and labels are suitable for getting output from taggers that return multiple labels.

The default type is str. The exception to this rule are morphology tags (e.g. spacy's morph and stanza's feats, which default to strs).

To specify label types, use the .to(label_type=x) method on tags, when specifying them to Archive.iter or Group.view.

Groups and Views

Groups are an underlying building block of nlabel. You might not encounter them directly.

A group contains data from multiple taggers for one shared text. If you need to collect data for multiple texts, use archives.

Documents can be combined into Groups, which will then contain information from multiple taggers:

from nlabel import Group

group = Group.join([doc1, doc2])

Groups have a view method that works similar to the iter method available in Archives.

Computing Embeddings

The following code uses a spacy model to generate token vectors from spacy's native vector attribute:

nlp = NLP.spacy(
    spacy_model,
    vectors={'token': nlabel.embeddings.native})

Spacy's vector attribute is usually filled via spacy's own Tok2Vec and Transformer components or external extensions such as spacy-sentence-bert.

Alternatively, the following code constructs a model that computes transformer embeddings for tokens via flair:

nlp = NLP.flair(
    vectors={'token': nlabel.embeddings.huggingface(
        "dbmdz/bert-base-german-cased", layers="-1, -2")},
    from_spacy=spacy_model)

from_spacy indicates that sentence splitter and tokenizer should be taken from the provided spacy model.

Archives

Engines

nlabel comes with three different persistence engines:

  • carenero is for collecting data, esp. in a batch setting - by supporting restartability and transaction safety, and enabling export of full data or sub sets of it into bahia or arriba.
  • bahia is suitable for archival purposes, as it is just a thin wrapper around a zip of human-readable json files; it is not the ideal format for exports.
  • arriba is a binary format optimized for read performance, it is suitable for data analysis; it is not suitable for exports.

Storage Size

The following graph shows data from a real-world dataset, consisting of 18861 texts (125.3 MB text data), tagged with 4 taggers and a total of 31 tags (no embedding data). Y axis shows size in GB (note logarithmic scale). REFI-QDA is roughly 100 times the size of arriba.

storage size requirements for different engines

Random Access Speeds

The exact speed of arriba depends on the task and data, but but often arriba performs 10 to 100 times faster than bahia and carenero on real-world projects. From the same data set as earlier (when extracting all POS tags from one of 4 taggers over 2000 documents):

access times for different engines

The carenero/ALL benchmarks shows the time when accessing all tags from all taggers through carenero.

More Engine Details

These engines support storing both tagging data and embedding vectors. In the ordering above, they go from slower to faster.

carenero bahia arriba
data collection + - -
exporting + - -
read speeds - - +
suitable for archival - + -

(*) bahia supports writes, but does not avoid adding duplicates or support proper restartability in batch settings, i.e. it is not suited to incremental updates.

Additional Examples

Importing a CSV to a local archive

Create a carenero archive from a CSV:

from nlabel.importers import CSV

import spacy

csv = CSV(
    "/path/to/some.csv",
    keys=['zeitung_id', 'text_type_id', 'filename'],
    text='text')
csv.importer(spacy.load("en_core_web_sm")).to_local_archive()

This will create an archive located in the same folder as the CSV. The code above is restartable, i.e. it is okay to interrupt and continue later - it will not add duplicate entries.

Once the archive has been created, one can either use it directly, e.g. iterating its documents:

from nlabel import open_archive

with open_archive("some/archive.nlabel", mode="r") as archive:
    for doc in archive.iter(some_selector):
        for x in doc.tokens:
            print(x.text, x.xpos, x.vector)

Or, one can save the archive to different formats for faster traversal:

archive.save("demo2", engine="bahia")
archive.save("demo3", engine="arriba")

The open_archive call from above works with all archive types.

Note that the iter call on archives takes an optional view description that allows picking/renaming tags as described earlier.

Exporting to a remote archive

For larger jobs, it is often useful to separate computation and storage, and to allow multiple computation processes (both often applies to GPU cluster environments). Since carenero's sqlite is bad at handling concurrent writes, the solution is starting a dedicated web service that handles the writing on a dedicated machine.

On machine A, start an archive server (it will write a carenero archive to the given path):

python -m nlabel.importers.server /path/to/archive.nlabel --password your_pwd

On machine B, you can start one or multiple importers writing to that remote archive. Modifying the example from the local archive:

from nlabel import RemoteArchive

remote_archive = RemoteArchive("http://localhost:8000", ("user", "your_pwd"))
csv.importer(spacy.load("en_core_web_sm")).to_remote_archive(
    remote_archive, batch_size=8)

Exporting REFI-QDA

The following code exports ent tags to a REFI-QDA project.

from nlabel import NLP

import spacy
nlp = NLP(spacy.load("en_core_web_lg"))
text = 'some longer text...'
doc = nlp(text)

doc.save_to_qda(
    "/path/to/your.qdp", {
        'tagger': {
        },
        'tags': {
            'ent'
        }
    })

A save_to_qda method is also part of cantenero and bahia archives.

Owner
Bernhard Liebl
Bernhard Liebl
This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Improving Transformer Models by Reordering their Sublayers This repository contains the code for running the character-level Sandwich Transformers fro

Ofir Press 53 Sep 26, 2022
GPT-3 command line interaction

Writer_unblock Straight-forward command line interfacing with GPT-3. Finding yourself stuck at a conceptual stage? Spinning your wheels needlessly on

Seth Nuzum 6 Feb 10, 2022
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

18 Nov 28, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
Pre-Training with Whole Word Masking for Chinese BERT

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui 7.7k Dec 31, 2022
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
The code from the whylogs workshop in DataTalks.Club on 29 March 2022

whylogs Workshop The code from the whylogs workshop in DataTalks.Club on 29 March 2022 whylogs - The open source standard for data logging (Don't forg

DataTalksClub 12 Sep 05, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

537 Jan 05, 2023
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

Google Research 2.1k Dec 28, 2022