nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Overview

Logo

nlabel is currently alpha software and in an early stage of development.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is also a system to collate results from various taggers and keep track of used models and configurations.

Apart from its standard persistence through sqlite and json files, nlabel's binary arriba format combines especially low storage requirements with high performance (see benchmarks below).

Through arriba, nlabel is thus especially suitable for

  • inspecting many features on few documents
  • inspecting few features on many documents

To support external tool chains, nlabel supports exporting to REFI-QDA.

Quick Start

Processing text works occurs in two steps. First, a NLP instance is built from an existing NLP pipeline:

from nlabel import NLP

import spacy

nlp = NLP(spacy.load("en_core_web_sm"), renames={
    'pos': 'upos',
    'tag': 'xpos'
}, require_gpu=False)

In the example, above nlp now contains a pipeline based on spacy's en_core_web_sm model. We instruct nlp to generate embedding vectors via vectors, and to rename two tags, namely pos to upos and tag to xpos.

In the next step, we run the pipeline and look at its output:

doc = nlp(
    "If you're going to San Francisco,"
    "be sure to wear some flowers in your hair.")

for sent in doc.sentences:
    for token in doc.tokens:
        print(token.text, token.upos, token.vector)

You can ask a doc which tags it carries, by calling doc.tags. In the example above, this would give:

['dep', 'ent_iob', 'lemma', 'morph', 'sentence', 'token', 'upos', 'xpos']

In the following sections, some of internal concepts will be explained. To get directly to code that will generate archives for document collections, skip to Importing a CSV to a local archive.

Tags and Labels

nlabel handles everything as tags, even it is has no label. That means that nlabel regards tokens and sentences as as tags with labels. Tags can both be iterated but also asked for labels. Tags can also be regarded as containers that contain other tags. The following examples illustrate the concepts:

for ent in doc.ents:
    print(ent.label, ent.text)

outputs

GPE San Francisco,

while

for ent in doc.ents:
    for token in ent.tokens:
        print(ent.label, token.text, token.xpos)

outputs

GPE San PROPN
GPE Francisco PROPN

NLP engines

To plug in a different nlp engine, set nlp differently:

import stanza
nlp = NLP(stanza.Pipeline('en'))

Since we renamed tag and pos, in the spacy example above, this would work without additional work.

At the moment nlabel has implementations for spacy, stanza, flair and deeppavlov. You can also write your own nlp data generators (based on nlabel.nlp.Tagger).

While NLP usually auto-detects the type of NLP parser you provide it, there are specialized constructors (NLP.spacy, NLP.flair, etc.) that cover some border cases.

Saving and Loading Documents

Documents can be saved to disk:

doc.save("path/to/file")

By default, this will generate a json-based format that should be easy to parse, even if you do decide to not use nlabel after this point - see bahia json documentation.

Of course, you can also use nlabel to load its own documents:

from nlabel import Document

with Document.open("path/to/file") as doc:
    for sent in doc.sentences:
        for token in sent.tokens:
            print(token.text, token.upos, token.vector)

Working with Archives

To store data from multiple taggers and texts, the approach from the last section would generate lots of separate files. nlabel offers a much better alternative through Archives.

There will be more detailed info on archives later on, for now, here is a quick run-through of how to use them.

A first example

This creates (or opens an existing) archive using the carenero engine (details later on), and adds a newly parsed document to it.

with open_archive("/path/to/archive", engine="carenero") as archive:
    doc = nlp(text)
    archive.add(doc)

Opening the archive later would allow us to retrieve all documents:

with open_archive("/path/to/archive", "r") as archive:
    for doc in archive.iter():
        print(doc.text)

Archives know some more things like the number of documents - use len(archive) - or information about its taggers (see next section).

Multiple Taggers

Things get interesting when using more than one tagger, e.g.:

with open_archive("/path/to/archive", engine="carenero") as archive:
    archive.add(nlp1(text))  # e.g. spacy
    archive.add(nlp2(text))  # e.g. stanza

In such an archive, calling archive.iter() will produce an error:

there are 2 taggers with conflicting tag names in this archive,
please use a selector

The reason for this error message is that spacy's and stanza's tag names clash, and nlabel would not know how to deciper doc.tokens to map either to spacy's or stanza's token data.

To resolve this issue, we can specify which tagger to use in iter.

To do this, we can first ask the archive for the taggers it knows by calling archive.taggers. Each tagger carries a unique signature that identifies it. For example, print(archive.taggers[0]) might the following signature:

env:
  machine: arm64
  platform: macOS-12.1-arm64-arm-64bit
  runtime:
    nlabel: 0.0.1.dev0
    python: 3.9.7
library:
  name: spacy
  version: 3.2.1
model:
  lang: en
  name: core_web_sm
  version: 3.2.0
renames:
  pos: upos
  tag: xpos
type: nlp
vectors:
  token:
    type: native

To iterate over documents getting tag data from this tagger, we can use archive.iter(archive.taggers[0]).

More commonly, we want to select a tagger based on its attributes, not on its index in an archive. To do this, we can use a MongoDB style query syntax:

spacy_tagger = archive.taggers[{
    'library': {
        'name': 'spacy'
    }
}]

This will return the tagger, that carries the name 'spacy' in the 'library' section of its signature. If there are no or multiple such taggers, we will get a KeyError.

As shorthand for the query above, you can also use:

spacy_tagger = archive.taggers[{
    'library.name': 'spacy'
}]

Mixing and Bridging Taggers

What happens if we want not exactly one tagger, but the output from multiple taggers.

Archive.iter() also allows to specify single tags and even rename them.

Using spacy_tagger from the last section and a new stanza_tagger:

for doc in archive.iter( spacy_tagger.sentence, spacy_tagger.xpos, stanza_tagger.xpos.to('st_xpos'))):

With these docs, we now can access spacy's sentence and xpos tags, but also stanza's xpos tag, which we rename to st_xpos to avoid a name clash with spacy's `xpos' tag:

    for token in doc.tokens:  # spacy tokens
        print(token.xpos)  # spacy xpos
        print(token.st_xpos)  # stanza xpos

Note that this only works, if stanza's tokenization for a token exactly matches that of spacy.

The Design of nlabel and Inherent Quirks

nlabel does not differentiate between tags and structuring entities such as sentences and tokens. All of them are the same concept to nlabel: labeled spans, that can be containers to other spans.

What can look like a bug at times, is a very conscious design decision: nlabel is completely agnostic to tags in terms of knowing only a single concept that it applies to everything.

Due to this design, there are various formulations in the API that are perfectly valid but rather confusing.

Obviously, it is desirable to write code that avoids these valid but quirky formulations.

Anything is a span with a label

The code below will look for a tag called "pos" that is perfectly aligned with the current token. If such a tag exists, nlabel considers it to be the "token's pos tag", and will return this tag's label.

for token in doc.tokens:
    print(token.pos)

Here is a quirky twist on the code above:

for token in doc.tokens:
    print(token.sentence)

This is allowed. The code will do the same thing as above: first it looks for a tag called "sentence" that is perfectly aligned with the current token. If such a tag exists, its label is returned.

Since the "sentence" tags provided by nlp libraries carry no labels, and "sentence" tags are not aligned to "token" tags, this will fail at step one or two, and therefore just return an empty label. Still, it is valid in terms of nlabel's concepts.

Using the "label" attribute

for ent in sentence.ents:
    print(ent.label)

The following code does exactly the same thing (avoid using it):

for ent in sentence.ents:
    print(ent.ent)

Label Types

There are four label types in nlabel:

description notes
labels all labels constisting of value and score
label first label only ignores ensuing labels
strs string list of label values ignores scores
str first label value as string ignores score and ensuing labels

strs and labels are suitable for getting output from taggers that return multiple labels.

The default type is str. The exception to this rule are morphology tags (e.g. spacy's morph and stanza's feats, which default to strs).

To specify label types, use the .to(label_type=x) method on tags, when specifying them to Archive.iter or Group.view.

Groups and Views

Groups are an underlying building block of nlabel. You might not encounter them directly.

A group contains data from multiple taggers for one shared text. If you need to collect data for multiple texts, use archives.

Documents can be combined into Groups, which will then contain information from multiple taggers:

from nlabel import Group

group = Group.join([doc1, doc2])

Groups have a view method that works similar to the iter method available in Archives.

Computing Embeddings

The following code uses a spacy model to generate token vectors from spacy's native vector attribute:

nlp = NLP.spacy(
    spacy_model,
    vectors={'token': nlabel.embeddings.native})

Spacy's vector attribute is usually filled via spacy's own Tok2Vec and Transformer components or external extensions such as spacy-sentence-bert.

Alternatively, the following code constructs a model that computes transformer embeddings for tokens via flair:

nlp = NLP.flair(
    vectors={'token': nlabel.embeddings.huggingface(
        "dbmdz/bert-base-german-cased", layers="-1, -2")},
    from_spacy=spacy_model)

from_spacy indicates that sentence splitter and tokenizer should be taken from the provided spacy model.

Archives

Engines

nlabel comes with three different persistence engines:

  • carenero is for collecting data, esp. in a batch setting - by supporting restartability and transaction safety, and enabling export of full data or sub sets of it into bahia or arriba.
  • bahia is suitable for archival purposes, as it is just a thin wrapper around a zip of human-readable json files; it is not the ideal format for exports.
  • arriba is a binary format optimized for read performance, it is suitable for data analysis; it is not suitable for exports.

Storage Size

The following graph shows data from a real-world dataset, consisting of 18861 texts (125.3 MB text data), tagged with 4 taggers and a total of 31 tags (no embedding data). Y axis shows size in GB (note logarithmic scale). REFI-QDA is roughly 100 times the size of arriba.

storage size requirements for different engines

Random Access Speeds

The exact speed of arriba depends on the task and data, but but often arriba performs 10 to 100 times faster than bahia and carenero on real-world projects. From the same data set as earlier (when extracting all POS tags from one of 4 taggers over 2000 documents):

access times for different engines

The carenero/ALL benchmarks shows the time when accessing all tags from all taggers through carenero.

More Engine Details

These engines support storing both tagging data and embedding vectors. In the ordering above, they go from slower to faster.

carenero bahia arriba
data collection + - -
exporting + - -
read speeds - - +
suitable for archival - + -

(*) bahia supports writes, but does not avoid adding duplicates or support proper restartability in batch settings, i.e. it is not suited to incremental updates.

Additional Examples

Importing a CSV to a local archive

Create a carenero archive from a CSV:

from nlabel.importers import CSV

import spacy

csv = CSV(
    "/path/to/some.csv",
    keys=['zeitung_id', 'text_type_id', 'filename'],
    text='text')
csv.importer(spacy.load("en_core_web_sm")).to_local_archive()

This will create an archive located in the same folder as the CSV. The code above is restartable, i.e. it is okay to interrupt and continue later - it will not add duplicate entries.

Once the archive has been created, one can either use it directly, e.g. iterating its documents:

from nlabel import open_archive

with open_archive("some/archive.nlabel", mode="r") as archive:
    for doc in archive.iter(some_selector):
        for x in doc.tokens:
            print(x.text, x.xpos, x.vector)

Or, one can save the archive to different formats for faster traversal:

archive.save("demo2", engine="bahia")
archive.save("demo3", engine="arriba")

The open_archive call from above works with all archive types.

Note that the iter call on archives takes an optional view description that allows picking/renaming tags as described earlier.

Exporting to a remote archive

For larger jobs, it is often useful to separate computation and storage, and to allow multiple computation processes (both often applies to GPU cluster environments). Since carenero's sqlite is bad at handling concurrent writes, the solution is starting a dedicated web service that handles the writing on a dedicated machine.

On machine A, start an archive server (it will write a carenero archive to the given path):

python -m nlabel.importers.server /path/to/archive.nlabel --password your_pwd

On machine B, you can start one or multiple importers writing to that remote archive. Modifying the example from the local archive:

from nlabel import RemoteArchive

remote_archive = RemoteArchive("http://localhost:8000", ("user", "your_pwd"))
csv.importer(spacy.load("en_core_web_sm")).to_remote_archive(
    remote_archive, batch_size=8)

Exporting REFI-QDA

The following code exports ent tags to a REFI-QDA project.

from nlabel import NLP

import spacy
nlp = NLP(spacy.load("en_core_web_lg"))
text = 'some longer text...'
doc = nlp(text)

doc.save_to_qda(
    "/path/to/your.qdp", {
        'tagger': {
        },
        'tags': {
            'ent'
        }
    })

A save_to_qda method is also part of cantenero and bahia archives.

Owner
Bernhard Liebl
Bernhard Liebl
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

BUET CSE NLP Group 29 Nov 19, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
The code from the whylogs workshop in DataTalks.Club on 29 March 2022

whylogs Workshop The code from the whylogs workshop in DataTalks.Club on 29 March 2022 whylogs - The open source standard for data logging (Don't forg

DataTalksClub 12 Sep 05, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Making Emojis More Predictable by Karan Abrol, Karanjot Singh and Pritish Wadhwa, Natural Language Processing (CSE546) under the guidance of Dr. Shad

Karanjot Singh 2 Jan 17, 2022
Awesome-NLP-Research (ANLP)

Awesome-NLP-Research (ANLP)

Language, Information, and Learning at Yale 72 Dec 19, 2022
Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

GPU Docker NLP Application Deployment Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU, to setup the enviroment on

Ritesh Yadav 9 Oct 14, 2022
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022
FewCLUE: 为中文NLP定制的小样本学习测评基准

FewCLUE: 为中文NLP定制的小样本学习测评基准

CLUE benchmark 387 Jan 04, 2023
neural network based speaker embedder

Content What is deepaudio-speaker? Installation Get Started Model Architecture How to contribute to deepaudio-speaker? Acknowledge What is deepaudio-s

20 Dec 29, 2022
Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 06, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
MiCECo - Misskey Custom Emoji Counter

MiCECo Misskey Custom Emoji Counter Introduction This little script counts custo

7 Dec 25, 2022
A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

c3rb3ru5 45 Jan 05, 2023
texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

Jörg Thalheim 70 Dec 26, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022