skweak: A software toolkit for weak supervision applied to NLP tasks

Last update: Dec 28, 2022

Overview

skweak: Weak supervision for NLP

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels without pre-existing datasets. The only available option is often to collect and annotate texts by hand, which is expensive and time-consuming.

skweak (pronounced /skwi:k/) is a Python-based software toolkit that provides a concrete solution to this problem using weak supervision. skweak is built around a very simple idea: Instead of annotating texts by hand, we define a set of labelling functions to automatically label our documents, and then aggregate their results to obtain a labelled version of our corpus.

The labelling functions may take various forms, such as domain-specific heuristics (like pattern-matching rules), gazetteers (based on large dictionaries), machine learning models, or even annotations from crowd-workers. The aggregation is done using a statistical model that automatically estimates the relative accuracy (and confusions) of each labelling function by comparing their predictions with one another.

skweak can be applied to both sequence labelling and text classification, and comes with a complete API that makes it possible to create, apply and aggregate labelling functions with just a few lines of code. The toolkit is also tightly integrated with SpaCy, which makes it easy to incorporate into existing NLP pipelines. Give it a try!

Full Paper:
Pierre Lison, Jeremy Barnes and Aliaksandr Hubin (2021), "skweak: Weak Supervision Made Easy for NLP", arXiv:2104.09683.

Documentation & API: See the Wiki for details on how to use skweak.

121_file_Video.mp4

Dependencies

spacy >= 3.0.0
hmmlearn >= 0.2.4
pandas >= 0.23
numpy >= 1.18

You also need Python >= 3.6.

Install

The easiest way to install skweak is through pip:

pip install skweak

or if you want to install from the repo:

pip install --user git+https://github.com/NorskRegnesentral/skweak

The above installation only includes the core library (not the additional examples in examples).

Basic Overview

Weak supervision with skweak goes through the following steps:

Start: First, you need raw (unlabelled) data from your text domain. skweak is build on top of SpaCy, and operates with Spacy Doc objects, so you first need to convert your documents to Doc objects using SpaCy.
Step 1: Then, we need to define a range of labelling functions that will take those documents and annotate spans with labels. Those labelling functions can comes from heuristics, gazetteers, machine learning models, etc. See the for more details.
Step 2: Once the labelling functions have been applied to your corpus, you need to aggregate their results in order to obtain a single annotation layer (instead of the multiple, possibly conflicting annotations from the labelling functions). This is done in skweak using a generative model that automatically estimates the relative accuracy and possible confusions of each labelling function.
Step 3: Finally, based on those aggregated labels, we can train our final model. Step 2 gives us a labelled corpus that (probabilistically) aggregates the outputs of all labelling functions, and you can use this labelled data to estimate any kind of machine learning model. You are free to use whichever model/framework you prefer.

Quickstart

Here is a minimal example with three labelling functions (LFs) applied on a single document:

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils

# LF 1: heuristic to detect occurrences of MONEY entities
def money_detector(doc):
   for tok in doc[1:]:
      if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
          yield tok.i-1, tok.i+1, "MONEY"
lf1 = heuristics.FunctionAnnotator("money", money_detector)

# LF 2: detection of years with a regex
lf2= heuristics.TokenConstraintAnnotator("years", lambda tok: re.match("(19|20)\d{2}$", tok.text), "DATE")

# LF 3: a gazetteer with a few names
NAMES = [("Barack", "Obama"), ("Donald", "Trump"), ("Joe", "Biden")]
trie = gazetteers.Trie(NAMES)
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":trie})

# We create a corpus (here with a single text)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump paid $750 in federal income taxes in 2016")

# apply the labelling functions
doc = lf3(lf2(lf1(doc)))

# and aggregate them
hmm = aggregation.HMM("hmm", ["PERSON", "DATE", "MONEY"])
hmm.fit_and_aggregate([doc])

# we can then visualise the final result (in Jupyter)
utils.display_entities(doc, "hmm")

Obviously, to get the most out of skweak, you will need more than three labelling functions. And, most importantly, you will need a larger corpus including as many documents as possible from your domain, so that the model can derive good estimates of the relative accuracy of each labelling function.

Documentation

See the Wiki.

License

skweak is released under an MIT License.

The MIT License is a short and simple permissive license allowing both commercial and non-commercial use of the software. The only requirement is to preserve the copyright and license notices (see file License). Licensed works, modifications, and larger works may be distributed under different terms and without source code.

Citation

See our paper describing the framework:

Pierre Lison, Jeremy Barnes and Aliaksandr Hubin (2021), "skweak: Weak Supervision Made Easy for NLP", arXiv:2104.09683

@misc{lison2021skweak,
      title={skweak: Weak Supervision Made Easy for NLP}, 
      author={Pierre Lison and Jeremy Barnes and Aliaksandr Hubin},
      year={2021},
      eprint={2104.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

Label Function Analysis

First of all, thanks for open sourcing such an awesome project!

Our team has been playing around skweak for a sequential labeling task, and we were wondering if there were any plans in the roadmap to include tooling that helps practitioners understand the "impact" of their label functions statistically.

Snorkel for example, provides a LF Analysis tool to understand how one's label functions apply to a dataset statistically (e.g., coverage, overlap, conflicts). Similar functionality would be tremendously helpful in gauging the efficacy of one's label functions for each class in a sequential labeling problem.

Are there any plans to add such functionality down the line as a feature enhancement?
enhancement

opened by schopra8 20
Tokens with no possible state

I very often get the error of this line that there is a "problem with token X", causing HMM training to be aborted after only a couple of documents in the very first iteration.

I found out that this is due to framelogprob having all -np.inf for the token in question. So I checked what happens in self._compute_log_likelihood for the respective document and found that this document had only one labeling function firing and X[source] in this line was all False for the first token (or state?).

This means that this token/state is also all masked with -np.inf in logsum in this line.

Now, I am unsure how to fix that. This clearly does not look like the desired behavior but I suppose "testing for tokens with no possible states" is there for a reason. Can I simply replace -np.inf in self._compute_log_likelihood with -100000 ? Then, of course, the test will not fail and not abort training but there will be a token with only very improbable states. Is that ok?

Or is that the wrong approach? Should tokens without observed labels from the labeling functions rather get a default label (e.g., O)? So why is that not done here? Is it a bug? I am not sure where I should look for a bug, if there is one. Can someone with a better knowledge of the code base give some advice on this?

opened by mnschmit 10
_do_forward_pass, _do_backward_pass, _compute_posteriors not defined in skweak.aggregation

skweak/aggregation.py", line 405, in fit logprob, fwdlattice = self._do_forward_pass(framelogprob) AttributeError: 'HMM' object has no attribute '_do_forward_pass'

opened by ManuBohra 10
TypeError: unhashable type: 'list'

Upon applying config file in order to train textcat model using the following code:

!spacy init config - --lang en --pipeline ner --optimize accuracy | \ spacy train - --paths.train ./train.spacy --paths.dev ./train.spacy \ --initialize.vectors en_core_web_md --output train

I receive following error message:

[i] Saving to output directory: train [i] Using CPU

=========================== Initializing pipeline =========================== 2022-03-27 15:49:59.778883: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.778913: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2022-03-27 15:49:59.798942: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.798976: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [2022-03-27 15:50:05,376] [INFO] Set up nlp object from config [2022-03-27 15:50:05,395] [INFO] Pipeline: ['tok2vec', 'ner'] [2022-03-27 15:50:05,395] [INFO] Created vocabulary [2022-03-27 15:50:07,968] [INFO] Added vectors: en_core_web_md [2022-03-27 15:50:08,292] [INFO] Finished initializing nlp object Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\ProgramData\Anaconda3\Scripts\spacy.exe_main.py", line 7, in File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli_util.py", line 71, in setup_cli command(prog_name=COMMAND) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 829, in call return self.main(*args, **kwargs) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1259, in invoke return process_result(sub_ctx.command.invoke(sub_ctx)) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 610, in invoke return callback(*args, **kwargs) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\typer\main.py", line 497, in wrapper return callback(**use_params) # type: ignore File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 45, in train_cli train(config_path, output_path, use_gpu=use_gpu, overrides=overrides) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 72, in train nlp = init_nlp(config, use_gpu=use_gpu) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\initialize.py", line 84, in init_nlp nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1308, in initialize proc.initialize(get_examples, nlp=self, **p_settings) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\tok2vec.py", line 215, in initialize validate_get_examples(get_examples, "Tok2Vec.initialize") File "spacy\training\example.pyx", line 65, in spacy.training.example.validate_get_examples File "spacy\training\example.pyx", line 44, in spacy.training.example.validate_examples File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 142, in call for real_eg in examples: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 164, in make_examples for reference in reference_docs: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 199, in read_docbin for doc in docs: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_serialize.py", line 150, in get_docs doc.spans.from_bytes(self.span_groups[i]) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_dict_proxies.py", line 54, in from_bytes group = SpanGroup(doc).from_bytes(value_bytes) File "spacy\tokens\span_group.pyx", line 170, in spacy.tokens.span_group.SpanGroup.from_bytes File "C:\ProgramData\Anaconda3\lib\site-packages\srsly_msgpack_api.py", line 27, in msgpack_loads msg = msgpack.loads(data, raw=False, use_list=use_list) File "C:\ProgramData\Anaconda3\lib\site-packages\srsly\msgpack_init.py", line 79, in unpackb return _unpackb(packed, **kwargs) File "srsly\msgpack_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb TypeError: unhashable type: 'list'

Seems like a dependency issue. What is the reason for it? And is there a way to fix it?

Also : Is the following error message a problem ? "[E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside." or can it be avoided by simply applying the following?: for document in train_data: try: document.ents = document.spans["hmm"] skweak.utils.docbin_writer(train_data, "train.spacy") except Exception as e: print(e)

opened by AlineBornschein 6
TypeError when nothing is found on in a document

Hi! I'm getting an exception from fit_and_aggregate. TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'. The exception is from line 227 in aggregation.py, np.apply_along_axis(...)

This seems to happen when all of my labeling functions return empty on one of the docs so the DataFrame is empty.

opened by oholter 6
Error in MultilabelNaiveBayes

I am using Skweak Multilabel for classification and I am getting the following error message - RuntimeError: No valid state found at position 0

I aggregated LFs using CombinedAnnotator, then initialized MultilabelNaiveBayes - MultilabelNaiveBayes("skweak_preds",final_label_list) and then trained the model - skweak_model.fit(d2s)

Any help in fixing this is appreciated. Thanks!

opened by sujeethrv 5
Converting .spacy files to conll format to train other models on it.

Once I fit the aggregation model on the data, I used Skweak's function to write it as a Docbin file which will get saved as a .spacy file. How do I convert this into a normal CoNLL format file. Are there any libraries or tools that can do that ?

opened by Akshay0799 5

Gazetteer is not working with single tokens

Hello.

Can't get why gazetteer doesn't match single name 'Barack'?

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils, base
nlp = spacy.load("en_core_web_sm", disable=["ner"])
doc = nlp('Barack Obama and Donald Trump')
NAMES = [("Barack"), ("Donald", "Trump")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
doc = lf3(doc)
print(doc.spans)

{'presidents': [Donald Trump]}

Any ideas?

Thanks for a remarkable lib!

opened by slavaGanzin 5

[Question] Underspecified Labels w/ out Fine-Grained Label
Context

I'm training an NER model using the HMM aggregator.

I have 2 label classes [A, B] and an under-specified label [C] which is a super-class of A and B within my ontology.

I have 3-sets of gazetteer label functions - one set for A, one set for B, and one set for C.

Issue

When training the HMM, I have tokens which are annotated by label functions for C (superclass) but are not annotated by label functions for A and B (e.g., the term "Apple" is being labeled as an ENT but is not being captured by the LFs for PER or PROD).

Currently I'm calling the HMM function as follows:

hmm = aggregation.HMM("hmm", [A, B], sequence_labelling=True) hmm.add_underspecified_label(C, [A, B]) _ = hmm.fit_and_aggregate(annotated_docs)

This triggers an error from the below aggregation code, since all probability mass is being placed on a label that was not included in the HMM (i.e., the under-specified label C). https://github.com/NorskRegnesentral/skweak/blob/0613f20b9c8be3f22553e303ec22c72dea1f206a/skweak/aggregation.py#L397-L401

Question(s)

Should I be including the under-specified label as a possible label option in the HMM?

hmm = aggregation.HMM("hmm", [A, B, C], sequence_labelling=True) hmm.add_underspecified_label(C, [A, B]) _ = hmm.fit_and_aggregate(annotated_docs)

How are underspecified labels "learned" or trained differently vs. the "specified labels" (e.g., A, B in the example)?

Thanks in advance!
opened by schopra8 5
use Flair with skweak

hello , is here anyone who tried to implement another model/framework other than spacy (ner) as a labeling function. i tried to work with flair but didnt work. can anyone help me and thanks in advance .

opened by Ihebzayen 4
Runtime error in display_entities

I am using the latest version of skweak: 0.2.17. I tried running the example (quick-start.ipynb) in the repo. When I try to execute

skweak.utils.display_entities(docs[28], "other_org_detector")

, I get this error.

opened by latchukarthick98 3
Step by step NER alternative 2

Hello,

First of all, thank you for the library.

I'm kind of new to NER, and I'd like to know how the 2nd alternative of the NER process would be done, where a more sophisticated model is created, since I didn't find it in Step by Step NER.

opened by boskis222 0
minimal example not working

When I try to run the minimal example on the home page, an error appears: AttributeError: 'BaseHMM' object has no attribute '_do_forward_log_pass'

Am I missing something from the install or is it just pip install skweak?

opened by davidbetancur8 2
Support options in displacy.render

This is enhance request for display_entities can be a bit more flexible if you includeoptions={} as part of their parameters. Ex: def display_entities(doc: Doc, layer=None, add_tooltip=False, options={}):

then fix the line below: html = spacy.displacy.render(doc2, jupyter=False, style="ent", manual=True, options=options)

That will extends the functionality of render when creating new entities.

Thanks for the great work with SKWEAK.

opened by lidiexy-palinode 0
Support for relation extraction

Right now, skweak supports two main types of NLP tasks: (token-level) sequence labelling and text classification. Both rests on the idea that labelling functions associate labels to text spans, and the role of the aggregation model is then to merge the outputs of those labelling functions such as to get unified predictions.

However, some NLP tasks cannot be easily associated to text spans. For instance, relation extraction necessitates a prediction on pairs of spans.

The question is then how to provide support for such type of tasks, for instance by implementing a RelationAnnotator that could be used to associate pairs of spans to a label.

Technically speaking, we could still encode the annotations internally as SpanGroup objects. One solution would be to only add one span of the pair in the SpanGroup, but then specify that this span is connected to a second span (SpanGroup objects allows the inclusion of JSON-serialised attributes). The method get_observation_df in the BaseAggregator class could then be extended to detect whether a span is a normal one, or is connected to a second span. If that is the case, the aggregation would then be done on pairs of spans instead of single spans.

Do get in touch if this functionality is something you need, so that we know whether we should prioritise this in our next release :-)
enhancement

opened by plison 4
Regression-based outcome

Hello, thank you for sharing this repo. Do you have plans for providing capability for a regression-based outcome? Something along the lines of fine-grained sentiment on a scale from 1-5?
enhancement

opened by dmracek 1

Releases(0.3.1)

0.3.1(Mar 25, 2022)
Brand new version of skweak, including both a number of bug fixes and some new functionalities:

skweak is now using the latest version of hmmlearn, thereby fixing a number errors due to a mismatch between method names

We now have a clearer split between aggregation models for sequence labelling and for text classification. Possible aggregators for sequence labelling are SequentialMajorityVoter and HMM (preferred), while the aggregators for non-sequential text classification are MajorityVoter and NaiveBayes.

We also introduce a brand new functionality: multi-label classification! Instead of assuming that all labels are mutually exclusive, you can now aggregate the results of labelling functions without assuming that only one label is correct. This multi-label scheme is available for both sequence labelling (see MultilabelSequentialMajorityVoter and MultilabelHMM) and text classification (see MultilabelMajorityVoter and MultilabelNaiveBayes).

By default, all labels can be simultaneously true for a given data point, but you can enforce exclusivity relations between labels through the method set_exclusive_labels. If all labels are set to be mutually exclusive, the aggregation is equivalent to a standard multi-class setup. Internally, this functionality is implemented by constructing and fitting separate aggregation models for each label.

The code for the aggregation models has also been heavily refactored, making it hopefully easier to create new aggregation models.
Source code(tar.gz)
Source code(zip)
0.2.8(Apr 19, 2021)

First official release of skweak, with support for both sequence labelling and text classification! See the documentation for details.
Source code(tar.gz)
Source code(zip)
btc.tar.gz(63.68 MB)
conll2003.spacy(4.56 MB)
conll2003.tar.gz(63.63 MB)
crunchbase.json.gz(8.55 MB)
muc6.spacy(3.37 MB)
norec.conllu.tar.gz(161.68 MB)
reuters_small.spacy(1.54 MB)
reuters_small.tar.gz(190.15 KB)
wikidata_small_tokenised.json.gz(11.36 MB)
wikidata_tokenised.json.gz(21.06 MB)

Owner

Norsk Regnesentral (Norwegian Computing Center)

Norwegian Computing Center is a private foundation performing research in statistical modeling, machine learning and information/communication technology

GitHub Repository

p-tuning for few-shot NLU task

p-tuning_NLU Overview 这个小项目是受乐于分享的苏剑林大佬这篇p-tuning 文章启发，也实现了个使用P-tuning进行NLU分类的任务，思路是一样的，prompt实现方式有不同，这里是将[unused*]的embeddings参数抽取出用于初始化prompt_embed后

3 Dec 29, 2022

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

1 Jan 21, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 04, 2023

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022

【原神】自动演奏风物之诗琴的程序

疯物之诗琴读取midi并自动演奏原神风物之诗琴。可以自定义配置文件自动调整音符来适配风物之诗琴。（原神1.4直播那天就开始做了！到现在才能放出来。。）如何使用在Release页面中下载打包好的程序和midi压缩包并解压。双击运行“疯物之诗琴.exe”。在原神中打开风物之诗琴，软件内输入

435 Jan 04, 2023

Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

167 Jan 03, 2023

An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

6k Dec 30, 2022

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

364 Jan 03, 2023

Linking data between GBIF, Biodiverse, and Open Tree of Life

GBIF-biodiverse-OpenTree Linking data between GBIF, Biodiverse, and Open Tree of Life The python scripts will rely on opentree and Dendropy. To set up

2 Oct 03, 2022

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

1.6k Dec 25, 2022

Fully featured implementation of Routing Transformer

Routing Transformer A fully featured implementation of Routing Transformer. The paper proposes using k-means to route similar queries / keys into the

246 Jan 02, 2023

Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

124 Dec 29, 2022

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

83 Jan 09, 2023

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

20 Dec 14, 2022