Natural language Understanding Toolkit

Related tags

Text Data & NLPnut
Overview

Natural language Understanding Toolkit

TOC

Requirements

To install nut you need:

  • Python 2.5 or 2.6
  • Numpy (>= 1.1)
  • Sparsesvd (>= 0.1.4) [1] (only CLSCL)

Installation

To clone the repository run,

git clone git://github.com/pprett/nut.git

To build the extension modules inplace run,

python setup.py build_ext --inplace

Add project to python path,

export PYTHONPATH=$PYTHONPATH:$HOME/workspace/nut

Documentation

CLSCL

An implementation of Cross-Language Structural Correspondence Learning (CLSCL). See [Prettenhofer2010] for a detailed description and [Prettenhofer2011] for more experiments and enhancements.

The data for cross-language sentiment classification that has been used in the above study can be found here [2].

clscl_train

Training script for CLSCL. See ./clscl_train --help for further details.

Usage:

$ ./clscl_train en de cls-acl10-processed/en/books/train.processed cls-acl10-processed/en/books/unlabeled.processed cls-acl10-processed/de/books/unlabeled.processed cls-acl10-processed/dict/en_de_dict.txt model.bz2 --phi 30 --max-unlabeled=50000 -k 100 -m 450 --strategy=parallel

|V_S| = 64682
|V_T| = 106024
|V| = 170706
|s_train| = 2000
|s_unlabeled| = 50000
|t_unlabeled| = 50000
debug: DictTranslator contains 5012 translations.
mutualinformation took 5.624 sec
select_pivots took 7.197 sec
|pivots| = 450
create_inverted_index took 59.353 sec
Run joblib.Parallel
[Parallel(n_jobs=-1)]: Done   1 out of 450 |elapsed:    9.1s remaining: 67.8min
[Parallel(n_jobs=-1)]: Done   5 out of 450 |elapsed:   15.2s remaining: 22.6min
[..]
[Parallel(n_jobs=-1)]: Done 449 out of 450 |elapsed: 14.5min remaining:    1.9s
train_aux_classifiers took 881.803 sec
density: 0.1154
Ut.shape = (100,170706)
learn took 903.588 sec
project took 175.483 sec

Note

If you have access to a hadoop cluster, you can use --strategy=hadoop to train the pivot classifiers even faster, however, make sure that the hadoop nodes have Bolt (feature-mask branch) [3] installed.

clscl_predict

Prediction script for CLSCL.

Usage:

$ ./clscl_predict cls-acl10-processed/en/books/train.processed model.bz2 cls-acl10-processed/de/books/test.processed 0.01
|V_S| = 64682
|V_T| = 106024
|V| = 170706
load took 0.681 sec
load took 0.659 sec
classes = {negative,positive}
project took 2.498 sec
project took 2.716 sec
project took 2.275 sec
project took 2.492 sec
ACC: 83.05

Named-Entity Recognition

A simple greedy left-to-right sequence labeling approach to named entity recognition (NER).

pre-trained models

We provide pre-trained named entity recognizers for place, person, and organization names in English and German. To tag a sentence simply use:

>>> from nut.io import compressed_load
>>> from nut.util import WordTokenizer

>>> tagger = compressed_load("model_demo_en.bz2")
>>> tokenizer = WordTokenizer()
>>> tokens = tokenizer.tokenize("Peter Prettenhofer lives in Austria .")

>>> # see tagger.tag.__doc__ for input format
>>> sent = [((token, "", ""), "") for token in tokens]
>>> g = tagger.tag(sent)  # returns a generator over tags
>>> print(" ".join(["/".join(tt) for tt in zip(tokens, g)]))
Peter/B-PER Prettenhofer/I-PER lives/O in/O Austria/B-LOC ./O

You can also use the convenience demo script ner_demo.py:

$ python ner_demo.py model_en_v1.bz2

The feature detector modules for the pre-trained models are en_best_v1.py and de_best_v1.py and can be found in the package nut.ner.features. In addition to baseline features (word presence, shape, pre-/suffixes) they use distributional features (brown clusters), non-local features (extended prediction history), and gazetteers (see [Ratinov2009]). The models have been trained on CoNLL03 [4]. Both models use neither syntactic features (e.g. part-of-speech tags, chunks) nor word lemmas, thus, minimizing the required pre-processing. Both models provide state-of-the-art performance on the CoNLL03 shared task benchmark for English [Ratinov2009]:

processed 46435 tokens with 4946 phrases; found: 4864 phrases; correct: 4455.
accuracy:  98.01%; precision:  91.59%; recall:  90.07%; FB1:  90.83
              LOC: precision:  91.69%; recall:  90.53%; FB1:  91.11  1648
              ORG: precision:  87.36%; recall:  85.73%; FB1:  86.54  1630
              PER: precision:  95.84%; recall:  94.06%; FB1:  94.94  1586

and German [Faruqui2010]:

processed 51943 tokens with 2845 phrases; found: 2438 phrases; correct: 2168.
accuracy:  97.92%; precision:  88.93%; recall:  76.20%; FB1:  82.07
              LOC: precision:  87.67%; recall:  79.83%; FB1:  83.57  957
              ORG: precision:  82.62%; recall:  65.92%; FB1:  73.33  466
              PER: precision:  93.00%; recall:  78.02%; FB1:  84.85  1015

To evaluate the German model on the out-domain data provided by [Faruqui2010] use the raw flag (-r) to write raw predictions (without B- and I- prefixes):

./ner_predict -r model_de_v1.bz2 clner/de/europarl/test.conll - | clner/scripts/conlleval -r
loading tagger... [done]
use_eph:  True
use_aso:  False
processed input in 40.9214s sec.
processed 110405 tokens with 2112 phrases; found: 2930 phrases; correct: 1676.
accuracy:  98.50%; precision:  57.20%; recall:  79.36%; FB1:  66.48
              LOC: precision:  91.47%; recall:  71.13%; FB1:  80.03  563
              ORG: precision:  43.63%; recall:  83.52%; FB1:  57.32  1673
              PER: precision:  62.10%; recall:  83.85%; FB1:  71.36  694

Note that the above results cannot be compared directly to the resuls of [Faruqui2010] since they use a slighly different setting (incl. MISC entity).

ner_train

Training script for NER. See ./ner_train --help for further details.

To train a conditional markov model with a greedy left-to-right decoder, the feature templates of [Rationov2009]_ and extended prediction history (see [Ratinov2009]) use:

./ner_train clner/en/conll03/train.iob2 model_rr09.bz2 -f rr09 -r 0.00001 -E 100 --shuffle --eph
________________________________________________________________________________
Feature extraction

min count:  1
use eph:  True
build_vocabulary took 24.662 sec
feature_extraction took 25.626 sec
creating training examples... build_examples took 42.998 sec
[done]
________________________________________________________________________________
Training

num examples: 203621
num features: 553249
num classes: 9
classes:  ['I-LOC', 'B-ORG', 'O', 'B-PER', 'I-PER', 'I-MISC', 'B-MISC', 'I-ORG', 'B-LOC']
reg: 0.00001000
epochs: 100
9 models trained in 239.28 seconds.
train took 282.374 sec

ner_predict

You can use the prediction script to tag new sentences formatted in CoNLL format and write the output to a file or to stdout. You can pipe the output directly to conlleval to assess the model performance:

./ner_predict model_rr09.bz2 clner/en/conll03/test.iob2 - | clner/scripts/conlleval
loading tagger... [done]
use_eph:  True
use_aso:  False
processed input in 11.2883s sec.
processed 46435 tokens with 5648 phrases; found: 5605 phrases; correct: 4799.
accuracy:  96.78%; precision:  85.62%; recall:  84.97%; FB1:  85.29
              LOC: precision:  87.29%; recall:  88.91%; FB1:  88.09  1699
             MISC: precision:  79.85%; recall:  75.64%; FB1:  77.69  665
              ORG: precision:  82.90%; recall:  78.81%; FB1:  80.80  1579
              PER: precision:  88.81%; recall:  91.28%; FB1:  90.03  1662

References

[1] http://pypi.python.org/pypi/sparsesvd/0.1.4
[2] http://www.webis.de/research/corpora/corpus-webis-cls-10/cls-acl10-processed.tar.gz
[3] https://github.com/pprett/bolt/tree/feature-mask
[4] For German we use the updated version of CoNLL03 by Sven Hartrumpf.
[Prettenhofer2010] Prettenhofer, P. and Stein, B., Cross-language text classification using structural correspondence learning. In Proceedings of ACL '10.
[Prettenhofer2011] Prettenhofer, P. and Stein, B., Cross-lingual adaptation using structural correspondence learning. ACM TIST (to appear). [preprint]
[Ratinov2009] (1, 2, 3) Ratinov, L. and Roth, D., Design challenges and misconceptions in named entity recognition. In Proceedings of CoNLL '09.
[Faruqui2010] (1, 2, 3) Faruqui, M. and Padó S., Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. In Proceedings of KONVENS '10

Developer Notes

  • If you copy a new version of bolt into the externals directory make sure to run cython on the *.pyx files. If you fail to do so you will get a PickleError in multiprocessing.
Owner
Peter Prettenhofer
Peter Prettenhofer
Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

Hyunwoong Ko 72 Dec 07, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022
Random Directed Acyclic Graph Generator

DAG_Generator Random Directed Acyclic Graph Generator verison1.0 简介 工作流通常由DAG(有向无环图)来定义,其中每个计算任务$T_i$由一个顶点(node,task,vertex)表示。同时,任务之间的每个数据或控制依赖性由一条加权

Livion 17 Dec 27, 2022
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 29, 2022
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022
A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

Yuval Rosen 14 Dec 07, 2022
Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023
Telegram AI chat bot written in Python using Pyrogram

Aurora_Al Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @AuroraAl. Require

♗CσNϙUҽRσR_MҽSƙEƚҽҽR 1 Oct 31, 2021
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Fastseq 基于ONNXRUNTIME的文本生成加速框架

Fastseq 基于ONNXRUNTIME的文本生成加速框架

Jun Gao 9 Nov 09, 2021
🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

PokéBattle is an esoteric language designed so that the program looks like the transcript of a Pokémon battle. Original inspiration and specification

Eduardo Correia 9 Jan 11, 2022
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

ÖMER YILDIZ 4 Mar 20, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 05, 2022
A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Crosslingual Coreference Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non

Pandora Intelligence 71 Jan 04, 2023