Multilingual text (NLP) processing toolkit

Last update: Jan 07, 2023

Related tags

Text Data & NLP polyglot

Overview

polyglot

Polyglot is a natural language pipeline that supports massive multilingual applications.

Free software: GPLv3 license
Documentation: http://polyglot.readthedocs.org.

Features

Tokenization (165 Languages)
Language detection (196 Languages)
Named Entity Recognition (40 Languages)
Part of Speech Tagging (16 Languages)
Sentiment Analysis (136 Languages)
Word Embeddings (137 Languages)
Morphological analysis (135 Languages)
Transliteration (69 Languages)

Developer

Rami Al-Rfou @ rmyeid gmail com

Quick Tutorial

import polyglot
from polyglot.text import Text, Word

Language Detection

text = Text("Bonjour, Mesdames.")
print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))

Language Detected: Code=fr, Name=French

Tokenization

zen = Text("Beautiful is better than ugly. "
           "Explicit is better than implicit. "
           "Simple is better than complex.")
print(zen.words)

[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']

print(zen.sentences)

[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]

Part of Speech Tagging

text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")

print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
    print(u"{:<16}{:>2}".format(word, tag))

Word            POS Tag
------------------------------
O               DET
primeiro        ADJ
uso             NOUN
de              ADP
desobediência   NOUN
civil           ADJ
em              ADP
massa           NOUN
ocorreu         ADJ
em              ADP
setembro        NOUN
de              ADP
1906            NUM
.               PUNCT

Named Entity Recognition

text = Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")
print(text.entities)

[I-LOC([u'Gro\xdfbritannien']), I-PER([u'Gandhi'])]

Polarity

print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in zen.words[:6]:
    print("{:<16}{:>2}".format(w, w.polarity))

Word            Polarity
------------------------------
Beautiful        0
is               0
better           1
than             0
ugly            -1
.                0

Embeddings

word = Word("Obama", language="en")
print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)
for w in word.neighbors:
    print("{:<16}".format(w))
print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))
print(word.vector[:10])

Neighbors (Synonms) of Obama
------------------------------
Bush
Reagan
Clinton
Ahmadinejad
Nixon
Karzai
McCain
Biden
Huckabee
Lula


The first 10 dimensions out the 256 dimensions

[-2.57382345  1.52175975  0.51070285  1.08678675 -0.74386948 -1.18616164
  2.92784619 -0.25694436 -1.40958667 -2.39675403]

Morphology

word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)

[u'Pre', u'process', u'ing']

Transliteration

from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="ru")
print(transliterator.transliterate(u"preprocessing"))

препрокессинг

Multilingual text (NLP) processing toolkit

Related tags

Overview

polyglot

Features

Developer

Quick Tutorial

Language Detection

Tokenization

Part of Speech Tagging

Named Entity Recognition

Polarity

Embeddings

Morphology

Transliteration

Owner

RAMI ALRFOU

Making text a first-class citizen in TensorFlow.

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

Utilize Korean BERT model in sentence-transformers library

Submit issues and feature requests for our API here.

Korea Spell Checker

Minimal GUI for accessing the Watson Text to Speech service.

Code voor mijn Master project omtrent VideoBERT

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

A minimal Conformer ASR implementation adapted from ESPnet.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

HuggingTweets - Train a model to generate tweets

A demo of chinese asr

Code for using and evaluating SpanBERT.

Super easy library for BERT based NLP models

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.