A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Last update: Oct 07, 2022

Overview

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It can be adapted to many languages provided that the Snowball stemmer, a dependency of this project, supports it.

Usage

from artifici_lda.lda_service import train_lda_pipeline_default


FR_STOPWORDS = [
    "le", "les", "la", "un", "de", "en",
    "a", "b", "c", "s",
    "est", "sur", "tres", "donc", "sont",
    # even slang/texto stop words:
    "ya", "pis", "yer"]
# Note: this list of stop words is poor and is just as an example.

fr_comments = [
    "Un super-chat marche sur le trottoir",
    "Les super-chats aiment ronronner",
    "Les chats sont ronrons",
    "Un super-chien aboie",
    "Deux super-chiens",
    "Combien de chiens sont en train d'aboyer?"
]

transformed_comments, top_comments, _1_grams, _2_grams = train_lda_pipeline_default(
    fr_comments,
    n_topics=2,
    stopwords=FR_STOPWORDS,
    language='french')

print(transformed_comments)
print(top_comments)
print(_1_grams)
print(_2_grams)

Output:

array([[0.14218195, 0.85781805],
       [0.11032992, 0.88967008],
       [0.16960695, 0.83039305],
       [0.88967041, 0.11032959],
       [0.8578187 , 0.1421813 ],
       [0.83039303, 0.16960697]])

['Un super-chien aboie', 'Les super-chats aiment ronronner']

[[('chiens', 3.4911404011996545), ('super', 2.5000203653313933)],
 [('chats',  3.4911393765493255), ('super', 2.499979634668601 )]]

[[('super chiens', 2.4921035508342464)],
 [('super chats',  2.492102155345991 )]]

How it works

See Multilingual-LDA-Pipeline-Tutorial for an exhaustive example (intended to be read from top to bottom, not skimmed through). For more explanations on the Inverse Lemmatization, see Stemming-words-from-multiple-languages.

Supported Languages

Those languages are supported:

Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Porter
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish

You need to bring your own list of stop words. That could be achieved by computing the Term Frequencies on your corpus (or on a bigger corpus of the same language) and to use some of the most common words as stop words.

Dependencies and their license

numpy==1.14.3           # BSD-3-Clause and BSD-2-Clause BSD-like and Zlib
scikit-learn==0.19.1    # BSD-3-Clause
PyStemmer==1.3.0        # BSD-3-Clause and MIT
snowballstemmer==1.2.1  # BSD-3-Clause and BSD-2-Clause
translitcodec==0.4.0    # MIT License
scipy==1.1.0            # BSD-3-Clause and MIT-like

Unit tests

Run pytest with ./run_tests.sh. Coverage:

----------- coverage: platform linux, python 3.6.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
artifici_lda/__init__.py                       0      0   100%
artifici_lda/data_utils.py                    39      0   100%
artifici_lda/lda_service.py                   31      0   100%
artifici_lda/logic/__init__.py                 0      0   100%
artifici_lda/logic/count_vectorizer.py         9      0   100%
artifici_lda/logic/lda.py                     23      7    70%
artifici_lda/logic/letter_splitter.py         36      4    89%
artifici_lda/logic/stemmer.py                 60      3    95%
artifici_lda/logic/stop_words_remover.py      61      5    92%
--------------------------------------------------------------
TOTAL                                        259     19    93%

License

This project is published under the MIT License (MIT).

Coded by Guillaume Chevalier at Neuraxio Inc.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Related tags

Overview

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

Usage

How it works

Supported Languages

Dependencies and their license

Unit tests

License

Owner

Artifici Online Services inc.

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Get list of common stop words in various languages in Python

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

Code and data accompanying Natural Language Processing with PyTorch

Production First and Production Ready End-to-End Keyword Spotting Toolkit

🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

Paddlespeech Streaming ASR GUI

A full spaCy pipeline and models for scientific/biomedical documents.

Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

Text Classification in Turkish Texts with Bert

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

A paper list for aspect based sentiment analysis.

Kinky furry assitant based on GPT2

This repo contains simple to use, pretrained/training-less models for speaker diarization.

TensorFlow code and pre-trained models for BERT

This is the source code of RPG (Reward-Randomized Policy Gradient)

Generate a cool README/About me page for your Github Profile

Nateve compiler developed with python.