Biterm Topic Model (BTM): modeling topics in short texts

Overview

Biterm Topic Model

CircleCI Documentation Status Codacy Badge Issues Downloads PyPI

Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actually, it is a cythonized version of BTM. This package is also capable of computing perplexity and semantic coherence metrics.

Development

Please note that bitermplus is actively improved. Refer to documentation to stay up to date.

Requirements

  • cython
  • numpy
  • pandas
  • scipy
  • scikit-learn
  • tqdm

Setup

Linux and Windows

There should be no issues with installing bitermplus under these OSes. You can install the package directly from PyPi.

pip install bitermplus

Or from this repo:

pip install git+https://github.com/maximtrp/bitermplus.git

Mac OS

First, you need to install XCode CLT and Homebrew. Then, install libomp using brew:

xcode-select --install
brew install libomp
pip3 install bitermplus

Example

Model fitting

import bitermplus as btm
import numpy as np
import pandas as pd

# IMPORTING DATA
df = pd.read_csv(
    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()

# PREPROCESSING
# Obtaining terms frequency in a sparse matrix and corpus vocabulary
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
tf = np.array(X.sum(axis=0)).ravel()
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
docs_lens = list(map(len, docs_vec))
# Generating biterms
biterms = btm.get_biterms(docs_vec)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# METRICS
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
# or
perplexity = model.perplexity_
coherence = model.coherence_

Results visualization

You need to install tmplot first.

import tmplot as tmp
tmp.report(model=model, docs=texts)

Report interface

Tutorial

There is a tutorial in documentation that covers the important steps of topic modeling (including stability measures and results visualization).

Comments
  • the topic distribution for all doc is similar

    the topic distribution for all doc is similar

    topic

    [9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07  3.12592152e-07] [9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08  2.43742411e-08] [9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07  1.83996702e-07] [9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07  2.77376339e-07] [9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10  3.94318712e-10] [9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07  3.92884503e-07]

    bug help wanted good first issue 
    opened by JennieGerhardt 11
  • ERROR: Failed building wheel for bitermplus

    ERROR: Failed building wheel for bitermplus

    creating build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found #include <omp.h> ^~~~~~~ 1 error generated. error: command '/usr/bin/clang' failed with exit code 1 [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus Failed to build bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

    bug documentation 
    opened by QinrenK 9
  • Got an unexpected result in marked sample

    Got an unexpected result in marked sample

    Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, when i use the marked sample to train the model. i got the unexpeted result. Firstly, the marked samples contain 5 types, but trained model get a huge perlexity when the the number of topic is 5. Secondly, when i test the topic parameter from 1 to 20, the perplexity was reduced following the increase of topic number. my code is following: df = pd.read_csv('dataPretreatment/data/corpus.txt', header=None, names=['texts']) texts = df['texts'].str.strip().tolist() print(df) stop_words = segmentWord.stopwordslist() perplexitys = [] coherences = []

    for T in range(1,21,1): print(T) X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words) # Vectorizing documents docs_vec = btm.get_vectorized_docs(texts, vocabulary) # Generating biterms biterms = btm.get_biterms(docs_vec) # INITIALIZING AND RUNNING MODEL model = btm.BTM(X, vocabulary, seed=12321, T=T, M=50, alpha=50/T, beta=0.01) model.fit(biterms, iterations=2000) p_zd = model.transform(docs_vec) perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, T) coherence = model.coherence_ perplexitys.append(perplexity) coherences.append(coherence)

    ``

    opened by Chen-X666 7
  • Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

    Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

    Hi @maximtrp, I am trying to use bitermplus for topic modeling. Running the code shows the error I mentioned in the title. Seems sth in get_words_freqs function goes wrong. I appreciate if you advise how I can fix that.

    opened by Sajad7010 4
  • Cannot find Closest topics and Stable topics

    Cannot find Closest topics and Stable topics

    Hello there, I am able to generate the model and visualize it. But when I tried to find the closest topics and stable topics, I get the error for code line:

    closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=139, verbose=True)
    

    The error is:

    IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
    

    This is despite me separately checking the array size and it is 2-D. I am pasting the code below. Pl. can you check if I am doing anything wrong.

    Thank you.

    X, vocabulary, vocab_dict = btm.get_words_freqs(clean_text, max_df=.85, min_df=15,ngram_range=(1,2))
    
    # Vectorizing documents
    docs_vec = btm.get_vectorized_docs(clean_text, vocabulary)
    
    # Generating biterms
    Y = X.todense()
    biterms = btm.get_biterms(docs_vec, 15)
    
    # INITIALIZING AND RUNNING MODEL
    model = btm.BTM(X, vocabulary, T=8, M=10, alpha=500/1000, beta=0.01, win=15, has_background= True)
    model.fit(biterms, iterations=500, verbose=True)
    p_zd = model.transform(docs_vec,verbose=True)  
    print(p_zd) 
    
    # matrix of document-topics; topics vs. documents, topics vs. words probabilities 
    matrix_docs_topics = model.matrix_docs_topics_    #Documents vs topics probabilities matrix.
    topic_doc_matrix = model.matrix_topics_docs_      #Topics vs documents probabilities matrix.
    matrix_topic_words = model.matrix_topics_words_   #Topics vs words probabilities matrix.
    
    # Getting stable topics
    print("Array Dimension = ",len(matrix_topic_words.shape))
    closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=100, verbose=True)
    stable_topics, stable_kl = btm.get_stable_topics(closest_topics, thres=0.7)
    
    # Stable topics indices list
    print(stable_topics)
    
    help wanted question 
    opened by RashmiBatra 4
  • Questions regarding Perplexity and Model Comparison with C++

    Questions regarding Perplexity and Model Comparison with C++

    I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model? My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?

    help wanted question 
    opened by orpheus92 3
  • How do I get the topic words?

    How do I get the topic words?

    Hi,

    Firstly, thanks for sharing your code.

    Not an issue, just a question. I'm able to see the relevant words for a topic in the tmplot report. How do I get those words? I need to get at least the most three relevant terms.

    Thanks in advance.

    question 
    opened by aguinaldoabbj 3
  • failed building wheels

    failed building wheels

    Hi!

    I've got an error when running pip3 install bitermplus on MacOS (intel-based, Ventura), using python 3.10.8 in a separate venv (not anaconda):

    Building wheels for collected packages: bitermplus
      Building wheel for bitermplus (pyproject.toml) ... error
      error: subprocess-exited-with-error
    
      × Building wheel for bitermplus (pyproject.toml) did not run successfully.
      │ exit code: 1
      ╰─> [34 lines of output]
          Error in sitecustomize; set PYTHONVERBOSE for traceback:
          AssertionError:
          running bdist_wheel
          running build
          running build_py
          creating build
          creating build/lib.macosx-12-x86_64-cpython-310
          creating build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/__init__.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/_util.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          running egg_info
          writing src/bitermplus.egg-info/PKG-INFO
          writing dependency_links to src/bitermplus.egg-info/dependency_links.txt
          writing requirements to src/bitermplus.egg-info/requires.txt
          writing top-level names to src/bitermplus.egg-info/top_level.txt
          reading manifest file 'src/bitermplus.egg-info/SOURCES.txt'
          reading manifest template 'MANIFEST.in'
          adding license file 'LICENSE'
          writing manifest file 'src/bitermplus.egg-info/SOURCES.txt'
          copying src/bitermplus/_btm.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/_btm.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/_metrics.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/_metrics.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          running build_ext
          building 'bitermplus._btm' extension
          creating build/temp.macosx-12-x86_64-cpython-310
          creating build/temp.macosx-12-x86_64-cpython-310/src
          creating build/temp.macosx-12-x86_64-cpython-310/src/bitermplus
          clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-12-x86_64-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
          src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
          #include <omp.h>
                   ^~~~~~~
          1 error generated.
          error: command '/usr/bin/clang' failed with exit code 1
          [end of output]
    
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for bitermplus
    Failed to build bitermplus
    ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
    

    Could this error be related to #29? I've tested on a PC and it worked though.

    bug documentation 
    opened by alanmaehara 2
  • Failed building wheel for bitermplus

    Failed building wheel for bitermplus

    Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

    When I try to install bitermplus with pip install bitermplus there is an error massage like this : note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

    bug 
    opened by novra 2
  • Calculation of nmi,ami,ri

    Calculation of nmi,ami,ri

    I'm trying to test the model and see if it matches the data labels, but I can't get the topic for each document. I'm trying to get the list of labels to apply nmi, ami and ri so I'm wondering how to get the labels from the model. @maximtrp

    opened by gitassia 2
  • Implementation Guide

    Implementation Guide

    I was wondering is there any way to print the the topics generate by the BTM model, just like how I can do it with Gensim. In addition to that, I am getting all negative coherence values in the range of -500 or -600. I am not sure if I am doing something wrong. The issues is, I am not able to interpret the results, even plotting gives some strange output.

    image

    The following image show what is held by the variable adobe, again I am not sure if it needs to be in this manner or each row here needs to a list

    image
    opened by neel6762 2
Releases(v0.6.12)
Owner
Maksim Terpilowski
Research scientist
Maksim Terpilowski
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

The Atticus Project 273 Dec 17, 2022
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022
Ask for weather information like a human

weather-nlp About Ask for weather information like a human. Goals Understand typical questions like: Hourly temperatures in Potsdam on 2020-09-15. Rai

5 Oct 29, 2022
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench 🪑 The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 01, 2023
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero. From zero to hero • Installation • Getting Started • Examples • API • FAQ • Co

Jonathan Besomi 2.7k Jan 08, 2023
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 07, 2023
Host your own GPT-3 Discord bot

GPT3 Discord Bot Host your own GPT-3 Discord bot i'd host and make the bot invitable myself, however GPT3 terms of service prohibit public use of GPT3

[something hillarious here] 8 Jan 07, 2023
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

BERT is to NLP what AlexNet is to CV This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Iden

Asahi Ushio 20 Nov 03, 2022
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

HUSEIN ZOLKEPLI 1.7k Dec 30, 2022
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 07, 2022
a test times augmentation toolkit based on paddle2.0.

Patta Image Test Time Augmentation with Paddle2.0! Input | # input batch of images / / /|\ \ \ # apply

AgentMaker 110 Dec 03, 2022