Biterm Topic Model (BTM): modeling topics in short texts

Last update: Dec 30, 2022

Overview

Biterm Topic Model

Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actually, it is a cythonized version of BTM. This package is also capable of computing perplexity and semantic coherence metrics.

Development

Please note that bitermplus is actively improved. Refer to documentation to stay up to date.

Requirements

cython
numpy
pandas
scipy
scikit-learn
tqdm

Setup

Linux and Windows

There should be no issues with installing bitermplus under these OSes. You can install the package directly from PyPi.

pip install bitermplus

Or from this repo:

pip install git+https://github.com/maximtrp/bitermplus.git

Mac OS

First, you need to install XCode CLT and Homebrew. Then, install libomp using brew:

xcode-select --install
brew install libomp
pip3 install bitermplus

Example

Model fitting

import bitermplus as btm
import numpy as np
import pandas as pd

# IMPORTING DATA
df = pd.read_csv(
    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()

# PREPROCESSING
# Obtaining terms frequency in a sparse matrix and corpus vocabulary
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
tf = np.array(X.sum(axis=0)).ravel()
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
docs_lens = list(map(len, docs_vec))
# Generating biterms
biterms = btm.get_biterms(docs_vec)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# METRICS
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
# or
perplexity = model.perplexity_
coherence = model.coherence_

Results visualization

You need to install tmplot first.

import tmplot as tmp
tmp.report(model=model, docs=texts)

Tutorial

There is a tutorial in documentation that covers the important steps of topic modeling (including stability measures and results visualization).

Comments

the topic distribution for all doc is similar

topic

[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07 3.12592152e-07] [9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08 2.43742411e-08] [9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07 1.83996702e-07] [9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07 2.77376339e-07] [9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10 3.94318712e-10] [9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07 3.92884503e-07]
bug help wanted good first issue

opened by JennieGerhardt 11
ERROR: Failed building wheel for bitermplus

creating build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found #include <omp.h> ^~~~~~~ 1 error generated. error: command '/usr/bin/clang' failed with exit code 1 [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus Failed to build bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
bug documentation

opened by QinrenK 9
Got an unexpected result in marked sample

Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, when i use the marked sample to train the model. i got the unexpeted result. Firstly, the marked samples contain 5 types, but trained model get a huge perlexity when the the number of topic is 5. Secondly, when i test the topic parameter from 1 to 20, the perplexity was reduced following the increase of topic number. my code is following: df = pd.read_csv('dataPretreatment/data/corpus.txt', header=None, names=['texts']) texts = df['texts'].str.strip().tolist() print(df) stop_words = segmentWord.stopwordslist() perplexitys = [] coherences = []

for T in range(1,21,1): print(T) X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words) # Vectorizing documents docs_vec = btm.get_vectorized_docs(texts, vocabulary) # Generating biterms biterms = btm.get_biterms(docs_vec) # INITIALIZING AND RUNNING MODEL model = btm.BTM(X, vocabulary, seed=12321, T=T, M=50, alpha=50/T, beta=0.01) model.fit(biterms, iterations=2000) p_zd = model.transform(docs_vec) perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, T) coherence = model.coherence_ perplexitys.append(perplexity) coherences.append(coherence)

``

opened by Chen-X666 7
Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

Hi @maximtrp, I am trying to use bitermplus for topic modeling. Running the code shows the error I mentioned in the title. Seems sth in get_words_freqs function goes wrong. I appreciate if you advise how I can fix that.

opened by Sajad7010 4

Cannot find Closest topics and Stable topics

Hello there, I am able to generate the model and visualize it. But when I tried to find the closest topics and stable topics, I get the error for code line:

closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=139, verbose=True)

The error is:

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

This is despite me separately checking the array size and it is 2-D. I am pasting the code below. Pl. can you check if I am doing anything wrong.

Thank you.

X, vocabulary, vocab_dict = btm.get_words_freqs(clean_text, max_df=.85, min_df=15,ngram_range=(1,2))

# Vectorizing documents
docs_vec = btm.get_vectorized_docs(clean_text, vocabulary)

# Generating biterms
Y = X.todense()
biterms = btm.get_biterms(docs_vec, 15)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(X, vocabulary, T=8, M=10, alpha=500/1000, beta=0.01, win=15, has_background= True)
model.fit(biterms, iterations=500, verbose=True)
p_zd = model.transform(docs_vec,verbose=True)  
print(p_zd) 

# matrix of document-topics; topics vs. documents, topics vs. words probabilities 
matrix_docs_topics = model.matrix_docs_topics_    #Documents vs topics probabilities matrix.
topic_doc_matrix = model.matrix_topics_docs_      #Topics vs documents probabilities matrix.
matrix_topic_words = model.matrix_topics_words_   #Topics vs words probabilities matrix.

# Getting stable topics
print("Array Dimension = ",len(matrix_topic_words.shape))
closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=100, verbose=True)
stable_topics, stable_kl = btm.get_stable_topics(closest_topics, thres=0.7)

# Stable topics indices list
print(stable_topics)

help wanted question

opened by RashmiBatra 4

Questions regarding Perplexity and Model Comparison with C++

I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model? My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?
help wanted question

opened by orpheus92 3
How do I get the topic words?

Hi,

Firstly, thanks for sharing your code.

Not an issue, just a question. I'm able to see the relevant words for a topic in the tmplot report. How do I get those words? I need to get at least the most three relevant terms.

Thanks in advance.
question

opened by aguinaldoabbj 3

failed building wheels

Hi!

I've got an error when running pip3 install bitermplus on MacOS (intel-based, Ventura), using python 3.10.8 in a separate venv (not anaconda):

Building wheels for collected packages: bitermplus
  Building wheel for bitermplus (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for bitermplus (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [34 lines of output]
      Error in sitecustomize; set PYTHONVERBOSE for traceback:
      AssertionError:
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-12-x86_64-cpython-310
      creating build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/__init__.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_util.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running egg_info
      writing src/bitermplus.egg-info/PKG-INFO
      writing dependency_links to src/bitermplus.egg-info/dependency_links.txt
      writing requirements to src/bitermplus.egg-info/requires.txt
      writing top-level names to src/bitermplus.egg-info/top_level.txt
      reading manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      adding license file 'LICENSE'
      writing manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      copying src/bitermplus/_btm.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_btm.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running build_ext
      building 'bitermplus._btm' extension
      creating build/temp.macosx-12-x86_64-cpython-310
      creating build/temp.macosx-12-x86_64-cpython-310/src
      creating build/temp.macosx-12-x86_64-cpython-310/src/bitermplus
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-12-x86_64-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
      src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
      #include <omp.h>
               ^~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

Could this error be related to #29? I've tested on a PC and it worked though.

bug documentation

opened by alanmaehara 2

Failed building wheel for bitermplus

Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

When I try to install bitermplus with pip install bitermplus there is an error massage like this : note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
bug

opened by novra 2
Calculation of nmi,ami,ri

I'm trying to test the model and see if it matches the data labels, but I can't get the topic for each document. I'm trying to get the list of labels to apply nmi, ami and ri so I'm wondering how to get the labels from the model. @maximtrp

opened by gitassia 2
Implementation Guide

I was wondering is there any way to print the the topics generate by the BTM model, just like how I can do it with Gensim. In addition to that, I am getting all negative coherence values in the range of -500 or -600. I am not sure if I am doing something wrong. The issues is, I am not able to interpret the results, even plotting gives some strange output.

The following image show what is held by the variable adobe, again I am not sure if it needs to be in this manner or each row here needs to a list

opened by neel6762 2

Releases(v0.6.12)

v0.6.12(Mar 29, 2022)

This release contains some minor fixes and adds labels_ property to BTM model class (labels for the most probable topics for each of the documents). It also adds get_docs_top_topic method for creating DataFrames with documents and their labels.
Source code(tar.gz)
Source code(zip)
v0.6.11(Jan 8, 2022)

This release fixes the incompatibility error between bitermplus and scikit-learn.
Source code(tar.gz)
Source code(zip)
v0.6.10(Dec 16, 2021)

This release includes a number of minor fixes. Methods to select stable topics have been moved to tmplot package. Please see the updated tutorial in the documentation.
Source code(tar.gz)
Source code(zip)
v0.6.9(Aug 19, 2021)

This release introduces a function for Renyi entropy calculation (bitermplus.entropy) that can be used to estimate the optimal number of topics. For more details, read this paper.
Source code(tar.gz)
Source code(zip)
v0.6.8(Jul 23, 2021)

This release is an attempt to fix the issue with perplexity calculation yielding infinity values (#7).
Source code(tar.gz)
Source code(zip)
v0.6.7(Jul 1, 2021)
This release drops support for pyLDAvis in favor of tmplot that can be installed with pip (optional):

pip install tmplot
Source code(tar.gz)
Source code(zip)
v0.6.6(Jun 16, 2021)

This release exposes new model attributes: matrix_topics_docs_, matrix_words_topics_, and df_words_topics_ (words vs topics probabilities in a DataFrame).
Source code(tar.gz)
Source code(zip)
v0.6.5(Jun 11, 2021)

This release fixes a critical bug in the closest topics selection (get_closest_topics method).
Source code(tar.gz)
Source code(zip)
v0.6.4(Apr 18, 2021)

This release includes memory optimizations and new metrics for topics distance measuring (see get_closest_topics method).
Source code(tar.gz)
Source code(zip)
v0.6.3(Apr 7, 2021)

This release fixes a bug in transform method that occurred when empty documents were passed as inputs.
Source code(tar.gz)
Source code(zip)
v0.6.2(Apr 6, 2021)

This release fixes a bug in document vs topics matrix shape (reported in this issue).
Source code(tar.gz)
Source code(zip)
v0.6.1(Apr 5, 2021)

This is a minor release that fixes buffer types mismatch on creating biterms (critical bug that appeared under Windows).
Source code(tar.gz)
Source code(zip)
v0.6.0(Apr 4, 2021)
This is a major release that fixes critical bugs in arrays initialization. The previous versions of bitermplus are not recommended for use.

Changelog:

Arrays (n_bz, n_wz) are now properly initialized. This procedure was broken in the previous versions that led to biased results.

Data normalization (via _normalize hidden method) improved.

New NumPy random generators are used to initially assign topics to biterms.

Biterms (biterms_ model attribute) and topics probabilities (theta_ model attribute) are now available.

Biterms are now serialized as well when model is saved.

Source code(tar.gz)
Source code(zip)
v0.5.10(Mar 23, 2021)

This release improves model pickling and adds seed argument to fit() method of BTM class.
Source code(tar.gz)
Source code(zip)
v0.5.9(Mar 22, 2021)

In this release public extension attributes were converted to properties with comprehensible names and docstrings.
Source code(tar.gz)
Source code(zip)
v0.5.8(Mar 21, 2021)

This release fixed numerous bugs in the code of inference methods, optimizes memory usage, and covers most part of model fitting and inferring code with tests.
Source code(tar.gz)
Source code(zip)

Owner

Maksim Terpilowski

Research scientist

GitHub Repository https://bitermplus.readthedocs.io/en/stable/

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

端到端的长文本摘要模型（法研杯2020司法摘要赛道）

334 Jan 08, 2023

CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

66 Dec 14, 2022

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

1.6k Jan 03, 2023

Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022

Stuff related to Ben Eater's 8bit breadboard computer

8bit breadboard computer simulator This is an assembler + simulator/emulator of Ben Eater's 8bit breadboard computer. For a version with its RAM upgra

29 Dec 29, 2022

Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

21 Dec 14, 2022

Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

2 Mar 26, 2022

ReCoin - Restoring our environment and businesses in parallel

Shashank Ojha, Sabrina Button, Abdellah Ghassel, Joshua Gonzales "Reduce Reuse R

1 Mar 14, 2022

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

2 Sep 09, 2022

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

1 Feb 07, 2022

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

12.3k Jan 02, 2023

AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

12 Feb 16, 2022

Nmt - TensorFlow Neural Machine Translation Tutorial

Neural Machine Translation (seq2seq) Tutorial Authors: Thang Luong, Eugene Brevdo, Rui Zhao (Google Research Blogpost, Github) This version of the tut

6.1k Dec 29, 2022

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

491 Jan 07, 2023

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

2.3k Jan 01, 2023

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

7 Sep 22, 2022

Biterm Topic Model (BTM): modeling topics in short texts

Related tags

Overview

Biterm Topic Model

Development

Requirements

Setup

Linux and Windows

Mac OS

Example

Model fitting

Results visualization

Tutorial

Comments

topic

Releases(v0.6.12)

v0.6.12(Mar 29, 2022)

v0.6.11(Jan 8, 2022)

v0.6.10(Dec 16, 2021)

v0.6.9(Aug 19, 2021)

v0.6.8(Jul 23, 2021)

v0.6.7(Jul 1, 2021)

v0.6.6(Jun 16, 2021)

v0.6.5(Jun 11, 2021)

v0.6.4(Apr 18, 2021)

v0.6.3(Apr 7, 2021)

v0.6.2(Apr 6, 2021)

v0.6.1(Apr 5, 2021)

v0.6.0(Apr 4, 2021)

v0.5.10(Mar 23, 2021)

v0.5.9(Mar 22, 2021)

v0.5.8(Mar 21, 2021)