Pipeline for training LSA models using Scikit-Learn.

Last update: Sep 05, 2022

Overview

Latent Semantic Analysis

Pipeline for training LSA models using Scikit-Learn.

Usage

Instead of writing custom code for latent semantic analysis, you just need:

install pipeline:

pip install latent-semantic-analysis

run pipeline:

either in terminal:

lsa-train --path_to_config config.yaml

or in python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with raw text column (with arbitrary name).

Config

The user interface consists of only one files:

config.yaml - general configuration with sklearn TF-IDF and SVD parameters

Change config.yaml to create the desired configuration and train LSA model with the following command:

terminal:

lsa-train --path_to_config config.yaml

python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models

# data
data:
  data_path: data/data.csv
  sep: ','
  text_column: text

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# svd
svd:
  n_components: 10
  algorithm: arpack

NOTE: tf-idf and svd are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with LSA (TF-IDF and SVD steps)
config.yaml - config that was used to train the model
logging.txt - logging file
doc2topic.json - document embeddings
term2topic.json - term embeddings

Requirements

Python >= 3.6

Citation

If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021lsa,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training LSA models},
    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
    year         = {2021}
}

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

190 Dec 21, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

57 Dec 7, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

First Release! 🥳🎉🍾
Source code(tar.gz)
Source code(zip)

Pipeline for training LSA models using Scikit-Learn.

Related tags

Overview

Latent Semantic Analysis

Usage

Config

Output

Requirements

Citation

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Pipeline for chemical image-to-text competition

Pipeline for fast building text classification TF-IDF + LogReg baselines.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

BookNLP, a natural language processing pipeline for books

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

Owner

Dani El-Ayyass

NLTK Source

Open source code for AlphaFold.

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

ReCoin - Restoring our environment and businesses in parallel

多语言降噪预训练模型MBart的中文生成任务

Crowd sourced training data for Rasa NLU models

TalkNet: Audio-visual active speaker detection Model

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

Sequence-to-Sequence learning using PyTorch

This is a MD5 password/passphrase brute force tool

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

An open source library for deep learning end-to-end dialog systems and chatbots.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

A combination of autoregressors and autoencoders using XLNet for sentiment analysis

Natural Language Processing library built with AllenNLP 🌲🌱

Repository for the paper: VoiceMe: Personalized voice generation in TTS

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)