A simple machine learning package to cluster keywords in higher-level groups.

Last update: Dec 18, 2022

Overview

Simple Keyword Clusterer

A simple machine learning package to cluster keywords in higher-level groups.

Example:
"Senior Frontend Engineer" --> "Frontend Engineer"
"Junior Backend developer" --> "Backend developer"

Installation

pip install simple_keyword_clusterer

Usage

# import the package
from simple_keyword_clusterer import Clusterer

# read your keywords in list
with open("../my_keywords.txt", "r") as f:
    data = f.read().splitlines()

# instantiate object
clusterer = Clusterer()

# apply clustering
df = clusterer.extract(data)

print(df)

Performance

The algorithm will find the optimal number of clusters automatically based on the best Silhouette Score.

You can specify the number of clusters yourself too

# instantiate object
clusterer = Clusterer(n_clusters=4)

# apply clustering
df = clusterer.extract(data)

For best performance, try to reduce the variance of data by providing the same semantic context
(the job title keywords file should remain coherent, in that it shouldn't contain other stuff like gardening keywords).

If items are clearly separable, the algorithm should still be able to provide a useable output.

Customization

You can customize the clustering mechanism through the files

blacklist.txt
to_normalize.txt

If you notice that the clustering identifies unwanted groups, you can blacklist certain words simply by appending them in the blacklist.txt file.

The to_normalize.txt file contains tuples that identify a transformation to apply to the keyword. For instance

("back end", "backend), ("front end", "frontend), ("sr", "Senior"), ("jr", "junior")

Simply add your tuples to use this functionality.

Dependencies

Scikit-learn
Pandas
Matplotlib
Seaborn
Numpy
NLTK
Tqdm

Make sure to download NLTK English stopwords and punctuation with the command

nltk.download("stopwords")
nltk.download('punkt')

Contact

If you feel like contacting me, do so and send me a mail. You can find my contact information on my website.

A simple machine learning package to cluster keywords in higher-level groups.

Related tags

Overview

Simple Keyword Clusterer

Installation

Usage

Performance

Customization

Dependencies

Contact

Owner

Andrea D'Agostino

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning.

Practical Time-Series Analysis, published by Packt

MiniTorch - a diy teaching library for machine learning engineers

Simulation of early COVID-19 using SIR model and variants (SEIR ...).

A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Firebase + Cloudrun + Machine learning

Reggy - Regressions with arbitrarily complex regularization terms

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams

CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Machine learning that just works, for effortless production applications

Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along with material in the form of Jupyter Notebooks.

李航《统计学习方法》复现

flexible time-series processing & feature extraction

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Predict profitability of trades based on indicator buy / sell signals

Merlion: A Machine Learning Framework for Time Series Intelligence

Probabilistic time series modeling in Python

Library for machine learning stacking generalization.