BERT, LDA, and TFIDF based keyword extraction in Python

Overview

rtd ci codecov pyversions pypi pypistatus license coc codestyle colab

BERT, LDA, and TFIDF based keyword extraction in Python

kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichlet Allocation. The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see kwx.languages for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby guaranteeing sensible results that are in line with user intuitions.

For a thorough overview of the process and techniques see the Google slides, and reference the documentation for explanations of the models and visualization methods.

Contents

Installation

kwx can be downloaded from PyPI via pip or sourced directly from this repository:

pip install kwx
git clone https://github.com/andrewtavis/kwx.git
cd kwx
python setup.py install
import kwx

Models

Implemented NLP modeling methods within kwx.model include:

BERT

Bidirectional Encoder Representations from Transformers derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.

kwx uses sentence-transformers pretrained models. See their GitHub and documentation for the available models.

LDA

Latent Dirichlet Allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.

Although not as computationally robust as some machine learning models, LDA provides quick results that are suitable for many applications. Specifically for keyword extraction, in most settings the results are similar to those of BERT in a fraction of the time.

Other Methods

The user can also choose to simply query the most common words from a text corpus or compute TFIDF (Term Frequency Inverse Document Frequency) keywords - those that are unique in a text body in comparison to another that's compared. The former method is used in kwx as a baseline to check model efficacy, and the latter is a useful baseline when a user has another text or text body to compare the target corpus against.

Usage

Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. examples/kw_extraction provides an example of how to use kwx by deriving keywords from tweets in the Kaggle Twitter US Airline Sentiment dataset.

The following outlines using kwx to derive keywords from a text corpus with prompt_remove_words as True (the user will be asked if some of the extracted words need to be replaced):

Text Cleaning

from kwx.utils import prepare_data

input_language = "english" # see kwx.languages for options

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
    data="df_or_csv_xlsx_path",
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

Keyword Extraction

from kwx.model import extract_kws

num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]

# Remove n-grams for BERT training
corpus_no_ngrams = [
    " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]

# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
    method="BERT", # "BERT", "LDA", "TFIDF", "frequency"
    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
    text_corpus=corpus_no_ngrams,  # automatically tokenized if using LDA
    input_language=input_language,
    output_language=None,  # allows the output to be translated
    num_keywords=num_keywords,
    num_topics=num_topics,
    corpuses_to_compare=None,  # for TFIDF
    ignore_words=ignore_words,
    prompt_remove_words=True,  # check words with user
    show_progress_bar=True,
    batch_size=32,
)
The BERT keywords are:

['time', 'flight', 'plane', 'southwestair', 'ticket', 'cancel', 'united', 'baggage',
'love', 'virginamerica', 'service', 'customer', 'delay', 'late', 'hour']

Should words be removed [y/n]? y
Type or copy word(s) to be removed: southwestair, united, virginamerica

The new BERT keywords are:

['late', 'baggage', 'service', 'flight', 'time', 'love', 'book', 'customer',
'response', 'hold', 'hour', 'cancel', 'cancelled_flighted', 'delay', 'plane']

Should words be removed [y/n]? n

The model will be rerun until all words known to be unreasonable are removed for a suitable output. kwx.model.gen_files could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).

Visuals

kwx.visuals includes the following functions for presenting and analyzing the results of keyword extraction:

Topic Number Evaluation

A graph of topic coherence and overlap given a variable number of topics to derive keywords from.

from kwx.visuals import graph_topic_num_evals
import matplotlib.pyplot as plt

graph_topic_num_evals(
    method=["lda", "bert"],
    text_corpus=text_corpus,
    num_keywords=num_keywords,
    topic_nums_to_compare=list(range(5, 15)),
    metrics=True, #  stability and coherence
)
plt.show()

t-SNE

t-SNE allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.

from kwx.visuals import t_sne
import matplotlib.pyplot as plt

t_sne(
    dimension="both",  # 2d and 3d are options
    text_corpus=text_corpus,
    num_topics=10,
    remove_3d_outliers=True,
)
plt.show()

pyLDAvis

pyLDAvis is included so that users can inspect LDA extracted topics, and further so that it can easily be generated for output files.

from kwx.visuals import pyLDAvis_topics

pyLDAvis_topics(
    method="lda",
    text_corpus=text_corpus,
    num_topics=10,
    display_ipython=False,  # For Jupyter integration
)

Word Cloud

Word clouds via wordcloud are included for a basic representation of the text corpus - specifically being a way to convey basic visual information to potential stakeholders. The following figure from examples/kw_extraction shows a word cloud generated from tweets of US air carrier passengers:

from kwx.visuals import gen_word_cloud

ignore_words = ["words", "user", "knows", "they", "don't", "want"]

gen_word_cloud(
    text_corpus=text_corpus,
    ignore_words=None,
    height=500,
)

To-Do

Please see the contribution guidelines if you are interested in contributing to this project. Work that is in progress or could be implemented includes:

Comments
  • Text by text keyword extraction in dataset

    Text by text keyword extraction in dataset

    First of all thank you for the model. I want to do something like this; For example, there are 20 text data in my dataset. I want to extract the keyword of each text. How can I do that?

    bug question 
    opened by AhmetCakar 6
  • Error

    Error "__init__() got an unexpected keyword argument 'common_terms'" occured when running example kw_extraction.ipynb

    Hi, I am trying to run notebook "kw_extraction.ipynb" given as example in Google Colab. When I am at the step of preparing data, I got error "init() got an unexpected keyword argument 'common_terms'".

    image

    May I know how to solve this? It seems like it is using a parameter that does not exist in gensim_models.phrases anymore, so shall I change the version of gensim to a lower level...?

    bug 
    opened by Y-H-Lai 5
  • ModuleNotFoundError: No module named 'pyLDAvis.gensim'

    ModuleNotFoundError: No module named 'pyLDAvis.gensim'

    Hi Andrew, I found this ModuleNotFoundError while running the line

    from kwx.model import extract_kws

    Error description: 25 import pandas as pd 26 import pyLDAvis ---> 27 import pyLDAvis.gensim 28 import seaborn as sns 29 from gensim import corpora

    ModuleNotFoundError: No module named 'pyLDAvis.gensim'

    But, it can be solved by installing : pip install pyLDAvis==3.2.2

    bug 
    opened by AbhiPawar5 5
  • [WinError 3] The system cannot find the path specified: 'C:\\mysystem/.cache\\torch\\sentence_transformers\\sbert.net_models_xlm-r-bert-base-nli-stsb-mean-tokens'

    [WinError 3] The system cannot find the path specified: 'C:\\mysystem/.cache\\torch\\sentence_transformers\\sbert.net_models_xlm-r-bert-base-nli-stsb-mean-tokens'

    I get this error in different percentages while trying to make keyword extraction with BERT. For example, 96 percent gave this error first, then 100 percent gave this error. The last 26 percent gave this error. Can you help me? Screenshot_1

    opened by AhmetCakar 4
  • Bump certifi from 2021.10.8 to 2022.12.7

    Bump certifi from 2021.10.8 to 2022.12.7

    Bumps certifi from 2021.10.8 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 3
  • Keyword extraction for BERT does not work for less samples

    Keyword extraction for BERT does not work for less samples

    Hi Andrew, I tried the keyword extraction API for just 5 samples in a dataframe.

    bert_kws = extract_kws( method="BERT", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=3, )

    Which returns, ValueError: n_samples=5 should be >= n_clusters=10 for batch_size. I wonder why that's happening? Thanks!

    bug question 
    opened by AbhiPawar5 3
  • Bump ipython from 7.10.0 to 7.16.3

    Bump ipython from 7.10.0 to 7.16.3

    ⚠️ Dependabot is rebasing this PR ⚠️

    Rebasing might not happen immediately, so don't worry if this takes some time.

    Note: if you make any changes to this PR yourself, they will take precedence over the rebase.


    Bumps ipython from 7.10.0 to 7.16.3.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 2
  • Bump tensorflow from 2.4.1 to 2.5.0

    Bump tensorflow from 2.4.1 to 2.5.0

    Bumps tensorflow from 2.4.1 to 2.5.0.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.5.0

    Release 2.5.0

    Major Features and Improvements

    • Support for Python3.9 has been added.
    • tf.data:
      • tf.data service now supports strict round-robin reads, which is useful for synchronous training workloads where example sizes vary. With strict round robin reads, users can guarantee that consumers get similar-sized examples in the same step.
      • tf.data service now supports optional compression. Previously data would always be compressed, but now you can disable compression by passing compression=None to tf.data.experimental.service.distribute(...).
      • tf.data.Dataset.batch() now supports num_parallel_calls and deterministic arguments. num_parallel_calls is used to indicate that multiple input batches should be computed in parallel. With num_parallel_calls set, deterministic is used to indicate that outputs can be obtained in the non-deterministic order.
      • Options returned by tf.data.Dataset.options() are no longer mutable.
      • tf.data input pipelines can now be executed in debug mode, which disables any asynchrony, parallelism, or non-determinism and forces Python execution (as opposed to trace-compiled graph execution) of user-defined functions passed into transformations such as map. The debug mode can be enabled through tf.data.experimental.enable_debug_mode().
    • tf.lite
      • Enabled the new MLIR-based quantization backend by default
        • The new backend is used for 8 bits full integer post-training quantization
        • The new backend removes the redundant rescales and fixes some bugs (shared weight/bias, extremely small scales, etc)
        • Set experimental_new_quantizer in tf.lite.TFLiteConverter to False to disable this change
    • tf.keras
      • tf.keras.metrics.AUC now support logit predictions.
      • Enabled a new supported input type in Model.fit, tf.keras.utils.experimental.DatasetCreator, which takes a callable, dataset_fn. DatasetCreator is intended to work across all tf.distribute strategies, and is the only input type supported for Parameter Server strategy.
    • tf.distribute
      • tf.distribute.experimental.ParameterServerStrategy now supports training with Keras Model.fit when used with DatasetCreator.
      • Creating tf.random.Generator under tf.distribute.Strategy scopes is now allowed (except for tf.distribute.experimental.CentralStorageStrategy and tf.distribute.experimental.ParameterServerStrategy). Different replicas will get different random-number streams.
    • TPU embedding support
      • Added profile_data_directory to EmbeddingConfigSpec in _tpu_estimator_embedding.py. This allows embedding lookup statistics gathered at runtime to be used in embedding layer partitioning decisions.
    • PluggableDevice
    • oneAPI Deep Neural Network Library (oneDNN) CPU performance optimizations from Intel-optimized TensorFlow are now available in the official x86-64 Linux and Windows builds.
      • They are off by default. Enable them by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.
      • We do not recommend using them in GPU systems, as they have not been sufficiently tested with GPUs yet.
    • TensorFlow pip packages are now built with CUDA11.2 and cuDNN 8.1.0

    Breaking Changes

    • The TF_CPP_MIN_VLOG_LEVEL environment variable has been renamed to to TF_CPP_MAX_VLOG_LEVEL which correctly describes its effect.

    Bug Fixes and Other Changes

    • tf.keras:
      • Preprocessing layers API consistency changes:
        • StringLookup added output_mode, sparse, and pad_to_max_tokens arguments with same semantics as TextVectorization.
        • IntegerLookup added output_mode, sparse, and pad_to_max_tokens arguments with same semantics as TextVectorization. Renamed max_values, oov_value and mask_value to max_tokens, oov_token and mask_token to align with StringLookup and TextVectorization.
        • TextVectorization default for pad_to_max_tokens switched to False.
        • CategoryEncoding no longer supports adapt, IntegerLookup now supports equivalent functionality. max_tokens argument renamed to num_tokens.
        • Discretization added num_bins argument for learning bins boundaries through calling adapt on a dataset. Renamed bins argument to bin_boundaries for specifying bins without adapt.
      • Improvements to model saving/loading:
        • model.load_weights now accepts paths to saved models.

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.5.0

    Breaking Changes

    • The TF_CPP_MIN_VLOG_LEVEL environment variable has been renamed to to TF_CPP_MAX_VLOG_LEVEL which correctly describes its effect.

    Known Caveats

    Major Features and Improvements

    • TPU embedding support

      • Added profile_data_directory to EmbeddingConfigSpec in _tpu_estimator_embedding.py. This allows embedding lookup statistics gathered at runtime to be used in embedding layer partitioning decisions.
    • tf.keras.metrics.AUC now support logit predictions.

    • Creating tf.random.Generator under tf.distribute.Strategy scopes is now allowed (except for tf.distribute.experimental.CentralStorageStrategy and tf.distribute.experimental.ParameterServerStrategy). Different replicas will get different random-number streams.

    • tf.data:

      • tf.data service now supports strict round-robin reads, which is useful for synchronous training workloads where example sizes vary. With strict round robin reads, users can guarantee that consumers get similar-sized examples in the same step.
      • tf.data service now supports optional compression. Previously data would always be compressed, but now you can disable compression by passing compression=None to tf.data.experimental.service.distribute(...).
      • tf.data.Dataset.batch() now supports num_parallel_calls and deterministic arguments. num_parallel_calls is used to indicate that multiple input batches should be computed in parallel. With num_parallel_calls set, deterministic is used to indicate that outputs can be obtained in the non-deterministic order.
      • Options returned by tf.data.Dataset.options() are no longer mutable.
      • tf.data input pipelines can now be executed in debug mode, which disables any asynchrony, parallelism, or non-determinism and forces Python execution (as opposed to trace-compiled graph execution) of user-defined functions passed into transformations such as map. The debug mode can be enabled through tf.data.experimental.enable_debug_mode().
    • tf.lite

      • Enabled the new MLIR-based quantization backend by default
        • The new backend is used for 8 bits full integer post-training quantization
        • The new backend removes the redundant rescales and fixes some bugs (shared weight/bias, extremely small scales, etc)

    ... (truncated)

    Commits
    • a4dfb8d Merge pull request #49124 from tensorflow/mm-cherrypick-tf-data-segfault-fix-...
    • 2107b1d Merge pull request #49116 from tensorflow-jenkins/version-numbers-2.5.0-17609
    • 16b8139 Update snapshot_dataset_op.cc
    • 86a0d86 Merge pull request #49126 from geetachavan1/cherrypicks_X9ZNY
    • 9436ae6 Merge pull request #49128 from geetachavan1/cherrypicks_D73J5
    • 6b2bf99 Validate that a and b are proper sparse tensors
    • c03ad1a Ensure validation sticks in banded_triangular_solve_op
    • 12a6ead Merge pull request #49120 from geetachavan1/cherrypicks_KJ5M9
    • b67f5b8 Merge pull request #49118 from geetachavan1/cherrypicks_BIDTR
    • a13c0ad [tf.data][cherrypick] Fix snapshot segfault when using repeat and prefecth
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 2
  • Bump tensorflow from 2.5.0 to 2.5.1

    Bump tensorflow from 2.5.0 to 2.5.1

    Bumps tensorflow from 2.5.0 to 2.5.1.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.5.1

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse (CVE-2021-37656)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixDiagV* ops (CVE-2021-37657)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixSetDiagV* ops (CVE-2021-37658)
    • Fixes an undefined behavior arising from reference binding to nullptr and heap OOB in binary cwise ops (CVE-2021-37659)
    • Fixes a division by 0 in inplace operations (CVE-2021-37660)
    • Fixes a crash caused by integer conversion to unsigned (CVE-2021-37661)
    • Fixes an undefined behavior arising from reference binding to nullptr in boosted trees (CVE-2021-37662)
    • Fixes a heap OOB in boosted trees (CVE-2021-37664)
    • Fixes vulnerabilities arising from incomplete validation in QuantizeV2 (CVE-2021-37663)
    • Fixes vulnerabilities arising from incomplete validation in MKL requantization (CVE-2021-37665)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToVariant (CVE-2021-37666)
    • Fixes an undefined behavior arising from reference binding to nullptr in unicode encoding (CVE-2021-37667)
    • Fixes an FPE in tf.raw_ops.UnravelIndex (CVE-2021-37668)
    • Fixes a crash in NMS ops caused by integer conversion to unsigned (CVE-2021-37669)
    • Fixes a heap OOB in UpperBound and LowerBound (CVE-2021-37670)
    • Fixes an undefined behavior arising from reference binding to nullptr in map operations (CVE-2021-37671)
    • Fixes a heap OOB in SdcaOptimizerV2 (CVE-2021-37672)
    • Fixes a CHECK-fail in MapStage (CVE-2021-37673)
    • Fixes a vulnerability arising from incomplete validation in MaxPoolGrad (CVE-2021-37674)
    • Fixes an undefined behavior arising from reference binding to nullptr in shape inference (CVE-2021-37676)
    • Fixes a division by 0 in most convolution operators (CVE-2021-37675)
    • Fixes vulnerabilities arising from missing validation in shape inference for Dequantize (CVE-2021-37677)
    • Fixes an arbitrary code execution due to YAML deserialization (CVE-2021-37678)
    • Fixes a heap OOB in nested tf.map_fn with RaggedTensors (CVE-2021-37679)

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse

    ... (truncated)

    Commits
    • 8222c1c Merge pull request #51381 from tensorflow/mm-fix-r2.5-build
    • d584260 Disable broken/flaky test
    • f6c6ce3 Merge pull request #51367 from tensorflow-jenkins/version-numbers-2.5.1-17468
    • 3ca7812 Update version numbers to 2.5.1
    • 4fdf683 Merge pull request #51361 from tensorflow/mm-update-relnotes-on-r2.5
    • 05fc01a Put CVE numbers for fixes in parentheses
    • bee1dc4 Update release notes for the new patch release
    • 47beb4c Merge pull request #50597 from kruglov-dmitry/v2.5.0-sync-abseil-cmake-bazel
    • 6f39597 Merge pull request #49383 from ashahab/abin-load-segfault-r2.5
    • 0539b34 Merge pull request #48979 from liufengdb/r2.5-cherrypick
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Update gensim LDA to 4.X

    Update gensim LDA to 4.X

    This issue is for discussing and eventually implementing an update for gensim implementations of LDA in kwx. The package was originally written with 3.X versions of gensim, and 4.X versions apparently have some dramatic improvements as far as modeling options/efficency and n-gram creation (for kwx.utils.clean). Changes would need to be made in kwx.utils, kwx.model, and kwx.topic_model.

    Documenting what would need to happen for the switch and then work towards implementing it would be very much appreciated :)

    Thanks for your interest in contributing!

    enhancement good first issue question 
    opened by andrewtavis 1
  • [ImgBot] Optimize images

    [ImgBot] Optimize images

    opened by imgbot[bot] 1
  • Edit spaCy loading based on version

    Edit spaCy loading based on version

    spaCy has new loading mechanisms in the later versions that produce errors in data preparation within kwx.utils. The scripts should be changed to check the spaCy version so that these changes are accounted for and errors are produced.

    bug good first issue 
    opened by andrewtavis 0
  • Adding t-SNE and pyLDA style visualizations for BERT

    Adding t-SNE and pyLDA style visualizations for BERT

    A major difference between BERT and LDA kwx implementations is that there are no visualization methods for BERT. It would be good to add a pyLDAvis style visualization of topic words as well as a t-SNE visualization of topic similarities. These would be added to kwx.visuals.

    enhancement help wanted 
    opened by andrewtavis 0
  • Convert translation feature

    Convert translation feature

    opened by andrewtavis 1
  • Remove ngrams and topic number

    Remove ngrams and topic number

    Hi Andrew, again me :) I want to ask two questions about the algorithm. When using the first BERT model, why are we remove ngrams and can't we use them without remove ngrams? My second question is that when using BERT we give the number of keywords and the number of topics. How does the number of threads work, so what is the logic?

    question 
    opened by AhmetCakar 9
  • TFIDF requires a corpus to compare

    TFIDF requires a corpus to compare

    Hi Andrew, I was trying the Keyword Extraction API with TF-IDF, the code is: bert_kws = extract_kws( method="TFIDF", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=5, )

    Which returns the error, AssertionError: TFIDF requires another text corpus to be passed to the corpuses_to_compare argument.

    I wonder why we require corpus to compare for keyword extraction? Thanks!

    question 
    opened by AbhiPawar5 2
  • Adding TFIDF key-phrase extraction

    Adding TFIDF key-phrase extraction

    This issue is for discussing and eventually implementing key-phrase extraction for TFIDF in kwx. It would be best to first collect code snippets and documentation links for how to best implement this with scikit-learn based TFIDF models, and then from there work on an implementation can begin :)

    Thanks for your interest in contributing!

    enhancement help wanted 
    opened by andrewtavis 0
Releases(v1.0.0)
  • v1.0.0(Dec 28, 2021)

  • v0.1.8(Apr 29, 2021)

    Changes include:

    • Support has been added for gensim 3.8.x and 4.x
    • Dependencies in requirement and environment files are now condensed
    • An alert for users when the corpus size is to small for the number of topics was added
    • An import error for pyLDAvis was fixed
    Source code(tar.gz)
    Source code(zip)
  • v0.1.7.3(Mar 30, 2021)

    Changes include:

    • Switching over to an src structure
    • Removing the lda_bert method because its dependencies were causing breaks
    • Code quality is now checked with Codacy
    • Extensive code formatting to improve quality and style
    • Bug fixes and a more explicit use of exceptions
    • More extensive contributing guidelines
    • Tests now use random seeds and are thus more robust
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Mar 15, 2021)

    Changes include:

    • Keyword extraction and selection are now disjointed so that modeling doesn't occur again to get new keywords

    • Keyword extraction and cleaning are now fully disjointed processes

    • kwargs for sentence-transformers BERT, LDA, and TFIDF can now be passed

    • The cleaning process is verbose and uses multiprocessing

    • The user has greater control over the cleaning process

    • Reformatting of the code to make the process more clear

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Feb 17, 2021)

    First stable release of kwx

    Changes include:

    • Full documentation of the package

    • Virtual environment files

    • Bug fixes

    • Extensive testing of all modules with GH Actions and Codecov

    • Code of conduct and contribution guidelines

    Source code(tar.gz)
    Source code(zip)
  • v0.0.2.2(Jan 31, 2021)

    The minimum viable product of kwx:

    • Users are able to extract keywords using the following methods

      • Most frequent words
      • TFIDF words unique to one corpus when compared to others
      • Latent Dirichlet Allocation
      • Bidirectional Encoder Representations from Transformers
      • An autoencoder application of LDA and BERT combined
    • Users are able to tell the model to remove certain words to fine tune results

    • Support is offered for a universal cleaning process in all major languages

    • Visualization techniques to display keywords and topics are included

    • Outputs can be cleanly organized in a directory or zip file

    • Runtimes for topic number comparisons are estimated using tqdm

    Source code(tar.gz)
    Source code(zip)
Owner
Andrew Tavis McAllister
Data scientist, developer and designer. Humboldt University of Berlin (MS); University of Oregon (BA).
Andrew Tavis McAllister
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
📔️ Generate a text-based journal from a template file.

JGen 📔️ Generate a text-based journal from a template file. Contents Getting Started Example Overview Usage Details Reserved Keywords Gotchas Getting

Harrison Broadbent 21 Sep 25, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models

Heng Cai 209 Dec 30, 2022
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

UVA Computer Vision 86 Dec 28, 2022
A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

Zhang Yixiao 2 Jan 10, 2022
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 05, 2023
Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. @inproceedings{tedes

Babelscape 40 Dec 11, 2022
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretra

Mozilla 6.5k Jan 08, 2023
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022