Various Algorithms for Short Text Mining

Overview

Short Text Mining in Python

CircleCI GitHub release Documentation Status Updates Python 3 pypi download stars

Introduction

This package shorttext is a Python package that facilitates supervised and unsupervised learning for short text categorization. Due to the sparseness of words and the lack of information carried in the short texts themselves, an intermediate representation of the texts and documents are needed before they are put into any classification algorithm. In this package, it facilitates various types of these representations, including topic modeling and word-embedding algorithms.

Since release 1.5.2, it runs on Python 3.9. Since release 1.5.0, support for Python 3.6 was decommissioned. Since release 1.2.4, it runs on Python 3.8. Since release 1.2.3, support for Python 3.5 was decommissioned. Since release 1.1.7, support for Python 2.7 was decommissioned. Since release 1.0.8, it runs on Python 3.7 with 'TensorFlow' being the backend for keras. Since release 1.0.7, it runs on Python 3.7 as well, but the backend for keras cannot be TensorFlow. Since release 1.0.0, shorttext runs on Python 2.7, 3.5, and 3.6.

Characteristics:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification;
  • neural network classification (including ConvNet, and C-LSTM);
  • maximum entropy classification;
  • metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);
  • character-level sequence-to-sequence (seq2seq) learning;
  • spell correction;
  • API for word-embedding algorithm for one-time loading; and
  • Sentence encodings and similarities based on BERT.

Documentation

Documentation and tutorials for shorttext can be found here: http://shorttext.rtfd.io/.

See tutorial for how to use the package, and FAQ.

Installation

To install it, in a console, use pip.

>>> pip install -U shorttext

or, if you want the most recent development version on Github, type

>>> pip install -U git+https://github.com/stephenhky/[email protected]

Developers are advised to make sure Keras >=2 be installed. Users are advised to install the backend Tensorflow (preferred) or Theano in advance. It is desirable if Cython has been previously installed too.

See installation guide for more details.

Issues

To report any issues, go to the Issues tab of the Github page and start a thread. It is welcome for developers to submit pull requests on their own to fix any errors.

Contributors

If you would like to contribute, feel free to submit the pull requests. You can talk to me in advance through e-mails or the Issues page.

Useful Links

News

  • 07/11/2021: shorttext 1.5.3 released.
  • 07/06/2021: shorttext 1.5.2 released.
  • 04/10/2021: shorttext 1.5.1 released.
  • 04/09/2021: shorttext 1.5.0 released.
  • 02/11/2021: shorttext 1.4.8 released.
  • 01/11/2021: shorttext 1.4.7 released.
  • 01/03/2021: shorttext 1.4.6 released.
  • 12/28/2020: shorttext 1.4.5 released.
  • 12/24/2020: shorttext 1.4.4 released.
  • 11/10/2020: shorttext 1.4.3 released.
  • 10/18/2020: shorttext 1.4.2 released.
  • 09/23/2020: shorttext 1.4.1 released.
  • 09/02/2020: shorttext 1.4.0 released.
  • 07/23/2020: shorttext 1.3.0 released.
  • 06/05/2020: shorttext 1.2.6 released.
  • 05/20/2020: shorttext 1.2.5 released.
  • 05/13/2020: shorttext 1.2.4 released.
  • 04/28/2020: shorttext 1.2.3 released.
  • 04/07/2020: shorttext 1.2.2 released.
  • 03/23/2020: shorttext 1.2.1 released.
  • 03/21/2020: shorttext 1.2.0 released.
  • 12/01/2019: shorttext 1.1.6 released.
  • 09/24/2019: shorttext 1.1.5 released.
  • 07/20/2019: shorttext 1.1.4 released.
  • 07/07/2019: shorttext 1.1.3 released.
  • 06/05/2019: shorttext 1.1.2 released.
  • 04/23/2019: shorttext 1.1.1 released.
  • 03/03/2019: shorttext 1.1.0 released.
  • 02/14/2019: shorttext 1.0.8 released.
  • 01/30/2019: shorttext 1.0.7 released.
  • 01/29/2019: shorttext 1.0.6 released.
  • 01/13/2019: shorttext 1.0.5 released.
  • 10/03/2018: shorttext 1.0.4 released.
  • 08/06/2018: shorttext 1.0.3 released.
  • 07/24/2018: shorttext 1.0.2 released.
  • 07/17/2018: shorttext 1.0.1 released.
  • 07/14/2018: shorttext 1.0.0 released.
  • 06/18/2018: shorttext 0.7.2 released.
  • 05/30/2018: shorttext 0.7.1 released.
  • 05/17/2018: shorttext 0.7.0 released.
  • 02/27/2018: shorttext 0.6.0 released.
  • 01/19/2018: shorttext 0.5.11 released.
  • 01/15/2018: shorttext 0.5.10 released.
  • 12/14/2017: shorttext 0.5.9 released.
  • 11/08/2017: shorttext 0.5.8 released.
  • 10/27/2017: shorttext 0.5.7 released.
  • 10/17/2017: shorttext 0.5.6 released.
  • 09/28/2017: shorttext 0.5.5 released.
  • 09/08/2017: shorttext 0.5.4 released.
  • 09/02/2017: end of GSoC project. (Report)
  • 08/22/2017: shorttext 0.5.1 released.
  • 07/28/2017: shorttext 0.4.1 released.
  • 07/26/2017: shorttext 0.4.0 released.
  • 06/16/2017: shorttext 0.3.8 released.
  • 06/12/2017: shorttext 0.3.7 released.
  • 06/02/2017: shorttext 0.3.6 released.
  • 05/30/2017: GSoC project (Chinmaya Pancholi, with gensim)
  • 05/16/2017: shorttext 0.3.5 released.
  • 04/27/2017: shorttext 0.3.4 released.
  • 04/19/2017: shorttext 0.3.3 released.
  • 03/28/2017: shorttext 0.3.2 released.
  • 03/14/2017: shorttext 0.3.1 released.
  • 02/23/2017: shorttext 0.2.1 released.
  • 12/21/2016: shorttext 0.2.0 released.
  • 11/25/2016: shorttext 0.1.2 released.
  • 11/21/2016: shorttext 0.1.1 released.

Possible Future Updates

  • Dividing components to other packages;
  • More available corpus.
Comments
  • standalone ?

    standalone ?

    Hi. I have many questions.... :-)

    I'm a beginner for python. Is there any method to run the code standalone ?

    e.g. I trained my data. And I'd like to see the scores on terminal by classifier.score('apple') . The word 'apple' can be changed.

    Thank you regards,

    opened by chocosando 20
  • ImportError: No module named classification_exceptions

    ImportError: No module named classification_exceptions

    import shorttext

    
    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-5-cb09b3381050> in <module>()
    ----> 1 import shorttext
    
    /usr/local/lib/python2.7/dist-packages/shorttext/__init__.py in <module>()
          5 sys.path.append(thisdir)
          6 
    ----> 7 from . import utils
          8 from . import data
          9 from . import classifiers
    
    /usr/local/lib/python2.7/dist-packages/shorttext/utils/__init__.py in <module>()
          4 from . import textpreprocessing
          5 from .wordembed import load_word2vec_model
    ----> 6 from . import compactmodel_io
          7 
          8 from .textpreprocessing import spacy_tokenize as tokenize
    
    /usr/local/lib/python2.7/dist-packages/shorttext/utils/compactmodel_io.py in <module>()
         13 from functools import partial
         14 
    ---> 15 import utils.classification_exceptions as e
         16 
         17 def removedir(dir):
    
    ImportError: No module named classification_exceptions
    
    
    opened by spate141 11
  • ImportError: dlopen: cannot load any more object with static TLS

    ImportError: dlopen: cannot load any more object with static TLS

    Hi, I got the following error when i import shorttext, how shall i resolve?

    Using TensorFlow backend.

    I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so.7.5 locally Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/shorttext/init.py", line 7, in from . import utils File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/init.py", line 3, in from . import gensim_corpora File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/gensim_corpora.py", line 2, in from .textpreprocessing import spacy_tokenize as tokenize File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/textpreprocessing.py", line 5, in import spacy File "/usr/local/lib/python2.7/dist-packages/spacy/init.py", line 8, in from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he File "/usr/local/lib/python2.7/dist-packages/spacy/en/init.py", line 4, in from ..language import Language File "/usr/local/lib/python2.7/dist-packages/spacy/language.py", line 12, in from .syntax.parser import get_templates ImportError: dlopen: cannot load any more object with static TLS

    opened by kenyeung128 8
  • extend score to take an array of shorttext

    extend score to take an array of shorttext

    Currently, score takes only a single input and as a result, the method is very slow if you are trying to classify thousands of examples. Is there a way you can generate scores for 10K+ samples at the same time.

    opened by rja172 6
  • Importing problem (not installation) over google colab

    Importing problem (not installation) over google colab

    I am experimenting with the library for the first time. The installation was successful and didn't need any extra steps. however when I started importing the library I got the following error related to keras:

    /usr/local/lib/python3.7/dist-packages/shorttext/generators/bow/AutoEncodingTopicModeling.py in () 8 from gensim.corpora import Dictionary 9 from keras import Input ---> 10 from keras.engine import Model 11 from keras.layers import Dense 12 from scipy.spatial.distance import cosine

    ImportError: cannot import name 'Model' from 'keras.engine' (/usr/local/lib/python3.7/dist-packages/keras/engine/init.py)

    I tried to install keras separately but no improvement. any suggestions would be appreciated.

    opened by yomnamahmoud 6
  • RuntimeWarning: overflow encountered in exp2 topicmodeler.train

    RuntimeWarning: overflow encountered in exp2 topicmodeler.train

    Code: trainclassdict = shorttext.data.nihreports(sample_size=None) topicmodeler = shorttext.generators.LDAModeler() topicmodeler.train(trainclassdict, 128) Error message: /lib/python2.7/site-packages/gensim/models/ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words

    Then the results are variable for topicmodeler.retrieve_topicvec('stem cell research')

    opened by dbonner 6
  • Remove negation terms from stopwords.txt

    Remove negation terms from stopwords.txt

    I noticed that stopwords.txt includes negation terms such as "no" and "not". These terms revert the meaning of a word or a sentence, so they should be preserved in the text data. For example, "not a good idea" would become "good idea" after stopword removal. Therefore, I recommend removing negation terms from the stopword list. Thanks!

    opened by star1327p 5
  • Input to shorttext.generators.LDAModeler()

    Input to shorttext.generators.LDAModeler()

    I was wondering what should be the format of data as input for:

    shorttext.generators.LDAModeler() topicmodeler.train(data, 100)

    Can I feed it with a pandas column? Or it should be in a dictionary format? If a dictionary, what should be the keys? I have a large set of tweets.

    opened by malizad 5
  • from shorttext.classifiers import MaxEntClassifier is it regression?

    from shorttext.classifiers import MaxEntClassifier is it regression?

    seems to be maxent is a fancy word for regression or you do have something special in your maxent? https://www.quora.com/What-is-the-relationship-between-Log-Linear-model-MaxEnt-model-and-Logistic-Regression or https://en.wikipedia.org/wiki/Multinomial_logistic_regression

    Multinomial logistic regression is known by a variety of other names, including polytomous LR,[2][3] multiclass LR, softmax regression, multinomial logit, the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model.[4]
    
    opened by Sandy4321 5
  • No Python 3.6 support with SciPy 1.6

    No Python 3.6 support with SciPy 1.6

    opened by Dobatymo 4
  • Data nihreports not available anymore

    Data nihreports not available anymore

    Some datasets are not available anymore.

    For example the following: nihtraindata = shorttext.data.nihreports(sample_size=None)

    Error message:

    Downloading...
    Source:  http://storage.googleapis.com/pyshorttext/nih_grant_public/nih_full.csv.zip
    Failure to download file!
    (<class 'urllib.error.HTTPError'>, <HTTPError 404: 'Not Found'>, <traceback object at 0x7f09063ed788>)
    

    Python error:

    HTTPError: HTTP Error 404: Not Found
    
    During handling of the above exception, another exception occurred:
    

    When opening the link the same error appears:

    image

    opened by AlessandroVol23 4
Releases(1.5.8)
Owner
Kwan-Yuet "Stephen" Ho
quantitative research, machine learning, data science, text mining, physics
Kwan-Yuet
⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Reduce T5 model size by 3X and increase the inference speed up to 5X. Install Usage Details Functionalities Benchmarks Onnx model Quantized onnx model

Kiran R 399 Jan 05, 2023
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in s

Jonas Belouadi 7 Nov 07, 2022
Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

SeMI Technologies 191 Dec 26, 2022
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

K. Seeliger 2 May 17, 2022
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Introduction Funnel-Transformer is a new self-attention model that gradually compresses the sequence of hidden states to a shorter one and hence reduc

GUOKUN LAI 197 Dec 11, 2022
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 06, 2021
Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

Yahui Liu 16 Feb 24, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

Nathan Raw 185 Dec 21, 2022
Retraining OpenAI's GPT-2 on Discord Chats

Train OpenAI's GPT-2 on Discord Chats Retraining a Text Generation Model on Discord Chats using gpt-2-simple that wraps existing model fine-tuning and

Ayush Mishra 4 Oct 27, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, whic

Jesse Zaneveld 33 Dec 28, 2022
Easy to start. Use deep nerual network to predict the sentiment of movie review.

Easy to start. Use deep nerual network to predict the sentiment of movie review. Various methods, word2vec, tf-idf and df to generate text vectors. Various models including lstm and cov1d. Achieve f1

1 Nov 19, 2021