File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Related tags

Text Data & NLPtf-idf
Overview

File-based TF-IDF

Calculates keywords in a document, using a word corpus.

Why?

Because I found myself with hundreds of plain text files, with no way to know what each one contains. I then recalled this thing called TF-IDF from university, but found no utility that operates on files. Hence, here we are.

How?

Basically, each word in the current document gets a score. The score increases each time the word it appears in this document, and decreases each time it appears in another document. The words with the highest scores will thus (theoretically) be the keywords.

Of course, this requires you to have many other documents (the corpus) to compare with. They should contain approximately the same language. For example, it makes sense to split chapters in a book and use those as the corpus. Use your senses.

Installation

Copy tfidf.py to some location on $PATH

Usage

usage: tfidf [-h] [--json] [--min-df MIN_DF] [-n N | --all] --input-document INPUT_DOCUMENT [corpus ...]

Calculates keywords in a document, using a word corpus.

positional arguments:
  corpus                corpus files (optional but highly reccommended)

options:
  -h, --help            show this help message and exit
  --json, -j            get output as json
  --min-df MIN_DF       if a word occurs less than this number of times in the corpus, it's not considered (default: 2)
  -n N                  limit output to this many words (default: 10)
  --all                 Don't limit the amount of words to output (default: false)
  --input-document INPUT_DOCUMENT, -i INPUT_DOCUMENT
                        document file to extract keywords from

Examples

To get the top 10 keywords for chapter 1 of Moby Dick:

# assume that *.txt matches all other chapters of mobydick
$ tfidf -n 10 -i mobydick_chapter1.txt *.txt

WORD             TF_IDF           TF               
passenger        0.003            0.002            
whenever         0.003            0.002            
money            0.003            0.002            
passengers       0.002            0.001            
purse            0.002            0.001            
me               0.002            0.011            
image            0.002            0.001            
hunks            0.002            0.001            
respectfully     0.002            0.001            
robust           0.002            0.001            
-----
num words in corpus: 208425
$ tfidf --all -j -i mobydick_chapter1.txt *.txt
[
    {
        "word": "lazarus",
        "tf_idf": 0.0052818627137794375,
        "tf": 0.0028169014084507044
    },
    {
        "word": "frost",
        "tf_idf": 0.004433890895007659,
        "tf": 0.0028169014084507044
    },
    {
        "word": "bedford",
        "tf_idf": 0.0037492766733561254,
        "tf": 0.0028169014084507044
    },
    ...
]

TF-IDF equations

t — term (word)
d — document (set of words)
corpus — (set of documents)
N — number of documents in corpus

tf(t,d) = count of t in d / number of words in d
df(t) = occurrence of t in N documents
idf(t) = N/df(t)

tf_idf(t, d) = tf(t, d) * idf(t)
Owner
Jakob Lindskog
Jakob Lindskog
nlp基础任务

NLP算法 说明 此算法仓库包括文本分类、序列标注、关系抽取、文本匹配、文本相似度匹配这五个主流NLP任务,涉及到22个相关的模型算法。 框架结构 文件结构 all_models ├── Base_line │   ├── __init__.py │   ├── base_data_process.

zuxinqi 23 Sep 22, 2022
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Jungil Kong 1.1k Jan 02, 2023
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 09, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

Dipanjan (DJ) Sarkar 1.5k Jan 03, 2023
A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

2 Nov 11, 2022
A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

Ifechukwudeni Oweh 10 Jul 17, 2022
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

312 Jan 03, 2023
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

molten A minimal, extensible, fast and productive API framework for Python 3. Changelog: https://moltenframework.com/changelog.html Community: https:/

3.2k Dec 28, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

Ivan 20 Aug 22, 2022
Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch

N-Grammer - Pytorch Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch Install $ pip install n-grammer-pytorch Usage

Phil Wang 66 Dec 29, 2022
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 06, 2023
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022