Python implementation of TextRank for phrase extraction and summarization of text documents

Last update: Jan 06, 2023

Overview

PyTextRank

PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to:

extract the top-ranked phrases from text documents
run low-cost extractive summarization of text documents
help infer links from unstructured text into structured data

Background

One of the goals for PyTextRank is to provide support (eventually) for entity linking, in contrast to the more commonplace usage of named entity recognition. These approaches can be used together in complementary ways to improve the results overall.

The introduction of graph algorithms -- notably, eigenvector centrality -- provides a more flexible and robust basis for integrating additional techniques that enhance the natural language work being performed. The entity linking aspects here are still a work-in-progress scheduled for a later release.

Internally PyTextRank constructs a lemma graph to represent links among the candidate phrases (e.g., unrecognized entities) and their supporting language. Generally speaking, any means of enriching that graph prior to phrase ranking will tend to improve results. Possible ways to enrich the lemma graph include coreference resolution and semantic relations, as well as leveraging knowledge graphs in the general case.

For example, WordNet and DBpedia both provide means for inferring links among entities, and purpose-built knowledge graphs can be applied for specific use cases. These can help enrich a lemma graph even in cases where links are not explicit within the text. Consider a paragraph that mentions cats and kittens in different sentences: an implied semantic relation exists between the two nouns since the lemma kitten is a hyponym of the lemma cat -- such that an inferred link can be added between them.

This has an additional benefit of linking parsed and annotated documents into more structured data, and can also be used to support knowledge graph construction.

The TextRank algorithm used here is based on research published in:
"TextRank: Bringing Order into Text"
Rada Mihalcea, Paul Tarau
Empirical Methods in Natural Language Processing (2004)

Several modifications in PyTextRank improve on the algorithm originally described in the paper:

fixed a bug: see Java impl, 2008
use lemmatization in place of stemming
include verbs in the graph (but not in the resulting phrases)
leverage preprocessing via noun chunking and named entity recognition
provide extractive summarization based on ranked phrases

This implementation was inspired by the Williams 2016 talk on text summarization. Note that while much better approaches exit for summarizing text, questions linger about some of the top contenders -- see: 1, 2. Arguably, having alternatives such as this allow for cost trade-offs.

Installation

Prerequisites:

To install from PyPi:

pip install pytextrank
python -m spacy download en_core_web_sm

If you install directly from this Git repo, be sure to install the dependencies as well:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

import spacy
import pytextrank

# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

doc = nlp(text)

# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

For other example usage, see the PyTextRank wiki. If you need to troubleshoot any problems:

use GitHub issues (most recommended)
search related discussions on StackOverflow
tweet to #textrank on Twitter (cc @pacoid)

For related course materials and training, please check for calendar updates in the article "Natural Language Processing in Python".

Let us know if you find this package useful, tell us about use cases, describe what else you would like to see integrated, etc. For inquiries about consulting work in machine learning, natural language, knowledge graph, and other AI applications, contact Derwen, Inc.

Testing

To run the unit tests:

coverage run -m unittest discover

To generate a coverage report and upload it to the codecov.io reporting site:

coverage report
bash <(curl -s https://codecov.io/bash) -t @.cc_token

Test coverage reports can be viewed at https://codecov.io/gh/DerwenAI/pytextrank

License and Copyright

Source code for PyTextRank plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

Attribution

Please use the following BibTeX entry for citing PyTextRank if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.

@software{PyTextRank,
  author = {Paco Nathan},
  title = {{PyTextRank, a Python implementation of TextRank for phrase extraction and summarization of text documents}},
  year = 2016,
  publisher = {Derwen},
  url = {https://github.com/DerwenAI/pytextrank}
}

TODOs

kglab integration
generate MkDocs
MyPy and PyLint coverage
include more unit tests
show examples of spacy-wordnet to enrich the lemma graph
leverage neuralcoref to enrich the lemma graph
generate a phrase graph, with entity linking into Wikidata, etc.
include more unit tests
fix Sphinx errors, generate docs

Kudos

Many thanks to our contributors: @louisguitton, @anna-droid-beep, @kavorite, @htmartin, @williamsmj, @mattkohl, @vanita5, @HarshGrandeur, @mnowotka, @kjam, @dvsrepo, @SaiThejeshwar, @laxatives, @dimmu, @JasonZhangzy1757, @jake-aft, @junchen1992, @Ankush-Chander, @shyamcody, @chikubee, encouragement from the wonderful folks at spaCy, plus general support from Derwen, Inc.

Comments

Example file throws KeyError: 1255

Have not been able to get either the long form (from wiki) or short form (from github readme) files to work successfully.

The file @ https://github.com/DerwenAI/pytextrank/blob/master/example.py throws a KeyError: 1255 when run. Output for this is below.

I have been able to get the example from the github page working but only for very small strings. Anything larger than a few words throws a KeyError with varying number depending on the length of the string.

Can't figure out the issue even using all input (txt files) from the example on the wiki page and changing the spacy version to various releases from 2.0.0 to present.

KeyError Traceback (most recent call last) in () 31 text = f.read() 32 ---> 33 doc = nlp(text) 34 35 print("pipeline", nlp.pipe_names)

/home/pete/.local/lib/python3.5/site-packages/spacy/language.py in call(self, text, disable, component_cfg) 433 if not hasattr(proc, "call"): 434 raise ValueError(Errors.E003.format(component=type(proc), name=name)) --> 435 doc = proc(doc, **component_cfg.get(name, {})) 436 if doc is None: 437 raise ValueError(Errors.E005.format(name=name))

/usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in PipelineComponent(self, doc) 530 """ 531 self.doc = doc --> 532 Doc.set_extension("phrases", force=True, default=self.calc_textrank()) 533 Doc.set_extension("textrank", force=True, default=self) 534

/usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in calc_textrank(self) 389 390 for chunk in self.doc.noun_chunks: --> 391 self.collect_phrases(chunk) 392 393 for ent in self.doc.ents:

/usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in collect_phrases(self, chunk) 345 if key in self.seen_lemma: 346 node_id = list(self.seen_lemma.keys()).index(key) --> 347 rank = self.ranks[node_id] 348 phrase.sq_sum_rank += rank 349 compound_key.add(key)

KeyError: 1255
bug

opened by oldskewlcool 17

A question on keyphrases that are subsets of others and overlapping `Spans`

I think the current implementation returns keyphrases that are potential subsets of each other, that this is due to the use of noun_chunks and ents, and that this is not the desired output. Specifically, if a document has an entity that is a superset (as far as span start and end is concerned) of a noun chunk (or vice-versa), and both contain a key token, then both will be returned as keyphrases.

While also/possibly linked to the issue of entity linkage (which I'd love to know more about!), this can simply be a matter of defining "entity" boundaries and a "duplication" issue, as the example below with "Seouls Four Seasons hotel" and "Four Seasons", where I believe one keyphrase is enough and having both is confusing, demonstrates.

Am I missing something? Is this the desired logic?

Example:

from spacy.util import filter_spans
import pytextrank
import en_core_web_sm

nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);

# from dat/lee.txt
text = """
After more than four hours of tight play and a rapid-fire endgame, Google's artificially intelligent Go-playing computer system has won a second contest against grandmaster Lee Sedol, taking a two-games-to-none lead in their historic best-of-five match in downtown Seoul.  The surprisingly skillful Google machine, known as AlphaGo, now needs only one more win to claim victory in the match. The Korean-born Lee Sedol will go down in defeat unless he takes each of the match's last three games. Though machines have beaten the best humans at chess, checkers, Othello, Scrabble, Jeopardy!, and so many other games considered tests of human intellect, they have never beaten the very best at Go. Game Three is set for Saturday afternoon inside Seoul's Four Seasons hotel.  The match is a way of judging the suddenly rapid progress of artificial intelligence. One of the machine-learning techniques at the heart of AlphaGo has already reinvented myriad online services inside Google and other big-name Internet companies, helping to identify images, recognize commands spoken into smartphones, improve search engine results, and more. Meanwhile, another AlphaGo technique is now driving experimental robotics at Google and places like the University of California at Berkeley. This week's match can show how far these technologies have come - and perhaps how far they will go.  Created in Asia over 2,500 year ago, Go is exponentially more complex than chess, and at least among humans, it requires an added degree of intuition. Lee Sedol is widely-regarded as the top Go player of the last decade, after winning more international titles than all but one other player. He is currently ranked number five in the world, and according to Demis Hassabis, who leads DeepMind, the Google AI lab that created AlphaGo, his team chose the Korean for this all-important match because they wanted an opponent who would be remembered as one of history's great players.  Although AlphaGo topped Lee Sedol in the match's first game on Wednesday afternoon, the outcome of Game Two was no easier to predict. In his 1996 match with IBM's Deep Blue supercomputer, world chess champion Gary Kasparov lost the first game but then came back to win the second game and, eventually, the match as a whole. It wasn't until the following year that Deep Blue topped Kasparov over the course of a six-game contest. The thing to realize is that, after playing AlphaGo for the first time on Wednesday, Lee Sedol could adjust his style of play - just as Kasparov did back in 1996. But AlphaGo could not. Because this Google creation relies so heavily on machine learning techniques, the DeepMind team needs a good four to six weeks to train a new incarnation of the system. And that means they can't really change things during this eight-day match.  "This is about teaching and learning," Hassabis told us just before Game Two. "One game is not enough data to learn from - for a machine - and training takes an awful lot of time."
"""

doc = nlp(text)

key_spans = []
for phrase in doc._.phrases:
    for chunk in phrase.chunks:
        key_spans.append(chunk)

print(len(key_spans))

full_set = set([p.text for p in doc._.phrases])

print(full_set)

print(len(filter_spans(key_spans)))

sub_set = set([pytextrank.util.default_scrubber(p) for p in filter_spans(key_spans)])

print(sub_set)

print(full_set - sub_set)

print(sub_set - full_set)

Possible solution?:

all_spans = list(self.doc.noun_chunks) + list(self.doc.ents)
filtered_spans = filter_spans(all_spans)
filtered_phrases = self._collect_phrases(filtered_spans, self.ranks) # replacing all_phrases

instead of

nc_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.noun_chunks, self.ranks)
ent_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.ents, self.ranks)
all_phrases: typing.Dict[Span, float] = { **nc_phrases, **ent_phrases }

see https://github.com/DerwenAI/pytextrank/blob/29339027b905844af0064ed9a0326e2578f21bf6/pytextrank/base.py#L362

Note:

My understanding is that self._get_min_phrases is doing something else.
spacy.util.filter_spans simply looks for the (first) longest span, which might not be the best solution.

enhancement

opened by DayalStrub 11

Errors importing from pytextrank
Hi! I'm working on a project connected with NLP and was happy to find out that there is such a tool as PyTextRank. However, I've encountered an issue at the very beginning trying to just import package to run the example code given here. The error that I get is the following:

----> from pytextrank import json_iter, parse_doc, pretty_print ImportError: cannot import name 'json_iter' ----> from pytextrank import parse_doc ImportError: cannot import name 'parse_doc'

I've tried running it in iPython console and a Jupyter Notebook, both the same result. I've installed PyTextRank with pip, the python version that I have is 3.5.4., spacy 2.1.8., networkx 2.4, graphvis 0.13.2
question
opened by Erin59 9
NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'.

It seems to me that nlp.add_pipe("textrank") must have "noun chunks" which will raise "NotImplementedError" for some language models where "noun chunks" have not been implemented. I've got "NotImplementedError" with "ru_core_news_lg" and "ru_core_news_sm" spacy models.

The proposal is to make the use of "noun chunks" optional to prevent such errors.
bug

opened by gremur 8
How to use this?

Hi there, I've been looking at your code and example for a long time and I still have no idea how to use this.

I have documents in string format, what JSON format should they have if I want to use the stages as in the examples?

I find there's a crucial piece of information missing in the documentation, which is how to use the functionality of this package with a simple document in string format (or list of strings, representing sentences). As I don't know beforehand what JSON format I have to convert my text to in order to use the stage pipeline.

Cheers
question

opened by romanovzky 8
Error: Can't find factory for 'textrank' for language English....

Hi there,

Does any know how to fix below errors when running example code?

Thanks.

Traceback (most recent call last): File "test.py", line 14, in nlp.add_pipe("textrank") File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 773, in add_pipe validate=validate, File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 639, in create_pipe raise ValueError(err) ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, textcat_multilabel, en.lemmatizer
question

opened by r76941156 7
Differences between 2.1.0 and 3.0.0

Are the changes between the two versions of pytextrank documented anywhere?

The queries seem to be giving different results, so I would like to understand if that is because of changes to spaCy or to the algorithm itself?

Thank you for your help.
question howto

opened by debraj135 7
Keyword extraction

Hi there, I'm working on a project extracting keywords from a german text. Is there a tutorial on how to extract keywords using pytextrank?

Best regards,
question

opened by danielp3011 7
AttributeError: 'DiGraph' object has no attribute 'edge'

Fixed by changing the code on pytextrank (307) from: try: graph.edge[pair[0]][pair[1]]["weight"] += 1.0 except KeyError: graph.add_edge(pair[0], pair[1], weight=1.0) to:
if "edge" in dir(graph): graph.edge[pair[0]][pair[1]]["weight"] += 1.0 else: graph.add_edge(pair[0], pair[1], weight=1.0)

opened by Vickoh 7
Add biasedtextrank module.

Hey @ceteri I have added basic version of biased textrank.

It takes into account "focus" as well "bias" to augment ranking in favour of focus. As per the paper, it should add bias to the graph based on similarity calculation between "focus" and nodes. But this version just assigns "bias" to focus terms while leaving other nodes unbiased.

Let me know of your ideas so that we can improve upon this version.

@louisguitton

opened by Ankush-Chander 6
IndexError: list index out of range

Hi,

I'm getting the following error when trying to run pytextrank with my own data. Is there a way to fix this?

app_1 | Traceback (most recent call last): app_1 | File "index.py", line 26, in app_1 | for rl in pytextrank.normalize_key_phrases(path_stage1, ranks): app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 581, in normalize_key_phrases app_1 | for rl in collect_entities(sent, ranks, stopwords, spacy_nlp): app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 485, in collect_entities app_1 | w_ranks, w_ids = find_entity(sent, ranks, ent.text.split(" "), 0) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | [Previous line repeated 137 more times] app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 451, in find_entity app_1 | w = sent[i + j] app_1 | IndexError: list index out of range
wontfix

opened by rabinneslo 6
[Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1
This PR was automatically created by Snyk using the credentials of a real user.

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Changes to the following files to upgrade the vulnerable dependencies to a fixed version:

requirements-dev.txt

⚠️ Warning

pymdown-extensions 8.0 requires Markdown, which is not installed. mkdocs-material 8.0.1 requires mkdocs, which is not installed. mkdocs-material 8.0.1 requires mkdocs-material-extensions, which is not installed. mkdocs-material 8.0.1 requires markdown, which is not installed. mkdocs-material 8.0.1 has requirement pymdown-extensions>=9.0, but you have pymdown-extensions 8.0.

Vulnerabilities that will be fixed

By pinning:

Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- | 441/1000
Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
39.0.1 -> 65.5.1
| No | No Known Exploit

(*) Note that the real score may have changed since the PR was raised.

Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

Check the changes in this PR to ensure they won't cause issues with your project.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

For more information: 🧐 View latest project report

🛠 Adjust project settings

📚 Read more about Snyk's upgrade and patch logic

Learn how to fix vulnerabilities with free interactive lessons:

🦉 Regular Expression Denial of Service (ReDoS)
opened by ceteri 0
[Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1
Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Changes to the following files to upgrade the vulnerable dependencies to a fixed version:

requirements-dev.txt

⚠️ Warning

pymdown-extensions 8.0 requires Markdown, which is not installed. mkdocs-material 8.0.1 requires mkdocs, which is not installed. mkdocs-material 8.0.1 requires markdown, which is not installed. mkdocs-material 8.0.1 requires mkdocs-material-extensions, which is not installed. mkdocs-material 8.0.1 has requirement pymdown-extensions>=9.0, but you have pymdown-extensions 8.0.

Vulnerabilities that will be fixed

By pinning:

Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- | 441/1000
Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
39.0.1 -> 65.5.1
| No | No Known Exploit

(*) Note that the real score may have changed since the PR was raised.

Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

Check the changes in this PR to ensure they won't cause issues with your project.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

For more information: 🧐 View latest project report

🛠 Adjust project settings

📚 Read more about Snyk's upgrade and patch logic

Learn how to fix vulnerabilities with free interactive lessons:

🦉 Regular Expression Denial of Service (ReDoS)
opened by snyk-bot 0
suggestion: allow "wildcard" POS for stopwords
The current approach which specifies stopwords as lemma: [POS] presents two issues:

There are some terms which POS taggers will fail over. For example, Spacy labels "AI" (artificial intelligence) as PROPN

If I create software to be used by people without linguistic knowledge, I cannot expect them to know about POS.

As a work-around, it is necessary to specify all POS tags, which is rather inelegant.
opened by arc12 0
"ValueError: [E002] Can't find factory for 'textrank' for language English (en)." - incompatibility with SpaCy 3.3.1?
I'm trying to use this package for the first time and followed the README:

!pip install pytextrank !python -m spacy download en_core_web_sm nlp = spacy.load("en_core_web_sm") nlp.add_pipe("textrank")

This throws an error at the last line:

ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls 'nlp.create_pipe' with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator '@Language.component' (for function components) or '@Language.factory' (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer`

Is this an incompatibility with SpaCy version 3.3.1 or have I overseen something crucial? Which SpaCy version do you recommend? (I restarted the kernel after installing pytextrank)
question
opened by lisabecker-ml6 1
Is `biasedtextrank` implemented?
https://github.com/DerwenAI/pytextrank/blob/9ab64507a26f946191504598f86021f511245cd7/pytextrank/base.py#L305

self.focus_tokens is initialized to an empty set but I don't see where it is parameterized?

e.g.

nlp = spacy.load("en_core_web_sm") nlp.add_pipe("biasedtextrank") focus = "my example focus" doc = nlp(text)

At what point can I inform the model of the focus?
question kg
opened by Ayenem 4
ZeroDivisionError: division by zero in _calc_discounted_normalised_rank
Hi,

I use this library together with spacy for the extraction of the most important words. However, when using the catalan model of spacy, the algorithm gives the following error:

`File "/code/app.py", line 20, in getNlpEntities

entities = runTextRankEntities(hl, contents['contents'], algorithm, num)

File "/code/nlp/textRankEntities.py", line 51, in runTextRankEntities

doc = nlp(joined_content)

File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1022, in call

error_handler(name, proc, [doc], e)

File "/usr/local/lib/python3.9/site-packages/spacy/util.py", line 1617, in raise_error

raise e

File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1017, in call

doc = proc(doc, **component_cfg.get(name, {})) # type: ignore[call-arg]

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 253, in call

doc._.phrases = doc._.textrank.calc_textrank()

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 363, in calc_textrank

nc_phrases = self._collect_phrases(self.doc.noun_chunks, self.ranks)

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 548, in _collect_phrases

return {

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 549, in

span: self._calc_discounted_normalised_rank(span, sum_rank)

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 592, in _calc_discounted_normalised_rank

phrase_rank = math.sqrt(sum_rank / (len(span) + non_lemma))

ZeroDivisionError: division by zero`
bug help wanted good first issue
opened by sumitkumarjethani 2

Releases(v3.2.4)

v3.2.4(Jul 27, 2022)
2022-07-27

better support for "ru" and other languages without noun_chunks support in spaCy

updated example notebook to illustrate TopicRank algorithm

made the node bias setting case-independent for Biased Textrank algorithm; kudos @Ankush-Chander

updated summarization tests; kudos @tomaarsen

reworked some unit tests to be less brittle, less dependent on specific spaCy point releases

What's Changed

updated docs and example to show TopicRank by @ceteri in https://github.com/DerwenAI/pytextrank/pull/211

working on #204 by @ceteri in https://github.com/DerwenAI/pytextrank/pull/212

Prevent exception on TopicRank when there are no noun_chunks by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/219

Biasedrank case fix by @Ankush-Chander in https://github.com/DerwenAI/pytextrank/pull/217

Docs update by @ceteri in https://github.com/DerwenAI/pytextrank/pull/221

rework some unit tests by @ceteri in https://github.com/DerwenAI/pytextrank/pull/222

Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.3...v3.2.4
Source code(tar.gz)
Source code(zip)
v3.2.3(Mar 6, 2022)
2022-03-06

handles missing noun_chunks in some language models (e.g., "ru") #204

add TopicRank algorithm; kudos @tomaarsen

improved test suite; fixed tests for newer spacy releases; kudos @tomaarsen

What's Changed

[Snyk] Security upgrade mistune from 0.8.4 to 2.0.1 by @snyk-bot in https://github.com/DerwenAI/pytextrank/pull/201

Improved test suite; fixed tests by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/205

Updated Copyright year from 2021 to 2022 by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/206

update API reference docs by @ceteri in https://github.com/DerwenAI/pytextrank/pull/207

Inclusion of the TopicRank Keyphrase Extraction algorithm by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/208

Prep release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/210

New Contributors

@snyk-bot made their first contribution in https://github.com/DerwenAI/pytextrank/pull/201

Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.2...v3.2.3
Source code(tar.gz)
Source code(zip)
v3.2.2(Oct 10, 2021)
What's Changed

prep next release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/189

warning about the deprecated code in archive by @ceteri in https://github.com/DerwenAI/pytextrank/pull/190

fixes chunk to be between sent_start and sent_end in BaseTextRank.calc_sent_dist by @clabornd in https://github.com/DerwenAI/pytextrank/pull/191

Update by @ceteri in https://github.com/DerwenAI/pytextrank/pull/198

add more scrubber examples and documentation by @dayalstrub-cma in https://github.com/DerwenAI/pytextrank/pull/197

kudos by @ceteri in https://github.com/DerwenAI/pytextrank/pull/199

prep PyPi release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/200

New Contributors

@clabornd made their first contribution in https://github.com/DerwenAI/pytextrank/pull/191

@dayalstrub-cma made their first contribution in https://github.com/DerwenAI/pytextrank/pull/197

Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.1...v3.2.2
Source code(tar.gz)
Source code(zip)
v3.2.1(Jul 24, 2021)
2021-07-24

add "paragraph" option into summary() function; kudos @CaptXiong

Source code(tar.gz)
Source code(zip)
v3.2.0(Jul 17, 2021)
2021-07-17

Various support for spaCy 3.1.x updates, which changes some interfaces.

NB: THE SCRUBBER UPDATE WILL BREAK PREVIOUS RELEASES

allow Span as scrubber argument, to align with spaCy 3.1.x; kudos @Ankush-Chander

add lgtm code reviews (slow, not integrating into GitHub PRs directly)

evaluating grayskull to generate a conda-forge recipe

add use of pipdeptree to analyze dependencies

use KG from biblio.ttl to generate bibliography

fixed overlooked comment from earlier code; kudos @debraj135

add visualisation using altair; kudos @louisguitton

add scrubber usage in sample notebook; kudos @Ankush-Chander

integrating use of MkRefs to generate semantic reference pages in docs

Source code(tar.gz)
Source code(zip)
v3.1.1(Mar 25, 2021)
2021-03-25

fix the span length calculation in explanation notebook; kudos @Ankush-Chander

add BiasedTextRank by @Ankush-Chander (many thanks!)

add conda environment.yml plus instructions

use bandit to check for security issues

use codespell to check for spelling errors

add pre-commit checks in general

update doc._.phrases in the call to change_focus() so the summarization will sync with the latest focus

Source code(tar.gz)
Source code(zip)
v3.1.0(Mar 12, 2021)
2021-03-12

rename master branch to main

add a factory class that assigns each doc its own Textrank object; kudos @Ankush-Chander

refactor the stopwords feature as a constructor argument

add get_unit_vector() method to expose the characteristic unit vector

add calc_sent_dist() method to expose the sentence distance measures (for summarization)

include a unit test for summarization

updated contributor instructions

pylint coverage for code checking

linking definitions and citations in source code apidocs to our online docs

updated links on PyPi

Source code(tar.gz)
Source code(zip)
v3.0.1(Feb 27, 2021)
2021-02-27

mypy coverage for type annotations

add DOI to README and CITATION

now deploying online docs at https://derwen.ai/docs/ptr/

Source code(tar.gz)
Source code(zip)
v3.0.0(Feb 14, 2021)
2021-02-14

THIS WILL BREAK THINGS!!!

support for spaCy 3.0.x; kudos @Lord-V15

full integration of PositionRank

migrated all unit tests to pytest

removed use of logger for debugging, introducing icecream instead

Source code(tar.gz)
Source code(zip)
v2.1.0(Jan 31, 2021)
2021-01-31

add PositionRank by @louisguitton (many thanks!)

fixes chunk in explain_summ.ipynb by @anna-droid-beep

add option preserve_order in TextRank.summary by @kavorite

tested with spaCy 2.3.5

Source code(tar.gz)
Source code(zip)
v2.0.3(Sep 15, 2020)
2020-09-15

try-catch ZeroDivisionError in summary method -- kudos @shyamcody

tested with updated dependencies: spaCy 2.3.x and NetworkX 2.5

Source code(tar.gz)
Source code(zip)
v2.0.2(Jun 28, 2020)
2020-05-20

fixed default value of ._.phrases to allow for disabling PTR in a pipeline

Source code(tar.gz)
Source code(zip)
v2.0.1(Mar 2, 2020)
2020-03-02

fix KeyError issue for pre Python 3.6

integrated codecov.io

added PyTextRank to the spaCy uniVerse

fixed README.md instructions to download en_core_web_sm

Source code(tar.gz)
Source code(zip)
v2.0.0(Nov 5, 2019)
refactored library to run as a spaCy extension

supports multiple languages

significantly faster, with less memory required

better extraction of top-ranked phrases

changed license to MIT

uses lemma-based stopwords for more precise control

WIP toward integration with knowledge graph use cases

Source code(tar.gz)
Source code(zip)
v1.2.1(Nov 1, 2019)
fixed error in installation instructions

Source code(tar.gz)
Source code(zip)
v1.2.0(Nov 1, 2019)
updated for current versions of spaCy and networkX -- kudos @dimmu

removed deprecated argument -- kudos @laxatives

Source code(tar.gz)
Source code(zip)
v1.1.1(Sep 15, 2017)

Patch disables use of NER in spaCy until an intermittent bug is resolved. Will probably replace named tuples with spaCy spans instead.
Source code(tar.gz)
Source code(zip)
v1.1.0(Jun 7, 2017)

Replaced TextBlob usage with spaCy for improved parsing results. Updated the other Python dependencies. Also added better handling for UTF-8.
Source code(tar.gz)
Source code(zip)
v1.0.1(May 1, 2017)
runs in Jupyter notebooks

fixed the install for aptagger

Source code(tar.gz)
Source code(zip)
v1.0.0(Mar 13, 2017)

v1.0.0 release on PyPi https://pypi.python.org/pypi/pytextrank/
Source code(tar.gz)
Source code(zip)

Owner

derwen.ai

"In the loop..."

GitHub Repository

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

101 Dec 26, 2022

Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

1 Jan 18, 2022

ReCoin - Restoring our environment and businesses in parallel

Shashank Ojha, Sabrina Button, Abdellah Ghassel, Joshua Gonzales "Reduce Reuse R

1 Mar 14, 2022

ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

409 Jan 06, 2023

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

1 Jan 27, 2022

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

29 Nov 30, 2022

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 | 한국어 State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained models

77.1k Dec 31, 2022

A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

3k Jan 06, 2023

A telegram bot to translate 100+ Languages

🔥 GOOGLE TRANSLATER 🔥 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🔰 Deploy To Railway 🔰 • • ✅ OFF

5 Dec 20, 2021

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python) 日本語は以下に続きます (Japanese follows) English: This book is written in Japanese and primaril

189 Dec 29, 2022

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

normalizer This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch

23 Nov 30, 2022

Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

212 Dec 10, 2022

Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

2k Dec 27, 2022

Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

273 Dec 17, 2022

Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

26 Dec 25, 2022

Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

20 Aug 22, 2022

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

15.3k Dec 30, 2022

Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

20 Dec 13, 2022

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

20 May 17, 2022

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

Python implementation of TextRank for phrase extraction and summarization of text documents

Related tags

Overview

PyTextRank

Background

Installation

Usage

Links

Testing

License and Copyright

Attribution

TODOs

Kudos

Comments

Example:

Possible solution?:

Thanks.

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Vulnerabilities that will be fixed

By pinning:

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Vulnerabilities that will be fixed

By pinning:

Releases(v3.2.4)

v3.2.4(Jul 27, 2022)

What's Changed

v3.2.3(Mar 6, 2022)

What's Changed

New Contributors

v3.2.2(Oct 10, 2021)

What's Changed

New Contributors

v3.2.1(Jul 24, 2021)

v3.2.0(Jul 17, 2021)

v3.1.1(Mar 25, 2021)

v3.1.0(Mar 12, 2021)

v3.0.1(Feb 27, 2021)

v3.0.0(Feb 14, 2021)

v2.1.0(Jan 31, 2021)

v2.0.3(Sep 15, 2020)

v2.0.2(Jun 28, 2020)

v2.0.1(Mar 2, 2020)

v2.0.0(Nov 5, 2019)

v1.2.1(Nov 1, 2019)

v1.2.0(Nov 1, 2019)

v1.1.1(Sep 15, 2017)

v1.1.0(Jun 7, 2017)

v1.0.1(May 1, 2017)

v1.0.0(Mar 13, 2017)

Owner

derwen.ai

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Repositório do trabalho de introdução a NLP

ReCoin - Restoring our environment and businesses in parallel

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

A library for Multilingual Unsupervised or Supervised word Embeddings

A telegram bot to translate 100+ Languages

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Unsupervised Language Model Pre-training for French

Beautiful visualizations of how language differs among document types.

Contract Understanding Atticus Dataset

Twitter Sentiment Analysis using #tag, words and username

Basic yet complete Machine Learning pipeline for NLP tasks

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs