Python implementation of TextRank for phrase extraction and summarization of text documents

Overview

PyTextRank

PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to:

  • extract the top-ranked phrases from text documents
  • run low-cost extractive summarization of text documents
  • help infer links from unstructured text into structured data

Background

One of the goals for PyTextRank is to provide support (eventually) for entity linking, in contrast to the more commonplace usage of named entity recognition. These approaches can be used together in complementary ways to improve the results overall.

The introduction of graph algorithms -- notably, eigenvector centrality -- provides a more flexible and robust basis for integrating additional techniques that enhance the natural language work being performed. The entity linking aspects here are still a work-in-progress scheduled for a later release.

Internally PyTextRank constructs a lemma graph to represent links among the candidate phrases (e.g., unrecognized entities) and their supporting language. Generally speaking, any means of enriching that graph prior to phrase ranking will tend to improve results. Possible ways to enrich the lemma graph include coreference resolution and semantic relations, as well as leveraging knowledge graphs in the general case.

For example, WordNet and DBpedia both provide means for inferring links among entities, and purpose-built knowledge graphs can be applied for specific use cases. These can help enrich a lemma graph even in cases where links are not explicit within the text. Consider a paragraph that mentions cats and kittens in different sentences: an implied semantic relation exists between the two nouns since the lemma kitten is a hyponym of the lemma cat -- such that an inferred link can be added between them.

This has an additional benefit of linking parsed and annotated documents into more structured data, and can also be used to support knowledge graph construction.

The TextRank algorithm used here is based on research published in:
"TextRank: Bringing Order into Text"
Rada Mihalcea, Paul Tarau
Empirical Methods in Natural Language Processing (2004)

Several modifications in PyTextRank improve on the algorithm originally described in the paper:

  • fixed a bug: see Java impl, 2008
  • use lemmatization in place of stemming
  • include verbs in the graph (but not in the resulting phrases)
  • leverage preprocessing via noun chunking and named entity recognition
  • provide extractive summarization based on ranked phrases

This implementation was inspired by the Williams 2016 talk on text summarization. Note that while much better approaches exit for summarizing text, questions linger about some of the top contenders -- see: 1, 2. Arguably, having alternatives such as this allow for cost trade-offs.

Installation

Prerequisites:

To install from PyPi:

pip install pytextrank
python -m spacy download en_core_web_sm

If you install directly from this Git repo, be sure to install the dependencies as well:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

import spacy
import pytextrank

# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

doc = nlp(text)

# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

For other example usage, see the PyTextRank wiki. If you need to troubleshoot any problems:

For related course materials and training, please check for calendar updates in the article "Natural Language Processing in Python".

Let us know if you find this package useful, tell us about use cases, describe what else you would like to see integrated, etc. For inquiries about consulting work in machine learning, natural language, knowledge graph, and other AI applications, contact Derwen, Inc.

Links

Testing

To run the unit tests:

coverage run -m unittest discover

To generate a coverage report and upload it to the codecov.io reporting site:

coverage report
bash <(curl -s https://codecov.io/bash) -t @.cc_token

Test coverage reports can be viewed at https://codecov.io/gh/DerwenAI/pytextrank

License and Copyright

Source code for PyTextRank plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

All materials herein are Copyright © 2016-2021 Derwen, Inc.

Attribution

Please use the following BibTeX entry for citing PyTextRank if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.

@software{PyTextRank,
  author = {Paco Nathan},
  title = {{PyTextRank, a Python implementation of TextRank for phrase extraction and summarization of text documents}},
  year = 2016,
  publisher = {Derwen},
  url = {https://github.com/DerwenAI/pytextrank}
}

TODOs

  • kglab integration
  • generate MkDocs
  • MyPy and PyLint coverage
  • include more unit tests
  • show examples of spacy-wordnet to enrich the lemma graph
  • leverage neuralcoref to enrich the lemma graph
  • generate a phrase graph, with entity linking into Wikidata, etc.
  • include more unit tests
  • fix Sphinx errors, generate docs

Kudos

Many thanks to our contributors: @louisguitton, @anna-droid-beep, @kavorite, @htmartin, @williamsmj, @mattkohl, @vanita5, @HarshGrandeur, @mnowotka, @kjam, @dvsrepo, @SaiThejeshwar, @laxatives, @dimmu, @JasonZhangzy1757, @jake-aft, @junchen1992, @Ankush-Chander, @shyamcody, @chikubee, encouragement from the wonderful folks at spaCy, plus general support from Derwen, Inc.

thx noam

Comments
  • Example file throws KeyError: 1255

    Example file throws KeyError: 1255

    Have not been able to get either the long form (from wiki) or short form (from github readme) files to work successfully.

    The file @ https://github.com/DerwenAI/pytextrank/blob/master/example.py throws a KeyError: 1255 when run. Output for this is below.

    I have been able to get the example from the github page working but only for very small strings. Anything larger than a few words throws a KeyError with varying number depending on the length of the string.

    Can't figure out the issue even using all input (txt files) from the example on the wiki page and changing the spacy version to various releases from 2.0.0 to present.


    KeyError Traceback (most recent call last) in () 31 text = f.read() 32 ---> 33 doc = nlp(text) 34 35 print("pipeline", nlp.pipe_names)

    /home/pete/.local/lib/python3.5/site-packages/spacy/language.py in call(self, text, disable, component_cfg) 433 if not hasattr(proc, "call"): 434 raise ValueError(Errors.E003.format(component=type(proc), name=name)) --> 435 doc = proc(doc, **component_cfg.get(name, {})) 436 if doc is None: 437 raise ValueError(Errors.E005.format(name=name))

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in PipelineComponent(self, doc) 530 """ 531 self.doc = doc --> 532 Doc.set_extension("phrases", force=True, default=self.calc_textrank()) 533 Doc.set_extension("textrank", force=True, default=self) 534

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in calc_textrank(self) 389 390 for chunk in self.doc.noun_chunks: --> 391 self.collect_phrases(chunk) 392 393 for ent in self.doc.ents:

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in collect_phrases(self, chunk) 345 if key in self.seen_lemma: 346 node_id = list(self.seen_lemma.keys()).index(key) --> 347 rank = self.ranks[node_id] 348 phrase.sq_sum_rank += rank 349 compound_key.add(key)

    KeyError: 1255

    bug 
    opened by oldskewlcool 17
  • A question on keyphrases that are subsets of others and overlapping `Spans`

    A question on keyphrases that are subsets of others and overlapping `Spans`

    I think the current implementation returns keyphrases that are potential subsets of each other, that this is due to the use of noun_chunks and ents, and that this is not the desired output. Specifically, if a document has an entity that is a superset (as far as span start and end is concerned) of a noun chunk (or vice-versa), and both contain a key token, then both will be returned as keyphrases.

    While also/possibly linked to the issue of entity linkage (which I'd love to know more about!), this can simply be a matter of defining "entity" boundaries and a "duplication" issue, as the example below with "Seouls Four Seasons hotel" and "Four Seasons", where I believe one keyphrase is enough and having both is confusing, demonstrates.

    Am I missing something? Is this the desired logic?

    Example:

    from spacy.util import filter_spans
    import pytextrank
    import en_core_web_sm
    
    nlp = en_core_web_sm.load()
    nlp.add_pipe("textrank", last=True);
    
    # from dat/lee.txt
    text = """
    After more than four hours of tight play and a rapid-fire endgame, Google's artificially intelligent Go-playing computer system has won a second contest against grandmaster Lee Sedol, taking a two-games-to-none lead in their historic best-of-five match in downtown Seoul.  The surprisingly skillful Google machine, known as AlphaGo, now needs only one more win to claim victory in the match. The Korean-born Lee Sedol will go down in defeat unless he takes each of the match's last three games. Though machines have beaten the best humans at chess, checkers, Othello, Scrabble, Jeopardy!, and so many other games considered tests of human intellect, they have never beaten the very best at Go. Game Three is set for Saturday afternoon inside Seoul's Four Seasons hotel.  The match is a way of judging the suddenly rapid progress of artificial intelligence. One of the machine-learning techniques at the heart of AlphaGo has already reinvented myriad online services inside Google and other big-name Internet companies, helping to identify images, recognize commands spoken into smartphones, improve search engine results, and more. Meanwhile, another AlphaGo technique is now driving experimental robotics at Google and places like the University of California at Berkeley. This week's match can show how far these technologies have come - and perhaps how far they will go.  Created in Asia over 2,500 year ago, Go is exponentially more complex than chess, and at least among humans, it requires an added degree of intuition. Lee Sedol is widely-regarded as the top Go player of the last decade, after winning more international titles than all but one other player. He is currently ranked number five in the world, and according to Demis Hassabis, who leads DeepMind, the Google AI lab that created AlphaGo, his team chose the Korean for this all-important match because they wanted an opponent who would be remembered as one of history's great players.  Although AlphaGo topped Lee Sedol in the match's first game on Wednesday afternoon, the outcome of Game Two was no easier to predict. In his 1996 match with IBM's Deep Blue supercomputer, world chess champion Gary Kasparov lost the first game but then came back to win the second game and, eventually, the match as a whole. It wasn't until the following year that Deep Blue topped Kasparov over the course of a six-game contest. The thing to realize is that, after playing AlphaGo for the first time on Wednesday, Lee Sedol could adjust his style of play - just as Kasparov did back in 1996. But AlphaGo could not. Because this Google creation relies so heavily on machine learning techniques, the DeepMind team needs a good four to six weeks to train a new incarnation of the system. And that means they can't really change things during this eight-day match.  "This is about teaching and learning," Hassabis told us just before Game Two. "One game is not enough data to learn from - for a machine - and training takes an awful lot of time."
    """
    
    doc = nlp(text)
    
    key_spans = []
    for phrase in doc._.phrases:
        for chunk in phrase.chunks:
            key_spans.append(chunk)
    
    print(len(key_spans))
    
    full_set = set([p.text for p in doc._.phrases])
    
    print(full_set)
    
    print(len(filter_spans(key_spans)))
    
    sub_set = set([pytextrank.util.default_scrubber(p) for p in filter_spans(key_spans)])
    
    print(sub_set)
    
    print(full_set - sub_set)
    
    print(sub_set - full_set)
    

    Possible solution?:

    all_spans = list(self.doc.noun_chunks) + list(self.doc.ents)
    filtered_spans = filter_spans(all_spans)
    filtered_phrases = self._collect_phrases(filtered_spans, self.ranks) # replacing all_phrases
    

    instead of

    nc_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.noun_chunks, self.ranks)
    ent_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.ents, self.ranks)
    all_phrases: typing.Dict[Span, float] = { **nc_phrases, **ent_phrases }
    

    see https://github.com/DerwenAI/pytextrank/blob/29339027b905844af0064ed9a0326e2578f21bf6/pytextrank/base.py#L362

    Note:

    • My understanding is that self._get_min_phrases is doing something else.
    • spacy.util.filter_spans simply looks for the (first) longest span, which might not be the best solution.
    enhancement 
    opened by DayalStrub 11
  • Errors importing from pytextrank

    Errors importing from pytextrank

    Hi! I'm working on a project connected with NLP and was happy to find out that there is such a tool as PyTextRank. However, I've encountered an issue at the very beginning trying to just import package to run the example code given here. The error that I get is the following:

    ----> from pytextrank import json_iter, parse_doc, pretty_print
    ImportError: cannot import name 'json_iter'
    ----> from pytextrank import parse_doc
    ImportError: cannot import name 'parse_doc'
    

    I've tried running it in iPython console and a Jupyter Notebook, both the same result. I've installed PyTextRank with pip, the python version that I have is 3.5.4., spacy 2.1.8., networkx 2.4, graphvis 0.13.2

    question 
    opened by Erin59 9
  • NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'.

    NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'.

    It seems to me that nlp.add_pipe("textrank") must have "noun chunks" which will raise "NotImplementedError" for some language models where "noun chunks" have not been implemented. I've got "NotImplementedError" with "ru_core_news_lg" and "ru_core_news_sm" spacy models.

    The proposal is to make the use of "noun chunks" optional to prevent such errors.

    bug 
    opened by gremur 8
  • How to use this?

    How to use this?

    Hi there, I've been looking at your code and example for a long time and I still have no idea how to use this.

    I have documents in string format, what JSON format should they have if I want to use the stages as in the examples?

    I find there's a crucial piece of information missing in the documentation, which is how to use the functionality of this package with a simple document in string format (or list of strings, representing sentences). As I don't know beforehand what JSON format I have to convert my text to in order to use the stage pipeline.

    Cheers

    question 
    opened by romanovzky 8
  • Error: Can't find factory for 'textrank' for language English....

    Error: Can't find factory for 'textrank' for language English....

    Hi there,

    Does any know how to fix below errors when running example code?

    Thanks.

    Traceback (most recent call last): File "test.py", line 14, in nlp.add_pipe("textrank") File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 773, in add_pipe validate=validate, File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 639, in create_pipe raise ValueError(err) ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, textcat_multilabel, en.lemmatizer

    question 
    opened by r76941156 7
  • Differences between 2.1.0 and 3.0.0

    Differences between 2.1.0 and 3.0.0

    Are the changes between the two versions of pytextrank documented anywhere?

    The queries seem to be giving different results, so I would like to understand if that is because of changes to spaCy or to the algorithm itself?

    Thank you for your help.

    question howto 
    opened by debraj135 7
  • Keyword extraction

    Keyword extraction

    Hi there, I'm working on a project extracting keywords from a german text. Is there a tutorial on how to extract keywords using pytextrank?

    Best regards,

    question 
    opened by danielp3011 7
  • AttributeError: 'DiGraph' object has no attribute 'edge'

    AttributeError: 'DiGraph' object has no attribute 'edge'

    Fixed by changing the code on pytextrank (307) from: try: graph.edge[pair[0]][pair[1]]["weight"] += 1.0 except KeyError: graph.add_edge(pair[0], pair[1], weight=1.0) to:
    if "edge" in dir(graph): graph.edge[pair[0]][pair[1]]["weight"] += 1.0 else: graph.add_edge(pair[0], pair[1], weight=1.0)

    opened by Vickoh 7
  • Add biasedtextrank module.

    Add biasedtextrank module.

    Hey @ceteri I have added basic version of biased textrank.

    It takes into account "focus" as well "bias" to augment ranking in favour of focus. As per the paper, it should add bias to the graph based on similarity calculation between "focus" and nodes. But this version just assigns "bias" to focus terms while leaving other nodes unbiased.

    Let me know of your ideas so that we can improve upon this version.

    @louisguitton

    opened by Ankush-Chander 6
  • IndexError: list index out of range

    IndexError: list index out of range

    Hi,

    I'm getting the following error when trying to run pytextrank with my own data. Is there a way to fix this?

    app_1 | Traceback (most recent call last): app_1 | File "index.py", line 26, in app_1 | for rl in pytextrank.normalize_key_phrases(path_stage1, ranks): app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 581, in normalize_key_phrases app_1 | for rl in collect_entities(sent, ranks, stopwords, spacy_nlp): app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 485, in collect_entities app_1 | w_ranks, w_ids = find_entity(sent, ranks, ent.text.split(" "), 0) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | [Previous line repeated 137 more times] app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 451, in find_entity app_1 | w = sent[i + j] app_1 | IndexError: list index out of range

    wontfix 
    opened by rabinneslo 6
  • [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    This PR was automatically created by Snyk using the credentials of a real user.


    Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

    Changes included in this PR

    • Changes to the following files to upgrade the vulnerable dependencies to a fixed version:
      • requirements-dev.txt
    ⚠️ Warning
    pymdown-extensions 8.0 requires Markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs-material-extensions, which is not installed.
    mkdocs-material 8.0.1 requires markdown, which is not installed.
    mkdocs-material 8.0.1 has requirement pymdown-extensions>=9.0, but you have pymdown-extensions 8.0.
    
    

    Vulnerabilities that will be fixed

    By pinning:

    Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- low severity | 441/1000
    Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
    SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
    39.0.1 -> 65.5.1
    | No | No Known Exploit

    (*) Note that the real score may have changed since the PR was raised.

    Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

    Check the changes in this PR to ensure they won't cause issues with your project.


    Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

    For more information: 🧐 View latest project report

    🛠 Adjust project settings

    📚 Read more about Snyk's upgrade and patch logic


    Learn how to fix vulnerabilities with free interactive lessons:

    🦉 Regular Expression Denial of Service (ReDoS)

    opened by ceteri 0
  • [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

    Changes included in this PR

    • Changes to the following files to upgrade the vulnerable dependencies to a fixed version:
      • requirements-dev.txt
    ⚠️ Warning
    pymdown-extensions 8.0 requires Markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs, which is not installed.
    mkdocs-material 8.0.1 requires markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs-material-extensions, which is not installed.
    mkdocs-material 8.0.1 has requirement pymdown-extensions>=9.0, but you have pymdown-extensions 8.0.
    
    

    Vulnerabilities that will be fixed

    By pinning:

    Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- low severity | 441/1000
    Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
    SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
    39.0.1 -> 65.5.1
    | No | No Known Exploit

    (*) Note that the real score may have changed since the PR was raised.

    Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

    Check the changes in this PR to ensure they won't cause issues with your project.


    Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

    For more information: 🧐 View latest project report

    🛠 Adjust project settings

    📚 Read more about Snyk's upgrade and patch logic


    Learn how to fix vulnerabilities with free interactive lessons:

    🦉 Regular Expression Denial of Service (ReDoS)

    opened by snyk-bot 0
  • suggestion: allow

    suggestion: allow "wildcard" POS for stopwords

    The current approach which specifies stopwords as lemma: [POS] presents two issues:

    1. There are some terms which POS taggers will fail over. For example, Spacy labels "AI" (artificial intelligence) as PROPN
    2. If I create software to be used by people without linguistic knowledge, I cannot expect them to know about POS.

    As a work-around, it is necessary to specify all POS tags, which is rather inelegant.

    opened by arc12 0
  • "ValueError: [E002] Can't find factory for 'textrank' for language English (en)." - incompatibility with SpaCy 3.3.1?

    I'm trying to use this package for the first time and followed the README:

    !pip install pytextrank
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")
    nlp.add_pipe("textrank")
    

    This throws an error at the last line:

    ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls 'nlp.create_pipe' with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator '@Language.component' (for function components) or '@Language.factory' (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer`

    Is this an incompatibility with SpaCy version 3.3.1 or have I overseen something crucial? Which SpaCy version do you recommend? (I restarted the kernel after installing pytextrank)

    question 
    opened by lisabecker-ml6 1
  • Is `biasedtextrank` implemented?

    Is `biasedtextrank` implemented?

    https://github.com/DerwenAI/pytextrank/blob/9ab64507a26f946191504598f86021f511245cd7/pytextrank/base.py#L305

    self.focus_tokens is initialized to an empty set but I don't see where it is parameterized?

    e.g.

    nlp = spacy.load("en_core_web_sm")
    nlp.add_pipe("biasedtextrank")
    focus = "my example focus"
    doc = nlp(text)
    

    At what point can I inform the model of the focus?

    question kg 
    opened by Ayenem 4
  • ZeroDivisionError: division by zero in _calc_discounted_normalised_rank

    ZeroDivisionError: division by zero in _calc_discounted_normalised_rank

    Hi,

    I use this library together with spacy for the extraction of the most important words. However, when using the catalan model of spacy, the algorithm gives the following error:

    `File "/code/app.py", line 20, in getNlpEntities

    entities = runTextRankEntities(hl, contents['contents'], algorithm, num)
    

    File "/code/nlp/textRankEntities.py", line 51, in runTextRankEntities

    doc = nlp(joined_content)
    

    File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1022, in call

    error_handler(name, proc, [doc], e)
    

    File "/usr/local/lib/python3.9/site-packages/spacy/util.py", line 1617, in raise_error

    raise e
    

    File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1017, in call

    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 253, in call

    doc._.phrases = doc._.textrank.calc_textrank()
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 363, in calc_textrank

    nc_phrases = self._collect_phrases(self.doc.noun_chunks, self.ranks)
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 548, in _collect_phrases

    return {
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 549, in

    span: self._calc_discounted_normalised_rank(span, sum_rank)
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 592, in _calc_discounted_normalised_rank

    phrase_rank = math.sqrt(sum_rank / (len(span) + non_lemma))
    

    ZeroDivisionError: division by zero`

    bug help wanted good first issue 
    opened by sumitkumarjethani 2
Releases(v3.2.4)
  • v3.2.4(Jul 27, 2022)

    2022-07-27

    • better support for "ru" and other languages without noun_chunks support in spaCy
    • updated example notebook to illustrate TopicRank algorithm
    • made the node bias setting case-independent for Biased Textrank algorithm; kudos @Ankush-Chander
    • updated summarization tests; kudos @tomaarsen
    • reworked some unit tests to be less brittle, less dependent on specific spaCy point releases

    What's Changed

    • updated docs and example to show TopicRank by @ceteri in https://github.com/DerwenAI/pytextrank/pull/211
    • working on #204 by @ceteri in https://github.com/DerwenAI/pytextrank/pull/212
    • Prevent exception on TopicRank when there are no noun_chunks by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/219
    • Biasedrank case fix by @Ankush-Chander in https://github.com/DerwenAI/pytextrank/pull/217
    • Docs update by @ceteri in https://github.com/DerwenAI/pytextrank/pull/221
    • rework some unit tests by @ceteri in https://github.com/DerwenAI/pytextrank/pull/222

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.3...v3.2.4

    Source code(tar.gz)
    Source code(zip)
  • v3.2.3(Mar 6, 2022)

    2022-03-06

    • handles missing noun_chunks in some language models (e.g., "ru") #204
    • add TopicRank algorithm; kudos @tomaarsen
    • improved test suite; fixed tests for newer spacy releases; kudos @tomaarsen

    What's Changed

    • [Snyk] Security upgrade mistune from 0.8.4 to 2.0.1 by @snyk-bot in https://github.com/DerwenAI/pytextrank/pull/201
    • Improved test suite; fixed tests by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/205
    • Updated Copyright year from 2021 to 2022 by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/206
    • update API reference docs by @ceteri in https://github.com/DerwenAI/pytextrank/pull/207
    • Inclusion of the TopicRank Keyphrase Extraction algorithm by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/208
    • Prep release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/210

    New Contributors

    • @snyk-bot made their first contribution in https://github.com/DerwenAI/pytextrank/pull/201

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.2...v3.2.3

    Source code(tar.gz)
    Source code(zip)
  • v3.2.2(Oct 10, 2021)

    What's Changed

    • prep next release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/189
    • warning about the deprecated code in archive by @ceteri in https://github.com/DerwenAI/pytextrank/pull/190
    • fixes chunk to be between sent_start and sent_end in BaseTextRank.calc_sent_dist by @clabornd in https://github.com/DerwenAI/pytextrank/pull/191
    • Update by @ceteri in https://github.com/DerwenAI/pytextrank/pull/198
    • add more scrubber examples and documentation by @dayalstrub-cma in https://github.com/DerwenAI/pytextrank/pull/197
    • kudos by @ceteri in https://github.com/DerwenAI/pytextrank/pull/199
    • prep PyPi release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/200

    New Contributors

    • @clabornd made their first contribution in https://github.com/DerwenAI/pytextrank/pull/191
    • @dayalstrub-cma made their first contribution in https://github.com/DerwenAI/pytextrank/pull/197

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.1...v3.2.2

    Source code(tar.gz)
    Source code(zip)
  • v3.2.1(Jul 24, 2021)

  • v3.2.0(Jul 17, 2021)

    2021-07-17

    Various support for spaCy 3.1.x updates, which changes some interfaces.

    • NB: THE SCRUBBER UPDATE WILL BREAK PREVIOUS RELEASES
    • allow Span as scrubber argument, to align with spaCy 3.1.x; kudos @Ankush-Chander
    • add lgtm code reviews (slow, not integrating into GitHub PRs directly)
    • evaluating grayskull to generate a conda-forge recipe
    • add use of pipdeptree to analyze dependencies
    • use KG from biblio.ttl to generate bibliography
    • fixed overlooked comment from earlier code; kudos @debraj135
    • add visualisation using altair; kudos @louisguitton
    • add scrubber usage in sample notebook; kudos @Ankush-Chander
    • integrating use of MkRefs to generate semantic reference pages in docs
    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Mar 25, 2021)

    2021-03-25

    • fix the span length calculation in explanation notebook; kudos @Ankush-Chander
    • add BiasedTextRank by @Ankush-Chander (many thanks!)
    • add conda environment.yml plus instructions
    • use bandit to check for security issues
    • use codespell to check for spelling errors
    • add pre-commit checks in general
    • update doc._.phrases in the call to change_focus() so the summarization will sync with the latest focus
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Mar 12, 2021)

    2021-03-12

    • rename master branch to main
    • add a factory class that assigns each doc its own Textrank object; kudos @Ankush-Chander
    • refactor the stopwords feature as a constructor argument
    • add get_unit_vector() method to expose the characteristic unit vector
    • add calc_sent_dist() method to expose the sentence distance measures (for summarization)
    • include a unit test for summarization
    • updated contributor instructions
    • pylint coverage for code checking
    • linking definitions and citations in source code apidocs to our online docs
    • updated links on PyPi
    Source code(tar.gz)
    Source code(zip)
  • v3.0.1(Feb 27, 2021)

  • v3.0.0(Feb 14, 2021)

    2021-02-14

    • THIS WILL BREAK THINGS!!!
    • support for spaCy 3.0.x; kudos @Lord-V15
    • full integration of PositionRank
    • migrated all unit tests to pytest
    • removed use of logger for debugging, introducing icecream instead
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Jan 31, 2021)

    2021-01-31

    • add PositionRank by @louisguitton (many thanks!)
    • fixes chunk in explain_summ.ipynb by @anna-droid-beep
    • add option preserve_order in TextRank.summary by @kavorite
    • tested with spaCy 2.3.5
    Source code(tar.gz)
    Source code(zip)
  • v2.0.3(Sep 15, 2020)

    2020-09-15

    • try-catch ZeroDivisionError in summary method -- kudos @shyamcody
    • tested with updated dependencies: spaCy 2.3.x and NetworkX 2.5
    Source code(tar.gz)
    Source code(zip)
  • v2.0.2(Jun 28, 2020)

  • v2.0.1(Mar 2, 2020)

    2020-03-02

    • fix KeyError issue for pre Python 3.6
    • integrated codecov.io
    • added PyTextRank to the spaCy uniVerse
    • fixed README.md instructions to download en_core_web_sm
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Nov 5, 2019)

    • refactored library to run as a spaCy extension
    • supports multiple languages
    • significantly faster, with less memory required
    • better extraction of top-ranked phrases
    • changed license to MIT
    • uses lemma-based stopwords for more precise control
    • WIP toward integration with knowledge graph use cases
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Nov 1, 2019)

  • v1.2.0(Nov 1, 2019)

  • v1.1.1(Sep 15, 2017)

  • v1.1.0(Jun 7, 2017)

    Replaced TextBlob usage with spaCy for improved parsing results. Updated the other Python dependencies. Also added better handling for UTF-8.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(May 1, 2017)

  • v1.0.0(Mar 13, 2017)

Owner
derwen.ai
"In the loop..."
derwen.ai
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 01, 2022
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

HNLP 1.1k Dec 16, 2022
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

BLEU Score Implementation for paper: BLEU: a Method for Automatic Evaluation of Machine Translation Author: Ba Ngoc from ProtonX BLEU score is a popul

Ngoc Nguyen Ba 6 Oct 07, 2021
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
A simple chatbot based on chatterbot that you can use for anything has basic features

Chatbotium A simple chatbot based on chatterbot that you can use for anything has basic features. I have some errors Read the paragraph below: Known b

Herman 1 Feb 16, 2022
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022
A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

ÖMER YILDIZ 4 Mar 20, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
A telegram bot to translate 100+ Languages

🔥 GOOGLE TRANSLATER 🔥 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🔰 Deploy To Railway 🔰 • • ✅ OFF

Aɴᴋɪᴛ Kᴜᴍᴀʀ 5 Dec 20, 2021
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

Structural Guidance for Transformer Language Models This repository accompanies the paper, Structural Guidance for Transformer Language Models, publis

International Business Machines 10 Dec 14, 2022