"elect", "electoral", "electorate" etc. | PythonRepo" /> "elect", "electoral", "electorate" etc. | PythonRepo">

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Overview

word forms logo

Accurately generate all possible forms of an English word

Word forms can accurately generate all possible forms of an English word. It can conjugate verbs. It can connect different parts of speeches e.g noun to adjective, adjective to adverb, noun to verb etc. It can pluralize singular nouns. It does this all in one function. Enjoy!

Examples

Some very timely examples :-P

>>> from word_forms.word_forms import get_word_forms
>>> get_word_forms("president")
>>> {'n': {'presidents', 'presidentships', 'presidencies', 'presidentship', 'president', 'presidency'},
     'a': {'presidential'},
     'v': {'preside', 'presided', 'presiding', 'presides'},
     'r': {'presidentially'}}
>>> get_word_forms("elect")
>>> {'n': {'elects', 'electives', 'electors', 'elect', 'eligibilities', 'electorates', 'eligibility', 'elector', 'election', 'elections', 'electorate', 'elective'},
     'a': {'eligible', 'electoral', 'elective', 'elect'},
     'v': {'electing', 'elects', 'elected', 'elect'},
     'r': set()}
>>> get_word_forms("politician")
>>> {'n': {'politician', 'politics', 'politicians'},
     'a': {'political'},
     'v': set(),
     'r': {'politically'}}
>>> get_word_forms("am")
>>> {'n': {'being', 'beings'},
     'a': set(),
     'v': {'was', 'be', "weren't", 'am', "wasn't", "aren't", 'being', 'were', 'is', "isn't", 'been', 'are', 'am not'},
     'r': set()}
>>> get_word_forms("ran")
>>> {'n': {'run', 'runniness', 'runner', 'runninesses', 'running', 'runners', 'runnings', 'runs'},
     'a': {'running', 'runny'},
     'v': {'running', 'run', 'ran', 'runs'},
     'r': set()}
>>> get_word_forms('continent', 0.8) # with configurable similarity threshold
>>> {'n': {'continents', 'continency', 'continences', 'continent', 'continencies', 'continence'},
     'a': {'continental', 'continent'},
     'v': set(),
     'r': set()}

As you can see, the output is a dictionary with four keys. "r" stands for adverb, "a" for adjective, "n" for noun and "v" for verb. Don't ask me why "r" stands for adverb. This is what WordNet uses, so this is why I use it too :-)

Help can be obtained at any time by typing the following:

>>> help(get_word_forms)

Why?

In Natural Language Processing and Search, one often needs to treat words like "run" and "ran", "love" and "lovable" or "politician" and "politics" as the same word. This is usually done by algorithmically reducing each word into a base word and then comparing the base words. The process is called Stemming. For example, the Porter Stemmer reduces both "love" and "lovely" into the base word "love".

Stemmers have several shortcomings. Firstly, the base word produced by the Stemmer is not always a valid English word. For example, the Porter Stemmer reduces the word "operation" to "oper". Secondly, the Stemmers have a high false negative rate. For example, "run" is reduced to "run" and "ran" is reduced to "ran". This happens because the Stemmers use a set of rational rules for finding the base words, and as we all know, the English language does not always behave very rationally.

Lemmatizers are more accurate than Stemmers because they produce a base form that is present in the dictionary (also called the Lemma). So the reduced word is always a valid English word. However, Lemmatizers also have false negatives because they are not very good at connecting words across different parts of speeches. The WordNet Lemmatizer included with NLTK fails at almost all such examples. "operations" is reduced to "operation" and "operate" is reduced to "operate".

Word Forms tries to solve this problem by finding all possible forms of a given English word. It can perform verb conjugations, connect noun forms to verb forms, adjective forms, adverb forms, plularize singular forms etc.

Bonus: A simple lemmatizer

We also offer a very simple lemmatizer based on word_forms. Here is how to use it.

>>> from word_forms.lemmatizer import lemmatize
>>> lemmatize("operations")
'operant'
>>> lemmatize("operate")
'operant'

Enjoy!

Compatibility

Tested on Python 3

Installation

Using pip:

pip install -U word_forms

From source

Or you can install it from source:

  1. Clone the repository:
git clone https://github.com/gutfeeling/word_forms.git
  1. Install it using pip or setup.py
pip install -e word_forms
% or
cd word_forms
python setup.py install

Acknowledgement

  1. The XTAG project for information on verb conjugations.
  2. WordNet

Maintainer

Hi, I am Dibya and I maintain this repository. I would love to hear from you. Feel free to get in touch with me at [email protected].

Contributors

  • Tom Aarsen @CubieDev is a major contributor and is singlehandedly responsible for v2.0.0.
  • Sajal Sharma @sajal2692 ia a major contributor.
  • Pamphile Roy @tupui is responsible for the PyPI package.

Contributions

Word Forms is not perfect. In particular, a couple of aspects can be improved.

  1. It sometimes generates non dictionary words like "runninesses" because the pluralization/singularization algorithm is not perfect. At the moment, I am using inflect for it.

If you like this package, feel free to contribute. Your pull requests are most welcome.

Comments
  • Using python-Levenshtein for similarity between lemmas, for performance

    Using python-Levenshtein for similarity between lemmas, for performance

    Hello again!

    @chrisjbryant suggested the use of python-Levenshtein for getting similarity ratios between strings in the comments of pull request #10, and though I've been busy until now, I finally got around to testing it. This pull request has some scripts in the Markdown. They are included in case you want to reproduce my tests and results. However, I would recommend not giving the scripts themselves much attention, and focusing on the explanations and outputs.

    Is it equivalent?

    I created the following file in the root folder of the project, where it could read from test_values.py.

    #!/usr/bin/env python
    # encoding: utf-8
    
    from difflib import SequenceMatcher
    from Levenshtein import ratio
    import unittest
    
    
    class TestWordForms(unittest.TestCase):
        """
        Simple TestCase for a specific input to output, one instance generated per test case for use in a TestSuite
        """
    
        def __init__(self, text_input: str, expected_output: dict, description: str = ""):
            super().__init__()
            self.text_input = text_input
            self.expected_output = expected_output
            self.description = description
    
        def setUp(self):
            pass
    
        def tearDown(self):
            pass
    
        def runTest(self):
            self.assertEqual(
                SequenceMatcher(None, self.text_input, self.expected_output).ratio(),
                ratio(self.text_input, self.expected_output),
                self.description,
            )
    
    
    if __name__ == "__main__":
        from test_values import test_values
    
        suite = unittest.TestSuite()
        suite.addTests(
            TestWordForms(
                inp,
                value,
                f"difflib.SequenceMatcher(None, {repr(inp)}, {repr(value)}) ~= Levenshtein.ratio({repr(inp)}, {repr(value)})",
            )
            for inp, out in test_values
            for values in out.values()
            for value in values
        )
        unittest.TextTestRunner().run(suite)
    

    In short, this takes all input values from test_values.py, and all individual outputs for these inputs, and checks whether the similarity ratio between these two is identical when using difflib.SequenceMatcher().ratio() vs Levenshtein.ratio(). The output is the following:

    .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
    ----------------------------------------------------------------------
    Ran 621 tests in 0.080s
    
    OK
    

    So, the Levenshtein.ratio() is indeed equivalent to difflib.SequenceMatcher().ratio() for these test cases.

    But is it faster?

    Again, I wrote a quick script for this. None of these are included in the actual commits as they should not be packaged with word_forms. The script is:

    #!/usr/bin/env python
    # encoding: utf-8
    
    from timeit import timeit
    from difflib import SequenceMatcher
    from Levenshtein import ratio
    
    from test_values import test_values
    test_cases = [
        (
            inp,
            value,
        )
        for inp, out in test_values
        for values in out.values()
        for value in values
    ]
    
    n = 100
    ratio_list = []
    for str_one, str_two in test_cases:
        diff  = timeit(lambda: SequenceMatcher(None, str_one, str_two).ratio(), number=n)
        leven = timeit(lambda: ratio(str_one, str_two), number=n)
        ratio_list.append(diff / leven)
        #print(f"Levenshtein.ratio() is {ratio_list[-1]:.4f} times as fast as difflib.SequenceMatcher().ratio() for {repr(str_one)} to {repr(str_two)}")
    print(f"Minimum performance gain (ratio): {min(ratio_list):.4f} times as fast")
    print(f"Maximum performance gain (ratio): {max(ratio_list):.4f} times as fast")
    print(f"Median performance gain (ratio) : {sorted(ratio_list)[round(len(ratio_list)/2)]:.4f} times as fast")
    
    

    Which outputted:

    Minimum performance gain (ratio): 21.2509 times as fast
    Maximum performance gain (ratio): 194.4625 times as fast
    Median performance gain (ratio) : 78.2975 times as fast
    

    So, yes, it is much faster. Will it be a noticable performance increase when implemented in word_forms? Well, I made a quick script for that too.

    Is it actually noticably faster in get_word_forms?

    #!/usr/bin/env python
    # encoding: utf-8
    
    from timeit import timeit
    from word_forms.word_forms import get_word_forms
    from test_values import test_values
    
    n = 200
    speed_list = []
    for test_case in test_values:
        str_one = test_case[0]
        speed = timeit(lambda: get_word_forms(str_one), number=n)
        speed_list.append(speed)
        #print(f"Took {speed:.8f}s")
    print(f"Minimum execution time (seconds): {min(speed_list):.8f}s")
    print(f"Maximum execution time (seconds): {max(speed_list):.8f}s")
    print(f"Median execution time (seconds) : {sorted(speed_list)[round(len(speed_list)/2)]:.8f}s")
    

    Which outputted the following when using difflib.SequenceMatcher().ratio(): (Note, the execution time is for calling the function 200 times)

    Minimum execution time (seconds): 0.01940580s
    Maximum execution time (seconds): 1.16317950s
    Median execution time (seconds) : 0.07265300s
    

    and the following for Levenshtein.ratio():

    Minimum execution time (seconds): 0.01647300s
    Maximum execution time (seconds): 1.23246420s
    Median execution time (seconds) : 0.05827050s
    

    When considering the median, there is a noticable performance increase of some ~20%, but this is likely not enough for any user to actually notice. Regardless, using Levenshtein.ratio() is definitely preferable, unless you desperately want to keep the amount of dependencies to a minimum.


    Thank you for the recommendation @chrisjbryant

    All tests still pass, as expected.

    • Tom Aarsen
    opened by tomaarsen 9
  • Use difflib instead of python-Levenshtein for computing similarity ratio

    Use difflib instead of python-Levenshtein for computing similarity ratio

    The python-Levenshtein library has a GPLv2 license, meaning that the derived works must be available under the same license. Due to this, I, and presumably others, cannot use this library if we want a different license for works that use the word_forms library as a dependancy. (Thinking about this, it may mean that this particular library should also be GPLv2).

    While looking for alternatives to this, I chanced upon Python's own difflib library, and it's SequenceMatcher.ratio() function. The output of this ratio, is exactly the same as the python-Levenshtein ratios. In fact, there is some overlap between the actual implementations of these libraries, as mentioned in the python-Levenshtein docs.

    Code block to demonstrate this:

    from difflib import SequenceMatcher
    from Levenshtein import ratio
    
    def sequence_matcher_ratio(a, b):
        return SequenceMatcher(None, a, b).ratio()
    
    def compare_equality(a, b):
        print(sequence_matcher_ratio(a, b) == ratio(a, b))
    
    def compare_print(a, b):
        print("Sequence Matcher Ratio: ", sequence_matcher_ratio(a, b))
        print("Levenshtein Ratio: ", ratio(a, b))
    
    >>> compare_equality('continent', 'continence') 
    True
    
    >>> compare_print('continent', 'continence')
    Sequence Matcher Ratio:  0.8421052631578947
    Levenshtein Ratio:  0.8421052631578947
    

    I propose we move to using SequenceMatcher, or some other library instead of the python-Levenshtein library.

    I'm already doing this in my own fork, so I can raise a PR for this if needed.

    opened by sajal2692 6
  • Unexpected results

    Unexpected results

    Hi, Your module looks great! I did want to report the following weird case though.

    wf.get_word_forms("genetic") {'n': 'geneticss', 'originations', 'gene', 'origins', 'originators', 'origin', 'geneticist', 'geneticists', 'originator', 'genes', 'genetics', 'origination'}, 'r': {'genetically'}, 'a': {'genetic', 'genic', 'genetical', 'originative'}, 'v': {'originate', 'originating', 'originated', 'originates'}}

    Clearly "genetic" and "originated" should not be considered different forms of the same word. Do you have any idea why this happens? And is it easy to fix or just an exception! Thanks!

    opened by chrisjbryant 5
  • Any way to avoid the download every time?

    Any way to avoid the download every time?

    I'm writing an anagram solver, and using this to help pull different versions of words in the dictionary like "weakened" vs. "weaken" - "weaken" is in the dictionary file I'm using, but "weakened" is not because it's a different form of the base word.

    Every time I run my program, there's a 7 or so second delay while word_forms downloads files and sets things up - is there any way to just have it use the files it downloaded last time?

    opened by Ralithune 4
  • "Fished" not showing up in results

    I was running a simple test.

    get_word_forms("fish")

    However, "fished" is not showing us as expected under "v".

    Actual results: {'n': {'fisher', 'fisheries', 'fishery', 'fishers', 'fishings', 'fishing', 'fish'}, 'a': {'fishy'}, 'v': {'fish'}, 'r': {'fishily'}}

    opened by ivanmkc 3
  • How to use word_forms without internet access?

    How to use word_forms without internet access?

    I have inherited some code that uses word_forms. It needs to run on a system that has no internet access (firewalled). It looks to me that when word_forms initializes it calls ntlk.download('wordnet') which croaks after timing out. I wonder if there's a way for me to add the WordNet dataset locally and modify the word_forms code to have it initialize nltk with the local copy of WordNet instead?

    For example:

    >>> from word_forms.word_forms import get_word_forms
    [nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
    [nltk_data]     connection attempt failed because the connected party
    [nltk_data]     did not properly respond after a period of time, or
    [nltk_data]     established connection failed because connected host
    [nltk_data]     has failed to respond>
    >>> import nltk
    >>> nltk.download('wordnet')
    [nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
    [nltk_data]     connection attempt failed because the connected party
    [nltk_data]     did not properly respond after a period of time, or
    [nltk_data]     established connection failed because connected host
    [nltk_data]     has failed to respond>
    False
    

    Finally, thanks for all the effort that has gone into making this package available, it's very appreciated!

    opened by monocongo 3
  • Across the board improvements

    Across the board improvements

    Hello Dibya,

    I've been looking for projects exactly like this one, that do something concise and interesting with text. The fact that this project was shown to have a few small flaws made it even more appealing to me. Over the past few days I've played around with your work, and found some ways to improve its results. This kind of work is very interesting to me, and maybe we can learn a thing or two with the work I did.

    I'll give a short overview of what I've done before diving into details.

    Improvement Overview

      1. Added a Testing Suite to track results. Run it with python test_word_forms.py.
      1. Fixed strange case sensitivity for some nouns. e.g.:
    get_word_forms("death") => {
     'n': {'death', 'dying', 'dice', 'Death', 'dyings', 'die', 'deaths', 'Deaths'},
          ...
    }
    
      1. Resolved some of the incorrect pluralisation provided by inflect: No more "politicss".
      1. Resolved some issues with weird unrelated results, such as:
    get_word_forms("verb") => {
      'a': {'verbal'},
      'n': {'wordings', 'wording', 'word', 'verbs', 'verb', 'words'},
      'r': {'verbally'},
      'v': {'verbified',
            'verbifies',
            'verbify',
            'verbifying',
            'word',
            'worded',
            'wording',
            'words'}
    }
    

    This is now:

    get_word_forms("verb") => {
      "a": {"verbal"},
      "n": {"verbs", "verb"},
      "r": {"verbally"},
      "v": {"verbifying", "verbified", "verbify", "verbifies"},
    }
    
      1. Resolved words like "ran", "am" and "was" not returning any values. The old system returns:
    get_word_forms("ran") => {'n': set(), 'a': set(), 'v': set(), 'r': set()}
    get_word_forms("run") => {
      'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
      'a': {'running', 'runny'}, 
      'v': {'running', 'run', 'runs', 'ran'}, 
      'r': set()
    }
    

    This is now:

    get_word_forms("ran") => {
      'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
      'a': {'running', 'runny'}, 
      'v': {'running', 'run', 'runs', 'ran'}, 
      'r': set()
    }
    get_word_forms("run") => {
      'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
      'a': {'running', 'runny'}, 
      'v': {'running', 'run', 'runs', 'ran'}, 
      'r': set()
    }
    
      1. Heavily improved performance of importing the module. The following program
    from time import time
    t = time()
    from word_forms.word_forms import get_word_forms
    print(f"It took {time() - t:.4f}s to load the module")
    

    used to output

    It took 10.1868s to load the module
    

    and now it outputs

    It took 2.7437s to load the module
    

    In addition to these fixes, this pull request solves issues #2 and #3.


    Detailed changes

    I've split up my work in small commits, so that with each commit the reasoning and changes are concise and hopefully followable. I would heavily recommend going through each commit in order to see what I've done, rather than immediately jumping to see the overall effect they had on the two main files of your project.

    I'll go through each of my commits and tell you my reasoning and what they did.

    • 885445c Added a test suite with a bunch of words and their (expected) results. These were manually edited to remove any errors, such as "politicss".
    • ead7285 Rather than looping over all synsets, and then over all lemmas, to get all words, I now call nltk.corpus.words.words(), which gives you (if I recall correctly) some 240000 words rather than the previous 180000, and in considerably less time. This is responsible for improvement vi.
    • 0f6d9b1 Commit title and description speak for itself.
    • 4f848e4 The function get_related_lemmas() is more intuitively implemented as a recursive function. I keep track of a list of known lemmas. When the function is called, it will take the word parameter, get the lemmas for that word exactly, add those to known_lemmas if not already present, and recursively call get_related_lemmas() again with lemmas related to the original word. Each of these recursive calls will add more entries to known_lemmas, which is then returned. Note that at this time (this will change in a later commit), this function is identical in functionality, it just has a different implementation.
    • 8a05104 Commit title and description speak for itself.
    • fe740e6 Slight simplification using .copy() on a list rather than using a list comprehension to copy a list.
    • 2e8c2a9 Moved away from iterating over a nested list, and using a dictionary instead. Previously, for each verb you iterated over the entire CONJUGATED_VERB_LIST, and then for each nested list checked whether the verb was in that nested list. This is a very slow operation, one which does not need to be this slow. Now, I use a dict that points to an instance of the Verb class, which holds a set of strings. To illustrate the change, I'll show the old and new representations:
    CONJUGATED_VERB_LIST = [
      ['convoluted', 'convoluting', 'convolute', 'convolutes'],
      ['fawn', 'fawned', 'fawns', 'fawning']
    ]
    
    v1 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
    v2 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
    CONJUGATED_VERB_DICT = {
      'convoluted': v1, 
      'convoluting': v1,
      'convolute': v1, 
      'convolutes': v1, 
      'fawn': v2, 
      'fawned': v2, 
      'fawns': v2, 
      'fawning': v2
    }
    

    In the old system, we need:

    for verb in verb_set:
        for word_list in CONJUGATED_VERB_LIST:
            if verb in word_list:
                for word in word_list:
                    related_word_dict["v"].add(word)
    

    You can count the amount of nested loops for yourself. Now we only need:

    for verb in verb_set:
        if verb in CONJUGATED_VERB_DICT:
            related_word_dict["v"] |= CONJUGATED_VERB_DICT[verb].verbs
    

    This is considerably faster. Note that |= is a set union operation.

    • 30c8282 The changes in line 35 are reverted later on. Furthermore, now every pertainym is used in ADJECTIVE_TO_ADVERB, rather than just the first one. Other than that, word_forms is just slightly optimized to not need a list comprehension.
    • a9cd983 This is a very interesting commit. It uses something very similar to the difflib.get_close_matches() used in constants.py: difflib.SequenceMatcher. Now, new lemmas found in get_related_lemmas will only be considered if they are deemed at least 40% similar to the previous lemma. This will avoid jumps like "verbal" -> "word" and "genetic" -> "origin".
    • 7f6d5d2 Commit title and description speak for itself.
    • 4bd4cd5 Commit title and description speak for itself.
    • d1c6340 Commit title and description speak for itself. This is responsible for improvement ii.
    • b2a9daf Rather than blindly adding the plural of any noun, we ensure that the pluralized noun does not end with "css" to avoid "geneticss" and "politicss". We override this change later in commit 2ea150e.
    • 80d27bf Commit title and description speak for itself.
    • 5f31968 Commit title and description speak for itself.
    • 77b9c25 Commit title and description speak for itself.
    • 0cf0628 Commit title and description speak for itself.
    • 359fc0c Rather than turning words inputted to get_word_forms to singular, we now use the NLTK WordNetLemmatizer() to lemmatize the input word for nouns, verbs, adjectives and adverbs. With this change, we get "{run}" as words when the input was "ran", or {"be", "am"} if we input "am". Then, for each of these words we call get_related_lemmas, and duplicates lemmas are removed. This is responsible for improvement v.
    • 18eeba6 Commit title and description speak for itself.
    • 2ea150e Improves on commit b2a9daf. Uses a simple regular expression to detect whether the pluralised noun ends with a consonant followed by "ss". Commit title and description speak for itself. This is responsible for improvement iii.
    • c832b6b Commit title and description speak for itself.
    • 384f298 Commit title and description speak for itself.
    • e71f64f Now that the program output has been improved, some of the examples are outdated. I've updated them, and added the now relevant examples of "am" and "ran". The Contribution section is also updated to reflect that a bug was now fixed.
    • d11297e Commit title and description speak for itself.

    Tests

    The modified program passes all provided tests. The original program will fail 9 of them. The output of the tests on the original program is provided here: original_test_output.txt

    I've tested my changes with Python 3.7.4. Whether the program is still Python 2 compatible I do not know.

    Potential additions/considerations for you

    • Add tests to setup.py so tests can be run using setup.py.
    • Check/confirm Python 2 compatibility.
    • Consider adding a contributors section in the README with contributors.
    • Consider checking the README to see if it still has any inconsistencies with the new version, e.g. if it warns for bugs that no longer exist.
    • Note that the pluralisation by inflect can still cause issues, as it comes up with words like "runninesses" as the plural of "runniness".

    I've had a lot of fun messing with this project, but I recognise there are a lot of changes proposed in this pull request. If you feel like the spirit of the original version is lost in some way if this pull request was accepted, then I will turn my version into a standalone fork so people can use it if they'd like, with this project preserved like it is.

    Let me know if you need anything else from me.

    Tom Aarsen

    opened by tomaarsen 3
  • How to add domain specific words to this Library

    How to add domain specific words to this Library

    I want to add some domain specific words to this library. for example: below code gives all word forms for word 'moisturize'

    get_word_forms("moisturize")
    {'n': {'moistener',
      'moisteners',
      'moistening',
      'moistenings',
      'moisture',
      'moistures'},
     'a': set(),
     'v': {'moisten',
      'moistened',
      'moistening',
      'moistens',
      'moisturise',
      'moisturize',
      'moisturized',
      'moisturizes',
      'moisturizing'},
     'r': set()}
    

    But when I try below one. I get null set. How can I add word 'moisturizer' and get the above result as it's word forms. Thanks in advance for helping me.

    get_word_forms("moisturizer")
    {'n': set(), 'a': set(), 'v': set(), 'r': set()}
    
    opened by srinivas365 1
  • Performance improvement on importing

    Performance improvement on importing

    Closes #20

    Hello!


    Pull request overview

    This PR is about a performance improvement with three main changes:

    1. Removing the now-defunct ALL_WORDNET_WORDS as inspired by #20.
    2. Making ADJECTIVE_TO_ADVERB be loaded from a file rather than through NLTK's Wordnet.
    3. (Slightly) changing how Wordnet is imported in word_forms.py.

    Change 1:

    Before my commit 359fc0c8dba0a2d7308f2ffd3f0dc5f6b37025f0, there existed a function singularize(noun), which attempted to return the singular form of a noun, and checked if it existed in Wordnet using singular in ALL_WORDNET_WORDS. Essentially, it tried to find a form of the word passed to get_word_forms so the rest of the program could work with a word in Wordnet. In this commit I change this function to using WordNetLemmatizer().lemmatize(), which as property has that it ensures that the passed word is in Wordnet. However, the import of ALL_WORDNET_WORDS remained, without needing to exist. In @gutfeeling 's commit 143abc35b01971f6d23ad63f826c3e1f27e15225 this import was removed, but the initialisation in constants.py remained unnecessarily.

    Change 2:

    I tried to look for more ways to improve import times, as they were roughly 2.8 seconds for me earlier. I looked at constants, and realised that the dict that is created for ADJECTIVE_TO_ADVERB is only ~2800 items big, while it costs me just under a second to compute. Because this dict is unlikely to change much between different runs of the program (it can only change if NLTK's Wordnet is updated), this second can be gained by simply moving this dict into a file, and loading that - very similarly to what is done with en-verbs.txt.

    So, with change 1 and 2 combined, there is no use for NLTK's wordnet in constants.py, so those imports are removed too.

    Change 3:

    As you can see in the list of commits, I also attempted a faster method of checking whether NLTK's wordnet is downloaded:

    try:
        from nltk.data import find
        find("corpora/wordnet.zip")
    except LookupError:
        from nltk import download
        download("wordnet", quiet=True)
    from nltk.corpus import wordnet as wn
    

    This is considerably faster for importing (improving import times fourfold), but will increase the execution time of the very first call of get_word_forms() by roughly 1.5 seconds. This is because NLTK lazy loads its wordnet data, and the original check that uses wn.synsets("python") will cause NLTK to load wordnet, costing 1.5 seconds.

    So, the question is: Would you rather wait 1.5 seconds at the start of your program, or 1.5 extra seconds whenever you first call get_word_forms, and my take is that any program that calls the function in a "live" environment, e.g. based on user input, will rather have the former. So, I didn't implement this change, and only changed wn.synsets("python") to wn.synset("python.n.01"), which simply gets one synset instead of 3. This is marginably faster.


    Rough performance changes

    The import time of this module has been reduced by about 35%, from ~2.3975 to ~1.5668s (median time between 5 runs each), with no performance difference of the get_word_forms function. Furthermore, all tests still run perfectly.


    Thank you @sajal2692 for letting me know of the inconsistency in #20, and for your work on this repo!


    • Tom Aarsen
    opened by tomaarsen 1
  • feat: add similarity threshold for get_word_forms function

    feat: add similarity threshold for get_word_forms function

    Sometimes, the get_word_forms function returns outlandish results:

    >>> get_word_forms("continent")
    {'n': {'contentments', 'continent', 'containment', 'continence', 'continency', 'continents', 'continences', 'continencies', 'container', 'content', 'containments', 'contents', 'contentment', 'containers'}, 'a': {'continent', 'continental', 'content'}, 'v': {'containerized', 'containerize', 'containerizing', 'contains', 'containing', 'contain', 'contented', 'containerizes', 'contenting', 'contents', 'content', 'contained', 'containerise'}, 'r': set()}
    

    Word forms such as "container" or "content" may not really make sense here for a user's purpose.

    This pull request allows users to set a custom similarity threshold to configure the similarity of the results obtained from the get_word_forms function as an easy way to filter out (or perhaps, include more) results.

    >>> get_word_forms('continent', 0.8)
    {'n': {'continents', 'continent', 'continences', 'continence', 'continencies', 'continency'}, 'a': {'continent', 'continental'}, 'v': set(), 'r': set()}
    

    The default behaviour of the library remains unchanged. If a user does not specify the similarity threshold, a default value of 0.4, which was hardcoded in the code till now, is used.

    opened by sajal2692 1
  • NLTK Version Incompatibility with spacy-wordnet

    NLTK Version Incompatibility with spacy-wordnet

    Hey! Is there a reason the minimum nltk version was increased to 3.5, and not >=3.3 (or some other lower version)? Spacy-wordnet requires nltk 3.3, so word_forms isn't compatible with that at the moment. Thanks!

    opened by YanisaHS 1
  • British only verb conjugation?

    British only verb conjugation?

    I'm finding word_forms to be very useful, and I really appreciate you putting it out there.

    One small issue I ran into is with the word "model"

    Here are the results I get: { 'n': {'modellings', 'modelings', 'modeller', 'model', 'modelers', 'models', 'modellers', 'modeler', 'modeling', 'modelling'}, 'a': {'model'}, 'v': {'modelling', 'modelled', 'model', 'models'}, 'r': set() }

    As you can see for the noun forms both British and American spellings are provided, but for the verb form, only the British conjugations are shown. I haven't dug into the code yet to see how it works, but if there's an easy fix that would be great.

    opened by jstafford 0
Releases(v2.1.0)
  • v2.1.0(Sep 18, 2020)

    Message from @CubieDev

    Hello again!

    @chrisjbryant suggested the use of python-Levenshtein for getting similarity ratios between strings in the comments of pull request #10, and though I've been busy until now, I finally got around to testing it. This pull request has some scripts in the Markdown. They are included in case you want to reproduce my tests and results. However, I would recommend not giving the scripts themselves much attention, and focusing on the explanations and outputs.

    Is it equivalent?

    I created the following file in the root folder of the project, where it could read from test_values.py.

    #!/usr/bin/env python
    # encoding: utf-8
    
    from difflib import SequenceMatcher
    from Levenshtein import ratio
    import unittest
    
    
    class TestWordForms(unittest.TestCase):
        """
        Simple TestCase for a specific input to output, one instance generated per test case for use in a TestSuite
        """
    
        def __init__(self, text_input: str, expected_output: dict, description: str = ""):
            super().__init__()
            self.text_input = text_input
            self.expected_output = expected_output
            self.description = description
    
        def setUp(self):
            pass
    
        def tearDown(self):
            pass
    
        def runTest(self):
            self.assertEqual(
                SequenceMatcher(None, self.text_input, self.expected_output).ratio(),
                ratio(self.text_input, self.expected_output),
                self.description,
            )
    
    
    if __name__ == "__main__":
        from test_values import test_values
    
        suite = unittest.TestSuite()
        suite.addTests(
            TestWordForms(
                inp,
                value,
                f"difflib.SequenceMatcher(None, {repr(inp)}, {repr(value)}) ~= Levenshtein.ratio({repr(inp)}, {repr(value)})",
            )
            for inp, out in test_values
            for values in out.values()
            for value in values
        )
        unittest.TextTestRunner().run(suite)
    

    In short, this takes all input values from test_values.py, and all individual outputs for these inputs, and checks whether the similarity ratio between these two is identical when using difflib.SequenceMatcher().ratio() vs Levenshtein.ratio(). The output is the following:

    .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
    ----------------------------------------------------------------------
    Ran 621 tests in 0.080s
    
    OK
    

    So, the Levenshtein.ratio() is indeed equivalent to difflib.SequenceMatcher().ratio() for these test cases.

    But is it faster?

    Again, I wrote a quick script for this. None of these are included in the actual commits as they should not be packaged with word_forms. The script is:

    #!/usr/bin/env python
    # encoding: utf-8
    
    from timeit import timeit
    from difflib import SequenceMatcher
    from Levenshtein import ratio
    
    from test_values import test_values
    test_cases = [
        (
            inp,
            value,
        )
        for inp, out in test_values
        for values in out.values()
        for value in values
    ]
    
    n = 100
    ratio_list = []
    for str_one, str_two in test_cases:
        diff  = timeit(lambda: SequenceMatcher(None, str_one, str_two).ratio(), number=n)
        leven = timeit(lambda: ratio(str_one, str_two), number=n)
        ratio_list.append(diff / leven)
        #print(f"Levenshtein.ratio() is {ratio_list[-1]:.4f} times as fast as difflib.SequenceMatcher().ratio() for {repr(str_one)} to {repr(str_two)}")
    print(f"Minimum performance gain (ratio): {min(ratio_list):.4f} times as fast")
    print(f"Maximum performance gain (ratio): {max(ratio_list):.4f} times as fast")
    print(f"Median performance gain (ratio) : {sorted(ratio_list)[round(len(ratio_list)/2)]:.4f} times as fast")
    
    

    Which outputted:

    Minimum performance gain (ratio): 21.2509 times as fast
    Maximum performance gain (ratio): 194.4625 times as fast
    Median performance gain (ratio) : 78.2975 times as fast
    

    So, yes, it is much faster. Will it be a noticable performance increase when implemented in word_forms? Well, I made a quick script for that too.

    Is it actually noticably faster in get_word_forms?

    #!/usr/bin/env python
    # encoding: utf-8
    
    from timeit import timeit
    from word_forms.word_forms import get_word_forms
    from test_values import test_values
    
    n = 200
    speed_list = []
    for test_case in test_values:
        str_one = test_case[0]
        speed = timeit(lambda: get_word_forms(str_one), number=n)
        speed_list.append(speed)
        #print(f"Took {speed:.8f}s")
    print(f"Minimum execution time (seconds): {min(speed_list):.8f}s")
    print(f"Maximum execution time (seconds): {max(speed_list):.8f}s")
    print(f"Median execution time (seconds) : {sorted(speed_list)[round(len(speed_list)/2)]:.8f}s")
    

    Which outputted the following when using difflib.SequenceMatcher().ratio(): (Note, the execution time is for calling the function 200 times)

    Minimum execution time (seconds): 0.01940580s
    Maximum execution time (seconds): 1.16317950s
    Median execution time (seconds) : 0.07265300s
    

    and the following for Levenshtein.ratio():

    Minimum execution time (seconds): 0.01647300s
    Maximum execution time (seconds): 1.23246420s
    Median execution time (seconds) : 0.05827050s
    

    When considering the median, there is a noticable performance increase of some ~20%, but this is likely not enough for any user to actually notice. Regardless, using Levenshtein.ratio() is definitely preferable, unless you desperately want to keep the amount of dependencies to a minimum.


    Thank you for the recommendation @chrisjbryant

    All tests still pass, as expected.

    • Tom Aarsen

    Message from @gutfeeling

    1. Added a simple lemmatizer in word_forms.lemmatizer. It uses get_word_forms() to generate all forms of the word and then picks the shortest form that appears first in the dictionary (i.e. in alphabetically sorted order).
    2. Updated dependencies. Unipath has been replaced by Python3's pathlib. NLTK and inflect versions have been bumped.
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Sep 4, 2020)

    This release is a result of the awesome work done by @CubieDev.

    Improvement Overview

      1. Added a Testing Suite to track results. Run it with python test_word_forms.py.
      1. Fixed strange case sensitivity for some nouns. e.g.:
    get_word_forms("death") => {
     'n': {'death', 'dying', 'dice', 'Death', 'dyings', 'die', 'deaths', 'Deaths'},
          ...
    }
    
      1. Resolved some of the incorrect pluralisation provided by inflect: No more "politicss".
      1. Resolved some issues with weird unrelated results, such as:
    get_word_forms("verb") => {
      'a': {'verbal'},
      'n': {'wordings', 'wording', 'word', 'verbs', 'verb', 'words'},
      'r': {'verbally'},
      'v': {'verbified',
            'verbifies',
            'verbify',
            'verbifying',
            'word',
            'worded',
            'wording',
            'words'}
    }
    

    This is now:

    get_word_forms("verb") => {
      "a": {"verbal"},
      "n": {"verbs", "verb"},
      "r": {"verbally"},
      "v": {"verbifying", "verbified", "verbify", "verbifies"},
    }
    
      1. Resolved words like "ran", "am" and "was" not returning any values. The old system returns:
    get_word_forms("ran") => {'n': set(), 'a': set(), 'v': set(), 'r': set()}
    get_word_forms("run") => {
      'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
      'a': {'running', 'runny'}, 
      'v': {'running', 'run', 'runs', 'ran'}, 
      'r': set()
    }
    

    This is now:

    get_word_forms("ran") => {
      'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
      'a': {'running', 'runny'}, 
      'v': {'running', 'run', 'runs', 'ran'}, 
      'r': set()
    }
    get_word_forms("run") => {
      'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
      'a': {'running', 'runny'}, 
      'v': {'running', 'run', 'runs', 'ran'}, 
      'r': set()
    }
    
      1. Heavily improved performance of importing the module. The following program
    from time import time
    t = time()
    from word_forms.word_forms import get_word_forms
    print(f"It took {time() - t:.4f}s to load the module")
    

    used to output

    It took 10.1868s to load the module
    

    and now it outputs

    It took 2.7437s to load the module
    

    In addition to these fixes, this pull request solves issues #2 and #3.


    Detailed changes

    I've split up my work in small commits, so that with each commit the reasoning and changes are concise and hopefully followable. I would heavily recommend going through each commit in order to see what I've done, rather than immediately jumping to see the overall effect they had on the two main files of your project.

    I'll go through each of my commits and tell you my reasoning and what they did.

    • 885445c Added a test suite with a bunch of words and their (expected) results. These were manually edited to remove any errors, such as "politicss".
    • ead7285 Rather than looping over all synsets, and then over all lemmas, to get all words, I now call nltk.corpus.words.words(), which gives you (if I recall correctly) some 240000 words rather than the previous 180000, and in considerably less time. This is responsible for improvement vi.
    • 0f6d9b1 Commit title and description speak for itself.
    • 4f848e4 The function get_related_lemmas() is more intuitively implemented as a recursive function. I keep track of a list of known lemmas. When the function is called, it will take the word parameter, get the lemmas for that word exactly, add those to known_lemmas if not already present, and recursively call get_related_lemmas() again with lemmas related to the original word. Each of these recursive calls will add more entries to known_lemmas, which is then returned. Note that at this time (this will change in a later commit), this function is identical in functionality, it just has a different implementation.
    • 8a05104 Commit title and description speak for itself.
    • fe740e6 Slight simplification using .copy() on a list rather than using a list comprehension to copy a list.
    • 2e8c2a9 Moved away from iterating over a nested list, and using a dictionary instead. Previously, for each verb you iterated over the entire CONJUGATED_VERB_LIST, and then for each nested list checked whether the verb was in that nested list. This is a very slow operation, one which does not need to be this slow. Now, I use a dict that points to an instance of the Verb class, which holds a set of strings. To illustrate the change, I'll show the old and new representations:
    CONJUGATED_VERB_LIST = [
      ['convoluted', 'convoluting', 'convolute', 'convolutes'],
      ['fawn', 'fawned', 'fawns', 'fawning']
    ]
    
    v1 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
    v2 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
    CONJUGATED_VERB_DICT = {
      'convoluted': v1, 
      'convoluting': v1,
      'convolute': v1, 
      'convolutes': v1, 
      'fawn': v2, 
      'fawned': v2, 
      'fawns': v2, 
      'fawning': v2
    }
    

    In the old system, we need:

    for verb in verb_set:
        for word_list in CONJUGATED_VERB_LIST:
            if verb in word_list:
                for word in word_list:
                    related_word_dict["v"].add(word)
    

    You can count the amount of nested loops for yourself. Now we only need:

    for verb in verb_set:
        if verb in CONJUGATED_VERB_DICT:
            related_word_dict["v"] |= CONJUGATED_VERB_DICT[verb].verbs
    

    This is considerably faster. Note that |= is a set union operation.

    • 30c8282 The changes in line 35 are reverted later on. Furthermore, now every pertainym is used in ADJECTIVE_TO_ADVERB, rather than just the first one. Other than that, word_forms is just slightly optimized to not need a list comprehension.
    • a9cd983 This is a very interesting commit. It uses something very similar to the difflib.get_close_matches() used in constants.py: difflib.SequenceMatcher. Now, new lemmas found in get_related_lemmas will only be considered if they are deemed at least 40% similar to the previous lemma. This will avoid jumps like "verbal" -> "word" and "genetic" -> "origin".
    • 7f6d5d2 Commit title and description speak for itself.
    • 4bd4cd5 Commit title and description speak for itself.
    • d1c6340 Commit title and description speak for itself. This is responsible for improvement ii.
    • b2a9daf Rather than blindly adding the plural of any noun, we ensure that the pluralized noun does not end with "css" to avoid "geneticss" and "politicss". We override this change later in commit 2ea150e.
    • 80d27bf Commit title and description speak for itself.
    • 5f31968 Commit title and description speak for itself.
    • 77b9c25 Commit title and description speak for itself.
    • 0cf0628 Commit title and description speak for itself.
    • 359fc0c Rather than turning words inputted to get_word_forms to singular, we now use the NLTK WordNetLemmatizer() to lemmatize the input word for nouns, verbs, adjectives and adverbs. With this change, we get "{run}" as words when the input was "ran", or {"be", "am"} if we input "am". Then, for each of these words we call get_related_lemmas, and duplicates lemmas are removed. This is responsible for improvement v.
    • 18eeba6 Commit title and description speak for itself.
    • 2ea150e Improves on commit b2a9daf. Uses a simple regular expression to detect whether the pluralised noun ends with a consonant followed by "ss". Commit title and description speak for itself. This is responsible for improvement iii.
    • c832b6b Commit title and description speak for itself.
    • 384f298 Commit title and description speak for itself.
    • e71f64f Now that the program output has been improved, some of the examples are outdated. I've updated them, and added the now relevant examples of "am" and "ran". The Contribution section is also updated to reflect that a bug was now fixed.
    • d11297e Commit title and description speak for itself.

    Tests

    The modified program passes all provided tests. The original program will fail 9 of them. The output of the tests on the original program is provided here: original_test_output.txt

    I've tested my changes with Python 3.7.4. Whether the program is still Python 2 compatible I do not know.

    Potential additions/considerations for you

    • Add tests to setup.py so tests can be run using setup.py.
    • Check/confirm Python 2 compatibility.
    • Consider adding a contributors section in the README with contributors.
    • Consider checking the README to see if it still has any inconsistencies with the new version, e.g. if it warns for bugs that no longer exist.
    • Note that the pluralisation by inflect can still cause issues, as it comes up with words like "runninesses" as the plural of "runniness".
    Source code(tar.gz)
    Source code(zip)
Owner
Dibya Chakravorty
Dibya Chakravorty
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

M. Yusuf Sarıgöz 13 Oct 10, 2022
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022
iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform This repo try to implement iSTFTNet : Fast

Rishikesh (ऋषिकेश) 126 Jan 02, 2023
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
原神抽卡记录数据集-Genshin Impact gacha data

提要 持续收集原神抽卡记录中 可以使用抽卡记录导出工具导出抽卡记录的json,将json文件发送至[email protected],我会在清除个人信息后

117 Dec 27, 2022
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Flask React Project This is the backend for the Flask React project. Getting started Clone this repository (only this branch) git clone https://github

Courtney Newcomer 17 Sep 29, 2021
ChessCoach is a neural network-based chess engine capable of natural-language commentary.

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

Chris Butner 380 Dec 03, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
Course project of [email protected]

NaiveMT Prepare Clone this repository git clone [email protected]:Poeroz/NaiveMT.git

Poeroz 2 Apr 24, 2022