TextDescriptives - A Python library for calculating a large variety of statistics from text

Overview

spacy github actions pytest github actions docs github coverage

TextDescriptives

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

🔧 Installation

pip install textdescriptives

đź“° News

  • TextDescriptives has been completely re-implemented using spaCy v.3.0. The stanza implementation can be found in the stanza_version branch and will no longer be maintained.
  • Check out the brand new documentation here!

👩‍💻 Usage

Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory:

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics to a Pandas DataFrame or a dictionary.

td.extract_df(doc)
# td.extract_dict(doc)
text token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences flesch_reading_ease flesch_kincaid_grade smog gunning_fog automated_readability_index coleman_liau_index lix rix dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std pos_prop_DT pos_prop_NN pos_prop_VBZ pos_prop_VBN pos_prop_. pos_prop_PRP pos_prop_VBP pos_prop_IN pos_prop_RB pos_prop_VBD pos_prop_, pos_prop_WP
0 The world (...) 3.28571 3 1.54127 7 6 3.09839 1.08571 1 0.368117 35 23 0.657143 121 5 107.879 -0.0485714 5.68392 3.94286 -2.45429 -0.708571 12.7143 0.4 1.69524 0.422282 0.44381 0.0863679 0.097561 0.121951 0.0487805 0.0487805 0.121951 0.170732 0.121951 0.121951 0.0731707 0.0243902 0.0243902 0.0243902

Set which group(s) of metrics you want to extract using the metrics parameter (one or more of readability, dependency_distance, descriptive_stats, pos_stats, defaults to all)

If extract_df is called on an object created using nlp.pipe it will format the output with 1 row for each document and a column for each metric. Similarly, extract_dict will have a key for each metric and values as a list of metrics (1 per doc).

docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
            'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])

td.extract_df(docs, metrics="dependency_distance")
text dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std
0 The world (...) 1.69524 0.422282 0.44381 0.0863679
1 He felt (...) 2.56 0 0.44 0

The text column can by exluded by setting include_text to False.

Using specific components

The specific components (descriptive_stats, readability, dependency_distance and pos_stats) can be loaded individually. This can be helpful if you're only interested in e.g. readability metrics or descriptive statistics and don't want to run the dependency parser or part-of-speech tagger.

nlp = spacy.blank("da")
nlp.add_pipe("descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent pĂĄ ild. Det skulle senere vise sig at blive en meget indbringende forretning',
            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])

# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)
token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences
0 4.4 3 2.59615 10 10 1 1.65 1 0.852936 20 19 0.95 90 2
1 4 3.5 2.44949 6 6 3 1.58333 1 0.862007 12 12 1 53 2

Available attributes

The table below shows the metrics included in TextDescriptives and their attributes on spaCy's Doc, Span, and Token objects. For more information, see the docs.

Attribute Component Description
Doc._.token_length descriptive_stats Dict containing mean, median, and std of token length.
Doc._.sentence_length descriptive_stats Dict containing mean, median, and std of sentence length.
Doc._.syllables descriptive_stats Dict containing mean, median, and std of number of syllables per token.
Doc._.counts descriptive_stats Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc.
Doc._.pos_proportions pos_stats Dict of {pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}. Does not create a key if no tokens in the document fit the POSTAG.
Doc._.readability readability Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc.
Doc._.dependency_distance dependency_distance Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc.
Span._.token_length descriptive_stats Dict containing mean, median, and std of token length in the span.
Span._.counts descriptive_stats Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span.
Span._.dependency_distance dependency_distance Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc.
Token._.dependency_distance dependency_distance Dict containing the dependency distance and whether the head word is adjacent for a Token.

Authors

Developed by Lasse Hansen (@HLasse) at the Center for Humanities Computing Aarhus

Collaborators:

Comments
  • :arrow_up: Update numpy requirement from <1.24.0,>=1.20.0 to >=1.20.0,<1.25.0

    :arrow_up: Update numpy requirement from <1.24.0,>=1.20.0 to >=1.20.0,<1.25.0

    Updates the requirements on numpy to permit the latest version.

    Release notes

    Sourced from numpy's releases.

    v1.24.1

    NumPy 1.24.1 Release Notes

    NumPy 1.24.1 is a maintenance release that fixes bugs and regressions discovered after the 1.24.0 release. The Python versions supported by this release are 3.8-3.11.

    Contributors

    A total of 12 people contributed to this release. People with a "+" by their names contributed a patch for the first time.

    • Andrew Nelson
    • Ben Greiner +
    • Charles Harris
    • ClĂ©ment Robert
    • Matteo Raso
    • Matti Picus
    • Melissa Weber Mendonça
    • Miles Cranmer
    • Ralf Gommers
    • Rohit Goswami
    • Sayed Adel
    • Sebastian Berg

    Pull requests merged

    A total of 18 pull requests were merged for this release.

    • #22820: BLD: add workaround in setup.py for newer setuptools
    • #22830: BLD: CIRRUS_TAG redux
    • #22831: DOC: fix a couple typos in 1.23 notes
    • #22832: BUG: Fix refcounting errors found using pytest-leaks
    • #22834: BUG, SIMD: Fix invalid value encountered in several ufuncs
    • #22837: TST: ignore more np.distutils.log imports
    • #22839: BUG: Do not use getdata() in np.ma.masked_invalid
    • #22847: BUG: Ensure correct behavior for rows ending in delimiter in...
    • #22848: BUG, SIMD: Fix the bitmask of the boolean comparison
    • #22857: BLD: Help raspian arm + clang 13 about __builtin_mul_overflow
    • #22858: API: Ensure a full mask is returned for masked_invalid
    • #22866: BUG: Polynomials now copy properly (#22669)
    • #22867: BUG, SIMD: Fix memory overlap in ufunc comparison loops
    • #22868: BUG: Fortify string casts against floating point warnings
    • #22875: TST: Ignore nan-warnings in randomized out tests
    • #22883: MAINT: restore npymath implementations needed for freebsd
    • #22884: BUG: Fix integer overflow in in1d for mixed integer dtypes #22877
    • #22887: BUG: Use whole file for encoding checks with charset_normalizer.

    Checksums

    ... (truncated)

    Commits
    • a28f4f2 Merge pull request #22888 from charris/prepare-1.24.1-release
    • f8fea39 REL: Prepare for the NumPY 1.24.1 release.
    • 6f491e0 Merge pull request #22887 from charris/backport-22872
    • 48f5fe4 BUG: Use whole file for encoding checks with charset_normalizer [f2py] (#22...
    • 0f3484a Merge pull request #22883 from charris/backport-22882
    • 002c60d Merge pull request #22884 from charris/backport-22878
    • 38ef9ce BUG: Fix integer overflow in in1d for mixed integer dtypes #22877 (#22878)
    • bb00c68 MAINT: restore npymath implementations needed for freebsd
    • 64e09c3 Merge pull request #22875 from charris/backport-22869
    • dc7bac6 TST: Ignore nan-warnings in randomized out tests
    • Additional commits viewable in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies python 
    opened by dependabot[bot] 7
  • Write JOSS paper

    Write JOSS paper

    • [ ] Summary describing the purpose of the software
    • [ ] Statement of need
    • [ ] Package features & functionality
    • [ ] Target audience
    • [ ] References to other software addressing related needs
    • [ ] Past or ongoing research projects using the software
    opened by HLasse 4
  • Fix pos_stats extraction

    Fix pos_stats extraction

    closes #74

    Log:

    • map pos_stats -> pos_proportions in __unpack_extensions, and adapt conditional logic accordingly
    • fix and simplify iterative metric extraction in Extractor init method (all metrics which do not throw an error should work with __unpack_extension)
    opened by rbroc 4
  • Optimize pos-stats

    Optimize pos-stats

    Calculating POS stats seems to slow things down significantly. TODO:

    • Profile the package, what causes the slowdown?
    • Calculate the sum of values once and then call in the dict comprehension in PosStatistics

    Ideas for speedup:

    • Identify which pos_tags the model can make and predefine the counter/dictionary with those keys (would also solve the issue of different numbers of keys across docs/sentences)
    • Alternatives to Counter?

    Other options:

    • Remove posstats from default TextDescriptives and make it an optional component that takes in which specific POS tags the user is interested in and extracts those (+ 'others')
    enhancement 
    opened by HLasse 4
  • Add_pipe problem

    Add_pipe problem

    Hello, I am tying textdescriptives on a Python 3.6 + spaCy3.1 on a Linux system through JupyterLab.

    I found this error. Not quite sure how to deal with spaCy decorator. Could you please help? Thanks!

    nlp.add_pipe("textdescriptives")

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    bug 
    opened by ruidaiphd 4
  • Ci: add pre-commit

    Ci: add pre-commit

    @hlasse will you add the pre-commit ci.

    Fixes #98

    We could also add mypy to this as well.

    I could not resolve this issue in the test (which test is correct):

    def test_readability_multi_process(nlp):
        texts = [oliver_twist, secret_garden, flatland]
        texts = [ftfy.fix_text(text) for text in texts]
    
        docs = nlp.pipe(texts, n_process=3)
        for doc in docs:
            assert doc._.readability
        text = ftfy.fix_text(text)
        text = " ".join(text.split())
        doc = nlp(text)
        assert pytest.approx(expected, rel=1e-2) == doc._.readability["rix"]
    
    
    def test_readability_multi_process(nlp):
        texts = [oliver_twist, secret_garden, flatland]
        texts = [ftfy.fix_text(text) for text in texts]
    
        docs = nlp.pipe(texts, n_process=3)
        for doc in docs:
            assert doc._.readability
    

    anything I am missing in the CI?

    opened by KennethEnevoldsen 3
  • Fixed to work with attribute ruler

    Fixed to work with attribute ruler

    The pipeline does not work for e.g. the Danish pipeline which use an attribute ruler (as opposed to a tagger) for assigning POS-tags. Maybe it is worth removing this restriction all together assuming other things could also set the POS tag. Instead check for whether the document is POS tagged using the has_annotation.

    Frida and Kenneth

    opened by frillecode 3
  • ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en).

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en).

    How to reproduce the behaviour

    import spacy import textdescriptives as td nlp = spacy.load("en_core_web_sm") nlp.add_pipe("textdescriptives") doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.") doc..readability doc..token_length

    Environment

    Name: textdescriptives Version: 0.1.1 Windows 10 Python 3.6

    Error message

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, spancat, textcat_multilabel, en.lemmatizer

    ValueError Traceback (most recent call last) in 6 import textdescriptives as td 7 nlp = spacy.load('en_core_web_sm') ----> 8 nlp.add_pipe('textdescriptives') 9 doc = nlp('This is a short test text') 10 doc._.readability # access some of the values

    ~.conda\envs\tdstanza\lib\site-packages\spacy\language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate) 780 config=config, 781 raw_config=raw_config, --> 782 validate=validate, 783 ) 784 pipe_index = self._get_pipe_index(before, after, first, last)

    ~.conda\envs\tdstanza\lib\site-packages\spacy\language.py in create_pipe(self, factory_name, name, config, raw_config, validate) 639 lang_code=self.lang, 640 ) --> 641 raise ValueError(err) 642 pipe_meta = self.get_factory_meta(factory_name) 643 config = config or {}

    bug 
    opened by id8314 3
  • Add pos_proportions

    Add pos_proportions

    Here goes!

    The function runs fine separate from the package:

    
    import spacy
    from typing import Counter
    from spacy.tokens import Doc, Span
    
    # Load English tokenizer, tagger, parser and NER
    nlp = spacy.load("en_core_web_sm")
    
    # Process whole documents
    text = ("Here is the first sentence. It was pretty short, yes. Let's make another one that's slightly longer and more complex.")
    
    doc = nlp(text)
    
    def pos_proportions(doc: Doc) -> dict:
            """
                Returns:
                    Dict with proportions of part-of-speech tag in doc.
            """
            pos_counts = Counter()
        
            for token in doc:
                pos_counts[token.tag_] += 1
    
            pos_proportions = {}
    
            for tag in pos_counts:
                pos_proportions[tag] = pos_counts[tag] / sum(pos_counts.values())
    
            return pos_proportions
    
    print(pos_proportions(doc))
    
    

    However, the test fails with:

    textdescriptives/tests/test_descriptive_stats.py F                       [100%]
    
    =================================== FAILURES ===================================
    _____________________________ test_pos_proportions _____________________________
    
    nlp = <spacy.lang.en.English object at 0x7ffc82162550>
    
        def test_pos_proportions(nlp):
            doc = nlp(
                "Here is the first sentence. It was pretty short. Let's make another one that's slightly longer and more complex."
            )
        
    >       assert doc._.pos_proportions == {'RB': 0.125, 'VBZ': 0.08333333333333333, 'DT': 0.08333333333333333, 'JJ': 0.125, 'NN': 0.08333333333333333, '.': 0.125, 'PRP': 0.08333333333333333, 'VBD': 0.041666666666666664, 'VB': 0.08333333333333333, 'WDT': 0.041666666666666664, 'JJR': 0.041666666666666664, 'CC': 0.041666666666666664, 'RBR': 0.041666666666666664}
    E       AssertionError: assert {'': 1.0} == {'.': 0.125, ...': 0.125, ...}
    E         Left contains 1 more item:
    E         {'': 1.0}
    E         Right contains 13 more items:
    E         {'.': 0.125,
    E          'CC': 0.041666666666666664,
    E          'DT': 0.08333333333333333,
    E          'JJ': 0.125,...
    

    I wager that's because I've not implemented the function correctly in the package somewhere, and would love a hand with that :-)

    opened by MartinBernstorff 3
  • Ci mypy

    Ci mypy

    Copy-pasted from previous PR

    fixes #104, fixes #103, fixes #108

    Note for reviewer:

    • ~~Now the extract_dict output is a jsonl style list of dicts instead of a singular dict with keys and list of values. What is the ideal format?~~
    • The current tests for extract_df are quite bad (we, e.g. don't check that we get all the right keys out). I at least made plenty of mistakes during refactored which wasn't caught - might be worth adding in another PR
    • Merged #110 into this branch as well as this branch fix some of the tests
    opened by HLasse 2
  • introduce src folder (probably lead to better behaviour)

    introduce src folder (probably lead to better behaviour)

    it actually caught a bug when I did it with dacy with files not being properly added to the manifest.in file.

    I don't want to do this if you don't agree @HLasse ?

    enhancement 
    opened by KennethEnevoldsen 2
  • :arrow_up: Update pyphen requirement from <0.12.0,>=0.11.0 to >=0.11.0,<0.14.0

    :arrow_up: Update pyphen requirement from <0.12.0,>=0.11.0 to >=0.11.0,<0.14.0

    Updates the requirements on pyphen to permit the latest version.

    Release notes

    Sourced from pyphen's releases.

    0.13.2

    • Add Thai dictionary
    Changelog

    Sourced from pyphen's changelog.

    Version 0.13.2

    Released on 2022-11-29.

    • Add Thai dictionary.

    Version 0.13.1

    Released on 2022-11-15.

    • Update Italian dictionary.

    Version 0.13.0

    Released on 2022-09-01.

    • Make language parameter case-insensitive.
    • Add Catalan dictionary.
    • Update French dictionary.
    • Update script upgrading dictionaries.

    Version 0.12.0

    Released on 2021-12-27.

    • Support Python 3.10, drop Python 3.6 support.
    • Add documentation.
    • Update Belarusian dictionary.

    Version 0.11.0

    Released on 2021-06-26.

    • Update dictionaries (add Albanian, Belarusian, Esperanto, Mongolian; update Italian, Portuguese of Brazil, Russian).
    • Use Flit for packaging. You can now build packages using pip install flit, flit build.

    Version 0.10.0

    ... (truncated)

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Bump ruff from 0.0.191 to 0.0.212

    :arrow_up: Bump ruff from 0.0.191 to 0.0.212

    Bumps ruff from 0.0.191 to 0.0.212.

    Release notes

    Sourced from ruff's releases.

    v0.0.212

    What's Changed

    New Contributors

    Full Changelog: https://github.com/charliermarsh/ruff/compare/v0.0.211...v0.0.212

    v0.0.211

    What's Changed

    Full Changelog: https://github.com/charliermarsh/ruff/compare/v0.0.210...v0.0.211

    v0.0.210

    What's Changed

    ... (truncated)

    Commits
    • ee4cae9 Bump version to 0.0.212
    • 2e3787a Remove an unneeded .to_string() in tokenize_files_to_codes_mapping (#1676)
    • 81b211d Simplify Option<String> → Option<&str> conversion using as_deref (#1675)
    • 1ad7226 Replace &String with &str in AnnotatedImport::ImportFrom (#1674)
    • 914287d Fix format and lint errors
    • 75bb6ad Implement duplicate isinstance detection (SIM101) (#1673)
    • 04111da Improve Pandas call and attribute detection (#1671)
    • 2464cf6 Fix some &String, &Option, and &Vec usages (#1670)
    • d34e6c0 Allow overhang in Google-style docstring arguments (#1668)
    • e6611c4 Fix flake8-import-conventions configuration examples (#1660)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0

    :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0

    Updates the requirements on ftfy to permit the latest version.

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Bump pre-commit from 2.20.0 to 2.21.0

    :arrow_up: Bump pre-commit from 2.20.0 to 2.21.0

    Bumps pre-commit from 2.20.0 to 2.21.0.

    Release notes

    Sourced from pre-commit's releases.

    pre-commit v2.21.0

    Features

    Fixes

    Changelog

    Sourced from pre-commit's changelog.

    2.21.0 - 2022-12-25

    Features

    Fixes

    Commits
    • 40c5bda v2.21.0
    • bb27ea3 Merge pull request #2642 from rkm/fix/dotnet-nuget-config
    • c38e0c7 dotnet: ignore nuget source during tool install
    • bce513f Merge pull request #2641 from rkm/fix/dotnet-tool-prefix
    • e904628 fix dotnet hooks with prefixes
    • d7b8b12 Merge pull request #2646 from pre-commit/pre-commit-ci-update-config
    • 94b6178 [pre-commit.ci] pre-commit autoupdate
    • b474a83 Merge pull request #2643 from pre-commit/pre-commit-ci-update-config
    • a179808 [pre-commit.ci] pre-commit autoupdate
    • 3aa6206 Merge pull request #2605 from lorenzwalthert/r/fix-exe
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
Releases(v2.1.0)
  • v2.1.0(Jan 6, 2023)

    Feature

    Fix

    • Remove previously assigned extensions before extracting new metrics (1a7ca00)
    • Remove doc extension instead of pipe component. TODO double check all assings are correct (bc32d47)

    Documentation

    • Add arxiv badge to readme (7b57aea)
    • Update readme after review and add citation in docs (728a0d4)
    • Add arxiv citation (bfab60b)
    • Add extract_metrics to docs and readme (163bee5)
    • Download spacy model in tutorial (96634cb)
    • Reset changelog (12007b7)
    Source code(tar.gz)
    Source code(zip)
    textdescriptives-2.1.0-py3-none-any.whl(241.87 KB)
    textdescriptives-2.1.0.tar.gz(1.20 MB)
  • v2.0.0(Jan 2, 2023)

    New API and updated docs and tutorials. See the documentation for more.

    What's Changed

    • Icon by @HLasse in https://github.com/HLasse/TextDescriptives/pull/68
    • ci: update pytest-coverage.comment version by @HLasse in https://github.com/HLasse/TextDescriptives/pull/70
    • :arrow_up: Update pandas requirement from <1.5.0,>=1.0.0 to >=1.0.0,<1.6.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/69
    • :arrow_up: Update pytest requirement from <7.2.0,>=7.1.3 to >=7.1.3,<7.3.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/73
    • :arrow_up: Bump schneegans/dynamic-badges-action from 1.3.0 to 1.6.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/72
    • :arrow_up: Bump MishaKav/pytest-coverage-comment from 1.1.37 to 1.1.39 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/76
    • ci: dependabot automerge if tests pass by @HLasse in https://github.com/HLasse/TextDescriptives/pull/78
    • Fix pos_stats extraction by @rbroc in https://github.com/HLasse/TextDescriptives/pull/75
    • Update docstrings by @HLasse in https://github.com/HLasse/TextDescriptives/pull/84
    • docs: docs for dependency distance formula by @HLasse in https://github.com/HLasse/TextDescriptives/pull/89
    • feat: Separate component loaders by @HLasse in https://github.com/HLasse/TextDescriptives/pull/88
    • Simple tutorial and misc docs by @HLasse in https://github.com/HLasse/TextDescriptives/pull/90
    • fix: allow multiprocessing in descriptive stats component by @HLasse in https://github.com/HLasse/TextDescriptives/pull/91
    • feat: spacy 3.4 compatibility - dashes to slashes in factory names by @HLasse in https://github.com/HLasse/TextDescriptives/pull/95
    • feat: add word embedding coherence/similarity by @HLasse in https://github.com/HLasse/TextDescriptives/pull/92
    • HLasse/Make-quality-work-with-n_process->-1 by @HLasse in https://github.com/HLasse/TextDescriptives/pull/96
    • Ci: add pre-commit by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/97
    • CI: Added semantic release by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/110
    • Ci mypy by @HLasse in https://github.com/HLasse/TextDescriptives/pull/111
    • Extract_df_and_tutorial_fix by @HLasse in https://github.com/HLasse/TextDescriptives/pull/116
    • Docs-move-documentation-to-create-func by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/117
    • Build-transition-to-pyproject-toml by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/122
    • HLasse/Change-documentation-landing-page by @HLasse in https://github.com/HLasse/TextDescriptives/pull/123
    • tutorial: add open in colab button by @HLasse in https://github.com/HLasse/TextDescriptives/pull/125
    • HLasse/Update-README by @HLasse in https://github.com/HLasse/TextDescriptives/pull/128
    • Version 2.0 by @HLasse in https://github.com/HLasse/TextDescriptives/pull/118
    • CI: Fix errors in CI by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/129

    New Contributors

    • @rbroc made their first contribution in https://github.com/HLasse/TextDescriptives/pull/75

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.1.0...2.0.0

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Sep 26, 2022)

    Added quality filter to check the data quality of your texts! Thanks to @KennethEnevoldsen for the PR.

    What's Changed

    • build: update requirements for python 3.10 by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/62
    • Feature: Add quality descriptives by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/63
    • Update pytest-cov-comment.yml by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/66
    • docs: minor readme updates by @HLasse in https://github.com/HLasse/TextDescriptives/pull/67

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.7...v1.1.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.7(May 4, 2022)

    Lots of minor stuff mainly related to Github actions and workflows. Fixed a couple of minor issues causing tests to fail.

    What's Changed

    • update: more wiggle room for pos tests by @HLasse in https://github.com/HLasse/TextDescriptives/pull/30
    • updated ci workflow by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/27
    • Added dependabot workflow by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/26
    • Updated setup by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/25
    • Fixed error causing workflows to pass by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/40
    • add: black workflow and format everything by @HLasse in https://github.com/HLasse/TextDescriptives/pull/34
    • :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/35
    • :arrow_up: Bump actions/setup-python from 2 to 3 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/39
    • :arrow_up: Bump schneegans/dynamic-badges-action from 1.2.0 to 1.3.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/41
    • update: more robust spacy model download by @HLasse in https://github.com/HLasse/TextDescriptives/pull/46
    • check tests by @HLasse in https://github.com/HLasse/TextDescriptives/pull/47
    • update pytest-coverage-comment by @HLasse in https://github.com/HLasse/TextDescriptives/pull/45
    • update package version by @HLasse in https://github.com/HLasse/TextDescriptives/pull/48

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.6...v1.0.7

    Source code(tar.gz)
    Source code(zip)
  • v1.0.6(Mar 4, 2022)

    Fixed to also work with attribute ruler as opposed to just a tagger

    What's Changed

    • add extract_dict function by @HLasse in https://github.com/HLasse/TextDescriptives/pull/5
    • Add pos_proportions by @martbern in https://github.com/HLasse/TextDescriptives/pull/6
    • master to posstatistics by @HLasse in https://github.com/HLasse/TextDescriptives/pull/7
    • Add documentation for pos_stats by @martbern in https://github.com/HLasse/TextDescriptives/pull/8
    • Add documentation for pos_stats by @HLasse in https://github.com/HLasse/TextDescriptives/pull/9
    • change numpy requirement by @HLasse in https://github.com/HLasse/TextDescriptives/pull/11
    • Add Span support to pos_proportions by @martbern in https://github.com/HLasse/TextDescriptives/pull/14
    • Added references and changed pos-stats to part-of-speech stats by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/16
    • Added missing word by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/18
    • Fixed to work with attribute ruler by @frillecode in https://github.com/HLasse/TextDescriptives/pull/19

    New Contributors

    • @martbern made their first contribution in https://github.com/HLasse/TextDescriptives/pull/6
    • @frillecode made their first contribution in https://github.com/HLasse/TextDescriptives/pull/19

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.1...v1.0.6

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Aug 9, 2021)

  • v0.1(Jul 26, 2021)

Owner
PhD student in machine learning for healthcare at Aarhus University
wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

Andrew Tavis McAllister 35 Jan 04, 2023
Predictive Modeling & Analytics on Home Equity Line of Credit

Predictive Modeling & Analytics on Home Equity Line of Credit Data (Python) HMEQ Data Set In this assignment we will use Python to examine a data set

Dhaval Patel 1 Jan 09, 2022
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: âžś ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
Python Project on Pro Data Analysis Track

Udacity-BikeShare-Project: Python Project on Pro Data Analysis Track Basic Data Exploration with pandas on Bikeshare Data Basic Udacity project using

Belal Mohammed 0 Nov 10, 2021
The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias Introduction | Updates | Usage | Results&Pretrained Models | Statement | Intr

104 Nov 27, 2022
pandas: powerful Python data analysis toolkit

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

pandas 36.4k Jan 03, 2023
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
Validation and inference over LinkML instance data using souffle

Translates LinkML schemas into Datalog programs and executes them using Souffle, enabling advanced validation and inference over instance data

Linked data Modeling Language 7 Aug 07, 2022
Statistical Rethinking course winter 2022

Statistical Rethinking (2022 Edition) Instructor: Richard McElreath Lectures: Uploaded Playlist and pre-recorded, two per week Discussion: Online, F

Richard McElreath 3.9k Dec 31, 2022
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
API>local_db>AWS_RDS - Disclaimer! All data used is for educational purposes only.

APIlocal_dbAWS_RDS Disclaimer! All data used is for educational purposes only. ETL pipeline diagram. Aim of project By creating a fully working pipe

0 Apr 25, 2022
Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

1 Jan 26, 2022
Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of D

Joachim 1 Dec 26, 2021