TextDescriptives - A Python library for calculating a large variety of statistics from text

Overview

spacy github actions pytest github actions docs github coverage

TextDescriptives

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

🔧 Installation

pip install textdescriptives

📰 News

  • TextDescriptives has been completely re-implemented using spaCy v.3.0. The stanza implementation can be found in the stanza_version branch and will no longer be maintained.
  • Check out the brand new documentation here!

👩‍💻 Usage

Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory:

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics to a Pandas DataFrame or a dictionary.

td.extract_df(doc)
# td.extract_dict(doc)
text token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences flesch_reading_ease flesch_kincaid_grade smog gunning_fog automated_readability_index coleman_liau_index lix rix dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std pos_prop_DT pos_prop_NN pos_prop_VBZ pos_prop_VBN pos_prop_. pos_prop_PRP pos_prop_VBP pos_prop_IN pos_prop_RB pos_prop_VBD pos_prop_, pos_prop_WP
0 The world (...) 3.28571 3 1.54127 7 6 3.09839 1.08571 1 0.368117 35 23 0.657143 121 5 107.879 -0.0485714 5.68392 3.94286 -2.45429 -0.708571 12.7143 0.4 1.69524 0.422282 0.44381 0.0863679 0.097561 0.121951 0.0487805 0.0487805 0.121951 0.170732 0.121951 0.121951 0.0731707 0.0243902 0.0243902 0.0243902

Set which group(s) of metrics you want to extract using the metrics parameter (one or more of readability, dependency_distance, descriptive_stats, pos_stats, defaults to all)

If extract_df is called on an object created using nlp.pipe it will format the output with 1 row for each document and a column for each metric. Similarly, extract_dict will have a key for each metric and values as a list of metrics (1 per doc).

docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
            'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])

td.extract_df(docs, metrics="dependency_distance")
text dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std
0 The world (...) 1.69524 0.422282 0.44381 0.0863679
1 He felt (...) 2.56 0 0.44 0

The text column can by exluded by setting include_text to False.

Using specific components

The specific components (descriptive_stats, readability, dependency_distance and pos_stats) can be loaded individually. This can be helpful if you're only interested in e.g. readability metrics or descriptive statistics and don't want to run the dependency parser or part-of-speech tagger.

nlp = spacy.blank("da")
nlp.add_pipe("descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])

# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)
token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences
0 4.4 3 2.59615 10 10 1 1.65 1 0.852936 20 19 0.95 90 2
1 4 3.5 2.44949 6 6 3 1.58333 1 0.862007 12 12 1 53 2

Available attributes

The table below shows the metrics included in TextDescriptives and their attributes on spaCy's Doc, Span, and Token objects. For more information, see the docs.

Attribute Component Description
Doc._.token_length descriptive_stats Dict containing mean, median, and std of token length.
Doc._.sentence_length descriptive_stats Dict containing mean, median, and std of sentence length.
Doc._.syllables descriptive_stats Dict containing mean, median, and std of number of syllables per token.
Doc._.counts descriptive_stats Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc.
Doc._.pos_proportions pos_stats Dict of {pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}. Does not create a key if no tokens in the document fit the POSTAG.
Doc._.readability readability Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc.
Doc._.dependency_distance dependency_distance Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc.
Span._.token_length descriptive_stats Dict containing mean, median, and std of token length in the span.
Span._.counts descriptive_stats Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span.
Span._.dependency_distance dependency_distance Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc.
Token._.dependency_distance dependency_distance Dict containing the dependency distance and whether the head word is adjacent for a Token.

Authors

Developed by Lasse Hansen (@HLasse) at the Center for Humanities Computing Aarhus

Collaborators:

Comments
  • :arrow_up: Update numpy requirement from <1.24.0,>=1.20.0 to >=1.20.0,<1.25.0

    :arrow_up: Update numpy requirement from <1.24.0,>=1.20.0 to >=1.20.0,<1.25.0

    Updates the requirements on numpy to permit the latest version.

    Release notes

    Sourced from numpy's releases.

    v1.24.1

    NumPy 1.24.1 Release Notes

    NumPy 1.24.1 is a maintenance release that fixes bugs and regressions discovered after the 1.24.0 release. The Python versions supported by this release are 3.8-3.11.

    Contributors

    A total of 12 people contributed to this release. People with a "+" by their names contributed a patch for the first time.

    • Andrew Nelson
    • Ben Greiner +
    • Charles Harris
    • Clément Robert
    • Matteo Raso
    • Matti Picus
    • Melissa Weber Mendonça
    • Miles Cranmer
    • Ralf Gommers
    • Rohit Goswami
    • Sayed Adel
    • Sebastian Berg

    Pull requests merged

    A total of 18 pull requests were merged for this release.

    • #22820: BLD: add workaround in setup.py for newer setuptools
    • #22830: BLD: CIRRUS_TAG redux
    • #22831: DOC: fix a couple typos in 1.23 notes
    • #22832: BUG: Fix refcounting errors found using pytest-leaks
    • #22834: BUG, SIMD: Fix invalid value encountered in several ufuncs
    • #22837: TST: ignore more np.distutils.log imports
    • #22839: BUG: Do not use getdata() in np.ma.masked_invalid
    • #22847: BUG: Ensure correct behavior for rows ending in delimiter in...
    • #22848: BUG, SIMD: Fix the bitmask of the boolean comparison
    • #22857: BLD: Help raspian arm + clang 13 about __builtin_mul_overflow
    • #22858: API: Ensure a full mask is returned for masked_invalid
    • #22866: BUG: Polynomials now copy properly (#22669)
    • #22867: BUG, SIMD: Fix memory overlap in ufunc comparison loops
    • #22868: BUG: Fortify string casts against floating point warnings
    • #22875: TST: Ignore nan-warnings in randomized out tests
    • #22883: MAINT: restore npymath implementations needed for freebsd
    • #22884: BUG: Fix integer overflow in in1d for mixed integer dtypes #22877
    • #22887: BUG: Use whole file for encoding checks with charset_normalizer.

    Checksums

    ... (truncated)

    Commits
    • a28f4f2 Merge pull request #22888 from charris/prepare-1.24.1-release
    • f8fea39 REL: Prepare for the NumPY 1.24.1 release.
    • 6f491e0 Merge pull request #22887 from charris/backport-22872
    • 48f5fe4 BUG: Use whole file for encoding checks with charset_normalizer [f2py] (#22...
    • 0f3484a Merge pull request #22883 from charris/backport-22882
    • 002c60d Merge pull request #22884 from charris/backport-22878
    • 38ef9ce BUG: Fix integer overflow in in1d for mixed integer dtypes #22877 (#22878)
    • bb00c68 MAINT: restore npymath implementations needed for freebsd
    • 64e09c3 Merge pull request #22875 from charris/backport-22869
    • dc7bac6 TST: Ignore nan-warnings in randomized out tests
    • Additional commits viewable in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies python 
    opened by dependabot[bot] 7
  • Write JOSS paper

    Write JOSS paper

    • [ ] Summary describing the purpose of the software
    • [ ] Statement of need
    • [ ] Package features & functionality
    • [ ] Target audience
    • [ ] References to other software addressing related needs
    • [ ] Past or ongoing research projects using the software
    opened by HLasse 4
  • Fix pos_stats extraction

    Fix pos_stats extraction

    closes #74

    Log:

    • map pos_stats -> pos_proportions in __unpack_extensions, and adapt conditional logic accordingly
    • fix and simplify iterative metric extraction in Extractor init method (all metrics which do not throw an error should work with __unpack_extension)
    opened by rbroc 4
  • Optimize pos-stats

    Optimize pos-stats

    Calculating POS stats seems to slow things down significantly. TODO:

    • Profile the package, what causes the slowdown?
    • Calculate the sum of values once and then call in the dict comprehension in PosStatistics

    Ideas for speedup:

    • Identify which pos_tags the model can make and predefine the counter/dictionary with those keys (would also solve the issue of different numbers of keys across docs/sentences)
    • Alternatives to Counter?

    Other options:

    • Remove posstats from default TextDescriptives and make it an optional component that takes in which specific POS tags the user is interested in and extracts those (+ 'others')
    enhancement 
    opened by HLasse 4
  • Add_pipe problem

    Add_pipe problem

    Hello, I am tying textdescriptives on a Python 3.6 + spaCy3.1 on a Linux system through JupyterLab.

    I found this error. Not quite sure how to deal with spaCy decorator. Could you please help? Thanks!

    nlp.add_pipe("textdescriptives")

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    bug 
    opened by ruidaiphd 4
  • Ci: add pre-commit

    Ci: add pre-commit

    @hlasse will you add the pre-commit ci.

    Fixes #98

    We could also add mypy to this as well.

    I could not resolve this issue in the test (which test is correct):

    def test_readability_multi_process(nlp):
        texts = [oliver_twist, secret_garden, flatland]
        texts = [ftfy.fix_text(text) for text in texts]
    
        docs = nlp.pipe(texts, n_process=3)
        for doc in docs:
            assert doc._.readability
        text = ftfy.fix_text(text)
        text = " ".join(text.split())
        doc = nlp(text)
        assert pytest.approx(expected, rel=1e-2) == doc._.readability["rix"]
    
    
    def test_readability_multi_process(nlp):
        texts = [oliver_twist, secret_garden, flatland]
        texts = [ftfy.fix_text(text) for text in texts]
    
        docs = nlp.pipe(texts, n_process=3)
        for doc in docs:
            assert doc._.readability
    

    anything I am missing in the CI?

    opened by KennethEnevoldsen 3
  • Fixed to work with attribute ruler

    Fixed to work with attribute ruler

    The pipeline does not work for e.g. the Danish pipeline which use an attribute ruler (as opposed to a tagger) for assigning POS-tags. Maybe it is worth removing this restriction all together assuming other things could also set the POS tag. Instead check for whether the document is POS tagged using the has_annotation.

    Frida and Kenneth

    opened by frillecode 3
  • ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en).

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en).

    How to reproduce the behaviour

    import spacy import textdescriptives as td nlp = spacy.load("en_core_web_sm") nlp.add_pipe("textdescriptives") doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.") doc..readability doc..token_length

    Environment

    Name: textdescriptives Version: 0.1.1 Windows 10 Python 3.6

    Error message

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, spancat, textcat_multilabel, en.lemmatizer

    ValueError Traceback (most recent call last) in 6 import textdescriptives as td 7 nlp = spacy.load('en_core_web_sm') ----> 8 nlp.add_pipe('textdescriptives') 9 doc = nlp('This is a short test text') 10 doc._.readability # access some of the values

    ~.conda\envs\tdstanza\lib\site-packages\spacy\language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate) 780 config=config, 781 raw_config=raw_config, --> 782 validate=validate, 783 ) 784 pipe_index = self._get_pipe_index(before, after, first, last)

    ~.conda\envs\tdstanza\lib\site-packages\spacy\language.py in create_pipe(self, factory_name, name, config, raw_config, validate) 639 lang_code=self.lang, 640 ) --> 641 raise ValueError(err) 642 pipe_meta = self.get_factory_meta(factory_name) 643 config = config or {}

    bug 
    opened by id8314 3
  • Add pos_proportions

    Add pos_proportions

    Here goes!

    The function runs fine separate from the package:

    
    import spacy
    from typing import Counter
    from spacy.tokens import Doc, Span
    
    # Load English tokenizer, tagger, parser and NER
    nlp = spacy.load("en_core_web_sm")
    
    # Process whole documents
    text = ("Here is the first sentence. It was pretty short, yes. Let's make another one that's slightly longer and more complex.")
    
    doc = nlp(text)
    
    def pos_proportions(doc: Doc) -> dict:
            """
                Returns:
                    Dict with proportions of part-of-speech tag in doc.
            """
            pos_counts = Counter()
        
            for token in doc:
                pos_counts[token.tag_] += 1
    
            pos_proportions = {}
    
            for tag in pos_counts:
                pos_proportions[tag] = pos_counts[tag] / sum(pos_counts.values())
    
            return pos_proportions
    
    print(pos_proportions(doc))
    
    

    However, the test fails with:

    textdescriptives/tests/test_descriptive_stats.py F                       [100%]
    
    =================================== FAILURES ===================================
    _____________________________ test_pos_proportions _____________________________
    
    nlp = <spacy.lang.en.English object at 0x7ffc82162550>
    
        def test_pos_proportions(nlp):
            doc = nlp(
                "Here is the first sentence. It was pretty short. Let's make another one that's slightly longer and more complex."
            )
        
    >       assert doc._.pos_proportions == {'RB': 0.125, 'VBZ': 0.08333333333333333, 'DT': 0.08333333333333333, 'JJ': 0.125, 'NN': 0.08333333333333333, '.': 0.125, 'PRP': 0.08333333333333333, 'VBD': 0.041666666666666664, 'VB': 0.08333333333333333, 'WDT': 0.041666666666666664, 'JJR': 0.041666666666666664, 'CC': 0.041666666666666664, 'RBR': 0.041666666666666664}
    E       AssertionError: assert {'': 1.0} == {'.': 0.125, ...': 0.125, ...}
    E         Left contains 1 more item:
    E         {'': 1.0}
    E         Right contains 13 more items:
    E         {'.': 0.125,
    E          'CC': 0.041666666666666664,
    E          'DT': 0.08333333333333333,
    E          'JJ': 0.125,...
    

    I wager that's because I've not implemented the function correctly in the package somewhere, and would love a hand with that :-)

    opened by MartinBernstorff 3
  • Ci mypy

    Ci mypy

    Copy-pasted from previous PR

    fixes #104, fixes #103, fixes #108

    Note for reviewer:

    • ~~Now the extract_dict output is a jsonl style list of dicts instead of a singular dict with keys and list of values. What is the ideal format?~~
    • The current tests for extract_df are quite bad (we, e.g. don't check that we get all the right keys out). I at least made plenty of mistakes during refactored which wasn't caught - might be worth adding in another PR
    • Merged #110 into this branch as well as this branch fix some of the tests
    opened by HLasse 2
  • introduce src folder (probably lead to better behaviour)

    introduce src folder (probably lead to better behaviour)

    it actually caught a bug when I did it with dacy with files not being properly added to the manifest.in file.

    I don't want to do this if you don't agree @HLasse ?

    enhancement 
    opened by KennethEnevoldsen 2
  • :arrow_up: Update pyphen requirement from <0.12.0,>=0.11.0 to >=0.11.0,<0.14.0

    :arrow_up: Update pyphen requirement from <0.12.0,>=0.11.0 to >=0.11.0,<0.14.0

    Updates the requirements on pyphen to permit the latest version.

    Release notes

    Sourced from pyphen's releases.

    0.13.2

    • Add Thai dictionary
    Changelog

    Sourced from pyphen's changelog.

    Version 0.13.2

    Released on 2022-11-29.

    • Add Thai dictionary.

    Version 0.13.1

    Released on 2022-11-15.

    • Update Italian dictionary.

    Version 0.13.0

    Released on 2022-09-01.

    • Make language parameter case-insensitive.
    • Add Catalan dictionary.
    • Update French dictionary.
    • Update script upgrading dictionaries.

    Version 0.12.0

    Released on 2021-12-27.

    • Support Python 3.10, drop Python 3.6 support.
    • Add documentation.
    • Update Belarusian dictionary.

    Version 0.11.0

    Released on 2021-06-26.

    • Update dictionaries (add Albanian, Belarusian, Esperanto, Mongolian; update Italian, Portuguese of Brazil, Russian).
    • Use Flit for packaging. You can now build packages using pip install flit, flit build.

    Version 0.10.0

    ... (truncated)

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Bump ruff from 0.0.191 to 0.0.212

    :arrow_up: Bump ruff from 0.0.191 to 0.0.212

    Bumps ruff from 0.0.191 to 0.0.212.

    Release notes

    Sourced from ruff's releases.

    v0.0.212

    What's Changed

    New Contributors

    Full Changelog: https://github.com/charliermarsh/ruff/compare/v0.0.211...v0.0.212

    v0.0.211

    What's Changed

    Full Changelog: https://github.com/charliermarsh/ruff/compare/v0.0.210...v0.0.211

    v0.0.210

    What's Changed

    ... (truncated)

    Commits
    • ee4cae9 Bump version to 0.0.212
    • 2e3787a Remove an unneeded .to_string() in tokenize_files_to_codes_mapping (#1676)
    • 81b211d Simplify Option<String> → Option<&str> conversion using as_deref (#1675)
    • 1ad7226 Replace &String with &str in AnnotatedImport::ImportFrom (#1674)
    • 914287d Fix format and lint errors
    • 75bb6ad Implement duplicate isinstance detection (SIM101) (#1673)
    • 04111da Improve Pandas call and attribute detection (#1671)
    • 2464cf6 Fix some &String, &Option, and &Vec usages (#1670)
    • d34e6c0 Allow overhang in Google-style docstring arguments (#1668)
    • e6611c4 Fix flake8-import-conventions configuration examples (#1660)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0

    :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0

    Updates the requirements on ftfy to permit the latest version.

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Bump pre-commit from 2.20.0 to 2.21.0

    :arrow_up: Bump pre-commit from 2.20.0 to 2.21.0

    Bumps pre-commit from 2.20.0 to 2.21.0.

    Release notes

    Sourced from pre-commit's releases.

    pre-commit v2.21.0

    Features

    Fixes

    Changelog

    Sourced from pre-commit's changelog.

    2.21.0 - 2022-12-25

    Features

    Fixes

    Commits
    • 40c5bda v2.21.0
    • bb27ea3 Merge pull request #2642 from rkm/fix/dotnet-nuget-config
    • c38e0c7 dotnet: ignore nuget source during tool install
    • bce513f Merge pull request #2641 from rkm/fix/dotnet-tool-prefix
    • e904628 fix dotnet hooks with prefixes
    • d7b8b12 Merge pull request #2646 from pre-commit/pre-commit-ci-update-config
    • 94b6178 [pre-commit.ci] pre-commit autoupdate
    • b474a83 Merge pull request #2643 from pre-commit/pre-commit-ci-update-config
    • a179808 [pre-commit.ci] pre-commit autoupdate
    • 3aa6206 Merge pull request #2605 from lorenzwalthert/r/fix-exe
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
Releases(v2.1.0)
  • v2.1.0(Jan 6, 2023)

    Feature

    Fix

    • Remove previously assigned extensions before extracting new metrics (1a7ca00)
    • Remove doc extension instead of pipe component. TODO double check all assings are correct (bc32d47)

    Documentation

    • Add arxiv badge to readme (7b57aea)
    • Update readme after review and add citation in docs (728a0d4)
    • Add arxiv citation (bfab60b)
    • Add extract_metrics to docs and readme (163bee5)
    • Download spacy model in tutorial (96634cb)
    • Reset changelog (12007b7)
    Source code(tar.gz)
    Source code(zip)
    textdescriptives-2.1.0-py3-none-any.whl(241.87 KB)
    textdescriptives-2.1.0.tar.gz(1.20 MB)
  • v2.0.0(Jan 2, 2023)

    New API and updated docs and tutorials. See the documentation for more.

    What's Changed

    • Icon by @HLasse in https://github.com/HLasse/TextDescriptives/pull/68
    • ci: update pytest-coverage.comment version by @HLasse in https://github.com/HLasse/TextDescriptives/pull/70
    • :arrow_up: Update pandas requirement from <1.5.0,>=1.0.0 to >=1.0.0,<1.6.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/69
    • :arrow_up: Update pytest requirement from <7.2.0,>=7.1.3 to >=7.1.3,<7.3.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/73
    • :arrow_up: Bump schneegans/dynamic-badges-action from 1.3.0 to 1.6.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/72
    • :arrow_up: Bump MishaKav/pytest-coverage-comment from 1.1.37 to 1.1.39 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/76
    • ci: dependabot automerge if tests pass by @HLasse in https://github.com/HLasse/TextDescriptives/pull/78
    • Fix pos_stats extraction by @rbroc in https://github.com/HLasse/TextDescriptives/pull/75
    • Update docstrings by @HLasse in https://github.com/HLasse/TextDescriptives/pull/84
    • docs: docs for dependency distance formula by @HLasse in https://github.com/HLasse/TextDescriptives/pull/89
    • feat: Separate component loaders by @HLasse in https://github.com/HLasse/TextDescriptives/pull/88
    • Simple tutorial and misc docs by @HLasse in https://github.com/HLasse/TextDescriptives/pull/90
    • fix: allow multiprocessing in descriptive stats component by @HLasse in https://github.com/HLasse/TextDescriptives/pull/91
    • feat: spacy 3.4 compatibility - dashes to slashes in factory names by @HLasse in https://github.com/HLasse/TextDescriptives/pull/95
    • feat: add word embedding coherence/similarity by @HLasse in https://github.com/HLasse/TextDescriptives/pull/92
    • HLasse/Make-quality-work-with-n_process->-1 by @HLasse in https://github.com/HLasse/TextDescriptives/pull/96
    • Ci: add pre-commit by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/97
    • CI: Added semantic release by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/110
    • Ci mypy by @HLasse in https://github.com/HLasse/TextDescriptives/pull/111
    • Extract_df_and_tutorial_fix by @HLasse in https://github.com/HLasse/TextDescriptives/pull/116
    • Docs-move-documentation-to-create-func by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/117
    • Build-transition-to-pyproject-toml by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/122
    • HLasse/Change-documentation-landing-page by @HLasse in https://github.com/HLasse/TextDescriptives/pull/123
    • tutorial: add open in colab button by @HLasse in https://github.com/HLasse/TextDescriptives/pull/125
    • HLasse/Update-README by @HLasse in https://github.com/HLasse/TextDescriptives/pull/128
    • Version 2.0 by @HLasse in https://github.com/HLasse/TextDescriptives/pull/118
    • CI: Fix errors in CI by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/129

    New Contributors

    • @rbroc made their first contribution in https://github.com/HLasse/TextDescriptives/pull/75

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.1.0...2.0.0

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Sep 26, 2022)

    Added quality filter to check the data quality of your texts! Thanks to @KennethEnevoldsen for the PR.

    What's Changed

    • build: update requirements for python 3.10 by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/62
    • Feature: Add quality descriptives by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/63
    • Update pytest-cov-comment.yml by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/66
    • docs: minor readme updates by @HLasse in https://github.com/HLasse/TextDescriptives/pull/67

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.7...v1.1.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.7(May 4, 2022)

    Lots of minor stuff mainly related to Github actions and workflows. Fixed a couple of minor issues causing tests to fail.

    What's Changed

    • update: more wiggle room for pos tests by @HLasse in https://github.com/HLasse/TextDescriptives/pull/30
    • updated ci workflow by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/27
    • Added dependabot workflow by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/26
    • Updated setup by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/25
    • Fixed error causing workflows to pass by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/40
    • add: black workflow and format everything by @HLasse in https://github.com/HLasse/TextDescriptives/pull/34
    • :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/35
    • :arrow_up: Bump actions/setup-python from 2 to 3 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/39
    • :arrow_up: Bump schneegans/dynamic-badges-action from 1.2.0 to 1.3.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/41
    • update: more robust spacy model download by @HLasse in https://github.com/HLasse/TextDescriptives/pull/46
    • check tests by @HLasse in https://github.com/HLasse/TextDescriptives/pull/47
    • update pytest-coverage-comment by @HLasse in https://github.com/HLasse/TextDescriptives/pull/45
    • update package version by @HLasse in https://github.com/HLasse/TextDescriptives/pull/48

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.6...v1.0.7

    Source code(tar.gz)
    Source code(zip)
  • v1.0.6(Mar 4, 2022)

    Fixed to also work with attribute ruler as opposed to just a tagger

    What's Changed

    • add extract_dict function by @HLasse in https://github.com/HLasse/TextDescriptives/pull/5
    • Add pos_proportions by @martbern in https://github.com/HLasse/TextDescriptives/pull/6
    • master to posstatistics by @HLasse in https://github.com/HLasse/TextDescriptives/pull/7
    • Add documentation for pos_stats by @martbern in https://github.com/HLasse/TextDescriptives/pull/8
    • Add documentation for pos_stats by @HLasse in https://github.com/HLasse/TextDescriptives/pull/9
    • change numpy requirement by @HLasse in https://github.com/HLasse/TextDescriptives/pull/11
    • Add Span support to pos_proportions by @martbern in https://github.com/HLasse/TextDescriptives/pull/14
    • Added references and changed pos-stats to part-of-speech stats by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/16
    • Added missing word by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/18
    • Fixed to work with attribute ruler by @frillecode in https://github.com/HLasse/TextDescriptives/pull/19

    New Contributors

    • @martbern made their first contribution in https://github.com/HLasse/TextDescriptives/pull/6
    • @frillecode made their first contribution in https://github.com/HLasse/TextDescriptives/pull/19

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.1...v1.0.6

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Aug 9, 2021)

  • v0.1(Jul 26, 2021)

Owner
PhD student in machine learning for healthcare at Aarhus University
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenari

epispot 12 Dec 28, 2022
The micro-framework to create dataframes from functions.

The micro-framework to create dataframes from functions.

Stitch Fix Technology 762 Jan 07, 2023
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

Souvik Pratiher 31 Dec 16, 2022
Implementation in Python of the reliability measures such as Omega.

OmegaPy Summary Simple implementation in Python of the reliability measures: Omega Total, Omega Hierarchical and Omega Hierarchical Total. Name Link O

Rafael Valero Fernández 2 Apr 27, 2022
A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023
WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

Institute for Complex Systems (ICS), Johannes Kepler University Linz 40 Dec 13, 2022
Anomaly Detection with R

AnomalyDetection R package AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the pre

Twitter 3.5k Dec 27, 2022
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i

Ahmed Fasih 1 Dec 22, 2021
Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

Battery Intelligence Lab 20 Sep 28, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 03, 2023
CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

Zach Wegner 5 Sep 18, 2022
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023
MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI Hallo

Florent Zahoui 1 Feb 07, 2022
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

East Genomics 1 Nov 02, 2021
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
General Assembly's 2015 Data Science course in Washington, DC

DAT8 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15). Instructor: Kevin Markham (

Kevin Markham 1.6k Jan 07, 2023
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021