Visions provides an extensible suite of tools to support common data analysis operations

Overview

Visions

JossPaper PyPiDownloadsBadge PyPiDownloadsMonthlyBadge PyPiVersionBadge PythonBadge BinderBadge

And these visions of data types, they kept us up past the dawn.

Visions provides an extensible suite of tools to support common data analysis operations including

  • type inference on unknown data
  • casting data types
  • automated data summarization

https://github.com/dylan-profiler/visions/raw/develop/docsrc/source/_static/side-by-side.png

Documentation

Full documentation can be found here.

Installation

You can install visions via pip:

pip install visions

Alternatives and more details can be found in the documentation.

Supported frameworks

These frameworks are supported out-of-the-box in addition to native Python types:

https://github.com/dylan-profiler/visions/raw/develop/docsrc/source/_static/frameworks.png

  • Numpy
  • Pandas
  • Spark

Contributing and support

Contributions to visions are welcome. For more information, please visit the Community contributions page. The the Github issues tracker is used for reporting bugs, feature requests and support questions.

Acknowledgements

This package is part of the dylan-profiler project. The package is core component of pandas-profiling. More information can be found here. This work was partially supported by SIDN Fonds.

https://github.com/dylan-profiler/visions/raw/master/SIDNfonds.png

Comments
  • Numpy backend

    Numpy backend

    Summary

    This PR adds complete numpy backend support for the StandardSet of types. The type implementation is fully compatible with the pandas equivalent implementations with the exception of the object type.

    Caveats

    Whereas pandas provides support for Optional[int] and Optional[bool] numpy doesn't - in order to support those types by default I was forced to make object completely disjoint from any other concrete type. A similar story plays out for datetime objects with timezones which also default to object in numpy.

    opened by ieaves 8
  • API and Usage

    API and Usage

    (This is part of the review in openjournals/joss-reviews#2145)

    Hi @sbrugman,

    I am currently going through the package and I found it a very interesting project. The type inference that already exists in built-in Python is frequently not enough, and I often find myself writing my own functions for it on a case-by-case basis. So, it is nice to see that this is being done.

    However, I do find myself having quite a bit of trouble using the package effectively. Funnily enough, this is mostly caused by the (in my opinion) confusing naming schemes and the structure of the namespace. It may require a bit of effort to solve (not to mention that it might create incompatibilities with previous versions), but I believe that it will greatly improve the user experience when fixed.

    So, the main problem in my opinion is that almost all definitions are stored in their own subpackages in the visions.core subpackage, often separated as well over an implementations and a model subpackage. This means that in order to access a definition, let's say VisionsTypeset, I have to import this from visions.core.model.typeset, while pre-defined typesets I have to import from visions.core.implementations.typesets. In my opinion, it is incredibly confusing that these are stored in different subpackages/submodules, as they are related to the same thing, namely typesets, and I would expect to find all of these definitions in a visions.typesets subpackage. Preferably, all definitions the average user would use, should be available either at root (visions) or a single level deep (visions.xxx).

    I also noticed that almost all definitions have the word visions in their name. I get the feeling that the reason for this is to avoid namespace clashes when someone uses a wildcard import (from visions import *). However, as wildcard imports are heavily discouraged in the community, this leads to the user writing the word visions at least twice for every definition used (for example, using the visions_integer type requires me to write visions.visions_integer, which could be simplified to visions.integer or even visions.int).

    Finally, I am not entirely sure if this has to do with the online documentation being outdated as mentioned in #21, but according to the example here, a visions_standard_set object has a method called detect_type_series. In v0.2.3, this object neither has a method called detect_type_series nor type_detect_series (the name that the stand-alone function has in visions.core.functional), but instead it is called detect_series_type. If possible, could you check and make sure that the methods and stand-alone functions use consistent naming schemes?

    Please let me know if you have any questions.

    enhancement 
    opened by 1313e 7
  • Recommended stack overflow tag for questions

    Recommended stack overflow tag for questions

    I have a question about how to use the library. I considered opening an issue, but I see in the documentation that you recommend asking questions about how to use the package on Stack Overflow. Is there a tag that you'd suggest people use when asking questions there? I don't see anything with visions as a tag, but maybe I'm just the first person to ask a question over there.

    If you think visions would be a good tag choice it would make sense to update the stack overflow ask a question link to pre-populate the question with the tag (https://stackoverflow.com/questions/ask?tags=visions).

    Thanks!

    enhancement 
    opened by sterlinm 6
  • Automate building of documentation

    Automate building of documentation

    The documentation should rebuilt at every merge. The differences caused by the documentation convolute code reviews. The steps to build the documentation are simple and can be automated.

    Suggested solution via Github Actions (e.g.): https://github.com/marketplace/actions/sphinx-build https://github.com/ammaraskar/sphinx-action-test/blob/master/.github/workflows/default.yml

    enhancement 
    opened by sbrugman 6
  • Please push an updated version to pypi to correct dependency on attrs not attr

    Please push an updated version to pypi to correct dependency on attrs not attr

    Describe the bug visions uses the @attr.s decorator which comes from the attrs module, not the attr module. The master version of visions has the correct dependency, but the pypi versions do not.

    Additional context When using pandas_profiling which depends on visions, I got the following error:

    AttributeError: module 'attr' has no attribute 's'
    

    which led me to post this issue.

    bug 
    opened by proinsias 5
  • Version 0.7.5

    Version 0.7.5

    0.7.5 Includes:

    • Fixes to numpy backend for complex, object, email_address, URL, boolean
    • Support for new versions of pandas ABCIndex class (previously called ABCIndexClass)
    • Updated tests for numpy backend
    • Automated Github Actions unit tests on PR
    • Additional documentation
    opened by ieaves 4
  • fail to pass the test with 0.6.1 release

    fail to pass the test with 0.6.1 release

    Describe the bug The tests failed with 0.6.1 release. To Reproduce Steps to reproduce the behavior:

    python setup.py build
    PYTHONPATH=build/lib pytest -v
    

    Expected behavior pass all tests

    Additional context error log:

    =================================== FAILURES ===================================
    _____________________ test_contains[file_mixed_ext x File] _____________________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2    /build/python-visions/src/visions-0.6.1/build/...
    Name: file_mixed_ext, dtype: object
    type = File, member = True
    
        @pytest.mark.parametrize(**get_contains_cases(series, contains_map, typeset))
        def test_contains(series, type, member):
            """Test the generated combinations for "series in type"
        
            Args:
                series: the series to test
                type: the type to test against
                member: the result
            """
            result, message = contains(series, type, member)
    >       assert result, message
    E       AssertionError: file_mixed_ext in File; expected True, got False
    E       assert False
    
    tests/typesets/test_complete_set.py:190: AssertionError
    _______________________ test_contains[image_png x File] ________________________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2    /build/python-visions/src/visions-0.6.1/build/...
    Name: image_png, dtype: object
    type = File, member = True
    
        @pytest.mark.parametrize(**get_contains_cases(series, contains_map, typeset))
        def test_contains(series, type, member):
            """Test the generated combinations for "series in type"
        
            Args:
                series: the series to test
                type: the type to test against
                member: the result
            """
            result, message = contains(series, type, member)
    >       assert result, message
    E       AssertionError: image_png in File; expected True, got False
    E       assert False
    
    tests/typesets/test_complete_set.py:190: AssertionError
    _______________________ test_contains[image_png x Image] _______________________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2    /build/python-visions/src/visions-0.6.1/build/...
    Name: image_png, dtype: object
    type = Image, member = True
    
        @pytest.mark.parametrize(**get_contains_cases(series, contains_map, typeset))
        def test_contains(series, type, member):
            """Test the generated combinations for "series in type"
        
            Args:
                series: the series to test
                type: the type to test against
                member: the result
            """
            result, message = contains(series, type, member)
    >       assert result, message
    E       AssertionError: image_png in Image; expected True, got False
    E       assert False
    
    tests/typesets/test_complete_set.py:190: AssertionError
    ___________________ test_contains[image_png_missing x File] ____________________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2       ...c/visions-0.6.1/build/...
    4                                                 None
    Name: image_png_missing, dtype: object
    type = File, member = True
    
        @pytest.mark.parametrize(**get_contains_cases(series, contains_map, typeset))
        def test_contains(series, type, member):
            """Test the generated combinations for "series in type"
        
            Args:
                series: the series to test
                type: the type to test against
                member: the result
            """
            result, message = contains(series, type, member)
    >       assert result, message
    E       AssertionError: image_png_missing in File; expected True, got False
    E       assert False
    
    tests/typesets/test_complete_set.py:190: AssertionError
    ___________________ test_contains[image_png_missing x Image] ___________________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2       ...c/visions-0.6.1/build/...
    4                                                 None
    Name: image_png_missing, dtype: object
    type = Image, member = True
    
        @pytest.mark.parametrize(**get_contains_cases(series, contains_map, typeset))
        def test_contains(series, type, member):
            """Test the generated combinations for "series in type"
        
            Args:
                series: the series to test
                type: the type to test against
                member: the result
            """
            result, message = contains(series, type, member)
    >       assert result, message
    E       AssertionError: image_png_missing in Image; expected True, got False
    E       assert False
    
    tests/typesets/test_complete_set.py:190: AssertionError
    _____________ test_inference[file_mixed_ext x File expected True] ______________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2    /build/python-visions/src/visions-0.6.1/build/...
    Name: file_mixed_ext, dtype: object
    type = File, typeset = CompleteSet, difference = False
    
        @pytest.mark.parametrize(**get_inference_cases(series, inference_map, typeset))
        def test_inference(series, type, typeset, difference):
            """Test the generated combinations for "inference(series) == type"
        
            Args:
                series: the series to test
                type: the type to test against
            """
            result, message = infers(series, type, typeset, difference)
    >       assert result, message
    E       AssertionError: inference of file_mixed_ext expected File to be True (typeset=CompleteSet)
    E       assert False
    
    tests/typesets/test_complete_set.py:317: AssertionError
    _____________ test_inference[file_mixed_ext x Path expected False] _____________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2    /build/python-visions/src/visions-0.6.1/build/...
    Name: file_mixed_ext, dtype: object
    type = Path, typeset = CompleteSet, difference = True
    
        @pytest.mark.parametrize(**get_inference_cases(series, inference_map, typeset))
        def test_inference(series, type, typeset, difference):
            """Test the generated combinations for "inference(series) == type"
        
            Args:
                series: the series to test
                type: the type to test against
            """
            result, message = infers(series, type, typeset, difference)
    >       assert result, message
    E       AssertionError: inference of file_mixed_ext expected Path to be False (typeset=CompleteSet)
    E       assert False
    
    tests/typesets/test_complete_set.py:317: AssertionError
    _______________ test_inference[image_png x Image expected True] ________________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2    /build/python-visions/src/visions-0.6.1/build/...
    Name: image_png, dtype: object
    type = Image, typeset = CompleteSet, difference = False
    
        @pytest.mark.parametrize(**get_inference_cases(series, inference_map, typeset))
        def test_inference(series, type, typeset, difference):
            """Test the generated combinations for "inference(series) == type"
        
            Args:
                series: the series to test
                type: the type to test against
            """
            result, message = infers(series, type, typeset, difference)
    >       assert result, message
    E       AssertionError: inference of image_png expected Image to be True (typeset=CompleteSet)
    E       assert False
    
    tests/typesets/test_complete_set.py:317: AssertionError
    _______________ test_inference[image_png x Path expected False] ________________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2    /build/python-visions/src/visions-0.6.1/build/...
    Name: image_png, dtype: object
    type = Path, typeset = CompleteSet, difference = True
    
        @pytest.mark.parametrize(**get_inference_cases(series, inference_map, typeset))
        def test_inference(series, type, typeset, difference):
            """Test the generated combinations for "inference(series) == type"
        
            Args:
                series: the series to test
                type: the type to test against
            """
            result, message = infers(series, type, typeset, difference)
    >       assert result, message
    E       AssertionError: inference of image_png expected Path to be False (typeset=CompleteSet)
    E       assert False
    
    tests/typesets/test_complete_set.py:317: AssertionError
    ___________ test_inference[image_png_missing x Image expected True] ____________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2       ...c/visions-0.6.1/build/...
    4                                                 None
    Name: image_png_missing, dtype: object
    type = Image, typeset = CompleteSet, difference = False
    
        @pytest.mark.parametrize(**get_inference_cases(series, inference_map, typeset))
        def test_inference(series, type, typeset, difference):
            """Test the generated combinations for "inference(series) == type"
        
            Args:
                series: the series to test
                type: the type to test against
            """
            result, message = infers(series, type, typeset, difference)
    >       assert result, message
    E       AssertionError: inference of image_png_missing expected Image to be True (typeset=CompleteSet)
    E       assert False
    
    tests/typesets/test_complete_set.py:317: AssertionError
    ___________ test_inference[image_png_missing x Path expected False] ____________
    
    series = 0    /build/python-visions/src/visions-0.6.1/build/...
    1    /build/python-visions/src/visions-0.6.1/build/...
    2       ...c/visions-0.6.1/build/...
    4                                                 None
    Name: image_png_missing, dtype: object
    type = Path, typeset = CompleteSet, difference = True
    
        @pytest.mark.parametrize(**get_inference_cases(series, inference_map, typeset))
        def test_inference(series, type, typeset, difference):
            """Test the generated combinations for "inference(series) == type"
        
            Args:
                series: the series to test
                type: the type to test against
            """
            result, message = infers(series, type, typeset, difference)
    >       assert result, message
    E       AssertionError: inference of image_png_missing expected Path to be False (typeset=CompleteSet)
    E       assert False
    
    tests/typesets/test_complete_set.py:317: AssertionError
    =============================== warnings summary ===============================
    tests/test_root.py::test_multiple_roots
      /build/python-visions/src/visions-0.6.1/build/lib/visions/typesets/typeset.py:88: UserWarning: {Generic} were isolates in the type relation map and consequently orphaned. Please add some mapping to the orphaned nodes.
        warnings.warn(message)
    
    tests/test_summarization.py::test_complex_missing_summary
      /usr/lib/python3.8/site-packages/numpy/core/_methods.py:47: ComplexWarning: Casting complex values to real discards the imaginary part
        return umr_sum(a, axis, dtype, out, keepdims, initial, where)
    
    -- Docs: https://docs.pytest.org/en/stable/warnings.html
    =========================== short test summary info ============================
    FAILED tests/typesets/test_complete_set.py::test_contains[file_mixed_ext x File]
    FAILED tests/typesets/test_complete_set.py::test_contains[image_png x File]
    FAILED tests/typesets/test_complete_set.py::test_contains[image_png x Image]
    FAILED tests/typesets/test_complete_set.py::test_contains[image_png_missing x File]
    FAILED tests/typesets/test_complete_set.py::test_contains[image_png_missing x Image]
    FAILED tests/typesets/test_complete_set.py::test_inference[file_mixed_ext x File expected True]
    FAILED tests/typesets/test_complete_set.py::test_inference[file_mixed_ext x Path expected False]
    FAILED tests/typesets/test_complete_set.py::test_inference[image_png x Image expected True]
    FAILED tests/typesets/test_complete_set.py::test_inference[image_png x Path expected False]
    FAILED tests/typesets/test_complete_set.py::test_inference[image_png_missing x Image expected True]
    FAILED tests/typesets/test_complete_set.py::test_inference[image_png_missing x Path expected False]
    ================= 11 failed, 8954 passed, 2 warnings in 15.94s =================
    

    see also the complete build log here

    bug 
    opened by hubutui 4
  • No module named 'shapely'

    No module named 'shapely'

    (This is part of the review in openjournals/joss-reviews#2145)

    When I try to execute the example given here, I get an error stating No module named 'shapely'. I see that this is a dependency of visions, but is only listed in the requirements_test.txt. You probably have to add this requirement to the requirements.txt as well.

    PS: Currently, the requirements of the package are both listed in their own separate file and in the setup.py file. To avoid confusion for yourself, it is probably better to only use either. You can read in a requirements file and use it in the setup.py file by using:

    # Get the requirements list
    with open('requirements.txt', 'r') as f:
        requirements = f.read().splitlines()
    

    Keep in mind that it is possible to link different requirements files together. For example, you can link requirements.txt and requirements_dev.txt together by adding the line -r requirements.txt to the top of requirements_dev.txt. This means that installing the requirements of requirements_dev.txt will use both files. This however won't work if you parse the file in a setup.py file. In that case, you can simply read both files and append them together if necessary.

    opened by 1313e 4
  • Bump pyarrow from 1.0.1 to 5.0.0

    Bump pyarrow from 1.0.1 to 5.0.0

    Bumps pyarrow from 1.0.1 to 5.0.0.

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 3
  • all nulls should be inferred as generic

    all nulls should be inferred as generic

    i don't think this should be expected behavior.

    In [41]: infer_type(pd.DataFrame({'x':['', '']}), StandardSet())
    Out[41]: {'x': DateTime}
    
    In [39]: infer_type(pd.DataFrame({'x':[None, None]}), StandardSet())
    Out[39]: {'x': Boolean}
    
    bug 
    opened by majidaldo 3
  • comma separator

    comma separator

    New functionality

    • comma separator handling for string digits
    • new utility functionality for working with missing values

    Major Proposed Changes

    • Integer should be a strict subset of Float
    opened by ieaves 3
  • Add CodeQL workflow for GitHub code scanning

    Add CodeQL workflow for GitHub code scanning

    Hi dylan-profiler/visions!

    This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

    With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

    This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

    Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

    Questions? Check out the FAQ below!

    FAQ

    Click here to expand the FAQ section

    How often will the code scanning analysis run?

    By default, code scanning will trigger a scan with the CodeQL engine on the following events:

    • On every pull request — to flag up potential security problems for you to investigate before merging a PR.
    • On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.
    • Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

    What will this cost?

    Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

    What types of problems does CodeQL find?

    The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

    How do I upgrade my CodeQL engine?

    No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

    The analysis doesn’t seem to be working

    If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

    How do I disable LGTM.com?

    If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

    Which source code hosting platforms does code scanning support?

    GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

    How do I know this PR is legitimate?

    This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

    I have another question / how do I get in touch?

    Please join the discussion here to ask further questions and send us suggestions!

    opened by lgtm-com[bot] 0
  • Sktime semantic data types for time series & vision

    Sktime semantic data types for time series & vision

    I've recently been made aware of this excellent and imo much needed library by @lmmentel.

    The reason is its similarity to the datatypes module of sktime, which introduces semantic typing for time series related data types - we distinguish "mtypes" (machine representations) and "scitypes" (scientific types, what visions calls semantic type). More details here as reference.

    Few questions for visions devs:

    • time series are known to be a notoriously splintered field in terms of data representation, and even more when it comes to learning tasks (as in your ML example). Do you see visions moving in the direction of typing for ML?
    • would you have time to look into the sktime datatypes module and assess how similar this is to visions? If similar, we might be tempted to take a dependency on visions and contribute. Key features are mtype conversions, scitype inference, checks that also return metadata (e.g., number of time stamps in a series, which can be represented 4 different ways)
    enhancement 
    opened by fkiraly 7
  • src/visions/types/url.py passes non URLs

    src/visions/types/url.py passes non URLs

    src/visions/types/url.py does not correctly validate URLs.

    First, the example code (lines 14--19) from the docs do not return True:

    Python 3.9.4 (default, Apr  9 2021, 09:32:38)
    [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import visions
    >>> from urllib.parse import urlparse
    >>> urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']
    >>> x = [urlparse(url) for url in urls]
    >>> x in visions.URL
    False
    >>> x
    [ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment=''), ParseResult(scheme='https', netloc='github.com', path='/pandas-profiling/pandas-profiling', params='', query='', fragment='')]
    >>> import pkg_resources
    >>> pkg_resources.get_distribution("visions").version
    '0.7.4'
    

    Second, non URLs are passing:

    >>> urlparse('junk') in visions.URL
    True
    >>>
    

    The code should probably check something like the following for each element of x:

        try:
            result = urlparse(x)
            return all([result.scheme, result.netloc])
        except:
            return False
    

    Finally, and this is a suggested enhancement, I think the behavior would be more useful if it handled raw strings and did the parsing internally without the caller having to supply a parser:

    urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']
    >>> urls in visions.URL
    True
    
    bug 
    opened by leapingllamas 3
  • How to check if a type is/is_not parent of another type ?

    How to check if a type is/is_not parent of another type ?

    Follow the example of "Problem type inference".

    graph

    From one dataframe, I already make a list of type for each column. Here is the type_list:

    [Discrete,
     Nominal,
     Discrete,
     Nominal,
     Nominal,
     Nominal,
     Nominal,
     Nominal,
     Nominal,
     Binary,
     Discrete,
     Discrete,
     Discrete,
     Nominal,
     Binary]
    

    type(type_list[0]) give visions.types.type.VisionsBaseTypeMeta

    Now, I want to check if each type either have parent type of Categorical or Numeric.

    for column, t in zip(column, type_list):
         if is_type_parent_of_categorical(t): 
                category_job(dataframe[column]) 
    
    # binary is child if Categorical
    is_type_parent_of_categorical(type_list[14]) -> True 
    
    # Discrete is child of Numeric 
    is_type_parent_of_categorical(type_list[0]) -> False 
    

    How should I implement is_type_parent_of_categorical ?

    My workaround seem to work because string comparision:

    def is_type_parent_of_categorical(visions_type):
            type_str = str(visions_type)
                if type_str in ["Categorical", "Ordinal", "Nominal", "Binary"]:
                    return True
                return False
    
    enhancement 
    opened by ttpro1995 2
  • function: 'lowest' common type

    function: 'lowest' common type

    Sometimes going through a whole array is not needed. You have the types of the subsets of the array and you just want to get a compatible data type for all subsets.

    A common scenario when assembling horrible csvs is that the same column might be inferred as different types in different csvs. For example, (float <-- int). Worst case is to 'fall back' to string.

    enhancement 
    opened by majidaldo 2
Releases(v0.7.5)
  • v0.7.5(Dec 5, 2021)

  • v0.7.4(Sep 27, 2021)

  • v0.7.2(Sep 27, 2021)

  • v0.7.1(Feb 4, 2021)

  • v0.7.0(Jan 5, 2021)

  • v0.6.4(Oct 17, 2020)

    ENH: swifter apply for pandas backend FIX: fix for issue #147 ENH: __version__ attribute made available ENH: improved typing and CI ENH: contrib types/typesets for a low-threshold contribution of types

    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Oct 11, 2020)

    ENH: Expose state using typeset.detect and typeset.infer ENH: plotting of typesets improved FIX: fix and test cases for #136 CLN: pre-commit with black, isort, pyupgrade, flake8 ENH: type relations are now accessible by type (e.g. Float.relations[Integer])

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Sep 22, 2020)

  • v0.5.1(Sep 22, 2020)

    • Introduce stateful type inference and casting
    • Expose test utils to users and fix diagnostic information
    • Integer consistency for the standard set
    • Use pd.BooleanDtype for newer versions of pandas
    • Latest black formatting
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Aug 16, 2020)

    API breaking changes:

    • migration to single dispatch on typeset methods
    • updated API to unify detect / infer / cast against Series and DataFrames
    • improvements to boolean type
    Source code(tar.gz)
    Source code(zip)
  • v0.4.6(Jul 28, 2020)

  • v0.4.5(Jul 28, 2020)

  • v0.4.4(May 11, 2020)

  • v0.4.3(May 10, 2020)

  • v0.4.2(May 10, 2020)

    Support for Files and Images, rewritten summarization functions

    • Renamed ExistingPath to File
    • Renamed ImagePath to Image
    • Version bump to 0.4.2
    • Summaries: return series instead of dict
    • Categorical: unicode counts now based on original character distribution instead of unique characters which are used as intermediate step for increased performance.
    • Categorical: aggregate functions are included for string length (min, max, mean, median).
    • Path: number of unique values for the path parts are returned
    • Image: make Exif and Hash calculations optional. Also return width, height and area.
    • File: in addition to the file_size, return creation, modification and access time (which were already returned).
    Source code(tar.gz)
    Source code(zip)
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
Vectorizers for a range of different data types

Vectorizers for a range of different data types

Tutte Institute for Mathematics and Computing 69 Dec 29, 2022
An Aspiring Drop-In Replacement for NumPy at Scale

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the f

Legate 502 Jan 03, 2023
Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

3 Aug 13, 2021
Tokyo 2020 Paralympics, Analytics

Tokyo 2020 Paralympics, Analytics Thanks for checking out my app! It was built entirely using matplotlib and Tokyo 2020 Paralympics data. This applica

Petro Ivaniuk 1 Nov 18, 2021
Scraping and analysis of leetcode-compensations page.

Leetcode compensations report Scraping and analysis of leetcode-compensations page.

utsav 96 Jan 01, 2023
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 264 Dec 30, 2022
Hydrogen (or other pure gas phase species) depressurization calculations

HydDown Hydrogen (or other pure gas phase species) depressurization calculations This code is published under an MIT license. Install as simple as: pi

Anders Andreasen 13 Nov 26, 2022
A stock analysis app with streamlit

StockAnalysisApp A stock analysis app with streamlit. You select the ticker of the stock and the app makes a series of analysis by using the price cha

Antonio Catalano 50 Nov 27, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
Common bioinformatics database construction

biodb Common bioinformatics database construction 1.taxonomy (Substance classification database) Download the database wget -c https://ftp.ncbi.nlm.ni

sy520 2 Jan 04, 2022
Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

Rustam Zokirov 1 Dec 06, 2021
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
General Assembly's 2015 Data Science course in Washington, DC

DAT8 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15). Instructor: Kevin Markham (

Kevin Markham 1.6k Jan 07, 2023
Working Time Statistics of working hours and working conditions by industry and company

Working Time Statistics of working hours and working conditions by industry and company

Feng Ruohang 88 Nov 04, 2022
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

Living with Machines 25 Dec 26, 2022
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023