Efficient matrix representations for working with tabular data

Overview

Efficient matrix representations for working with tabular data

CI

Installation

Simply install via conda-forge!

conda install -c conda-forge tabmat

Use case

TL;DR: We provide matrix classes for efficiently building statistical algorithms with data that is partially dense, partially sparse and partially categorical.

Data used in economics, actuarial science, and many other fields is often tabular, containing rows and columns. Further common properties are also common:

  • It often is very sparse.
  • It often contains a mix of dense and sparse columns.
  • It often contains categorical data, processed into many columns of indicator values created by "one-hot encoding."

High-performance statistical applications often require fast computation of certain operations, such as

  • Computing sandwich products of the data, transpose(X) @ diag(d) @ X. A sandwich product shows up in the solution to weighted least squares, as well as in the Hessian of the likelihood in generalized linear models such as Poisson regression.
  • Matrix-vector products, possibly on only a subset of the rows or columns. For example, when limiting computation to an "active set" in a L1-penalized coordinate descent implementation, we may only need to compute a matrix-vector product on a small subset of the columns.
  • Computing all operations on standardized predictors which have mean zero and standard deviation one. This helps with numerical stability and optimizer efficiency in a wide range of machine learning algorithms.

This library and its design

We designed this library with the above use cases in mind. We built this library first for estimating generalized linear models, but expect it will be useful in a variety of econometric and statistical use cases. This library was borne out of our need for speed, and its unified API is motivated by the desire to work with a unified matrix API internal to our statistical algorithms.

Design principles:

  • Speed and memory efficiency are paramount.
  • You don't need to sacrifice functionality by using this library: DenseMatrix and SparseMatrix subclass np.ndarray and scipy.sparse.csc_matrix respectively, and inherit behavior from those classes wherever it is not improved on.
  • As much as possible, syntax follows NumPy syntax, and dimension-reducing operations (like sum) return NumPy arrays, following NumPy dimensions about the dimensions of results. The aim is to make these classes as close as possible to being drop-in replacements for numpy.ndarray. This is not always possible, however, due to the differing APIs of numpy.ndarray and scipy.sparse.
  • Other operations, such as toarray, mimic Scipy sparse syntax.
  • All matrix classes support matrix-vector products, sandwich products, and getcol.

Individual subclasses may support significantly more operations.

Matrix types

  • DenseMatrix represents dense matrices, subclassing numpy nparray. It additionally supports methods getcol, toarray, sandwich, standardize, and unstandardize.
  • SparseMatrix represents column-major sparse data, subclassing scipy.sparse.csc_matrix. It additionally supports methods sandwich and standardize.
  • CategoricalMatrix represents one-hot encoded categorical matrices. Because all the non-zeros in these matrices are ones and because each row has only one non-zero, the data can be represented and multiplied much more efficiently than a generic sparse matrix.
  • SplitMatrix represents matrices with both dense, sparse and categorical parts, allowing for a significant speedup in matrix multiplications.
  • StandardizedMatrix efficiently and sparsely represents a matrix that has had its column normalized to have mean zero and variance one. Even if the underlying matrix is sparse, such a normalized matrix will be dense. However, by storing the scaling and shifting factors separately, StandardizedMatrix retains the original matrix sparsity.

Wide data set

Benchmarks

See here for detailed benchmarking.

API documentation

See here for detailed API documentation.

Comments
  • Poor performance on narrow sparse matrices.

    Poor performance on narrow sparse matrices.

    I've been investigating problems where our MKL-based sparse matrices are massively underperforming scipy.sparse. For example:

      operation           storage memory        time
    0    matvec  scipy.sparse csc      0  0.00211215
    1    matvec  quantcore.matrix      0   0.0266283
    

    This is a matrix with 3e6 rows and 3 columns

    It seems like having a small number of columns makes MKL perform quite poorly. I'm not sure why that's the case. But, it may be worth having a check and just falling back to scipy.sparse in narrow cases like this. This kind of narrow case may actually be the dominant use case for sparse matrices because they will be a small component of a SplitMatrix.

    help wanted 
    opened by tbenthompson 11
  • Swap n_rows with n_cols in matvec

    Swap n_rows with n_cols in matvec

    This might fix https://github.com/Quantco/quantcore.glm/issues/323. I think we were passing in the number of rows into matvec when we mean to pass in the number of columns. But maybe I'm misunderstanding what's going on.

    The function signature for matvec is

    https://github.com/Quantco/quantcore.matrix/blob/9ef54c6cb21e8d8063c0968fe47c300b79d3af4b/src/quantcore/matrix/ext/categorical.pyx#L61-L62

    but before we were passing in the number of rows as last argument.

    opened by jtilly 6
  • Build script in PyPI source version uses default `jemalloc`

    Build script in PyPI source version uses default `jemalloc`

    I see the build script for linux uses jemalloc with disable-tls: "./autogen.sh --disable-cxx --with-jemalloc-prefix=local --with-install-suffix=local --disable-tls --disable-initial-exec-tls",

    However, the source distribution in PyPI doesn't run that script when installing it through pip, relying instead on whatever jemalloc it finds when it tries to compile. If, for example, one tries to install tabmat from source through pip, it will later on fail to import, complaining about an error with jemalloc:

    cannot allocate memory in static TLS block
    
    opened by david-cortes 5
  • BUG: cannot allocate memory in static TLS block   when installing through pip

    BUG: cannot allocate memory in static TLS block when installing through pip

    The installation wia conda froge was getting stuck in the "Solving environment" part so I tried to install with pip, given that the package is available on Pypi. pip install glum runs in seconds, but then I am unable to import stuff from it, with the following error:

    In [1]: from glum import GeneralizedLinearRegressor
    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-1-0284693fe484> in <module>
    ----> 1 from glum import GeneralizedLinearRegressor
    
    ~/anaconda3/lib/python3.9/site-packages/glum/__init__.py in <module>
          1 import pkg_resources
          2 
    ----> 3 from ._distribution import TweedieDistribution
          4 from ._glm import GeneralizedLinearRegressor
          5 from ._glm_cv import GeneralizedLinearRegressorCV
    
    ~/anaconda3/lib/python3.9/site-packages/glum/_distribution.py in <module>
          6 import numpy as np
          7 from scipy import sparse, special
    ----> 8 from tabmat import MatrixBase, StandardizedMatrix
          9 
         10 from ._functions import (
    
    ~/anaconda3/lib/python3.9/site-packages/tabmat/__init__.py in <module>
    ----> 1 from .categorical_matrix import CategoricalMatrix
          2 from .constructor import from_csc, from_pandas
          3 from .dense_matrix import DenseMatrix
          4 from .matrix_base import MatrixBase
          5 from .sparse_matrix import SparseMatrix
    
    ~/anaconda3/lib/python3.9/site-packages/tabmat/categorical_matrix.py in <module>
        171 from .ext.split import sandwich_cat_cat, sandwich_cat_dense
        172 from .matrix_base import MatrixBase
    --> 173 from .sparse_matrix import SparseMatrix
        174 from .util import (
        175     check_matvec_out_shape,
    
    ~/anaconda3/lib/python3.9/site-packages/tabmat/sparse_matrix.py in <module>
          4 from scipy import sparse as sps
          5 
    ----> 6 from .ext.sparse import (
          7     csc_rmatvec,
          8     csc_rmatvec_unrestricted,
    
    ImportError: /home/mathurin/anaconda3/lib/python3.9/site-packages/tabmat/ext/../../tabmat.libs/libjemalloclocal-691a3dac.so.2: cannot allocate memory in static TLS block
    

    googling did not help, is there a way to make the pip-installed version work ?

    opened by mathurinm 5
  • Improvements to SplitMatrix

    Improvements to SplitMatrix

    • Allow SplitMatrix to be constructed from another SplitMatrix.
    • Allow inputs of SplitMatrix to be 1-d
    • Implement __getitem__ for column subset
    • Also had to implement column subsetting for CategoricalMatrix
    • __repr__ uses the __repr__ method of components instead of str()

    ToDo:

    • [ ] FIX BUG WITH _split_col_subsets (first confirm that it's a bug)
    • [ ] Add testing for new features

    Checklist

    • [ ] Added a CHANGELOG.rst entry
    opened by MarcAntoineSchmidtQC 5
  • Enable dropping one column from a CategoricalMatrix?

    Enable dropping one column from a CategoricalMatrix?

    Currently, CategoricalMatrix does not provide an easy way to drop a column. We are required to include a category for every row in the dataset, but in an unregularized setting, it is nice to sometimes drop one column.

    Something sort of like this is already implemented by the cols parameter to the matrix vector and sandwich functions.

    question on hold 
    opened by tbenthompson 5
  • BUG: segfault when fitting a GeneralizedLinearRegressor

    BUG: segfault when fitting a GeneralizedLinearRegressor

    Requirements: pip install libsvmdata

    The following script gives me a segfault:

    from libsvmdata import fetch_libsvm
    from glum import GeneralizedLinearRegressor
    
    X, y = fetch_libsvm("rcv1.binary")
    clf = GeneralizedLinearRegressor(alpha=0.01, fit_intercept=False,
                                     family="gaussian")
    clf.fit(X, y)
    

    Output:

    In [1]: %run glum_segfault.py
    Dataset: rcv1.binary
    [1]    271745 segmentation fault (core dumped)  ipython
    

    I'm using glum 2.0.3

    @qb3

    opened by mathurinm 4
  • Bump pypa/cibuildwheel from 2.2.2 to 2.3.0

    Bump pypa/cibuildwheel from 2.2.2 to 2.3.0

    Bumps pypa/cibuildwheel from 2.2.2 to 2.3.0.

    Release notes

    Sourced from pypa/cibuildwheel's releases.

    v2.3.0

    • πŸ“ˆ cibuildwheel now defaults to manylinux2014 image for linux builds, rather than manylinux2010. If you want to stick with manylinux2010, it's simple to set this using the image options. (#926)
    • ✨ You can now pass environment variables from the host machine into the Docker container during a Linux build. Check out the docs for CIBW_ENVIRONMENT_PASS_LINUX for the details. (#914)
    • ✨ Added support for building PyPy 3.8 wheels. (#881)
    • ✨ Added support for building Windows arm64 CPython wheels on a Windows arm64 runner. We can't test this in CI yet, so for now, this is experimental. (#920)
    • πŸ“š Improved the deployment documentation (#911)
    • πŸ›  Changed the escaping behaviour inside cibuildwheel's option placeholders e.g. {project} in before_build or {dest_dir} in repair_wheel_command. This allows bash syntax like ${SOME_VAR} to passthrough without being interpreted as a placeholder by cibuildwheel. See this section in the docs for more info. (#889)
    • πŸ›  Pip updated to 21.3, meaning it now defaults to in-tree builds again. If this causes an issue with your project, setting environment variable PIP_USE_DEPRECATED=out-of-tree-build is available as a temporary flag to restore the old behaviour. However, be aware that this flag will probably be removed soon. (#881)
    • πŸ› You can now access the current Python interpreter using python3 within a build on Windows (#917)
    Changelog

    Sourced from pypa/cibuildwheel's changelog.

    v2.3.0

    26 November 2021

    • πŸ“ˆ cibuildwheel now defaults to manylinux2014 image for linux builds, rather than manylinux2010. If you want to stick with manylinux2010, it's simple to set this using the image options. (#926)
    • ✨ You can now pass environment variables from the host machine into the Docker container during a Linux build. Check out the docs for CIBW_ENVIRONMENT_PASS_LINUX for the details. (#914)
    • ✨ Added support for building PyPy 3.8 wheels. (#881)
    • ✨ Added support for building Windows arm64 CPython wheels on a Windows arm64 runner. We can't test this in CI yet, so for now, this is experimental. (#920)
    • πŸ“š Improved the deployment documentation (#911)
    • πŸ›  Changed the escaping behaviour inside cibuildwheel's option placeholders e.g. {project} in before_build or {dest_dir} in repair_wheel_command. This allows bash syntax like ${SOME_VAR} to passthrough without being interpreted as a placeholder by cibuildwheel. See this section in the docs for more info. (#889)
    • πŸ›  Pip updated to 21.3, meaning it now defaults to in-tree builds again. If this causes an issue with your project, setting environment variable PIP_USE_DEPRECATED=out-of-tree-build is available as a temporary flag to restore the old behaviour. However, be aware that this flag will probably be removed soon. (#881)
    • πŸ› You can now access the current Python interpreter using python3 within a build on Windows (#917)
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 4
  • Use a namespaced version of `jemalloc`

    Use a namespaced version of `jemalloc`

    We are currently observing issues when using quantcore.matrix in conjunction with onnx and onnxruntime on MacOS. The call to python -c 'import onnx; import quantcore.matrix.ext.dense; import onnxruntime' fails with a bus error or segfault whereas the call DYLD_INSERT_LIBRARIES=$CONDA_PREFIX/lib/libjemalloc.dylib python -c 'import onnx; import quantcore.matrix.ext.dense; import onnxruntime' passes just fine. This indicates that using an unnamespaced jemalloc may be problematic here as the following traceback indicates:

    collecting ... Process 6259 stopped
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x4efffffff7)
        frame #0: 0x000000013c0f3704 libjemalloc.2.dylib`je_free_default + 240
    libjemalloc.2.dylib`je_free_default:
    ->  0x13c0f3704 <+240>: str    x20, [x8, w9, sxtw #3]
        0x13c0f3708 <+244>: ldr    w8, [x19, #0x200]
        0x13c0f370c <+248>: sub    w9, w8, #0x1              ; =0x1
        0x13c0f3710 <+252>: str    w9, [x19, #0x200]
    Target 0: (python) stopped.
    (lldb) bt
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x4efffffff7)
      * frame #0: 0x000000013c0f3704 libjemalloc.2.dylib`je_free_default + 240
        frame #1: 0x0000000142745010 onnxruntime_pybind11_state.so`std::__1::__hash_table<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, std::__1::__unordered_map_hasher<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_hash, pybind11::detail::type_equal_to, true>, std::__1::__unordered_map_equal<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_equal_to, pybind11::detail::type_hash, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > > > >::__rehash(unsigned long) + 76
        frame #2: 0x0000000142744dd0 onnxruntime_pybind11_state.so`std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, void*>*>, bool> std::__1::__hash_table<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, std::__1::__unordered_map_hasher<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_hash, pybind11::detail::type_equal_to, true>, std::__1::__unordered_map_equal<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_equal_to, pybind11::detail::type_hash, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > > > >::__emplace_unique_key_args<std::__1::type_index, std::__1::piecewise_construct_t const&, std::__1::tuple<std::__1::type_index const&>, std::__1::tuple<> >(std::__1::type_index const&, std::__1::piecewise_construct_t const&, std::__1::tuple<std::__1::type_index const&>&&, std::__1::tuple<>&&) + 480
        frame #3: 0x00000001427427dc onnxruntime_pybind11_state.so`pybind11::detail::generic_type::initialize(pybind11::detail::type_record const&) + 396
        frame #4: 0x0000000142751688 onnxruntime_pybind11_state.so`pybind11::class_<onnxruntime::ExecutionOrder>::class_<>(pybind11::handle, char const*) + 140
        frame #5: 0x00000001427513f8 onnxruntime_pybind11_state.so`pybind11::enum_<onnxruntime::ExecutionOrder>::enum_<>(pybind11::handle const&, char const*) + 52
        frame #6: 0x000000014272b5c8 onnxruntime_pybind11_state.so`onnxruntime::python::addObjectMethods(pybind11::module_&, onnxruntime::Environment&) + 296
        frame #7: 0x0000000142734e68 onnxruntime_pybind11_state.so`PyInit_onnxruntime_pybind11_state + 340
        frame #8: 0x000000010019f994 python`_imp_create_dynamic + 2412
        frame #9: 0x00000001000b40f8 python`cfunction_vectorcall_FASTCALL + 208
        frame #10: 0x000000010016bfd8 python`_PyEval_EvalFrameDefault + 30088
    

    My suggestion would be to add an output to the jemalloc-feedstock as described in https://github.com/conda-forge/jemalloc-feedstock/issues/23 that comes with a prefixed version of the library.

    opened by xhochy 4
  • Bump google-github-actions/setup-gcloud from 0.2.0 to 0.2.1

    Bump google-github-actions/setup-gcloud from 0.2.0 to 0.2.1

    Bumps google-github-actions/setup-gcloud from 0.2.0 to 0.2.1.

    Release notes

    Sourced from google-github-actions/setup-gcloud's releases.

    setup-gcloud v0.2.1

    Bug Fixes

    Changelog

    Sourced from google-github-actions/setup-gcloud's changelog.

    0.2.1 (2021-02-12)

    Bug Fixes

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @xhochy.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 4
  • Update linter

    Update linter

    Updating the flake8 config to match the new flake8 config from glm_benchmarks.

    Changes:

    • changed linter according to the issue
    • added simple docstrings to public functions (most functions were in the main matrix classes)
    • preceded function names with underscores if the functions were only being used internally
    • added β€œno docstrings in magic function” flake8 error to list of ignores (didn’t seem helpful)
    • added # noqa in places where flake8 errors were just creating issues in unhelpful places

    Closes #45

    Checklist

    • [ ] Added a CHANGELOG.rst entry
    opened by MargueriteBastaQC 4
  • Bump pypa/cibuildwheel from 2.11.3 to 2.11.4

    Bump pypa/cibuildwheel from 2.11.3 to 2.11.4

    Bumps pypa/cibuildwheel from 2.11.3 to 2.11.4.

    Release notes

    Sourced from pypa/cibuildwheel's releases.

    v2.11.4

    • πŸ› Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)
    • πŸ›  Updates CPython 3.11 to 3.11.1 (#1371)
    • πŸ›  Updates PyPy 3.7 to 3.7.10, except on macOS which remains on 7.3.9 due to a bug. (#1371)
    • πŸ“š Added a reference to abi3audit to the docs (#1347)
    Changelog

    Sourced from pypa/cibuildwheel's changelog.

    v2.11.4

    24 Dec 2022

    • πŸ› Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)
    • πŸ›  Updates CPython 3.11 to 3.11.1 (#1371)
    • πŸ›  Updates PyPy to 7.3.10, except on macOS which remains on 7.3.9 due to a bug on that platform. (#1371)
    • πŸ“š Added a reference to abi3audit to the docs (#1347)
    Commits
    • 27fc88e Bump version: v2.11.4
    • a7e9ece Merge pull request #1371 from pypa/update-dependencies-pr
    • b9a3ed8 Update cibuildwheel/resources/build-platforms.toml
    • 3dcc2ff fix: not skipping the tests stops the copy (Windows ARM) (#1377)
    • 1c9ec76 Merge pull request #1378 from pypa/henryiii-patch-3
    • 22b433d Merge pull request #1379 from pypa/pre-commit-ci-update-config
    • 98fdf8c [pre-commit.ci] pre-commit autoupdate
    • cefc5a5 Update dependencies
    • e53253d ci: move to ubuntu 20
    • e9ecc65 [pre-commit.ci] pre-commit autoupdate (#1374)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Cannot sandwich SplitMatrix with non-owned array

    Cannot sandwich SplitMatrix with non-owned array

    This throws an error:

    import numpy as np
    import tabmat
    from scipy.sparse import csc_matrix
    
    rng = np.random.default_rng(seed=123)
    X = rng.standard_normal(size=(100,20))
    Xd = tabmat.DenseMatrix(X[:,:10])
    Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
    Xm = tabmat.SplitMatrix([Xd, Xs])
    Xm.sandwich(np.ones(X.shape[0]))
    
    ---------------------------------------------------------------------------
    Exception                                 Traceback (most recent call last)
    <ipython-input-2-91ba52e4f568> in <module>
          8 Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
          9 Xm = tabmat.SplitMatrix([Xd, Xs])
    ---> 10 Xm.sandwich(np.ones(X.shape[0]))
    
    ~/anaconda3/envs/py39/lib/python3.9/site-packages/tabmat/split_matrix.py in sandwich(self, d, rows, cols)
        287             idx_i = subset_cols_indices[i]
        288             mat_i = self.matrices[i]
    --> 289             res = mat_i.sandwich(d, rows, subset_cols[i])
        290             if isinstance(res, sps.dia_matrix):
        291                 out[(idx_i, idx_i)] += np.squeeze(res.data)
    
    ~/anaconda3/envs/py39/lib/python3.9/site-packages/tabmat/dense_matrix.py in sandwich(self, d, rows, cols)
         62         d = np.asarray(d)
         63         rows, cols = setup_restrictions(self.shape, rows, cols)
    ---> 64         return dense_sandwich(self, d, rows, cols)
         65 
         66     def _cross_sandwich(
    
    src/tabmat/ext/dense.pyx in tabmat.ext.dense.dense_sandwich()
    
    Exception: 
    

    Compare against this:

    Xd = tabmat.DenseMatrix(X[:,:10].copy())
    Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
    Xm = tabmat.SplitMatrix([Xd, Xs])
    Xm.sandwich(np.ones(X.shape[0]))
    

    (No error)

    opened by david-cortes 0
  • tabmat has no attribute __version__

    tabmat has no attribute __version__

    I find it convenient to be able to check directly inside a python shell.

    In [1]: import tabmat
    
    In [2]: tabmat.__version__
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-2-aae82a909ca3> in <module>
    ----> 1 tabmat.__version__
    
    AttributeError: module 'tabmat' has no attribute '__version__'
    

    this is tabmat 3.0.7 installed via PyPI

    opened by mathurinm 3
  • Support initializing matrices with Patsy?

    Support initializing matrices with Patsy?

    I think we've discussed this, but I don't remember the conclusion and can't find an issue now.

    We recommend from_pandas as the way "most users" should construct tabmat objects. from_pandas then guesses which columns should be treated as categorical. I think it would be really nice to have Patsy-like formulas as an alternative, since

    1. R users (including many economists) like using formulas, and
    2. It's easy to infer from a Patsy formula which columns are categorical, which are sparse (generally interactions with categoricals), and which are dense (everything else), so this could remove some of the guesswork from tabmat and improve performance.

    I'm not sure how feasible this would be, since Patsy is a sizable library that allows for fairly sophisticated formulas and it would be quite an endeavor to replicate all of the functionality. A few ways of doing this would be

    1. Don't change any code, but document how Patsy can already be used to construct a dataframe that can then be passed to tabmat / glum. Warn that this involves creating a large dense matrix as an intermediate. See Twitter discussion: https://twitter.com/esantorella22/status/1447980727820296198
    2. Have tabmat call patsy.dmatrix with "return_type = 'dataframe'", then call tabmat.from_pandas on the resulting pd.DataFrame. That would not be any more efficient than (1), but would just save the user a little typing and the need to install patsy. On the down side, it adds a dependency and may force creation of a very large dense matrix.
    3. Support very simple patsy-like formulas without having patsy as a dependency or reproducing its full functionality. That would allow the user to designate which columns should be treated as categorical in a more natural way. See Twitter discussion: https://twitter.com/esantorella22/status/1447981081358184461
    4. Make it so that any Patsy formula can be used to create a tabmat object -- I'm not sure how. Might be hard.
    opened by esantorella 2
Releases(3.1.2)
  • 3.1.2(Jul 1, 2022)

  • 3.1.1(Jul 1, 2022)

    3.1.1 - 2022-07-01

    Other changes:

    • Add Python 3.10 support to CI (remove Python 3.6).
    • We are now building the wheel for PyPI without --march=native to make it more portable across architectures.
    Source code(tar.gz)
    Source code(zip)
  • 3.1.0(Mar 7, 2022)

  • 3.0.8(Jan 3, 2022)

  • 3.0.7(Nov 23, 2021)

  • 3.0.6(Nov 12, 2021)

    Bug fix

    • We fixed a bug in SplitMatrix.matvec, where incorrect matrix vector products were computed when a SplitMatrix did not contain any dense components.
    Source code(tar.gz)
    Source code(zip)
  • 3.0.5(Nov 5, 2021)

    Other changes

    • We are now specifying the run time dependencies in setup.py, so that missing dependencies are automatically installed from PyPI when installing tabmat via pip.
    Source code(tar.gz)
    Source code(zip)
  • 3.0.4(Nov 3, 2021)

  • 3.0.3(Oct 15, 2021)

  • 3.0.2(Oct 15, 2021)

  • 3.0.1(Oct 8, 2021)

    3.0.1 - 2021-10-07

    Bug fix

    • The license was mistakenly left as proprietary. Corrected to BSD-3-Clause.

    Other changes

    • ReadTheDocs integration.
    • CONTRIBUTING.md
    • Correct pyproject.toml to work with PEP-517
    Source code(tar.gz)
    Source code(zip)
  • 3.0.0(Oct 7, 2021)

    3.0.0 - 2021-10-07

    It's public! Yay!

    Breaking changes:

    • The package has been renamed to tabmat. CELEBRATE!
    • The one_over_var_inf_to_val function has been made private.
    • The csc_to_split function has been re-named to tabmat.from_csc to match the tabmat.from_pandas function.
    • The tabmat.MatrixBase.get_col_means and tabmat.MatrixBase.get_col_stds methods have been made private.
    • The cross_sandwich method has also been made private.

    Bug fixes:

    • StandardizedMatrix.transpose_matvec was giving the wrong answer when the out parameter was provided. This is now fixed.
    • SplitMatrix.__repr__ now calls the __repr__ method of component matrices instead of __str__.

    Other changes:

    • Optimized the tabmat.SparseMatrix.matvec and tabmat.SparseMatrix.tranpose_matvec for when rows and cols are None.
    • Implemented CategoricalMatrix.__rmul__
    • Reorganizing the documentation and updating the text to match the current API.
    • Enable indexing the rows of a CategoricalMatrix. Previously CategoricalMatrix.__getitem__ only supported column indexing.
    • Allow creating a SplitMatrix from a list of any MatrixBase objects including another SplitMatrix.
    • Reduced memory usage in tabmat.SplitMatrix.matvec.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.3(Jul 15, 2021)

    2.0.3 - 2021-07-15

    Bug fix:

    • In SplitMatrix.sandwich, when a col subset was specified, incorrect output was produced if the components of the indices array were not sorted. SplitMatrix.__init__ now checks for sorted indices and maintains sorted index lists when combining matrices.

    Other changes:

    • SplitMatrix.__init__ now filters out any empty matrices.
    • StandardizedMatrix.sandwich passes rows=None and cols=None onwards to the underlying matrix instead of replacing them with full arrays of indices. This should improve performance slightly.
    • SplitMatrix.__repr__ now includes the type of the underlying matrix objects in the string output.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.2(Jun 24, 2021)

  • 2.0.1(Jun 20, 2021)

  • 2.0.0(Jun 17, 2021)

    2.0.0 - 2021-06-17

    Breaking changes:

    We renamed several public functions to make them private. These include functions in quantcore.matrix.benchmark that are unlikely to be used outside of this package as well as

    • quantcore.matrix.dense_matrix._matvec_helper
    • quantcore.matrix.sparse_matrix._matvec_helper
    • quantcore.matrix.split_matrix._prepare_out_array

    Other changes:

    • We removed the dependency on sparse_dot_mkl. We now use scipy.sparse.csr_matvec instead of sparse_dot_mkl.dot_product_mkl on all platforms, because the former suffered from poor performance, especially on narrow problems. This also means that we removed the function quantcore.matrix.sparse_matrix._dot_product_maybe_mkl.
    • We updated the pre-commit hooks and made sure the code is line with the new hooks.
    Source code(tar.gz)
    Source code(zip)
  • 1.0.6(Apr 26, 2021)

  • 1.0.5(Apr 26, 2021)

  • 1.0.3(Apr 22, 2021)

    Bug fixes:

    • Added a check that matrices are two-dimensional in the SplitMatrix.__init__
    • Replace np.int with np.int64 where appropriate due to NumPy deprecation of np.int.
    Source code(tar.gz)
    Source code(zip)
  • 1.0.2(Apr 20, 2021)

  • 1.0.1(Nov 25, 2020)

    Bug fixes:

    • Handling for nulls when setting up a CategoricalMatrix
    • Fixes to make several functions work with both row and col restrictions and out

    Other changes:

    • Added various tests and documentation improvements
    Source code(tar.gz)
    Source code(zip)
  • 1.0.0(Nov 11, 2020)

    Breaking change:

    • Rename dot to matvec. Our dot function supports matrix-vector multiplication for every subclass, but only supports matrix-matrix multiplication for some. We therefore rename it to matvec in line with other libraries.

    Bug fix:

    • Fix a bug in matvec for categorical components when the number of categories exceeds the number of rows.
    Source code(tar.gz)
    Source code(zip)
PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

Materials Discovery Group 61 Oct 02, 2022
Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation

Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation Overview Consider the scenario in which advertisement

Manuel Bressan 2 Nov 18, 2021
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022
Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

UChicago - Department of Computer Science 255 Dec 10, 2022
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
My first Python project is a simple Mad Libs program.

Python CLI Mad Libs Game My first Python project is a simple Mad Libs program. Mad Libs is a phrasal template word game created by Leonard Stern and R

Carson Johnson 1 Dec 10, 2021
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
An extension to pandas dataframes describe function.

pandas_summary An extension to pandas dataframes describe function. The module contains DataFrameSummary object that extend describe() with: propertie

Mourad 450 Dec 30, 2022
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

BioMASS 22 Dec 27, 2022
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

GΓ‘bor Vecsei 12 Aug 30, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022
Project: Netflix Data Analysis and Visualization with Python

Project: Netflix Data Analysis and Visualization with Python Table of Contents General Info Installation Demo Usage and Main Functionalities Contribut

Kathrin HΓ€lbich 2 Feb 13, 2022
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022