Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Overview

scikit-fda: Functional Data Analysis in Python

scikit-fda: Functional Data Analysis in Python

python build status Documentation Status Codecov PyPIBadge license doi

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

This package offers classes, methods and functions to give support to FDA in Python. Includes a wide range of utils to work with functional data, and its representation, exploratory analysis, or preprocessing, among other tasks such as inference, classification, regression or clustering of functional data. See documentation for further information on the features included in the package.

Documentation

The documentation is available at fda.readthedocs.io/en/stable/, which includes detailed information of the different modules, classes and methods of the package, along with several examples showing different functionalities.

The documentation of the latest version, corresponding with the develop version of the package, can be found at fda.readthedocs.io/en/latest/.

Installation

Currently, scikit-fda is available in Python 3.6 and 3.7, regardless of the platform. The stable version can be installed via PyPI:

pip install scikit-fda

Installation from source

It is possible to install the latest version of the package, available in the develop branch, by cloning this repository and doing a manual installation.

git clone https://github.com/GAA-UAM/scikit-fda.git
pip install ./scikit-fda

Make sure that your default Python version is currently supported, or change the python and pip commands by specifying a version, such as python3.6:

git clone https://github.com/GAA-UAM/scikit-fda.git
python3.6 -m pip install ./scikit-fda

Requirements

scikit-fda depends on the following packages:

The dependencies are automatically installed.

Contributions

All contributions are welcome. You can help this project grow in multiple ways, from creating an issue, reporting an improvement or a bug, to doing a repository fork and creating a pull request to the development branch.

The people involved at some point in the development of the package can be found in the contributors file.

License

The package is licensed under the BSD 3-Clause License. A copy of the license can be found along with the code.

Comments
  • Feature/improve visualization

    Feature/improve visualization

    The plotting functions have been changed to not use global state (using pyplot) whenever possible. Instead, the plotting functions in skfda create and return a new figure, if none is passed.

    Pyplot is still used for Sphinx-gallery, as it has to scrap the produced images.

    All the examples have been changed in order to use the new API and the object oriented functionalities of Matplotlib.

    opened by vnmabus 17
  • Retrieve coefficients for function reconstruction

    Retrieve coefficients for function reconstruction

    Hey there!

    I fitted a KNN FDataGrid to my input data. It actually looks pretty good so far and I now would like to "export" it so I can represent the function as numerical values (preferable in a numpy array).

    I saw that you offer some basis that can be used to "export" the underlying representation. Could you elaborrate on what basis should be used when? My data represents a demand/supply curve. I tried the BSpline one but it only constructs something close to a sine wave which doesn't really represent my data.

    Here is an image of the graph itself: grafik

    Is there some way to get the raw representation instead of transforming it to another basis?

    opened by mainrs 16
  • add inverse_transform #297 method to FPCA

    add inverse_transform #297 method to FPCA

    Reference issue: https://github.com/GAA-UAM/scikit-fda/issues/297#issue-775429060

    What does it address ? The inverse_transform method projects the functional principal score (coefficient value w.r.t to eigenfunctions basis) of a (or set of) sample back to the input space. In other words FPCA.inverse_transform(FPCA.transform()) should approximate the identity map.

    Empirical tests: I've only tested the inverse_transform method on synthetic datasets (both FDataGrid and FDataBasis) and assessed the results in terms of 'identity regression' i.e. with QQ-plots on the distribution of the residuals where residuals = X_input.value - X_recovered.value, where:

    • X_recovered is computed as FPCA.inverse_transform(FPCA.transform(X_input)),
    • value is .coefficients or .data_matrix attribute, depending on the input data format.

    X_input is generated according to:

    cov = Exponential(length_scale=0.5)
    n_grid = 100
    n_samples = 50
    
    fd = make_gaussian_process(
        start=0,
        stop=4,
        n_samples=n_samples,
        n_features=n_grid,
        mean=lambda t: np.power(t, 2) + 5,
        cov=cov,
    )
    

    Herebelow are the resulting QQ-plots (computed with scipy.stats.probplot) when `X_recovered' is computed with 3 principal components:

    • When `fd' is let in FDataGrid format: slope ~0.675, intercept ~ -3.e-17 and an R^2 ~ 0.9994. Note that the residuals are computed in terms of funtional data values. qqplot-fdatagrid

    • When `fd' is mapped to FDataBasis format with a 4-order Bspline basis with cardinality 50: -slope ~ 1.08, intercept -5.e-17 and R^2 ~0.9996. Note that the residuals are computed in terms of the coefficients in the (here Bspline) basis. qqplot-fdatabasis

    I can enhance the PR and provide further emprical results, just tell me :)

    opened by Clej 12
  • ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    Scikit-fda is successfully installed, but when I try to import it, I receive the following error:

    ValueError                                Traceback (most recent call last)
    <ipython-input-17-97c44e0d3210> in <module>
    ----> 1 import skfda
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/__init__.py in <module>
         35 from .representation._functional_data import concatenate
         36 
    ---> 37 from . import representation, datasets, preprocessing, exploratory, misc, ml, \
         38     inference
         39 
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/datasets/__init__.py in <module>
          5                              fetch_weather, fetch_aemet,
          6                              fetch_octane, fetch_gait)
    ----> 7 from ._samples_generators import (make_gaussian,
          8                                   make_gaussian_process,
          9                                   make_sinusoidal_process,
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/datasets/_samples_generators.py in <module>
          7 from .. import FDataGrid
          8 from .._utils import _cartesian_product
    ----> 9 from ..misc import covariances
         10 from ..preprocessing.registration import normalize_warping
         11 from ..representation.interpolation import SplineInterpolation
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/__init__.py in <module>
    ----> 1 from . import covariances, kernels, metrics
          2 from . import operators
          3 from . import regularization
          4 from ._math import (log, log2, log10, exp, sqrt, cumsum,
          5                     inner_product, inner_product_matrix)
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/covariances.py in <module>
          7 import sklearn.gaussian_process.kernels as sklearn_kern
          8 
    ----> 9 from ..exploratory.visualization._utils import _create_figure, _figure_to_svg
         10 
         11 
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/__init__.py in <module>
          1 from . import depth
    ----> 2 from . import outliers
          3 from . import stats
          4 from . import visualization
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/outliers/__init__.py in <module>
          4 )
          5 from ._iqr import IQROutlierDetector
    ----> 6 from .neighbors_outlier import LocalOutlierFactor
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/outliers/neighbors_outlier.py in <module>
          2 from sklearn.base import OutlierMixin
          3 
    ----> 4 from ...misc.metrics import lp_distance
          5 from ...ml._neighbors_base import (
          6     KNeighborsMixin,
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/metrics.py in <module>
          6 
          7 from .._utils import _pairwise_commutative
    ----> 8 from ..preprocessing.registration import normalize_warping, ElasticRegistration
          9 from ..preprocessing.registration._warping import _normalize_scale
         10 from ..preprocessing.registration.elastic import SRSF
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/__init__.py in <module>
    ----> 1 from . import registration
          2 from . import smoothing
          3 from . import dim_reduction
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/registration/__init__.py in <module>
         14 from ._warping import invert_warping, normalize_warping
         15 
    ---> 16 from .elastic import ElasticRegistration
         17 
         18 from . import validation, elastic
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/registration/elastic.py in <module>
          1 
    ----> 2 from fdasrsf.utility_functions import optimum_reparam
          3 import scipy.integrate
          4 from sklearn.base import BaseEstimator, TransformerMixin
          5 from sklearn.utils.validation import check_is_fitted
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/__init__.py in <module>
         20 del sys
         21 
    ---> 22 from .time_warping import fdawarp, align_fPCA, align_fPLS, pairwise_align_bayes
         23 from .plot_style import f_plot, rstyle, plot_curve, plot_reg_open_curve, plot_geod_open_curve, plot_geod_close_curve
         24 from .utility_functions import smooth_data, optimum_reparam, f_to_srsf, gradient_spline, elastic_distance, invertGamma, srsf_to_f
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/time_warping.py in <module>
          7 import numpy as np
          8 import matplotlib.pyplot as plt
    ----> 9 import fdasrsf.utility_functions as uf
         10 import fdasrsf.fPCA as fpca
         11 import fdasrsf.geometry as geo
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/utility_functions.py in <module>
         17 from joblib import Parallel, delayed
         18 import numpy.random as rn
    ---> 19 import optimum_reparamN2 as orN2
         20 import optimum_reparam_N as orN
         21 import cbayesian as bay
    
    src/optimum_reparamN2.pyx in init optimum_reparamN2()
    
    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    

    I'm using the standard format

    import skfda
    

    I don't know why this is happening, any help is greatly appreciated

    Version information

    • OS: MacOS
    • Python version: 3.7.9
    • scikit-fda version: 0.5
    • Version of other packages involved: [numpy: 1.19.2, scipy: 1.6.0, matplotlib: 3.3.4 , conda: 4.9.2 ]
    bug 
    opened by adeyemi-lawal 10
  • ModuleNotFoundError: No module named 'optimum_reparam'

    ModuleNotFoundError: No module named 'optimum_reparam'

    Hi, I just forked/cloned the repository and I found a module which is not listed in the dependencies. It's called "optimum_reparam" and it's imported right here. Maybe somewhere else. I failed finding the module myself, could someone provide details about this module in the README requirements section, please?

    Thanks for this initiative!

    opened by alejandro-ariza 8
  • Problem compiling binaries in macOS

    Problem compiling binaries in macOS

    Describe the bug Problem compiling C code from fdasrsf:

    error: $MACOSX_DEPLOYMENT_TARGET mismatch: now "10.14" but "10.15" during configure

    In the fdasrsf package it is fixed the variable with os.environ['MACOSX_DEPLOYMENT_TARGET'] = '10.14' in the setup.py. After cloning the repository and remove the line I get the same error. I tried to export the environment variable before running the installation with no success.

    Complete trace:

    Building wheel for fdasrsf (PEP 517) ... error ERROR: Command errored out with exit status 1: command: /Users/pablomm/scikit_fda_test2/venv/bin/python /Users/pablomm/scikit_fda_test2/venv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/tmpo2_rkkrs cwd: /private/var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/pip-install-ym8nls3t/fdasrsf_a923c378ec5a4ec689f1d7459d35c8c7 Complete output (38 lines): generating build/_DP.c (already up-to-date) running bdist_wheel running build running build_py creating build/lib.macosx-10.15-x86_64-3.8 creating build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/plot_style.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_stats.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_functions.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/geodesic.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/utility_functions.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/tolerance.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/pcr_regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/init.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/fPCA.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/umap_metric.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/time_warping.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/geometry.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/rbfgs.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/fPLS.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/boxplots.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf running build_ext cythoning src/optimum_reparamN2.pyx to src/optimum_reparamN2.c cythoning src/fpls_warp.pyx to src/fpls_warp.c cythoning src/mlogit_warp.pyx to src/mlogit_warp.c cythoning src/ocmlogit_warp.pyx to src/ocmlogit_warp.c cythoning src/oclogit_warp.pyx to src/oclogit_warp.c cythoning src/optimum_reparam_N.pyx to src/optimum_reparam_N.c cythoning src/cbayesian.pyx to src/cbayesian.cpp building 'optimum_reparamN2' extension creating build/temp.macosx-10.15-x86_64-3.8 creating build/temp.macosx-10.15-x86_64-3.8/src clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk -I/private/var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/pip-build-env-qnbgsf0y/overlay/lib/python3.8/site-packages/numpy/core/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/opt/tcl-tk/include -I/Users/pablomm/scikit_fda_test2/venv/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/include/python3.8 -c src/optimum_reparamN2.c -o build/temp.macosx-10.15-x86_64-3.8/src/optimum_reparamN2.o not modified: 'build/_DP.c' error: $MACOSX_DEPLOYMENT_TARGET mismatch: now "10.14" but "10.15" during configure


    ERROR: Failed building wheel for fdasrsf Building wheel for findiff (setup.py) ... done Created wheel for findiff: filename=findiff-0.8.9-py3-none-any.whl size=29218 sha256=c2bc96e93c195fb5c2a1acbf932b5311e8e94e571edf74929eca6e019c66532d Stored in directory: /Users/pablomm/Library/Caches/pip/wheels/df/48/68/71cc95b16d5f7c5115a009f92f9a5a3896fb2ece31228b0aa5 Successfully built findiff Failed to build fdasrsf ERROR: Could not build wheels for fdasrsf which use PEP 517 and cannot be installed directly

    To Reproduce

    virtualenv venv
    source venv/bin/activate
    python --version # 3.8.8 and also tried 3.9.2
    pip install scikit-fda
    

    Also, it is happening when the package is installed from code with python setup.py install

    Version information

    • OS: macOS catalina 10.15.7
    • Python version: 3.9.2 and 3.8.8
    • scikit-fda version: stable master (0.5) and develop
    • gcc and g++ version: 12.0.0
    bug 
    opened by pablomm 7
  • Importing from skfda import FDataGrid gives you ValueError

    Importing from skfda import FDataGrid gives you ValueError

    Describe the bug When importing FDataGrid from skfda it will launch the following error: ` ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    `To Reproduce We are installing scikit-fda==0.3 with the following docker image:

    FROM python:3.7-slim-buster
    
    # ------------------------------------------------------------------------------------Basic Linux Packages------------------------------------------------------------------------------------
    RUN apt-get update \
        && apt-get install -y --no-install-recommends \
        ca-certificates \
        cmake \
        build-essential \
        gcc \
        g++ \
        git \
        wget \
        curl \
        libffi-dev \
        python-dev \
        unixodbc-dev \
        && rm -rf /var/lib/apt/lists/*
    

    Expected behavior To be able to import the library Version information

    • OS: Linux
    • Python version: python:3.7-slim-buster
    • scikit-fda version: 0.3
    • Version of other packages involved [e.g. numpy, scipy, matplotlib, ... ]: numpy==1.16.2 , scipy==1.3.1, matplotlib==3.1.1

    Additional context Add any other context about the problem here. Since 01/02/2021 we have the issue of not being able to import the library. Before of that date we could import it without problem with the same requirements and environment

    Error Thread:

    src/hvl/utils/model_utils.py:12: in <module>
        from skfda import FDataGrid
    /usr/local/lib/python3.7/site-packages/skfda/__init__.py:36: in <module>
        from . import representation, datasets, preprocessing, exploratory, misc, ml
    /bin/bash failed with return code: 1
    /usr/local/lib/python3.7/site-packages/skfda/datasets/__init__.py:6: in <module>
    return code: 1
        from ._samples_generators import (make_gaussian_process,
    /usr/local/lib/python3.7/site-packages/skfda/datasets/_samples_generators.py:8: in <module>
        from ..misc import covariances
    /usr/local/lib/python3.7/site-packages/skfda/misc/__init__.py:2: in <module>
        from . import covariances, kernels, metrics
    /usr/local/lib/python3.7/site-packages/skfda/misc/covariances.py:9: in <module>
        from ..exploratory.visualization._utils import _create_figure, _figure_to_svg
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/__init__.py:4: in <module>
        from . import visualization
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/visualization/__init__.py:1: in <module>
        from . import clustering, representation
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/visualization/clustering.py:11: in <module>
        from ...ml.clustering.base_kmeans import FuzzyKMeans
    /usr/local/lib/python3.7/site-packages/skfda/ml/__init__.py:2: in <module>
        from . import classification, clustering, regression
    /usr/local/lib/python3.7/site-packages/skfda/ml/classification/__init__.py:3: in <module>
        from ..._neighbors import (KNeighborsClassifier, RadiusNeighborsClassifier,
    /usr/local/lib/python3.7/site-packages/skfda/_neighbors/__init__.py:11: in <module>
        from .unsupervised import NearestNeighbors
    >   ???
    E   ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    
    deps/fdasrsf/optimum_reparam.pyx:1: ValueError
    
    bug 
    opened by JavierHernandezMontes 6
  • Feature/local outlier factor

    Feature/local outlier factor

    • Created LocalOutlierFactor (which wraps scikit-learn multivariate version)
    • Example in gallery of detection of outliers
    • New real dataset employed in the example (fetch_octane)
    • Test and Doctests added
    enhancement pending theory 
    opened by pablomm 6
  • Error when calling L2Regularization

    Error when calling L2Regularization

    I have tried different python versions, but calling:

    regularization=L2Regularization(LinearDifferentialOperator())

    As in the docs, results in the following error:

    TypeError: init() takes 1 positional argument but 2 were given

    opened by dcbrien 5
  • Feature/neighbors

    Feature/neighbors

    Added the following neighbors estimators:

    • NearestNeighbors
    • KNeighborsClassifier
    • RadiusNeighborsClassifier
    • NearestCentroids
    • KNeighborsScalarRegressor
    • RadiusNeighborsScalarRegressor
    • KNeighborsFunctionalRegressor
    • RadiusNeighborsFunctionalRegressor

    I wrote some examples of KNeighborsClassifier, RadiusNeighborsClassifier and KNeighborsScalarRegressor, I will write other for the regressors with functional response and the nearest centroids, but in another PR.

    Also, I had to modify the lp_distance lo accept the infinity case and the mean to accept a list of weights.

    There are some little things which can be improved after merge #101.

    opened by pablomm 5
  • Transformer to reduce the image dimension

    Transformer to reduce the image dimension

    Created a transformer which receives a multivariate function from R^n->R^d and applies a vectorial norm to reduce the image dimension to 1.

    There are two version of the transformer, a procedural one and a sklearn style transformer.

    I put the functions in skfda.preprocessing.dim_reduction, but maybe there is a better place.

    opened by pablomm 5
  • Dose BasisSmoother rely on smoothing_parameter values?

    Dose BasisSmoother rely on smoothing_parameter values?

    Describe the bug It seems to me change smoothing_parameter doesn't effect result of BasisSmoother, particularly the score

    To Reproduce I'm using something similar to the one used in the docs

    t = np.linspace(0, 1, 5)
    x = np.sin(2 * np.pi * t) + np.cos(2 * np.pi * t) + 2
    fd = skfda.FDataGrid(data_matrix=x, grid_points=t)
    basis = skfda.representation.basis.FourierBasis((0, 1), n_basis=3)
    smoother = skfda.preprocessing.smoothing.BasisSmoother(basis,smoothing_parameter=100)
    fd_smooth = smoother.fit_transform(fd)
    fd_smooth.data_matrix.round(2)
    smoother.score(fd,fd)
    

    Expected behavior changing smoothing_parameter doesn't seem to affect the score. I'm having the updated github version

    Version information

    • OS: [e.g. MacOs 13.0]
    • Python version: [e.g. 3.9.15]
    • scikit-fda version: [e.g. latest]
    • Version of other packages involved [e.g. numpy 1.23.4]

    Additional context

    bug 
    opened by mikeaalv 2
  • Allow BasisSmoother to accept irregularly sampled data

    Allow BasisSmoother to accept irregularly sampled data

    Alternatively, I have implemented a separate SparseBasisSmoother class which encapsulates the functionality of BasisSmoother in order to apply it to irregularly sampled data, in the format given in Issue #325

    opened by opintosant 1
  • Data structure for discretized sparse data

    Data structure for discretized sparse data

    Is your feature request related to a problem? Please describe. We need a FData subclass capable of storing discretized functions where:

    • The points in which the functions are measured are NOT in a grid (specially relevant for high dimensional functions such as surfaces).
    • The points in which each function is measured are different, and probably a different number of points per observation.
    • There are typically few points measured per observation (sparse instead of dense).

    This structure is necessary for efficient storage and computation with this kind of data.

    Describe the solution you'd like The proposal for the implementation is to have three arrays:

    • A first array containing the (concatenated) input points for the observations. For example, for three functional observations $f_1$, $f_2$, $f_3$, evaluated at $f_1(1, 2) = (3, 4, 5), f_1(2, 4) = (1, 7, 3), f_2(0, 0) = (1, 0, 1), f_3(0, 1) = (2, 2, 2), f_3(3, 5) = (1, 0, 0)$, this array would contain [(1, 2), (2, 4), (0, 0), (0, 1), (3, 5)].
    • A second array with the corresponding values. In the previous example it would be [(3, 4, 5), (1, 7, 3), (1, 0, 1), (2, 2, 2), (1, 0, 0)].
    • A third array with the indexes where the points of each observation start. In this case [0, 2, 3].

    This approach has a compact representation and also allows for vectorization to be applied. The indexes can be used to apply the reduceat method of NumPy ufuncs.

    Describe alternatives you've considered A more high-level API should be also exposed and used when possible.

    enhancement 
    opened by vnmabus 1
  • Fix regularized FPCA with FDataGrid

    Fix regularized FPCA with FDataGrid

    The issue

    When using FPCA with regularization of FDataGrid objects, the obtained components are orthogonal wrt to the usual inner product, which is not the one that should be used in the regularization. As specified in Ramsay, J. O. and Silverman, B. W.. Functional Data Analysis (page 178), the components should not be orthogonal, but they should fulfill

    $$ \int \xi_j(s) \xi_k(s) ds + \int D^2 \xi_j(s) D^2 \xi_k(s) ds = 0, \quad (j\ne k) $$

    where $\xi_i$ are the loadings and $D^2$ is the second order diferential operator.

    Steps to reproduce

    The following code shows the problem:

    X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
    X = X[:300]
    y = y[:300]
    
    n_components = 4
    
    fpca_regression = FPCA(
        n_components=n_components,
        regularization=L2Regularization(
            LinearDifferentialOperator(1),
            regularization_parameter=20,
        ),
    )
    
    fpca_regression.fit(X, y)
    
    matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
    print("Grid regularized:\n", matrix)
    
    
    grid_points = X.grid_points
    X = X.to_basis(Fourier(n_basis=10))
    
    fpca_regression = FPCA(
        n_components=n_components,
        regularization=L2Regularization(
            LinearDifferentialOperator(1),
            regularization_parameter=0.25,
        ),
    )
    fpca_regression.fit(X, y)
    
    
    matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
    print("Basis regularized:\n", matrix)
    

    Output:

    Grid regularized:
     [[ 1.00000000e+00  6.96057795e-16  9.19403442e-17 -2.91433544e-16]
     [ 6.71554826e-16  1.00000000e+00  1.51354623e-16 -2.83627288e-16]
     [ 6.93889390e-18  1.55257751e-16  1.00000000e+00  1.49619900e-16]
     [-2.43294968e-16 -2.56305394e-16  1.47451495e-16  1.00000000e+00]]
    Basis regularized:
     [[ 0.99393024 -0.0180874  -0.01324601  0.0030897 ]
     [-0.0180874   0.79784464 -0.06858017  0.07861877]
     [-0.01324601 -0.06858017  0.74805033  0.09757001]
     [ 0.0030897   0.07861877  0.09757001  0.70994851]]
    

    In the grid regularized case, the output is the identity matrix, even though the regularization coefficient has been set to a very high value (20). In contrast, the basis regularized case outputs a matrix very different from the identity.

    Current implementation

    The relevant extract of the code is the following:

    basis = FDataGrid(
        data_matrix=np.identity(n_points_discretization),
        grid_points=X.grid_points,
    )
    regularization_matrix = compute_penalty_matrix(
        basis_iterable=(basis,),
        regularization_parameter=1,
        regularization=self.regularization,
    )
    
    basis_matrix = basis.data_matrix[..., 0]
    if regularization_matrix is not None:
        basis_matrix += regularization_matrix
    
    fd_data = np.linalg.solve(
        basis_matrix.T,
        fd_data.T,
    ).T
    
    # see docstring for more information
    final_matrix = fd_data @ np.sqrt(weights_matrix)
    
    pca = PCA(n_components=self.n_components)
    pca.fit(final_matrix)
    self.components_ = X.copy(
        data_matrix=np.transpose(
            np.linalg.solve(
                np.sqrt(weights_matrix),
                np.transpose(pca.components_),
            ),
        ),
        sample_names=(None,) * self.n_components,
    )
    

    We can see that the code "matches" the TFG corresponding to this functionality. In this TFG, it is stated that the problem is equivalent to solving the multivariate PCA on the following matrix:

    $$ \mathbf{V}=\frac{\mathbf{X}\mathbf{W}^{1/2}}{\sqrt{N}(\mathbf{I}_M + \mathbf{P}_k)} $$

    This formula has been extracted directly from the TFG, where the matrix operation is expressed as a fraction. Note, however that the correct equation would have the inverse of the denominator multiplying the nominator from the right side. In the TFG, this equation is justified by stating that maximizing the penalized variance is equivalent to the following eigenvalue problem:

    $$ \frac{\mathbf{W}^{1/2} (\mathbf{X}^\intercal \mathbf{X})\mathbf{W}^{1/2} v_j(t)} {N (\mathbf{I}_M + \mathbf{P}_k)^\intercal(\mathbf{I}_M + \mathbf{P}_k)} = \lambda_j v_j(t) $$

    Then, the previous equation can be seen as the usual eigenvalue problem for the covariance matrix of the data, but with a modified covariance matrix. However, even though the previous equation is similar to $\mathbf{V}^\intercal \mathbf{V}$, there are issues with this notation and analysis.

    fda.usc implementation

    The implementation in fda.usc is the following:

     if (lambda > 0) {
        if (is.vector(P)) {
          P <- P.penalty(tt, P)
        }
        dimp <- dim(P)
        if (!(dimp[1] == dimp[2] & dimp[1] == J)) {
          stop("Incorrect matrix dimension P")
        }
        M <- solve(diag(J) + lambda * P)
        Xcen.fdata$data <- Xcen.fdata$data %*% t(M)
      }
    

    where

    • diag(J) is the identity matrix of size J
    • P is the penalty matrix

    Therefore, the implementation in fda.usc is equivalent to the current implementation and also returns the identity matrix as the inner product matrix. The issue (moviedo5/fda.usc/#6) has been opened in the fda.usc repository.

    Proposed solution

    Mathematical analysis

    Non-regularized FPCA summary

    In the usual PCA, the goal is to maximize the variance of the projections upon the principal components found subject to two restrictions. That is to say, maximize

    $$ \frac{1}{N} \mathbf{\phi^\intercal X^\intercal X\phi} $$

    subject to:

    • $\mathbf{\phi^\intercal \phi}=1$
    • The loadings are orthogonal $\mathbf{\phi_i^\intercal \phi_j}=0$ if $i\neq j$

    Regularized FPCA

    The regularized version of FPCA aims to maximize the penalized sample variance. If we want to penalized the second derivative, we can define it as:

    $$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\phi,2) }, $$

    where $p(\phi,2)$ is the penalty function for the second derivative applied to $\phi$. The restrictions now are:

    • $|\phi|^2=1$
    • $\int \phi_i \phi_j + \lambda \int D^2\phi_i D^2\phi_j= 0$

    For simplicity, we will ignore the quadrature for now and assume $\mathbf{W=Id}$ (the integration weights). Then, we can make the following substitutions:

    $$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\xi,2) } \approx \frac{ \frac 1N \boldsymbol \phi^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \boldsymbol \phi + \lambda \boldsymbol \phi^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi } = \frac{ \frac 1N \boldsymbol {\phi}^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi }, $$

    where $\mathbf P_2$ is the penalty matrix for the second derivative, $\boldsymbol \phi$ is the vector of $\phi$ evaluated in the grid points and $\mathbf X$ is the usual data matrix. That is to say, the functions evaluated in the grid points.

    We can also obtain:

    $$ 0=\int \phi_i \phi_j + \int D^2\phi_i D^2\phi_j= \boldsymbol \phi_i^\intercal \boldsymbol \phi_j + \lambda \boldsymbol \phi_i ^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi_j = \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi $$

    In light of these relationships, we can see that we can transform the problem back to the usual PCA problem using the Cholesky decomposition $\mathbf{L} \mathbf{L}^\intercal = \mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2$. Then, we can apply the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$. Using the Cholesky decomposition, we can obtain the following equation.

    $$ psv(\phi) = \frac{ \frac 1N \boldsymbol \psi^\intercal \mathbf{L}^{-1} \mathbf X^\intercal \mathbf X \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }{ \boldsymbol \psi^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }= \frac{ \frac 1N \boldsymbol \psi ^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big)^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big) \boldsymbol \psi } { \boldsymbol \psi^\intercal \boldsymbol \psi } $$

    Then the constraints become:

    $$ 1 = |\boldsymbol \phi|^2 = |(\mathbf L^\intercal)^{-1} \boldsymbol \psi|^2 = \boldsymbol \psi^\intercal \mathbf L^{-1} (\mathbf L^{-1})^\intercal \boldsymbol \psi $$

    $$ 0 = \boldsymbol \psi_i^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi_j = \boldsymbol \psi_i^\intercal \boldsymbol \psi_j $$

    Threfore, we can see that the problem is equivalent to the usual PCA problem with the following changes:

    • The data matrix $\mathbf X$ is replaced by $\mathbf X \mathbf {(L^\intercal)}^{-1}$.
    • Use the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$ so that loadings obtained are orthogonal with respect to the new inner product given by $\mathbf{L}^{-1} (\mathbf L^{-1})^\intercal$.

    Note that after the change of variable, the loadings will no longer have norm 1 with respect to the original inner product. However, this can be easily fixed by dividing normailizing them at the end.

    Progress so far

    This Choesky decomposition approach has been implemented in the attached pull request. Note however that the code has not been optimized yet (still using np.inv instead of np.linalg.solve for example), nor has it been tested extensively.

    Regularization parameter effect

    The main issue remaining is that the regularization parameter $\lambda$ does not seem to have the same effect in grid implementation as in the basis implementation. The reason may be related to the penalty matrix. To reproduce the issue, run the following code:

    X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
    X = X[:300]
    y = y[:300]
    
    
    def fit_plot(ax, title, regularization_parameter=0):
        fpca_regression = FPCA(
            n_components=4,
            regularization=L2Regularization(
                LinearDifferentialOperator(1),
                regularization_parameter=regularization_parameter,
            ),
        )
        fpca_regression.fit(X, y)
        fpca_regression.components_.plot(axes=ax)
        ax.set_title(title)
        ax.set_xlabel("")
        ax.set_ylabel("")
    
    
    fig, ax = plt.subplots(3, 2, figsize=(10, 10))
    
    fit_plot(ax[0, 0], "Grid not regularized")
    fit_plot(ax[0, 1], "Grid regularized", regularization_parameter=2.5)
    
    grid_points = X.grid_points
    X = X.to_basis(Fourier(n_basis=10))
    
    fit_plot(ax[1, 0], "Basis not regularized")
    fit_plot(ax[1, 1], "Basis regularized", regularization_parameter=0.125)
    
    # Convert back to grid using the same points as before
    X = X.to_grid(grid_points)
    
    fit_plot(ax[2, 0], "Grid from basis not regularized")
    fit_plot(ax[2, 1], "Grid from basis regularized", regularization_parameter=2.5)
    
    plt.show()
    

    image

    As it can be seen, the regularization parameter has less of an effect in the basis implementation than in the grid implementation. However, by adjusting the parameters, the results do seem to match almost perfectly.

    Next steps

    • [ ] Fix the regularization parameter effect issue.
    • [ ] Either change the tests so that they do not compare against fda.usc in this particular case or wait for them to respond to the issue (moviedo5/fda.usc/#6). Taking advantage of the migration from unittest to pytest, it would be possible to use pytest mechanisms to prevent these tests from being executed while documenting the reason why they should not be executed.
    • [ ] Optimize and revise the implementation in #485
    • [ ] Investigate the fact that the results in FDataBasis might not have norm 1.
    opened by Ddelval 0
Releases(0.7.1)
  • 0.7.1(Jan 25, 2022)

  • 0.7(Jan 25, 2022)

    • TikhonovRegularization renamed to L2Regularization. The old name is deprecated. See #409 for details.
    • Add the option for MagnitudeShapePlot to plot the inlier ellipse.
    • Additional bug fixes.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.1(Dec 23, 2021)

  • 0.6(Sep 12, 2021)

    Some highlights of this version:

    • Add typing support for most functionalities of the package. Now you can use Mypy to check for errors.
    • New depth-based classifiers
    • New visualization functions
      • Parametric plot
      • Outliergram
      • Multiple interactive plots
    • New historical functional regression model
    • Other minor additions and fixes
    Source code(tar.gz)
    Source code(zip)
  • 0.5(Dec 31, 2020)

    Some highlights of the new functionality:

    • Depth functions refactored
    • Removed C code and added fdasrsf as a dependency
    • Added new variable selection methods
    • Added Maximum Depth classifier
    • Optimized FDataGrid inner product
    Source code(tar.gz)
    Source code(zip)
  • 0.4(Jul 23, 2020)

    Adds functional component analysis, ANOVA and Hotelling tests, basis representation for multivariate and vector valued functions and Tikhonov regularization.

    Source code(tar.gz)
    Source code(zip)
Owner
Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid
Machine Learning Group at Universidad Autónoma de Madrid
Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022
Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Thailand COVID-19 Cluster Data Extraction About Extract Clusters from Thailand Daily COVID-19 briefing PDF Download latest data Here. Data will be upd

Noppakorn Jiravaranun 5 Sep 27, 2021
Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

Vishwaajeeth Kamalakkannan 0 Jan 20, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022
ASTR 302: Python for Astronomy (Winter '22)

ASTR 302, Winter 2022, University of Washington: Python for Astronomy Mario Jurić Location When: 2:30-3:50, Monday & Wednesday, Winter quarter 2022 Wh

UW ASTR 302: Python for Astronomy 4 Jan 12, 2022
A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Theis Lab 2 Mar 29, 2022
Tools for analyzing data collected with a custom unity-based VR for insects.

unityvr Tools for analyzing data collected with a custom unity-based VR for insects. Organization: The unityvr package contains the following submodul

Hannah Haberkern 1 Dec 14, 2022
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

Zekeriyya Demirci 5 Nov 13, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

Souvik Pratiher 31 Dec 16, 2022
Functional tensors for probabilistic programming

Funsor Funsor is a tensor-like library for functions and distributions. See Functional tensors for probabilistic programming for a system description.

208 Dec 29, 2022
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
.npy, .npz, .mtx converter.

npy-converter Matrix Data Converter. Expand matrix for multi-thread, multi-process Divid matrix for multi-thread, multi-process Support: .mtx, .npy, .

taka 1 Feb 07, 2022
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
This is a repo documenting the best practices in PySpark.

Spark-Syntax This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark f

Eric Xiao 447 Dec 25, 2022
Convert monolithic Jupyter notebooks into Ploomber pipelines.

Soorgeon Join our community | Newsletter | Contact us | Blog | Website | YouTube Convert monolithic Jupyter notebooks into Ploomber pipelines. soorgeo

Ploomber 65 Dec 16, 2022
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022