Streaming Anomaly Detection Framework in Python (Outlier Detection for Streaming Data)

Last update: Dec 18, 2022

Overview

Python Streaming Anomaly Detection (PySAD)

PySAD is an open-source python framework for anomaly detection on streaming multivariate data.

Documentation

Features

Online Anomaly Detection

PySAD provides methods for online/sequential anomaly detection, i.e. anomaly detection on streaming data, where model updates itself as a new instance arrives.

Resource-Efficient

Streaming methods efficiently handle the limitied memory and processing time requirements of the data streams so that they can be used in near real-time. The methods can only store an instance or a small window of recent instances.

Streaming Anomaly Detection Tools

PySAD contains stream simulators, evaluators, preprocessors, statistic trackers, postprocessors, probability calibrators and more. In addition to streaming models, PySAD also provides integrations for batch anomaly detectors of the PyOD so that they can be used in the streaming setting.

Comprehensiveness

PySAD serves models that are specifically designed for both univariate and multivariate data. Furthermore, one can experiment via PySAD in supervised, semi-supervised and unsupervised setting.

User Friendly

Users with any experience level can easily use PySAD. One can easily design experiments and combine the tools in the framework. Moreover, the existing methods in PySAD are easy to extend.

Free and Open Source Software (FOSS)

PySAD is distributed under BSD License 2.0 and favors FOSS principles.

Installation

The PySAD framework can be installed via:

pip install -U pysad

Alternatively, you can install the library directly using the source code in Github repository by:

git clone https://github.com/selimfirat/pysad.git
cd pysad
pip install .

Required Dependencies:

numpy>=1.18.5
scipy>=1.4.1
scikit-learn>=0.23.2
pyod>=0.7.7.1

Optional Dependencies:

rrcf==0.4.3 (Only required for pysad.models.robust_random_cut_forest.RobustRandomCutForest)
PyNomaly==0.3.3 (Only required for pysad.models.loop.StreamLocalOutlierProbability)
mmh3==2.5.1 (Only required for pysad.models.xstream.xStream)
pandas==1.1.0 (Only required for pysad.utils.pandas_streamer.PandasStreamer)

Quick Links

Versioning

Semantic versioning is used for this project.

License

This project is licensed under the BSD License 2.0.

Citing PySAD

If you use PySAD for a scientific publication, we would appreciate citations to the following paper:

@article{pysad,
  title={PySAD: A Streaming Anomaly Detection Framework in Python},
  author={Yilmaz, Selim F and Kozat, Suleyman S},
  journal={arXiv preprint arXiv:2009.02572},
  year={2020}
}

Comments

Your docs favicon makes me think a Colab notebook stopped with an error

When I'm reading your documentation, the favicon you have looks almost identical to the Colab favicon when it stopped execution because of an error. I can't possibly be the only person that has been fooled by this.

opened by FuriouStyles 0
There is a problem in the method fit_partial in reference_window_model.py

In case initial_window_X is not provided, the training of the model will stop when the size cur_window_X is equal to window_size - 1 and restart when the size cur_window_X can be divided by sliding_size. This problem occurs mainly when window_size and sliding_size have different parity.

opened by eljabrichaymae 0
How can I access the training data that has been used?

Hello everyone,

When a model has been trained, such as LocalOutlierProbability. How can I access the training data that has been used?

I have managed to access the first dataset that is used when initialising the model: LocalOutlierProbability.model.data, but I need the new batch train data which is generated after call fit_partial(X).

Thanks in advance!

opened by joaquinCaceres 0

Only xStream could detect anomalous cases in the example

Hi, I tried different models based on example_usage.py but only xStream could detect anomalous cases, the other model either fail to run or does not predict any anomalous cases. Here is the code:

# Import modules.
from sklearn.utils import shuffle
from pysad.evaluation import AUROCMetric
from pysad.models import xStream
from pysad.models import xStream, ExactStorm, HalfSpaceTrees, IForestASD, KitNet, KNNCAD, LODA, LocalOutlierProbability, MedianAbsoluteDeviation, RelativeEntropy, RobustRandomCutForest, RSHash
from pysad.utils import ArrayStreamer
from pysad.transform.postprocessing import RunningAveragePostprocessor
from pysad.transform.preprocessing import InstanceUnitNormScaler
from pysad.transform.probability_calibration import ConformalProbabilityCalibrator, GaussianTailProbabilityCalibrator
from pysad.utils import Data
from tqdm import tqdm
import numpy as np
from pdb import set_trace

# This example demonstrates the usage of the most modules in PySAD framework.
if __name__ == "__main__":
    np.random.seed(61)  # Fix random seed.

    # Get data to stream.
    data = Data("data")
    X_all, y_all = data.get_data("arrhythmia.mat")
    X_all, y_all = shuffle(X_all, y_all)

    iterator = ArrayStreamer(shuffle=False)  # Init streamer to simulate streaming data.
    # set_trace()
    model = xStream()  # Init xStream anomaly detection model.
    # model = ExactStorm(window_size=25)
    # model = HalfSpaceTrees(feature_mins=np.zeros(X_all.shape[1]), feature_maxes=np.ones(X_all.shape[1]))
    # model = IForestASD()
    # model = KitNet(grace_feature_mapping =100, max_size_ae=100)
    # model = KNNCAD(probationary_period=10)
    # model = LODA()
    # model = LocalOutlierProbability()
    # model = MedianAbsoluteDeviation()
    # model = RelativeEntropy(min_val=0, max_val=1)
    # model = RobustRandomCutForest(num_trees=200)
    # model = RSHash(feature_mins=0, feature_maxes=1)
    
    preprocessor = InstanceUnitNormScaler()  # Init normalizer.
    postprocessor = RunningAveragePostprocessor(window_size=5)  # Init running average postprocessor.
    auroc = AUROCMetric()  # Init area under receiver-operating- characteristics curve metric.

    calibrator = GaussianTailProbabilityCalibrator(window_size=100)  # Init probability calibrator.
    idx = 0
    for X, y in tqdm(iterator.iter(X_all[100:], y_all[100:])):  # Stream data.
        X = preprocessor.fit_transform_partial(X)  # Fit preprocessor to and transform the instance.

        score = model.fit_score_partial(X)  # Fit model to and score the instance.        
        score = postprocessor.fit_transform_partial(score)  # Apply running averaging to the score.
        
        # print(score)
        auroc.update(y, score)  # Update AUROC metric.
        try:
            # set_trace()
            calibrated_score = calibrator.fit_transform(score)  # Fit & calibrate score.
        except:           
            calibrated_score = 0
            # set_trace()
        # set_trace()
        # Output if the instance is anomalous.
        if calibrated_score > 0.95:  # If probability of being normal is less than 5%.
            print(f"Alert: {idx}th data point is anomalous.")
            
        idx += 1

    # Output resulting AUROCS metric.
    # print("AUROC: ", auroc.get())

Does anyone know how to fix this problem ? Thank you very much.

opened by dangmanhtruong1995 0

KitNet + RunningAveragePostprocessor producing nan scores

It seems that maybe when i use KitNet + a RunningAveragePostprocessor i am getting nan scores from the RunningAveragePostprocessor.

If I do this:

# Import modules.
from sklearn.utils import shuffle
from pysad.evaluation import AUROCMetric
from pysad.models import xStream, RobustRandomCutForest, KNNCAD, ExactStorm, HalfSpaceTrees, IForestASD, KitNet
from pysad.utils import ArrayStreamer
from pysad.transform.postprocessing import RunningAveragePostprocessor
from pysad.transform.preprocessing import InstanceUnitNormScaler
from pysad.utils import Data
from tqdm import tqdm
import numpy as np

# This example demonstrates the usage of the most modules in PySAD framework.
if __name__ == "__main__":
    np.random.seed(61)  # Fix random seed.

    n_initial = 100

    # Get data to stream.
    data = Data("data")
    X_all, y_all = data.get_data("arrhythmia.mat")
    #X_all, y_all = shuffle(X_all, y_all)
    X_initial, y_initial = X_all[:n_initial], y_all[:n_initial]
    X_stream, y_stream = X_all[n_initial:], y_all[n_initial:]

    iterator = ArrayStreamer(shuffle=False)  # Init streamer to simulate streaming data.

    model = KitNet(max_size_ae=10, grace_feature_mapping=100, grace_anomaly_detector=100, learning_rate=0.1, hidden_ratio=0.75)
    preprocessor = InstanceUnitNormScaler()  # Init normalizer.
    postprocessor = RunningAveragePostprocessor(window_size=5)  # Init running average postprocessor.
    auroc = AUROCMetric()  # Init area under receiver-operating- characteristics curve metric.

    for X, y in tqdm(iterator.iter(X_stream, y_stream)):  # Stream data.
        X = preprocessor.fit_transform_partial(X)  # Fit preprocessor to and transform the instance.

        score = model.fit_score_partial(X)  # Fit model to and score the instance.
        print(score)
        #score = postprocessor.fit_transform_partial(score)  # Apply running averaging to the score.
        #print(score)

        auroc.update(y, score)  # Update AUROC metric.

    # Output resulting AUROCS metric.
    print("\nAUROC: ", auroc.get())

I see output that looks generally ok but it seem like a nan got in that kinda breaks things when it comes to the AUC

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
  warnings.warn(message, FutureWarning)
0it [00:00, ?it/s]/usr/local/lib/python3.6/dist-packages/pysad/models/kitnet_model/dA.py:119: RuntimeWarning: invalid value encountered in true_divide
  x = (x - self.norm_min) / (self.norm_max - self.norm_min + 0.0000000000000001)
101it [00:00, 948.75it/s]Feature-Mapper: train-mode, Anomaly-Detector: off-mode
0.0
...
0.0
The Feature-Mapper found a mapping: 274 features to 136 autoencoders.
Feature-Mapper: execute-mode, Anomaly-Detector: train-mode
nan
176861904806278.84
1.2789157528725288
0.04468589042395759
0.1220238749287982
0.059888825651861544
0.09122945608076023
...
0.1389761646050123
/usr/local/lib/python3.6/dist-packages/pysad/models/kitnet_model/utils.py:14: RuntimeWarning: overflow encountered in exp
  return 1. / (1 + numpy.exp(-x))
220it [00:03, 54.62it/s]0.12782183995180338
49677121607436.65
136071359600522.08
0.10972949863882411
...
0.1299215446450402
0.1567376498625513
0.1494816850581486
352it [00:05, 69.36it/s]
0.1402801274133297
0.18201141940107077
52873910494109.26
0.13997148683334693
0.13615269873450922
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-8af057e15ede> in <module>()
     47 
     48     # Output resulting AUROCS metric.
---> 49     print("\nAUROC: ", auroc.get())

6 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     97                     msg_err.format
     98                     (type_err,
---> 99                      msg_dtype if msg_dtype is not None else X.dtype)
    100             )
    101     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I think the issue is the nan after the line The Feature-Mapper found a mapping: 274 features to 136 autoencoders. Feature-Mapper: execute-mode, Anomaly-Detector: train-mode

This might be ok but if i then use it with a RunningAveragePostprocessor the nan seems to break the running average so its all just nans:

# Import modules.
from sklearn.utils import shuffle
from pysad.evaluation import AUROCMetric
from pysad.models import xStream, RobustRandomCutForest, KNNCAD, ExactStorm, HalfSpaceTrees, IForestASD, KitNet
from pysad.utils import ArrayStreamer
from pysad.transform.postprocessing import RunningAveragePostprocessor
from pysad.transform.preprocessing import InstanceUnitNormScaler
from pysad.utils import Data
from tqdm import tqdm
import numpy as np

# This example demonstrates the usage of the most modules in PySAD framework.
if __name__ == "__main__":
    np.random.seed(61)  # Fix random seed.

    n_initial = 100

    # Get data to stream.
    data = Data("data")
    X_all, y_all = data.get_data("arrhythmia.mat")
    #X_all, y_all = shuffle(X_all, y_all)
    X_initial, y_initial = X_all[:n_initial], y_all[:n_initial]
    X_stream, y_stream = X_all[n_initial:], y_all[n_initial:]

    iterator = ArrayStreamer(shuffle=False)  # Init streamer to simulate streaming data.

    model = KitNet(max_size_ae=10, grace_feature_mapping=100, grace_anomaly_detector=100, learning_rate=0.1, hidden_ratio=0.75)
    preprocessor = InstanceUnitNormScaler()  # Init normalizer.
    postprocessor = RunningAveragePostprocessor(window_size=5)  # Init running average postprocessor.
    auroc = AUROCMetric()  # Init area under receiver-operating- characteristics curve metric.

    for X, y in tqdm(iterator.iter(X_stream, y_stream)):  # Stream data.
        X = preprocessor.fit_transform_partial(X)  # Fit preprocessor to and transform the instance.

        score = model.fit_score_partial(X)  # Fit model to and score the instance.
        #print(score)
        score = postprocessor.fit_transform_partial(score)  # Apply running averaging to the score.
        print(score)

        auroc.update(y, score)  # Update AUROC metric.

    # Output resulting AUROCS metric.
    print("\nAUROC: ", auroc.get())

So output with the nan sort of being propagated is:

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
  warnings.warn(message, FutureWarning)
0it [00:00, ?it/s]/usr/local/lib/python3.6/dist-packages/pysad/models/kitnet_model/dA.py:119: RuntimeWarning: invalid value encountered in true_divide
  x = (x - self.norm_min) / (self.norm_max - self.norm_min + 0.0000000000000001)
101it [00:00, 881.82it/s]Feature-Mapper: train-mode, Anomaly-Detector: off-mode
0.0
0.0
0.0
...
0.0
The Feature-Mapper found a mapping: 274 features to 136 autoencoders.
Feature-Mapper: execute-mode, Anomaly-Detector: train-mode
nan
nan
nan
nan
185it [00:02, 46.04it/s]nan
nan
nan
193it [00:02, 42.56it/s]nan
nan
nan
200it [00:02, 41.06it/s]nan
nan
nan
nan
Feature-Mapper: execute-mode, Anomaly-Detector: exeute-mode
nan
nan
206it [00:02, 45.11it/s]/usr/local/lib/python3.6/dist-packages/pysad/models/kitnet_model/utils.py:14: RuntimeWarning: overflow encountered in exp
  return 1. / (1 + numpy.exp(-x))
213it [00:02, 49.93it/s]nan
nan
nan
nan
nan
nan
...

opened by andrewm4894 2

KNNCAD with low probationary_period fails

I think I found an issue if you set the probationary_period for KNNCAD to be too low.

This was tripping me up a little so thought worth raising in here. I'm not quite sure what the solution would be - maybe some sort of reasonable default for probationary_period in KNNCAD could help others at least avoid this in future.

Or maybe its just fine and people should not set such a low probationary_period but it was one of the first things i did so maybe others might too :)

Reproducible example:

# Import modules.
from sklearn.utils import shuffle
from pysad.evaluation import AUROCMetric
from pysad.models import xStream, RobustRandomCutForest, KNNCAD
from pysad.utils import ArrayStreamer
from pysad.transform.postprocessing import RunningAveragePostprocessor
from pysad.transform.preprocessing import InstanceUnitNormScaler
from pysad.utils import Data
from tqdm import tqdm
import numpy as np

# This example demonstrates the usage of the most modules in PySAD framework.
if __name__ == "__main__":
    np.random.seed(61)  # Fix random seed.

    # Get data to stream.
    data = Data("data")
    X_all, y_all = data.get_data("arrhythmia.mat")
    X_all, y_all = shuffle(X_all, y_all)

    iterator = ArrayStreamer(shuffle=False)  # Init streamer to simulate streaming data.

    model = KNNCAD(probationary_period=10)
    #model = RobustRandomCutForest()
    #model = xStream()  # Init xStream anomaly detection model.
    preprocessor = InstanceUnitNormScaler()  # Init normalizer.
    postprocessor = RunningAveragePostprocessor(window_size=5)  # Init running average postprocessor.
    auroc = AUROCMetric()  # Init area under receiver-operating- characteristics curve metric.

    for X, y in tqdm(iterator.iter(X_all[100:], y_all[100:])):  # Stream data.
        X = preprocessor.fit_transform_partial(X)  # Fit preprocessor to and transform the instance.

        score = model.fit_score_partial(X)  # Fit model to and score the instance.
        score = postprocessor.fit_transform_partial(score)  # Apply running averaging to the score.

        auroc.update(y, score)  # Update AUROC metric.

    # Output resulting AUROCS metric.
    print("\nAUROC: ", auroc.get())

Gives error:

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
  warnings.warn(message, FutureWarning)
0it [00:00, ?it/s]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-c8fd98afee64> in <module>()
     31         X = preprocessor.fit_transform_partial(X)  # Fit preprocessor to and transform the instance.
     32 
---> 33         score = model.fit_score_partial(X)  # Fit model to and score the instance.
     34         score = postprocessor.fit_transform_partial(score)  # Apply running averaging to the score.
     35 

1 frames
/usr/local/lib/python3.6/dist-packages/pysad/models/knn_cad.py in fit_partial(self, X, y)
     73                 self.training.append(self.calibration.pop(0))
     74 
---> 75             self.scores.pop(0)
     76             self.calibration.append(new_item)
     77             self.scores.append(new_score)

IndexError: pop from empty list

If i set the probationary_period to 25 i see a slightly different error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-fb6b7ffc5fde> in <module>()
     31         X = preprocessor.fit_transform_partial(X)  # Fit preprocessor to and transform the instance.
     32 
---> 33         score = model.fit_score_partial(X)  # Fit model to and score the instance.
     34         score = postprocessor.fit_transform_partial(score)  # Apply running averaging to the score.
     35 

4 frames
<__array_function__ internals> in partition(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in partition(a, kth, axis, kind, order)
    744     else:
    745         a = asanyarray(a).copy(order="K")
--> 746     a.partition(kth, axis=axis, kind=kind, order=order)
    747     return a
    748 

ValueError: kth(=28) out of bounds (6)

Then if I set probationary_period=50 it works.

So feels like is some sort of edge case I may be hitting when probationary_period is low.

I'm happy to work on a PR if some sort of easy fix we can make or even just want to set a default that might avoid people doing what I did :)

opened by andrewm4894 0

Releases(v0.1.1)

v0.1.1(Aug 19, 2020)

pysad-0.1.1.tar.gz
Source code(tar.gz)
Source code(zip)
v0.1.0(Aug 15, 2020)

Initial release pysad-0.1.0.tar.gz
Source code(tar.gz)
Source code(zip)

Owner

Selim Firat Yilmaz

M.S. in Bilkent University EEE

GitHub Repository

Binary classification for arrythmia detection with ECG datasets.

HEART DISEASE AI DATATHON 2021 [Eng] / [Kor] #English This is an AI diagnosis modeling contest that uses the heart disease echocardiography and electr

3 Jul 14, 2022

Pytorch Implementation for CVPR2018 Paper: Learning to Compare: Relation Network for Few-Shot Learning

LearningToCompare Pytorch Implementation for Paper: Learning to Compare: Relation Network for Few-Shot Learning Howto download mini-imagenet and make

246 Dec 19, 2022

Learning trajectory representations using self-supervision and programmatic supervision.

Trajectory Embedding for Behavior Analysis (TREBA) Implementation from the paper: Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Y

58 Jan 06, 2023

A generalist algorithm for cell and nucleus segmentation.

Cellpose | A generalist algorithm for cell and nucleus segmentation. Cellpose was written by Carsen Stringer and Marius Pachitariu. To learn about Cel

733 Dec 29, 2022

Single/multi view image(s) to voxel reconstruction using a recurrent neural network

3D-R2N2: 3D Recurrent Reconstruction Neural Network This repository contains the source codes for the paper Choy et al., 3D-R2N2: A Unified Approach f

1.2k Dec 27, 2022

Instant-nerf-pytorch - NeRF trained SUPER FAST in pytorch

instant-nerf-pytorch This is WORK IN PROGRESS, please feel free to contribute vi

94 Nov 22, 2022

RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation

RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation YouTube | BiliBili 16X interpolation results from two input images: Introd

28 Dec 09, 2022

An implementation of Deep Graph Infomax (DGI) in PyTorch

DGI Deep Graph Infomax (Veličković et al., ICLR 2019): https://arxiv.org/abs/1809.10341 Overview Here we provide an implementation of Deep Graph Infom

491 Jan 03, 2023

NumQMBasic - A mini-course offered to Undergrad physics students

The best way to use this material is by forking it by click the Fork button at the top, right corner. Then you will get your own copy to play with! Th

35 Dec 05, 2022

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling Transformer-based models are widely used in natural language processi

12 Jan 01, 2023

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

22 Nov 09, 2022

Repositório criado para abrigar os notebooks com a listas de exercícios propostos pelo professor Gustavo Guanabara do canal Curso em Vídeo do YouTube durante o Curso de Python 3

Curso em Vídeo - Exercícios de Python 3 Sobre o repositório Este repositório contém os notebooks com a listas de exercícios propostos pelo professor G

9 Oct 15, 2022