A PyTorch implementation of Deep SAD, a deep Semi-supervised Anomaly Detection method.

Overview

Deep SAD: A Method for Deep Semi-Supervised Anomaly Detection

This repository provides a PyTorch implementation of the Deep SAD method presented in our ICLR 2020 paper ”Deep Semi-Supervised Anomaly Detection”.

Citation and Contact

You find a PDF of the Deep Semi-Supervised Anomaly Detection ICLR 2020 paper on arXiv https://arxiv.org/abs/1906.02694.

If you find our work useful, please also cite the paper:

@InProceedings{ruff2020deep,
  title     = {Deep Semi-Supervised Anomaly Detection},
  author    = {Ruff, Lukas and Vandermeulen, Robert A. and G{\"o}rnitz, Nico and Binder, Alexander and M{\"u}ller, Emmanuel and M{\"u}ller, Klaus-Robert and Kloft, Marius},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://openreview.net/forum?id=HkgH0TEYwH}
}

If you would like get in touch, just drop us an email to [email protected].

Abstract

Deep approaches to anomaly detection have recently shown promising results over shallow methods on large and complex datasets. Typically anomaly detection is treated as an unsupervised learning problem. In practice however, one may have---in addition to a large set of unlabeled samples---access to a small pool of labeled samples, e.g. a subset verified by some domain expert as being normal or anomalous. Semi-supervised approaches to anomaly detection aim to utilize such labeled samples, but most proposed methods are limited to merely including labeled normal samples. Only a few methods take advantage of labeled anomalies, with existing deep approaches being domain-specific. In this work we present Deep SAD, an end-to-end deep methodology for general semi-supervised anomaly detection. We further introduce an information-theoretic framework for deep anomaly detection based on the idea that the entropy of the latent distribution for normal data should be lower than the entropy of the anomalous distribution, which can serve as a theoretical interpretation for our method. In extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10, along with other anomaly detection benchmark datasets, we demonstrate that our method is on par or outperforms shallow, hybrid, and deep competitors, yielding appreciable performance improvements even when provided with only little labeled data.

The need for semi-supervised anomaly detection

fig1

Installation

This code is written in Python 3.7 and requires the packages listed in requirements.txt.

Clone the repository to your machine and directory of choice:

git clone https://github.com/lukasruff/Deep-SAD-PyTorch.git

To run the code, we recommend setting up a virtual environment, e.g. using virtualenv or conda:

virtualenv

# pip install virtualenv
cd <path-to-Deep-SAD-PyTorch-directory>
virtualenv myenv
source myenv/bin/activate
pip install -r requirements.txt

conda

cd <path-to-Deep-SAD-PyTorch-directory>
conda create --name myenv
source activate myenv
while read requirement; do conda install -n myenv --yes $requirement; done < requirements.txt

Running experiments

We have implemented the MNIST, Fashion-MNIST, and CIFAR-10 datasets as well as the classic anomaly detection benchmark datasets arrhythmia, cardio, satellite, satimage-2, shuttle, and thyroid from the Outlier Detection DataSets (ODDS) repository (http://odds.cs.stonybrook.edu/) as reported in the paper.

The implemented network architectures are as reported in the appendix of the paper.

Deep SAD

You can run Deep SAD experiments using the main.py script.

Here's an example on MNIST with 0 considered to be the normal class and having 1% labeled (known) training samples from anomaly class 1 with a pollution ratio of 10% of the unlabeled training data (with unknown anomalies from all anomaly classes 1-9):

cd <path-to-Deep-SAD-PyTorch-directory>

# activate virtual environment
source myenv/bin/activate  # or 'source activate myenv' for conda

# create folders for experimental output
mkdir log/DeepSAD
mkdir log/DeepSAD/mnist_test

# change to source directory
cd src

# run experiment
python main.py mnist mnist_LeNet ../log/DeepSAD/mnist_test ../data --ratio_known_outlier 0.01 --ratio_pollution 0.1 --lr 0.0001 --n_epochs 150 --lr_milestone 50 --batch_size 128 --weight_decay 0.5e-6 --pretrain True --ae_lr 0.0001 --ae_n_epochs 150 --ae_batch_size 128 --ae_weight_decay 0.5e-3 --normal_class 0 --known_outlier_class 1 --n_known_outlier_classes 1;

Have a look into main.py for all possible arguments and options.

Baselines

We also provide an implementation of the following baselines via the respective baseline_<method_name>.py scripts: OC-SVM (ocsvm), Isolation Forest (isoforest), Kernel Density Estimation (kde), kernel Semi-Supervised Anomaly Detection (ssad), and Semi-Supervised Deep Generative Model (SemiDGM).

Here's how to run SSAD for example on the same experimental setup as above:

cd <path-to-Deep-SAD-PyTorch-directory>

# activate virtual environment
source myenv/bin/activate  # or 'source activate myenv' for conda

# create folder for experimental output
mkdir log/ssad
mkdir log/ssad/mnist_test

# change to source directory
cd src

# run experiment
python baseline_ssad.py mnist ../log/ssad/mnist_test ../data --ratio_known_outlier 0.01 --ratio_pollution 0.1 --kernel rbf --kappa 1.0 --normal_class 0 --known_outlier_class 1 --n_known_outlier_classes 1;

The autoencoder is provided through Deep SAD pre-training using --pretrain True with main.py. To then run a hybrid approach using one of the classic methods on top of autoencoder features, simply point to the saved autoencoder model using --load_ae ../log/DeepSAD/mnist_test/model.tar and set --hybrid True.

To run hybrid SSAD for example on the same experimental setup as above:

cd <path-to-Deep-SAD-PyTorch-directory>

# activate virtual environment
source myenv/bin/activate  # or 'source activate myenv' for conda

# create folder for experimental output
mkdir log/hybrid_ssad
mkdir log/hybrid_ssad/mnist_test

# change to source directory
cd src

# run experiment
python baseline_ssad.py mnist ../log/hybrid_ssad/mnist_test ../data --ratio_known_outlier 0.01 --ratio_pollution 0.1 --kernel rbf --kappa 1.0 --hybrid True --load_ae ../log/DeepSAD/mnist_test/model.tar --normal_class 0 --known_outlier_class 1 --n_known_outlier_classes 1;

License

MIT

Owner
Lukas Ruff
PhD student in the ML group at TU Berlin.
Lukas Ruff
Build AGNOS, the operating system for your comma three

agnos-builder This is the tool to build AGNOS, our Ubuntu based OS. AGNOS runs on the comma three devkit. NOTE: the edk2_tici and agnos-firmare submod

comma.ai 21 Dec 24, 2022
Some of the best ways and practices of doing code in Python!

Pythonicness ❤ This repository contains some of the best ways and practices of doing code in Python! Features Properly formatted codes (PEP 8) for bet

Samyak Jain 2 Jan 15, 2022
layout-parser 3.4k Dec 30, 2022
A fast time mocking alternative to freezegun that wraps libfaketime.

python-libfaketime: fast date/time mocking python-libfaketime is a wrapper of libfaketime for python. Some brief details: Linux and OS X, Pythons 3.5

Simon Weber 68 Jun 10, 2022
Main repository for the Sphinx documentation builder

Sphinx Sphinx is a tool that makes it easy to create intelligent and beautiful documentation for Python projects (or other documents consisting of mul

5.1k Jan 04, 2023
Course Materials for Math 340

UBC Math 340 Materials This repository aims to be the one repository for which you can find everything you about Math 340. Lecture Notes Lecture Notes

2 Nov 25, 2021
Automatically open a pull request for repositories that have no CONTRIBUTING.md file

automatic-contrib-prs Automatically open a pull request for repositories that have no CONTRIBUTING.md file for a targeted set of repositories. What th

GitHub 8 Oct 20, 2022
This tutorial will guide you through the process of self-hosting Polygon

Hosting guide This tutorial will guide you through the process of self-hosting Polygon Before starting Make sure you have the following tools installe

Polygon 2 Jan 31, 2022
:blue_book: Automatic documentation from sources, for MkDocs.

mkdocstrings Automatic documentation from sources, for MkDocs. Features - Python handler - Requirements - Installation - Quick usage Features Language

1.1k Jan 04, 2023
Watch a Sphinx directory and rebuild the documentation when a change is detected. Also includes a livereload enabled web server.

sphinx-autobuild Rebuild Sphinx documentation on changes, with live-reload in the browser. Installation sphinx-autobuild is available on PyPI. It can

Executable Books 440 Jan 06, 2023
Fully reproducible, Dockerized, step-by-step, tutorial on how to mock a "real-time" Kafka data stream from a timestamped csv file. Detailed blog post published on Towards Data Science.

time-series-kafka-demo Mock stream producer for time series data using Kafka. I walk through this tutorial and others here on GitHub and on my Medium

Maria Patterson 26 Nov 15, 2022
Build documentation in multiple repos into one site.

mkdocs-multirepo-plugin Build documentation in multiple repos into one site. Setup Install plugin using pip: pip install git+https://github.com/jdoiro

Joseph Doiron 47 Dec 28, 2022
Use Brainf*ck with python!

Brainfudge Run Brainf*ck code with python! Classes Interpreter(array_len): encapsulate all functions into class __init__(self, array_len: int=30000) -

1 Dec 14, 2021
Python Advanced --- numpy, decorators, networking

Python Advanced --- numpy, decorators, networking (and more?) Hello everyone 👋 This is the project repo for the "Python Advanced - ..." introductory

Andreas Poehlmann 2 Nov 05, 2021
🐱‍🏍 A curated list of awesome things related to Hugo themes.

awesome-hugo-themes Automated deployment @ 2021-10-12 06:24:07 Asia/Shanghai &sorted=updated Theme Author License GitHub Stars Updated Blonde wamo MIT

13 Dec 12, 2022
Data science on SDGs - Udemy Online Course Material: Data Science on Sustainable Development Goals

Data Science on Sustainable Development Goals (SDGs) Udemy Online Course Material: Data Science on Sustainable Development Goals https://bit.ly/data_s

Frank Kienle 1 Jan 04, 2022
Easy OpenAPI specs and Swagger UI for your Flask API

Flasgger Easy Swagger UI for your Flask API Flasgger is a Flask extension to extract OpenAPI-Specification from all Flask views registered in your API

Flasgger 3.1k Dec 24, 2022
JMESPath is a query language for JSON.

JMESPath JMESPath (pronounced "james path") allows you to declaratively specify how to extract elements from a JSON document. For example, given this

1.7k Dec 31, 2022
Fun interactive program to sort a list :)

LHD-Build-Sort-a-list Fun interactive program to sort a list :) Inspiration LHD Build Write a script to sort a list. What it does It is a menu driven

Ananya Gupta 1 Jan 15, 2022
Projeto em Python colaborativo para o Bootcamp de Dados do Itaú em parceria com a Lets Code

🧾 lets-code-todo-list por Henrique V. Domingues e Josué Montalvão Projeto em Python colaborativo para o Bootcamp de Dados do Itaú em parceria com a L

Henrique V. Domingues 1 Jan 11, 2022