A rule learning algorithm for the deduction of syndrome definitions from time series data.

Overview

README

This project provides a rule learning algorithm for the deduction of syndrome definitions from time series data. Large parts of the algorithm are based on "BOOMER".

Features

The algorithm that is provided by this project currently supports the following functionalities for learning descriptive rules:

  • The quality of rules is assessed by comparing the predictions of the current model to the ground truth in terms of the Pearson correlation coefficient.
  • When learning a new rule, random samples of the features may be used.
  • Hyper-parameters that provide control over the specificity/generality of rules are available.
  • The algorithm can natively handle numerical, ordinal and nominal features (without the need for pre-processing techniques such as one-hot encoding).
  • The algorithm is able to deal with missing feature values, i.e., occurrences of NaN in the feature matrix.

In addition, the following features that may speed up training or reduce the memory footprint are currently implemented:

  • Dense or sparse feature matrices can be used for training. The use of sparse matrices may speed-up training significantly on some data sets.
  • Multi-threading can be used to parallelize the evaluation of a rule's potential refinements across multiple CPU cores.

Project structure

|-- cpp                     Contains the implementation of core algorithms in C++
    |-- subprojects
        |-- common          Contains implementations that all algorithms have in common
        |-- tsa             Contains implementations for time series analysis
    |-- ...
|-- python                  Contains Python code for running experiments
    |-- rl
        |-- common          Contains Python code that is needed to run any kind of algorithms
            |-- cython      Contains commonly used Cython wrappers
            |-- ...
        |-- tsa             Contains Python code for time series analysis
            |-- cython      Contains time series-specific Cython wrappers
            |-- ...
        |-- testbed         Contains useful functionality for running experiments
            |-- ...
    |-- main.py             Can be used to start an experiment
    |-- ...
|-- Makefile                Makefile for compilation
|-- ...

Project setup

The algorithm provided by this project is implemented in C++. In addition, a Python wrapper that implements the scikit-learn API is available. To be able to integrate the underlying C++ implementation with Python, Cython is used.

The C++ implementation, as well as the Cython wrappers, must be compiled in order to be able to run the provided algorithm. To facilitate compilation, this project comes with a Makefile that automatically executes the necessary steps.

At first, a virtual Python environment can be created via the following command:

make venv

As a prerequisite, Python 3.7 (or a more recent version) must be available on the host system. All compile-time dependencies (numpy, scipy, Cython, meson and ninja) that are required for building the project will automatically be installed into the virtual environment. As a result of executing the above command, a subdirectory venv should have been created within the project's root directory.

Afterwards, the compilation can be started by executing the following command:

make compile

Finally, the library must be installed into the virtual environment, together with all of its runtime dependencies (e.g. scikit-learn, a full list can be found in setup.py). For this purpose, the project's Makefile provides the following command:

make install

Whenever any C++ or Cython source files have been modified, they must be recompiled by running the command make compile again! If compilation files do already exist, only the modified files will be recompiled.

Cleanup: To get rid of any compilation files, as well as of the virtual environment, the following command can be used:

make clean

For more fine-grained control, the command make clean_venv (for deleting the virtual environment) or make clean_compile (for deleting the compiled files) can be used. If only the compiled Cython files should be removed, the command make clean_cython can be used. Accordingly, the command make clean_cpp removes the compiled C++ files.

Parameters

The file python/main.py allows to run experiments on a specific data set using different configurations of the learning algorithm. The implementation takes care of writing the experimental results into .csv files and the learned model can (optionally) be stored on disk to reuse it later.

In order to run an experiment, the following command line arguments must be provided (most of them are optional):

Parameter Optional? Default Description
--data-dir No None The path of the directory where the data sets are located.
--temp-dir No None The path of the directory where temporary files should be saved.
--dataset No None The name of the .csv files that store the raw data (without suffix).
--feature-definition No None The name of the .txt file that specifies the names of the features to be used (without suffix).
--from-year No None The first year (inclusive) that should be taken into account.
--to-year No None The last year (inclusive) that should be taken into account.
--from-week Yes -1 The first week (inclusive) of the first year that should be taken into account or -1, if all weeks of that year should be used.
--to-week Yes -1 The last week (inclusive) of the last year that should be taken into account or -1, if all weeks of that year should be used.
--count-file-name Yes None The name of the file that stores the number of cases that correspond to individual weeks (without suffix). If not specified, the results from appending "_counts" to the dataset name.
--one-hot-encoding Yes False True, if one-hot-encoding should be used for nominal attributes, False otherwise.
--output-dir Yes None The path of the directory into which the experimental results (.csv files) should be written.
--print-rules Yes True True, if the induced rules should be printed on the console, False otherwise.
--store-rules Yes True True, if the induced rules should be stored as a .txt file, False otherwise. Does only have an effect if the parameter --output-dir is specified.
--print-options Yes {} A dictionary that specifies additional options to be used for printing or storing rules, if the parameter --print-rules and/or --store-rules is set to True, e.g. {'print_feature_names':True,'print_label_names':True,'print_nominal_values':True}.
--store-predictions Yes True True, if the predictions for the training data should be stored as a .csv file, False otherwise. Does only have an effect if the parameter --output-dir is specified.
--model-dir Yes None The path of the directory where models (.model files) are located.
--max-rules Yes 50 The maximum number of rules to be induced or -1, if the number of rules should not be restricted.
--time-limit Yes -1 The duration in seconds after which the induction of rules should be canceled or -1, if no time limit should be used.
--feature-sub-sampling Yes None The name of the strategy to be used for feature sub-sampling. Must be random-feature-selection or None. Additional arguments may be provided as a dictionary, e.g. random_feature-selection{'sample_size':0.5}.
--min-support Yes 0.0001 The percentage of training examples that must be covered by a rule. Must be greater than 0 and smaller than 1.
--max-conditions Yes -1 The maximum number of conditions to be included in a rule's body. Must be at least 1 or -1, if the number of conditions should not be restricted.
--random-state Yes 1 The seed to the be used by random number generators.
--feature-format Yes auto The format to be used for the feature matrix. Must be sparse, if a sparse matrix should be used, dense, if a dense matrix should be used, or auto, if the format should be chosen automatically.
--num-threads-refinement Yes 1 The number of threads to be used to search for potential refinements of rules. Must be at least 1 or -1, if the number of cores that are available on the machine should be used.
--log-level Yes info The log level to be used. Must be debug, info, warn, warning, error, critical, fatal or notset.

Example and data format

In the following, we give a more detailed description of the data that must be provided to the algorithm. All input files must use UTF-8 encoding and they must be available in a single directory. The path of the directory must be specified via the parameter --data-dir. The following files must be included in the directory:

  • A .csv file that stores the raw training data (see data/example.csv for an example). Each row (separated by line breaks) must correspond to an individual instance and the columns (separated by commas) must correspond to the available features. The names of the columns/features must be given as the first row. The names of columns can be arbitrary, but there must be a column named "week" that associates each instance with a corresponding year and week (using the format year-month, e.g. 2019-2).
  • A .csv file that specifies the number of cases that correspond to individual weeks (see data/example_counts.csv for an example). The file must consist of three columns, year,week,cases, separated by commas. The names of columns must be given as the first row. Each of the other rows (separated by line breaks) assigns a specific number of cases to a certain week of a year (all values must be positive integers). For each combination of year and week that occurs in the column "week" of the first .csv file, the number of cases must be specified in this second .csv file.
  • A .txt file that specifies the names of the features that should be taken into account (see data/features.txt for an example). Each feature name must be given as a new line. For each feature that is specified in the text file, a column with the same name must exist in the first .csv file.

The parameter --dataset is used to identify the .csv files that should be used by the algorithm. Its value must correspond to the name of the first .csv file mentioned above, omitting the file's suffix (e.g. example if the file's name is example.csv). The second .csv file must be named accordingly by appending the suffix _counts to the name of the first file (e.g. example_counts.csv). The parameter --feature-definition is used to specify the name of the text file that stores the names of relevant features. The given value must correspond to the name of the text file, again omitting the file's suffix (e.g. features, if the file's name is features.txt).

In the following, the command for running an experiment, including all mandatory parameters, can be seen:

venv/bin/python3 python/main.py --data-dir /path/to/data/ --temp-dir /path/to/temp/ --dataset example --feature-definition features --from-year 2018 --to-year 2019

When running the program for the first time, the .csv files that are located in the specified data directory will be loaded. The data will be filtered according to the parameters --from-year and --to-year, such that only instances that belong to the specified timespan are retained. Furthermore, all columns that are missing from the supplied text file will be removed. Finally, the data is converted into the format that is required for learning a rule model. This results in two files (an .arff file and a .xml file) that are written to the directory that is specified via the parameter --temp-dir. The resulting files are named according to the following scheme: <dataset>_<feature-definition>_<from-year>-<to-year> (e.g., example_features_2018-2019.) When running the program multiple times, it will check if the files do already exist. If this is the case, the preprocessing step will be skipped and the available files will be used as they are.

You might also like...
Rule-based Customer Segmentation
Rule-based Customer Segmentation

Rule-based Customer Segmentation Business Problem A game company wants to create level-based new customer definitions (personas) by using some feature

Rule based classification A hotel s customers dataset

Rule-based-classification-A-hotel-s-customers-dataset- Aim: Categorize new customers by segment and predict how much revenue they can generate This re

PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)
PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

PyExplainer PyExplainer is a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of J

Continuous Security Group Rule Change Detection & Response at scale
Continuous Security Group Rule Change Detection & Response at scale

Introduction Get notified of Security Group Changes across all AWS Accounts & Regions in an AWS Organization, with the ability to respond/revert those

A rule-based log analyzer & filter

Flog 一个根据规则集来处理文本日志的工具。 前言 在日常开发过程中,由于缺乏必要的日志规范,导致很多人乱打一通,一个日志文件夹解压缩后往往有几十万行。 日志泛滥会导致信息密度骤减,给排查问题带来了不小的麻烦。 以前都是用grep之类的工具先挑选出有用的,再逐条进行排查,费时费力。在忍无可忍之后决

The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting".

IGMTF The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting". Requirements The framework

TAug :: Time Series Data Augmentation using Deep Generative Models

TAug :: Time Series Data Augmentation using Deep Generative Models Note!!! The package is under development so be careful for using in production! Fea

A real world application of a Recurrent Neural Network on a binary classification of time series data
A real world application of a Recurrent Neural Network on a binary classification of time series data

What is this This is a real world application of a Recurrent Neural Network on a binary classification of time series data. This project includes data

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

Releases(0.1.0)
  • 0.1.0(Sep 24, 2021)

    The first release of the algorithm. It supports the following functionalities for learning descriptive rules:

    • The quality of rules is assessed by comparing the predictions of the current model to the ground truth in terms of the Pearson correlation coefficient.
    • When learning a new rule, random samples of the features may be used.
    • Hyper-parameters that provide control over the specificity/generality of rules are available.
    • The algorithm can natively handle numerical, ordinal and nominal features (without the need for pre-processing techniques such as one-hot encoding).
    • The algorithm is able to deal with missing feature values, i.e., occurrences of NaN in the feature matrix.
    Source code(tar.gz)
    Source code(zip)
Text mining project; Using distilBERT to predict authors in the classification task authorship attribution.

DistilBERT-Text-mining-authorship-attribution Dataset used: https://www.kaggle.com/azimulh/tweets-data-for-authorship-attribution-modelling/version/2

1 Jan 13, 2022
[CVPRW 21] "BNN - BN = ? Training Binary Neural Networks without Batch Normalization", Tianlong Chen, Zhenyu Zhang, Xu Ouyang, Zechun Liu, Zhiqiang Shen, Zhangyang Wang

BNN - BN = ? Training Binary Neural Networks without Batch Normalization Codes for this paper BNN - BN = ? Training Binary Neural Networks without Bat

VITA 40 Dec 30, 2022
A deep learning based semantic search platform that computes similarity scores between provided query and documents

semanticsearch This is a deep learning based semantic search platform that computes similarity scores between provided query and documents. Documents

1 Nov 30, 2021
Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

HiT-GAN Official TensorFlow Implementation HiT-GAN presents a Transformer-based generator that is trained based on Generative Adversarial Networks (GA

Google Research 78 Oct 31, 2022
Official PyTorch code for WACV 2022 paper "CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows"

CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows WACV 2022 preprint:https://arxiv.org/abs/2107.1

Denis 156 Dec 28, 2022
DCA - Official Python implementation of Delaunay Component Analysis algorithm

Delaunay Component Analysis (DCA) Official Python implementation of the Delaunay

Petra Poklukar 9 Sep 06, 2022
In this project, we develop a face recognize platform based on MTCNN object-detection netcwork and FaceNet self-supervised network.

模式识别大作业——人脸检测与识别平台 本项目是一个简易的人脸检测识别平台,提供了人脸信息录入和人脸识别的功能。前端采用 html+css+js,后端采用 pytorch,

Xuhua Huang 5 Aug 02, 2022
The implemention of Video Depth Estimation by Fusing Flow-to-Depth Proposals

Flow-to-depth (FDNet) video-depth-estimation This is the implementation of paper Video Depth Estimation by Fusing Flow-to-Depth Proposals Jiaxin Xie,

32 Jun 14, 2022
An implementation of the research paper "Retina Blood Vessel Segmentation Using A U-Net Based Convolutional Neural Network"

Retina Blood Vessels Segmentation This is an implementation of the research paper "Retina Blood Vessel Segmentation Using A U-Net Based Convolutional

Srijarko Roy 23 Aug 20, 2022
S-attack library. Official implementation of two papers "Are socially-aware trajectory prediction models really socially-aware?" and "Vehicle trajectory prediction works, but not everywhere".

S-attack library: A library for evaluating trajectory prediction models This library contains two research projects to assess the trajectory predictio

VITA lab at EPFL 71 Jan 04, 2023
🏎️ Accelerate training and inference of 🤗 Transformers with easy to use hardware optimization tools

Hugging Face Optimum 🤗 Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to t

Hugging Face 842 Dec 30, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 09, 2023
Patch SVDD for Image anomaly detection

Patch SVDD Patch SVDD for Image anomaly detection. Paper: https://arxiv.org/abs/2006.16067 (published in ACCV 2020). Original Code : https://github.co

Hong-Jeongmin 0 Dec 03, 2021
PyTorch implementation of normalizing flow models

PyTorch implementation of normalizing flow models

Vincent Stimper 242 Jan 02, 2023
Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Query Embedding on Hyper-Relational Knowledge Graphs This repository contains the code used for the experiments in the paper Query Embedding on Hyper-

DimitrisAlivas 19 Jul 26, 2022
Skipgram Negative Sampling in PyTorch

PyTorch SGNS Word2Vec's SkipGramNegativeSampling in Python. Yet another but quite general negative sampling loss implemented in PyTorch. It can be use

Jamie J. Seol 287 Dec 14, 2022
[AAAI 2022] Separate Contrastive Learning for Organs-at-Risk and Gross-Tumor-Volume Segmentation with Limited Annotation

A paper Introduction This is an official release of the paper Separate Contrastive Learning for Organs-at-Risk and Gross-Tumor-Volume Segmentation wit

Jiacheng Wang 14 Dec 08, 2022
Finite-temperature variational Monte Carlo calculation of uniform electron gas using neural canonical transformation.

CoulombGas This code implements the neural canonical transformation approach to the thermodynamic properties of uniform electron gas. Building on JAX,

FermiFlow 9 Mar 03, 2022
Official PyTorch implementation of PICCOLO: Point-Cloud Centric Omnidirectional Localization (ICCV 2021)

Official PyTorch implementation of PICCOLO: Point-Cloud Centric Omnidirectional Localization (ICCV 2021)

16 Nov 19, 2022
Codes for "Template-free Prompt Tuning for Few-shot NER".

EntLM The source codes for EntLM. Dependencies: Cuda 10.1, python 3.6.5 To install the required packages by following commands: $ pip3 install -r requ

77 Dec 27, 2022