This tool parses log data and allows to define analysis pipelines for anomaly detection.

Overview

logdata-anomaly-miner Build Status DeepSource

This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis with limited resources and lowest possible permissions to make it suitable for production server use.

AECID Demo – Anomaly Detection with aminer and Reporting to IBM QRadar

Requirements

In order to install logdata-anomaly-miner a Linux system with python >= 3.6 is required. Debian-based distributions are currently recommended.

See requirements.txt for further module dependencies

Installation

Debian

There are Debian packages for logdata-anomaly-miner in the official Debian/Ubuntu repositories.

apt-get update && apt-get install logdata-anomaly-miner

From source

The following command will install the latest stable release:

cd $HOME
wget https://raw.githubusercontent.com/ait-aecid/logdata-anomaly-miner/main/scripts/aminer_install.sh
chmod +x aminer_install.sh
./aminer_install.sh

Docker

For installation with Docker see: Deployment with Docker

Getting started

Here are some resources to read in order to get started with configurations:

Publications

Publications and talks:

A complete list of publications can be found at https://aecid.ait.ac.at/further-information/.

Contribution

We're happily taking patches and other contributions. Please see the following links for how to get started:

Bugs

If you encounter any bugs, please create an issue on Github.

Security

If you discover any security-related issues read the SECURITY.md first and report the issues.

License

GPL-3.0

Comments
  • Multiline support

    Multiline support

    Since issue 372 was closed, I open a new issue for multiline support. See https://github.com/ait-aecid/logdata-anomaly-miner/issues/372

    As I mentioned in the issue, it would be good to have an optional EOL parameter in the config to support simple multiline logs that are clearly separable, e.g., by \n\n that otherwise does not occur. We could also think about supporting more advanced multiline logs, in particular, json formatted logs where each json object spans over several lines rather than a single line. This could be solved by counting brackets, i.e., the ByteStreamAtomizer increases a counter (initially set to 0) for every "{" and decreases it for every "}" (or any other user-defined characters), and passes a log_atom to the parser every time this counter reaches 0.

    enhancement 
    opened by landauermax 15
  • Allowlist and blocklist for detector path lists

    Allowlist and blocklist for detector path lists

    allowlisted_paths in ECD should be named blocklisted_paths, since these paths are not considered for detection.

    allowlisted_paths should also exist, but does the oppsite: Only when one of the paths in the logatom match dictionary contains one of the allowlisted_paths, analysis should be carried out.

    The attribute paths should overrule these lists.

    This feature should be available for all detectors that may be analyzing all available parser matches, such as the VTD.

    enhancement 
    opened by landauermax 15
  • Fix import warnings

    Fix import warnings

    /usr/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from spec or package, falling back on name and path

    return f(*args, **kwds)

    should not occur, when running the aminer.

    bug 
    opened by 4cti0nfi9ure 15
  • %z makes parsing way too slow

    %z makes parsing way too slow

    When using the %z in the parsing model (see slow.txt), I get around 50 lines per second. Without it I get around 1000 lines per second (see fast.txt). There is something wrong with parsing %z in the DateTimeModelElement.

    fast.txt slow.txt train.log config.py.txt

    bug high 
    opened by landauermax 12
  • added nullable functionality to JsonModelElements.

    added nullable functionality to JsonModelElements.

    Make sure these boxes are signed before submitting your Pull Request -- thank you.

    Must haves

    • [x] I have read and followed the contributing guide lines at https://github.com/ait-aecid/logdata-anomaly-miner/wiki/Git-development-workflow
    • [x] Issues exist for this PR
    • [x] I added related issues using the "Fixes #"-notations
    • [x] This Pull-Requests merges into the "development"-branch

    Fixes #1061 Fixes #1074

    Submission specific

    • [ ] This PR introduces breaking changes
    • [ ] My change requires a change to the documentation
    • [ ] I have updated the documentation accordingly
    • [ ] I have added tests to cover my changes
    • [ ] All new and existing tests passed

    Describe changes:

    opened by ernstleierzopf 11
  • Create backups of persistency

    Create backups of persistency

    There should be a parameter for the command line that backups the persistency in regular intervals. Also, there should be a command for the remote control that saves the persistency when executed.

    The persistency should be copied into a directory /var/lib/aminer/backup/yyyy-mm-dd-hh-mm-ss/...

    There should also be the possibility to restore configs, by remote control, config settings, etc.

    enhancement 
    opened by landauermax 11
  • Tabs in logs

    Tabs in logs

    My log file contains tabulators (e.g. System name:\tTESTNAME). However, the byte strings in the parsing models cannot interpret these tabulators (\t): FixedDataModelElement('fixed1', b'System name:\t'),

    How can I make it possible for the tabs to be interpreted correctly?

    opened by tschohanna 10
  • Add overall output for aminer

    Add overall output for aminer

    There should be a way to write everything that the AMiner outputs in a file. For example, in the beginning of the config, a parameter StandardOutput: "/etc/aminer/output.txt" can be set, where all the output (anomalies, errors, etc) is written to in addition to the usual output components. By default, it should be None and not write anything.

    enhancement 
    opened by landauermax 10
  • Warning if two detectors persist on same file

    Warning if two detectors persist on same file

    It is possible to define two detectors of the same type that will end up persisting in the same file - this can especially happen by accident, when the "Default" name is used. We should not prevent it completely, but at least print a warning when two or more detectors persist on the same file.

    enhancement 
    opened by landauermax 9
  • AtomFilterMatchAction YAML support

    AtomFilterMatchAction YAML support

    There should be a way to use a MatchRule so that only logs that match are forwarded to a specific detector, using the AtomFilterMatchAction. This can be done in python configs, but not in yaml configs. Also, tests and documentation is missing.

    enhancement high 
    opened by landauermax 8
  • Paths to JSON list elements

    Paths to JSON list elements

    I have this sample data:

    [email protected]:/home/ubuntu# cat file3.log 
    {"a": ["success", "a.png"]}
    {"a": ["success", "b.png"]}
    {"a": ["fail", "c.png"]}
    {"a": ["success", "c.png"]}
    

    The values in the list should be detected with a value detector. They should not be mixed, i.e., the first and second element in the list are independent.

    I use the following config to parse the file:

    LearnMode: True
    
    LogResourceList:
      - "file:///home/ubuntu/file3.log"
    
    Parser:  
           - id: x
             type: VariableByteDataModelElement
             name: 'x'
             args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'
    
           - id: json
             start: True
             type: JsonModelElement
             name: 'model'
             key_parser_dict:
               "a": 
                 - x
    
    Input:
            timestamp_paths: None
            verbose: True
            json_format: True
    
    Analysis:
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/x'
              learn_mode: true
              persistence_id: test
    
    EventHandlers:
            - id: stpe
              json: true
              type: StreamPrinterEventHandler
    

    Note that I use a value detector on the list. The result is as follows:

    [email protected]:/home/ubuntu# cat /var/lib/aminer/NewMatchPathValueDetector/test 
    ["bytes:a.png", "bytes:c.png", "bytes:b.png"]
    

    Only the last value has been learned, but I also want to learn the first element in the array.

    I propose to model all elements of the lists as their own elements, so that the parser looks like this:

    Parser:
           - id: y
             type: FixedWordlistDataModelElement
             name: 'y'
             args:
               - 'success'
               - 'fail'
                 
           - id: x
             type: VariableByteDataModelElement
             name: 'x'
             args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'
    
           - id: json
             start: True
             type: JsonModelElement
             name: 'model'
             key_parser_dict:
               "a": 
                 - y
                 - x
    

    and the analysis could look like this, where each element can be addressed individually by an analysis component:

    Analysis:
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/x'
              learn_mode: true
              persistence_id: test
    
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/y'
              learn_mode: true
              persistence_id: test
    

    The current implementation uses a single element to model all elements of the list. This can also be convenient and should be possible by introducing a new element called ListOfElements. It should parse any number of elements in the list with the specified parsing model element. For example, the list of elements here is a list of variable byte elements:

    Parser:
           - id: loe
             type: ListOfElements
             name: 'loe'
             args: z
                 
           - id: z
             type: VariableByteDataModelElement
             name: 'z'
             args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'
    
           - id: json
             start: True
             type: JsonModelElement
             name: 'model'
             key_parser_dict:
               "a": 
                 - loe
    

    The ListOfElements element should then assign the index of the element in the JSON list at the end of the path. For example, the following paths can be used in the analysis section:

    Analysis:
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/loe/0'
              learn_mode: true
              persistence_id: test
    
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/loe/1'
              learn_mode: true
              persistence_id: test
    
    enhancement medium 
    opened by landauermax 8
  • extended FrequencyDetector wiki tests.

    extended FrequencyDetector wiki tests.

    Make sure these boxes are signed before submitting your Pull Request -- thank you.

    Must haves

    • [x] I have read and followed the contributing guide lines at https://github.com/ait-aecid/logdata-anomaly-miner/wiki/Git-development-workflow
    • [x] Issues exist for this PR
    • [x] I added related issues using the "Fixes #"-notations
    • [x] This Pull-Requests merges into the "development"-branch

    Fixes #1008 Fixes #1009

    Submission specific

    • [ ] This PR introduces breaking changes
    • [ ] My change requires a change to the documentation
    • [ ] I have updated the documentation accordingly
    • [ ] I have added tests to cover my changes
    • [ ] All new and existing tests passed

    Describe changes:

    opened by ernstleierzopf 0
  • fixed test26 so no fix definition number has to be added.

    fixed test26 so no fix definition number has to be added.

    Make sure these boxes are signed before submitting your Pull Request -- thank you.

    Must haves

    • [x] I have read and followed the contributing guide lines at https://github.com/ait-aecid/logdata-anomaly-miner/wiki/Git-development-workflow
    • [x] Issues exist for this PR
    • [x] I added related issues using the "Fixes #"-notations
    • [x] This Pull-Requests merges into the "development"-branch

    Fixes #1181

    Submission specific

    • [ ] This PR introduces breaking changes
    • [ ] My change requires a change to the documentation
    • [ ] I have updated the documentation accordingly
    • [ ] I have added tests to cover my changes
    • [ ] All new and existing tests passed

    Describe changes:

    opened by ernstleierzopf 0
  • Random test fails when new detector is added

    Random test fails when new detector is added

    When adding a new detector and running the tests, they usually fail at test26_filter_config_errors in YamlConfigTest.py as there is an integer that needs to be incremented. For example, see PR #1180 where this had to be fixed when adding a new detector. It is hard to spot why this test fails as it has nothing to do with the added detector and it is not an indicator of something that needs to be fixed. I therefore suggest to modify this test case so that no matter what integer comes after the "definition" keyword, the test passes. Then adding new detectors in the future should not make it necessary to always update this test.

    test medium 
    opened by landauermax 0
  • Add possibility to run some LogResources as json input and some as normal text input.

    Add possibility to run some LogResources as json input and some as normal text input.

    LogResourceList:
    
       - url: "file:///var/log/apache2/access.log"
       - url: "unix:///var/lib/akafka/aminer.sock"
         type: json  # Konfiguriert den ByteStream
         parser_id: kafka_audit_logs  # Konfiguriert den zugehörigen Parser
    
    
    Parser:
       - id: kafka_audit_logs
         type: AuditDingsParser
    
       - id: ApacheAccessModel
         start: true
    
    opened by ernstleierzopf 0
  • Shorten the build-time for docker builds

    Shorten the build-time for docker builds

    Currently the complete docker image is build at once. This takes a lot of time for each build. We could shorten the build time by inheriting from a pre-built image.

    enhancement 
    opened by whotwagner 0
Releases(V2.5.1)
  • V2.5.1(May 17, 2022)

    Bugfixes:

    • EFD: Fixed problem that appears with empty windows
    • Fixed index out of range if matches are empty in JsonModelElement array.
    • EFD: Fixed problem that appears with empty windows
    • EFD: Enabled immediate detection without training, if both limits are set
    • EFD: Fixed bug related to auto_include_flag
    • Remove spaces in aminer logo
    • ParserCounter: Fixed do_timer
    • Fixed code to allow the usage of AtomFilterMatchAction in yaml configs
    • Fixed JsonModelElement when json object is null
    • Fix incorrect message of charset detector
    • Fix match list handling for json objects
    • Fix incorrect message of charset detector

    Changes:

    • Added nullable functionality to JsonModelElements
    • Added include-directive to supervisord.conf
    • ETD: Output warning when count first exceeds range
    • EFD: Added option to output anomaly when the count first exceeds the range
    • VTD: Added variable type 'range'
    • EFD: Added the function reset_counter
    • EFD: Added option to set the lower and upper limit of the range interval
    • Enhance EFD to consider multiple time windows
    • VTD: Changed the value of parameter num_updates_until_var_reduction to track all variables from False to 0.
    • PAD: Used the binom_test of the scipy package as test if the model should be reinitialized if too few anomalies occur than are expected
    • Add ParsedLogAtom to aminer parser to ensure compatibility with lower versions
    • Added script to add build-id to the version-string
    • Support for installations from source in install-script
    • Fixed and stadardize the persistence time of various detectors
    • Refactoring
    • Improve performance
    • Improve output handling
    • Improved testing
    Source code(tar.gz)
    Source code(zip)
  • V2.5.0(Dec 6, 2021)

    Bugfixes:

    • Fixed bug in YamlConfig

    Changes:

    • Added supervisord to docker
    • Moved unparsed atom handlers to analysis(yamlconfig)
    • Moved new_match_path_detector to analysis(yamlconfig)
    • Refactor: merged all UnparsedHandlers into one python-file
    • Added remotecontrol-command for reopening eventhandlers
    • Added config-parameters for logrotation
    • Improved testing
    Source code(tar.gz)
    Source code(zip)
  • V2.4.2(Nov 24, 2021)

    Bugfixes:

    • PVTID: Fixed output format of previously appeared times
    • VTD: Fixed bugs (static -> discrete)
    • VTD: Fixed persistency-bugs
    • Fixed %z performance issues
    • Fixed error where optional keys with an array type are not parsed when being null
    • Fixed issues with JasonModelElement
    • Fixed persistence handling for ValueRangeDetector
    • PTSAD: Fixed a bug, which occurs, when the ETD stops saving the values of one analyzed path
    • ETD: Fixed the problem when entries of the match_dictionary are not of type MatchElement
    • Fixed error where json data instead of array was parsed successfully.

    Changes:

    • Added multiple parameters to VariableCorrelationDetector
    • Improved VTD
    • PVTID: Renamed parameter time_window_length to time_period_length
    • PVTID: Added check if atom time is None
    • Enhanced output of MTTD and PVTID
    • Improved docker-compose-configuration
    • Improved testing
    • Enhanced PathArimaDetector
    • Improved documentation
    • Improved KernelMsgParsingModel
    • Added pretty print for json output
    • Added the PathArimaDetector
    • TSA: Added functionality to discard arima models with too few log lines per time step
    • TSA: improved confidence calculation
    • TSA: Added the option to force the period length
    • TSA: Automatic selection of the pause area of the ACF
    • Extended EximGenericParsingModel
    • Extended AudispdParsingModel
    Source code(tar.gz)
    Source code(zip)
  • V2.4.1(Jul 23, 2021)

    Bugfixes:

    • Fixed issues with array of arrays in JsonParser
    • Fixed problems with invalid json-output
    • Fixed ValueError in DTME
    • Fixed error with parsing floats in scientific notation with the JsonModelElement.
    • Fixed issue with paths in JsonModelElement
    • Fixed error with \x encoded json
    • Fixed error where EMPTY_ARRAY and EMPTY_OBJECT could not be parsed from the yaml config
    • Fixed a bug in the TSA when encountering a new event type
    • Fixed systemd script
    • Fixed encoding errors when reading yaml configs

    Changes:

    • Add entropy detector
    • Add charset detector
    • Add value range detector
    • Improved ApacheAccessModel, AudispdParsingModel
    • Refactoring
    • Improved documentation
    • Improved testing
    • Improved schema for yaml-config
    • Added EMPTY_STRING option to the JsonModelElement
    • Implemented check to report unparsed atom if ALLOW_ALL is used with data with a type other than list or dict
    Source code(tar.gz)
    Source code(zip)
  • V2.4.0(Jun 10, 2021)

    Bugfixes:

    • Fixed error in JsonModelElement
    • Fixed problems with umlauts in JsonParser
    • Fixed problems with the start element of the ElementValueBranchModelElement
    • Fixed issues with the stat and debug command line parameters
    • Fixed issues if posix acl are not supported by the filesystem
    • Fixed issues with output for non ascii characters
    • Modified kafka-version

    Changes:

    • Improved command-line-options install-script
    • Added documentation
    • Improved VTD CM-Test
    • Improved unit-tests
    • Refactoring
    • Added TSAArimaDetector
    • Improved ParserCount
    • Added the PathValueTimeIntervalDetector
    • Implemented offline mode
    • Added PCA detector
    • Added timeout-paramter to ESD
    Source code(tar.gz)
    Source code(zip)
  • V2.3.1(Apr 8, 2021)

  • V2.3.0(Mar 31, 2021)

    Bugfixes:

    • Changed pyyaml-version to 5.4
    • NewMatchIdValueComboDetector: Fix allow multiple values per id path
    • ByteStreamLineAtomizer: fixed encoding error
    • Fixed too many open directory-handles
    • Added close() function to LogStream

    Changes:

    • Added EventFrequencyDetector
    • Added EventSequenceDetector
    • Added JsonModelElement
    • Added tests for Json-Handling
    • Added command line parameter for update checks
    • Improved testing
    • Splitted yaml-schemas into multiple files
    • Improved support for yaml-config
    • YamlConfig: set verbose default to true
    • Various refactoring
    Source code(tar.gz)
    Source code(zip)
  • V2.2.3(Feb 5, 2021)

  • V2.2.2(Jan 29, 2021)

  • V2.2.1(Jan 26, 2021)

    Bugfixes:

    • Fixed warnigs due to files in Persistency-Directory
    • Fixed ACL-problems in dockerfile and autocreate /var/lib/aminer/log

    Changes:

    • Added simple test for dockercontainer
    • Negate result of the timeout-command. 1 is okay. 0 must be an error
    • Added bullseye-tests
    • Make tmp-dir in debian-bullseye-test and debian-buster-test unique
    Source code(tar.gz)
    Source code(zip)
  • V2.2.0(Dec 23, 2020)

    Changes:

    • Added Dockerfile
    • Addes checks for acl of persistency directory
    • Added VariableCorrelationDetector
    • Added tool for managing multiple persistency files
    • Added supress-list for output
    • Added suspend-mode to remote-control
    • Added requirements.txt
    • Extended documentation
    • Extended yaml-configuration-support
    • Standardize command line parameters
    • Removed --Forground cli parameter
    • Fixed Security warnings by removing functions that allow race-condition
    • Refactoring
    • Ethical correct naming of variables
    • Enhanced testing
    • Added statistic outputs
    • Enhanced status info output
    • Changed global learn_mode behavior
    • Added RemoteControlSocket to yaml-config
    • Reimplemented the default mailnotificationhandler

    Bugfixes:

    • Fixed typos in documentation
    • Fixed issue with the AtomFilter in the yaml-config
    • Fixed order of ETD in yaml-config
    • Fixed various issues in persistency
    Source code(tar.gz)
    Source code(zip)
  • V2.1.0(Nov 5, 2020)

    • Changes:
      • Added VariableTypeDetector,EventTypeDetector and EventCorrelationDetector
      • Added support for unclean format strings in the DateTimeModelElement
      • Added timezones to the DateTimeModelElement
      • Enhanced ApacheAccessModel
      • Yamlconfig: added support for kafka stream
      • Removed cpu limit configuration
      • Various refactoring
      • Yamlconfig: added support for more detectors
      • Added new command-line-parameters
      • Renamed executables to aminer.py and aminerremotecontroly.py
      • Run Aminer in forgroundd-mode per default
      • Added various unit-tests
      • Improved yamlconfig and checks
      • Added start-config for parser to yamlconfig
      • Renamed config templates
      • Removed imports from init.py for better modularity
      • Created AnalysisComponentsPerformanceTests for the EventTypeDetector
      • Extended demo-config
      • Renamed whitelist to allowlist
      • Added warnings for non-existent resources
      • Changed default of auto_include_flag to false
    • Bugfixes:
      • Fixed some exit() in forks
      • Fixed debian files
      • Fixed JSON output of the AffectedLogAtomValues in all detectors
      • Fixed normal output of the NewMatchPathValueDetector
      • Fixed reoccuring alerting in MissingMatchPathValueDetector
    Source code(tar.gz)
    Source code(zip)
  • V2.0.2(Jul 17, 2020)

    • Changes:
      • Added help parameters
      • Added help-screen
      • Added version parameter
      • Adden path and value filter
      • Change time model of ApacheAccessModel for arbitrary time zones
      • Update link to documentation
      • Added SECURITY.md
      • Refactoring
      • Updated man-page
      • Added unit-tests for loadYamlconfig
    • Bugfixes:
      • Fixed header comment type in schema file
      • Fix debian files
    Source code(tar.gz)
    Source code(zip)
  • V2.0.1(Jun 24, 2020)

    • Changes:
      • Updated documentation
      • Updated testcases
      • Updated demos
      • Updated debian files
      • Added copyright headers
      • Added executable bit to AMiner
    Source code(tar.gz)
    Source code(zip)
  • V2.0.0(May 29, 2020)

    • Changes:
      • Updated documentation
      • Added functions getNameByComponent and getIdByComponent to AnalysisChild.py
      • Update DefaultMailNotificationEventHandler.py to python3
      • Extended AMinerRemoteControl
      • Added support for configuration in yaml format
      • Refactoring
      • Added KafkaEventHandler
      • Added JsonConverterHandler
      • Added NewMatchIdValueComboDetector
      • Enabled multiple default timestamp paths
      • Added debug feature ParserCount
      • Added unit and integration tests
      • Added installer script
      • Added VerboseUnparsedHandler
    • Bugfixes including:
      • Fixed dependencies in Debian packaging
      • Fixed typo in various analysis components
      • Fixed import of ModelElementInterface in various parsing components
      • Fixed issues with byte/string comparison
      • Fixed issue in DecimalIntegerValueModelElement, when parsing integer including sign and padding character
      • Fixed unnecessary long blocking time in SimpleMultisourceAtomSync
      • Changed minum matchLen in DelimitedDataModelElement to 1 byte
      • Fixed timezone offset in ModuloTimeMatchRule
      • Minor bugfixes
    Source code(tar.gz)
    Source code(zip)
Owner
AECID
Automatic Event Correlation for Incident Detection
AECID
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022
Efficient matrix representations for working with tabular data

Efficient matrix representations for working with tabular data

QuantCo 70 Dec 14, 2022
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
Pipeline to convert a haploid assembly into diploid

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes

Mikhail Kolmogorov 50 Jan 05, 2023
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

1 Dec 28, 2021
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022
PipeChain is a utility library for creating functional pipelines.

PipeChain Motivation PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Austra

Michael Milton 2 Aug 07, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
INFO-H515 - Big Data Scalable Analytics

INFO-H515 - Big Data Scalable Analytics Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group Exercise clas

Yann-Aël Le Borgne 58 Dec 11, 2022
Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Insurance-Fraud-Claims Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance com

1 Jan 27, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
A variant of LinUCB bandit algorithm with local differential privacy guarantee

Contents LDP LinUCB Description Model Architecture Dataset Environment Requirements Script Description Script and Sample Code Script Parameters Launch

Weiran Huang 4 Oct 25, 2022
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021