The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

Last update: Dec 13, 2022

Related tags

Overview

MOTIF Dataset

The Malware Open-source Threat Intelligence Family (MOTIF) dataset contains 3,095 disarmed PE malware samples from 454 families, labeled with ground truth confidence. Family labels were obtained by surveying thousands of open-source threat reports published by 14 major cybersecurity organizations between Jan. 1st, 2016 Jan. 1st, 2021. The dataset also provides a comprehensive alias mapping for each family and EMBER raw features for each file.

Further information about the MOTIF dataset is provided in our paper.

If you use the provided data or code, please make sure to cite our paper:

@misc{joyce2021motif,
      title={MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels},
      author={Robert J. Joyce and Dev Amlani and Charles Nicholas and Edward Raff},
      year={2021},
      eprint={2111.15031},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Downloading the Dataset

Due to the size of the dataset, you must use Git LFS in order to clone the repository. Installation instructions for Git LFS are linked here. On Debian-based systems, the Git LFS package can be installed using:

sudo apt-get install git-lfs

Once Git LFS is installed, you can clone this repository using:

git lfs clone https://github.com/boozallen/MOTIF.git

Dataset Contents

The main dataset is located in dataset/ and contains the following files:

motif_dataset.jsonl

Each line of motif_dataset.jsonl is a .json object with the following entries:

Name	Description
md5	MD5 hash of malware sample
sha1	SHA-1 hash of malware sample
sha256	SHA-256 hash of malware sample
reported_hash	Hash of malware sample provided in report
reported_family	Normalized family name provided in report
aliases	List of known aliases for family
label	Unique id for malware family (for ML purposes)
report_source	Name of organization that published report
report_date	Date report was published
report_url	URL of report
report_ioc_url	URL to report appendix (if any)
appeared	Year and month malware sample was first seen

Each .json object also contains EMBER raw features (version 2) for the file:

Name	Description
histogram	EMBER histogram
byteentropy	EMBER byte histogram
strings	EMBER strings metadata
general	EMBER general file metadata
header	EMBER PE header metadata
section	EMBER PE section metadata
imports	EMBER imports metadata
exports	EMBER exports metadata
datadirectories	EMBER data directories metadata

motif_families.csv

This file contains an alias mapping for each of the 454 malware families in the MOTIF dataset. It also contains a succinct description of the family and the threat group or campaign that the family is attributed to (if any).

Column	Description
Aliases	List of known aliases for family
Description	Brief sentence describing capabilities of malware family
Attribution (If any)	Name of threat actor malware/campaign is attributed to

motif_reports.csv

This file provides information gathered from our original survey of open-source threat reports. We identified 4,369 malware hashes with 595 distinct reported family names during the survey, but we were unable to obtain some of the files and we restricted the MOTIF dataset to only files in the PE file format. The reported hash, family, source, date, URL, and IOC URL of any malware samples which did not make it into the final MOTIF dataset are located here.

MOTIF.7z

The disarmed malware samples are provided in this 1.47GB encrypted .7z file, which can be unzipped using the following password:

i_assume_all_risk_opening_malware

Each file is named in the format MOTIF_MD5, with MD5 indicating the file's hash prior to when it was disarmed.

X_train.dat and y_train.dat

EMBERv2 feature vectors and labels are provided in X_train.dat and y_train.dat, respectively. Feature vectors were computed using LIEF v0.9.0. These files are named for compatibility with the EMBER read_vectorized_features() function. MOTIF is not split into a training or test set, and X_train.dat and y_train.dat contain feature vectors and labels for the entire dataset.

Benchmark Models

We provide code for training the ML models described in our paper, located in benchmarks/. To support these models, code for modified versions of MalConv2 is included in the MalConv2/ directory.

Requirements:

Packages required for training the ML models can be installed using the following commands:

pip3 install -r requirements.txt
python3 setup.py install

Training the LightGBM or outlier detection models also requires EMBER:

pip3 install git+https://github.com/elastic/ember.git

Training the models:

The LightGBM model can be trained using the following command, where /path/to/MOTIF/dataset/ indicates the path to the dataset/ directory.

python3 lgbm.py /path/to/MOTIF/dataset/

The MalConv2 model can be trained using the following command, where /path/to/MOTIF/MOTIF_defanged/ indicates the path to the unzipped folder containing the disarmed malware samples:

python3 malconv.py /path/to/MOTIF/MOTIF_defanged/ /path/to/MOTIF/dataset/motif_dataset.jsonl

The three outlier detection models can be trained using the following command:

python3 outliers.py /path/to/MOTIF/dataset/

Proper Use of Data

Use of this dataset must follow the provided terms of licensing. We intend this dataset to be used for research purposes and have taken measures to prevent abuse by attackers. All files are prevented from running using the same technique as the SOREL dataset. We refer to their statement regarding safety and abuse of the data.

The malware we’re releasing is “disarmed” so that it will not execute. This means it would take knowledge, skill, and time to reconstitute the samples and get them to actually run. That said, we recognize that there is at least some possibility that a skilled attacker could learn techniques from these samples or use samples from the dataset to assemble attack tools to use as part of their malicious activities. However, in reality, there are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use. In other words, this disarmed sample set will have much more value to researchers looking to improve and develop their independent defenses than it will have to attackers.

The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

Related tags

Overview

MOTIF Dataset

Downloading the Dataset

Dataset Contents

motif_dataset.jsonl

motif_families.csv

motif_reports.csv

MOTIF.7z

X_train.dat and y_train.dat

Benchmark Models

Requirements:

Training the models:

Proper Use of Data

Owner

Booz Allen Hamilton

A PyTorch implementation of " EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks."

Council-GAN - Implementation for our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020)

A series of convenience functions to make basic image processing operations such as translation, rotation, resizing, skeletonization, and displaying Matplotlib images easier with OpenCV and Python.

Medical Insurance Cost Prediction using Machine earning

transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

This is the official implementation for "Do Transformers Really Perform Bad for Graph Representation?".

Implementation of QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application

Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

PyTorch implementation for Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021, Oral)

Driller: augmenting AFL with symbolic execution!

The reference baseline of final exam for XMU machine learning course

This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling

ObjDetApp deploys a pytorch model for object detection

Allele-specific pipeline for unbiased read mapping(WIP), QTL discovery(WIP), and allelic-imbalance analysis

The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training

[CVPR 2021] 'Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator'

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

Implementation for "Domain-Specific Bias Filtering for Single Labeled Domain Generalization"

TimeSHAP explains Recurrent Neural Network predictions.

The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

Related tags

Overview

MOTIF Dataset

Downloading the Dataset

Dataset Contents

motif_dataset.jsonl

motif_families.csv

motif_reports.csv

MOTIF.7z

X_train.dat and y_train.dat

Benchmark Models

Requirements:

Training the models:

Proper Use of Data

Owner

Booz Allen Hamilton

A PyTorch implementation of " EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks."

Council-GAN - Implementation for our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020)

A series of convenience functions to make basic image processing operations such as translation, rotation, resizing, skeletonization, and displaying Matplotlib images easier with OpenCV and Python.

Medical Insurance Cost Prediction using Machine earning

transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

This is the official implementation for "Do Transformers Really Perform Bad for Graph Representation?".

Implementation of QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application

Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

PyTorch implementation for Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021, Oral)

Driller: augmenting AFL with symbolic execution!

The reference baseline of final exam for XMU machine learning course

This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling

*ObjDetApp* deploys a pytorch model for object detection

Allele-specific pipeline for unbiased read mapping(WIP), QTL discovery(WIP), and allelic-imbalance analysis

The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training

[CVPR 2021] 'Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator'

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

Implementation for "Domain-Specific Bias Filtering for Single Labeled Domain Generalization"

TimeSHAP explains Recurrent Neural Network predictions.

ObjDetApp deploys a pytorch model for object detection