A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Overview

A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Datasets

Because of copyright issues, both the MalwareBazaar dataset and the MalwareDrift dataset just contain the malware SHA-256 hash and all of the related information which can be find in the Datasets folder. You can download raw malware samples from the open-source malware release website by applying an api-key, and use disassembly tool to convert the malware into binary and disassembly files.

  • The MalwareBazaar dataset : you can download the samples from MalwareBazaar.
  • The MalwareDrift dataset : you can download the samples from VirusShare.

Experimental Settings

Model Training Strategy Optimizer Learning Rate Batch Size Input Format
ResNet-50 From Scratch Adam 1e-3 64 224*224 color image
ResNet-50 Transfer Adam 1e-3 All data* 224*224 color image
VGG-16 From Scratch SGD 5e-6** 64 224*224 color image
VGG-16 Transfer SGD 5e-6 64 224*224 color image
Inception-V3 From Scratch Adam 1e-3 64 224*224 color image
Inception-V3 Transfer Adam 1e-3 All data 224*224 color image
IMCFN From Scratch SGD 5e-6*** 32 224*224 color image
IMCFN Transfer SGD 5e-6*** 32 224*224 color image
CBOW+MLP - SGD 1e-3 128 CBOW: byte sequences; MLP: 256*256 matrix
MalConv - SGD 1e-3 32 2MB raw byte values
MAGIC - Adam 1e-4 10 ACFG
Word2Vec+KNN - - - - Word2Vec: Opcode sequences; KNN distance measure: WMD
MCSC - SGD 5e-3 64 Opcode sequences

* The batch size is set to 128 for the MalwareBazaar dataset
** The learning rate is set to 5e-5 for the Malimg dataset and 1e-5 for the MalwareBazaar dataset
*** The learning rate is set to 1e-5 for the MalwareBazaar dataset
CBOW is with default parameters in the Word2Vec package in the Gensim library of Python

Graphically Analysis of Table 4 and Table 5

Here is a more detailed figure analysis for Table 4 and Table 5 in order to make the raw information in the paper easier to digest.

Table 4

  • The classification performance (F1-Score) of each approach on three datasets classification performance

    The figure shows the classification performance (F1-Score) of each methods on three datasets. It is noteworthy that the Malimg dataset only contains malware images, and thus it can only be used to evaluate the 4 image-based methods.

  • The average classification performance (F1-Score) of each approach for three datasets average classification performance

    The figure shows the average classification performance (F1-Score) of each method for the three datasets. Among them, the F1-score corresponding to each model is obtained by averaging the F1-score of the model on three datasets, which represents the average performance.

  • The train time and resource overhead of each method on three datasets
    resource consumption

    The figure shows the train time (left subgraph) and resource overhead (right subgraph) needed for every method on three datasets. The bar immediately to the right of the train time bar is the memory overhead of this model. Similarly, there are only 4 image-based models for the Malimg dataset.

Table 5

  • The classification performance (F1-Score) of transfer learning for image-based approaches on three datasets transfer learning

    This figure shows the F1-Score obtained by every image-based model using the strategy of training from scratch, 10% transfer learning, 50% transfer learning, 80% transfer learning, and 100% transfer learning, respectively. Every subgraph correspond to the BIG-15, Malimg, and MalwareBazaar dataset, respectively.

  • The train time and resource overhead of transfer learning for image-based approaches on three datasets
    resource consumption

    Each row correspond to the BIG-15, Mmalimg, and MalwareBazaar dataset, respectively. For each row, there are 4 models (ResNet-50, VGG-16, Inception-V3 and IMCFN). For each model, there are 8 bars on the right, the left 4 bars stands for the train time under 10%, 50%, 80% and 100% transfer learning, and the right 4 bars are the memory overhead under 10%, 50%, 80% and 100% transfer learning.

Ladder Variational Autoencoders (LVAE) in PyTorch

Ladder Variational Autoencoders (LVAE) PyTorch implementation of Ladder Variational Autoencoders (LVAE) [1]: where the variational distributions q at

Andrea Dittadi 63 Dec 22, 2022
A toy project using OpenCV and PyMunk

A toy project using OpenCV, PyMunk and Mediapipe the source code for my LindkedIn post It's just a toy project and I didn't write a documentation yet,

Amirabbas Asadi 82 Oct 28, 2022
This is a classifier which basically predicts whether there is a gun law in a state or not, depending on various things like murder rates etc.

Gun-Laws-Classifier This is a classifier which basically predicts whether there is a gun law in a state or not, depending on various things like murde

Awais Saleem 1 Jan 20, 2022
Contains a bunch of different python programm tasks

py_tasks Contains a bunch of different python programm tasks Armstrong.py - calculate Armsrong numbers in range from 0 to n with / without cache and c

Dmitry Chmerenko 1 Dec 17, 2021
A naive ROS interface for visualDet3D.

YOLO3D ROS Node This repo contains a Monocular 3D detection Ros node. Base on https://github.com/Owen-Liuyuxuan/visualDet3D All parameters are exposed

Yuxuan Liu 19 Oct 08, 2022
COIN the currently largest dataset for comprehensive instruction video analysis.

COIN Dataset COIN is the currently largest dataset for comprehensive instruction video analysis. It contains 11,827 videos of 180 different tasks (i.e

86 Dec 28, 2022
Build and run Docker containers leveraging NVIDIA GPUs

NVIDIA Container Toolkit Introduction The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. The toolkit includ

NVIDIA Corporation 15.6k Jan 01, 2023
Author's PyTorch implementation of TD3+BC, a simple variant of TD3 for offline RL

A Minimalist Approach to Offline Reinforcement Learning TD3+BC is a simple approach to offline RL where only two changes are made to TD3: (1) a weight

Scott Fujimoto 193 Dec 23, 2022
An Object Oriented Programming (OOP) interface for Ontology Web language (OWL) ontologies.

Enabling a developer to use Ontology Web Language (OWL) along with its reasoning capabilities in an Object Oriented Programming (OOP) paradigm, by pro

TheEngineRoom-UniGe 7 Sep 23, 2022
Learning trajectory representations using self-supervision and programmatic supervision.

Trajectory Embedding for Behavior Analysis (TREBA) Implementation from the paper: Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Y

58 Jan 06, 2023
RobustVideoMatting and background composing in one model by using onnxruntime.

RVM_onnx_compose RobustVideoMatting and background composing in one model by using onnxruntime. Usage pip install -r requirements.txt python infer_cam

Quantum Liu 4 Apr 07, 2022
Code for "My(o) Armband Leaks Passwords: An EMG and IMU Based Keylogging Side-Channel Attack" paper

Myo Keylogging This is the source code for our paper My(o) Armband Leaks Passwords: An EMG and IMU Based Keylogging Side-Channel Attack by Matthias Ga

Secure Mobile Networking Lab 7 Jan 03, 2023
《Dual-Resolution Correspondence Network》(NeurIPS 2020)

Dual-Resolution Correspondence Network Dual-Resolution Correspondence Network, NeurIPS 2020 Dependency All dependencies are included in asset/dualrcne

Active Vision Laboratory 45 Nov 21, 2022
[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021) Authors: Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan*, Jianbing Shen, & Ling Shao This

Daniel-Ji 55 Dec 22, 2022
Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object

151 Dec 26, 2022
Hcaptcha-challenger - Gracefully face hCaptcha challenge with Yolov5(ONNX) embedded solution

hCaptcha Challenger 🚀 Gracefully face hCaptcha challenge with Yolov5(ONNX) embe

593 Jan 03, 2023
Code for "The Box Size Confidence Bias Harms Your Object Detector"

The Box Size Confidence Bias Harms Your Object Detector - Code Disclaimer: This repository is for research purposes only. It is designed to maintain r

Johannes G. 24 Dec 07, 2022
Code for the paper "Curriculum Dropout", ICCV 2017

Curriculum Dropout Dropout is a very effective way of regularizing neural networks. Stochastically "dropping out" units with a certain probability dis

Pietro Morerio 21 Jan 02, 2022
GPU implementation of $k$-Nearest Neighbors and Shared-Nearest Neighbors

GPU implementation of kNN and SNN GPU implementation of $k$-Nearest Neighbors and Shared-Nearest Neighbors Supported by numba cuda and faiss library E

Hyeon Jeon 7 Nov 23, 2022
SAS output to EXCEL converter for Cornell/MIT Language and acquisition lab

CORNELLSASLAB SAS output to EXCEL converter for Cornell/MIT Language and acquisition lab Instructions: This python code can be used to convert SAS out

2 Jan 26, 2022