A neural-based binary analysis tool

Last update: Dec 22, 2022

Related tags

Overview

A neural-based binary analysis tool

Introduction

This directory contains the demo of a neural-based binary analysis tool. We test the framework using multiple binary analysis tasks: (i) vulnerability detection. (ii) code similarity measures. (iii) decompilations. (iv) malware analysis (coming later).

Requirements

Python 3.7.6
Python packages
- dgl 0.6.0
- numpy 1.18.1
- pandas 1.2.0
- scipy 1.4.1
- sklearn 0.0
- tensorboard 2.2.1
- torch 1.5.0
- torchtext 0.2.0
- tqdm 4.42.1
- wget 3.2
C++14 compatible compiler
Clang++ 3.7.1

Tasks and Dataset preparation

Binary code similarity measures

Download dataset
- Download POJ-104 datasets from here and extract them into data/.
Compile and preprocess
- Run python extract_obj.py -a data/obj (clang++-3.7.1 required)
- Run python preprocess/split_dataset.py -i data/obj -m p -o data/split.pkl to split the dataset into train/valid/test sets.
- Run python preprocess/sim_preprocess.py to compile the binary code into graphs data.
- *(part of the preprocessing code are from [1])

Binary Vulnerability detections

Cramming the binary dataset
- The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in preprocess/cram_vul_dataset.
Download Preprocessed data
- Run ./preprocess.sh (clang++-3.7.1 required), or
- You can directly download the preprocessed datasets from here and extract them into data/.
- Run python preprocess/vul_preprocess.py to compile the binary code into graphs data

Binary decompilation [N-Bref]

Download dataset
- Download the demo datasets (raw and preprocessed data) from here and extract them into data/. (More datasets to come.)
- No need to compile the code into graph again as the data has already been preprocessed.

Training and Evaluation

Binary code similarity measures

Run cd baseline_model && python run_similarity_check.py

Binary Vulnerability detections

Run cd baseline_model && python run_vulnerability_detection.py

Binary decompilation [N-Bref]

Dump the trace of tree expansion:
- To accelerate the online processing of the tree output, we will dump the trace of the trea data by running python -m preprocess.dump_trace
Training scripts:
- First, cd baseline model.
- To train the model using torch parallel, run python run_tree_transformer.py.
- To train it on multi-gpu using distribute pytorch, run python run_tree_transformer_multi_gpu.py
- To evaluate, run python run_tree_transformer.py --eval
- To evaluate a multi-gpu trained model, run python run_tree_transformer_multi_gpu.py --eval

References

[1] Ye, Fangke, et al. "MISIM: An End-to-End Neural Code Similarity System." arXiv preprint arXiv:2006.05265 (2020).

[2] Zhou, Yaqin, et al. "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks." Advances in Neural Information Processing Systems. 2019.

[3] Shi, Zhan, et al. "Learning Execution through Neural Code Fusion.", ICLR (2019).

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

A neural-based binary analysis tool

Related tags

Overview

A neural-based binary analysis tool

Introduction

Requirements

Tasks and Dataset preparation

Binary code similarity measures

Binary Vulnerability detections

Binary decompilation [N-Bref]

Training and Evaluation

Binary code similarity measures

Binary Vulnerability detections

Binary decompilation [N-Bref]

References

License

Owner

Facebook Research

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Tools for analyzing data collected with a custom unity-based VR for insects.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

EOD Historical Data Python Library (Unofficial)

Python script for transferring data between three drives in two separate stages

A 2-dimensional physics engine written in Cairo

small package with utility functions for analyzing (fly) calcium imaging data

Python Practicum - prepare for your Data Science interview or get a refresher.

Aggregating gridded data (xarray) to polygons

PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

Learn machine learning the fun way, with Oracle and RedBull Racing

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

MDAnalysis is a Python library to analyze molecular dynamics simulations.

VevestaX is an open source Python package for ML Engineers and Data Scientists.

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Hydrogen (or other pure gas phase species) depressurization calculations