A neural-based binary analysis tool

Related tags

Data Analysisnbref
Overview

A neural-based binary analysis tool

Introduction

This directory contains the demo of a neural-based binary analysis tool. We test the framework using multiple binary analysis tasks: (i) vulnerability detection. (ii) code similarity measures. (iii) decompilations. (iv) malware analysis (coming later).

Requirements

  • Python 3.7.6
  • Python packages
    • dgl 0.6.0
    • numpy 1.18.1
    • pandas 1.2.0
    • scipy 1.4.1
    • sklearn 0.0
    • tensorboard 2.2.1
    • torch 1.5.0
    • torchtext 0.2.0
    • tqdm 4.42.1
    • wget 3.2
  • C++14 compatible compiler
  • Clang++ 3.7.1

Tasks and Dataset preparation

Binary code similarity measures

  1. Download dataset
    • Download POJ-104 datasets from here and extract them into data/.
  2. Compile and preprocess
    • Run python extract_obj.py -a data/obj (clang++-3.7.1 required)
    • Run python preprocess/split_dataset.py -i data/obj -m p -o data/split.pkl to split the dataset into train/valid/test sets.
    • Run python preprocess/sim_preprocess.py to compile the binary code into graphs data.
    • *(part of the preprocessing code are from [1])

Binary Vulnerability detections

  1. Cramming the binary dataset
    • The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in preprocess/cram_vul_dataset.
  2. Download Preprocessed data
    • Run ./preprocess.sh (clang++-3.7.1 required), or
    • You can directly download the preprocessed datasets from here and extract them into data/.
    • Run python preprocess/vul_preprocess.py to compile the binary code into graphs data

Binary decompilation [N-Bref]

  1. Download dataset
    • Download the demo datasets (raw and preprocessed data) from here and extract them into data/. (More datasets to come.)
    • No need to compile the code into graph again as the data has already been preprocessed.

Training and Evaluation

Binary code similarity measures

  • Run cd baseline_model && python run_similarity_check.py

Binary Vulnerability detections

  • Run cd baseline_model && python run_vulnerability_detection.py

Binary decompilation [N-Bref]

  1. Dump the trace of tree expansion:
    • To accelerate the online processing of the tree output, we will dump the trace of the trea data by running python -m preprocess.dump_trace
  2. Training scripts:
    • First, cd baseline model.
    • To train the model using torch parallel, run python run_tree_transformer.py.
    • To train it on multi-gpu using distribute pytorch, run python run_tree_transformer_multi_gpu.py
    • To evaluate, run python run_tree_transformer.py --eval
    • To evaluate a multi-gpu trained model, run python run_tree_transformer_multi_gpu.py --eval

References

[1] Ye, Fangke, et al. "MISIM: An End-to-End Neural Code Similarity System." arXiv preprint arXiv:2006.05265 (2020).

[2] Zhou, Yaqin, et al. "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks." Advances in Neural Information Processing Systems. 2019.

[3] Shi, Zhan, et al. "Learning Execution through Neural Code Fusion.", ICLR (2019).

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

Owner
Facebook Research
Facebook Research
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-

International Business Machines 293 Dec 29, 2022
Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Cloudera 759 Jan 07, 2023
Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

The yt project 304 Jan 02, 2023
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 01, 2022
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

Gravitational-Wave-Analysis This project showcases how to analyze the Gravitational wave data stored at LIGO/VIRGO observatories, using Python program

1 Jan 23, 2022
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
Project: Netflix Data Analysis and Visualization with Python

Project: Netflix Data Analysis and Visualization with Python Table of Contents General Info Installation Demo Usage and Main Functionalities Contribut

Kathrin Hälbich 2 Feb 13, 2022
First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

Alexander Butler 150 Jan 06, 2023
statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

1 Oct 03, 2021
ped-crash-techvol: Texas Ped Crash Tech Volume Pack

ped-crash-techvol: Texas Ped Crash Tech Volume Pack In conjunction with the Final Report "Identifying Risk Factors that Lead to Increase in Fatal Pede

Network Modeling Center; Center for Transportation Research; The University of Texas at Austin 2 Sep 28, 2022
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Nguyễn Quang Huy 5 Sep 28, 2022
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample c

Kevin Schwarzwald 42 Nov 09, 2022
A crude Hy handle on Pandas library

Quickstart Hyenas is a curde Hy handle written on top of Pandas API to allow for more elegant access to data-scientist's powerhouse that is Pandas. In

Peter Výboch 4 Sep 05, 2022
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023
MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

SeungHeonDoh 3 Jul 02, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022
AWS Glue ETL Code Samples

AWS Glue ETL Code Samples This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilit

AWS Samples 1.2k Jan 03, 2023