Malware-Related Sentence Classification

This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Classification using External Knowledge".

Installation

Installation from the source. Python's virtual or Conda environments are recommended.

git clone https://github.com/chaumng/malware_related_sentence_classification.git
cd malware_related_sentence_classification
pip install -r requirements.txt

This repo is tested on Python 3.7.

Classification and Evaluation

Preprocess data

python preprocess_data.py

Parameter searching: Classify and evaluate

In this repo, we already provided the GAT weak labels in a file. To perform parameter searching, run the following command. The default value is to perform the second grid search. You can change the value of the argument param_grid_setting to "first_grid_search" perform the first grid search, or to "best_setting" to run only the best setting.

python svm_param_search.py --param_grid_setting second_grid_search

Citation

If you find this paper or this code useful, please cite this paper:

@inproceedings{chaunguyen_et_al_2021,
  title={Enrichment of Features for Malware-Related Sentence Classification using External Knowledge},
  author={Nguyen, Chau and Tran, Vu and Nguyen, Le Minh},
  booktitle={Proceedings of the 33rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI)},
  year={2021},
  organization={IEEE},
}

Malware-Related Sentence Classification

Related tags

Overview

Malware-Related Sentence Classification

Installation

Classification and Evaluation

Preprocess data

Parameter searching: Classify and evaluate

Citation

Owner

Chau Nguyen

Autoregressive Entity Retrieval

Maha is a text processing library specially developed to deal with Arabic text.

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

Image2pcl - Enter the metaverse with 2D image to 3D projections

scikit-learn wrappers for Python fastText.

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

Switch spaces for knowledge graph embeddings

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

2021海华AI挑战赛·中文阅读理解·技术组·第三名

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Pipeline for training LSA models using Scikit-Learn.

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

Python utility library for compositing PDF documents with reportlab.

Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks