An Unsupervised Detection Framework for Chinese Jargons in the Darknet

This repo is the Python 3 implementation of 《An Unsupervised Detection Framework for Chinese Jargons in the Darknet》 (Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM ’22).

Introduction

This project proposes Chinese jargon detection framework based on unsupervised learning.

Requirements

pip install -r requirements.txt

Data

Due to the sensitivity of the darknet information, we will not distribute the dataset directly, we show some samples of dataset in /dataset/sample.csv and we will leave the contact information for readers to request for Raw Corpus.

Please contact Liang Ke ([email protected]) for the Darknet corpus dataset.
The Modern Chinese Dictionary (the 7th edition) that we used for cross-corpus comparison is from here.

Code

Preprocess the raw corpus using preprocess.py and get the clean corpus.
Find out-of-vocabulary words using newWordsDiscovey.py, and add them to tokenizer dictionary.
Pretrain word-based DC-BERT model with clean corpus using pretrain.py.
Generate word embeddings with pretrained DC-BERT using genEmbedding.py.
Consruct seed criminal keywords with findSeedKeywords.py, we show an example of a list of seed criminal keywords for readers to reference, you can either delete or add words related to your task.
Find jargon candidates (words related to relevant cybercrimes and are very likely to be jargons) with findCandidate.py.
Finally, you can obtain real darknet Chinese jargons detected by our framework using findJargon.py.

Citation

waiting for camera-ready

An Unsupervised Detection Framework for Chinese Jargons in the Darknet

Related tags

Overview

An Unsupervised Detection Framework for Chinese Jargons in the Darknet

Introduction

Requirements

Data

Code

Citation

Owner

Functional deep learning

This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network.

Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations

Streaming Anomaly Detection Framework in Python (Outlier Detection for Streaming Data)

DNA-RECON { Automatic Web Reconnaissance Tool }

FLSim a flexible, standalone library written in PyTorch that simulates FL settings with a minimal, easy-to-use API

Deep Learning Specialization by Andrew Ng, deeplearning.ai.

A PyTorch Implementation of "SINE: Scalable Incomplete Network Embedding" (ICDM 2018).

Optimizers-visualized - Visualization of different optimizers on local minimas and saddle points.

TensorLight - A high-level framework for TensorFlow

MPI Interest Group on Algorithms on 1st semester 2021

Language Models Can See: Plugging Visual Controls in Text Generation

git《Tangent Space Backpropogation for 3D Transformation Groups》(CVPR 2021) GitHub:1]

RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation (CIKM'17)

CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing

Linear Variational State Space Filters

A PyTorch implementation of PointRend: Image Segmentation as Rendering

A U-Net combined with a variational auto-encoder that is able to learn conditional distributions over semantic segmentations.

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular potentials