NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models

Last update: Dec 28, 2022

Overview

NaturalCC

NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models for many software engineering tasks, e.g., code summarization, code retrieval, code completion, code clone detection and type inference. Our vision is to bridge the gap between programming language and natural language through machine learning techniques.

⭐ Features

A collection of code corpus with data preprocessing
Performance benchmark
Mixed precision training
- Nvidia APEX
- Automatic Mixed Precision
Multi-GPU training
Better logging output
Various Implementations:
- tensorflow gradient clipping
- optimizers or learning schedulers
- baseline models
- binary data formats

🚀 Installation

Requirements

PyTorch version >= 1.6.0
Python version >= 3.6
GCC/G++ > 5.0
For training new models, you'll also need an NVIDIA GPU and NCCL
(optional) For faster training, you need to install NVIDIA's apex library.

1. Install prerequisite libraries

git clone https://github.com/xcodemind/naturalcc && cd naturalcc
pip install -r requirements.txt

Once you installed prerequisite libraries, you can check them via python -m env_test

2. Build or install NaturalCC

Export your NaturalCC cache directory (data and models will be saved in this directory) to user variables(~/.bashrc or ~/.zshrc).

> ~/.bashrc">

echo "export NCC=/data/ncc_data" >> ~/.bashrc

Note: PyCharm cannot get environment variables and, therefore, we recommend you to register your NCC variable at ncc/__init__.py.

Compile Cython files to accelerate programs and register NaturalCC into your pip list

# compile for debug
# python setup.py build_ext --inplace
# install 
pip install --editable ./

3. Half precision computation (optional)

NaturalCC supports half precision training.

If your Pytorch.__version__ < 1.6.0 and nvcc -V is runnable, please install apex.
Otherwise, use Automatic Mixed Precision (AMP). Available Now (set amp: 1 in yaml file, An example).

4. Install GCC/G++ with conda (if you do not have permission)

Since NCC is build via Cython, your GCC/G++ version should be greater than 4.9. If you have the root permission, update GCC/G++; otherwise, install GCC/G++ with conda.

# install GCC/G++ with conda
conda install -c anaconda gxx_linux-64
conda install -c conda-forge gcc_linux-64
cd ~/anaconda/envs/XXX/bin
ln -s x86_64-conda_cos6-linux-gnu-gcc gcc
ln -s x86_64-conda_cos6-linux-gnu-g++ g++
# check
conda deactivate
conda activate XXX
>> type "gcc/g++ -v" in terminals

📚 Dataset

Currently, we have processed the following datasets:

🤖 Implementations

Code retrieval (search)

Code completion

SeqRNN
GPT2

Heterogeneous mapping

Code summarization

Naive Copy
CodeNN
DeepCom
Seq2Seeq + Attention
Nary-/ChildSum-Tree2Seq
Code2Seq
Transformer + (Sinusoidal/Relative/Learned Position Encoding)
CodeBERT
GraphCodeBERT
PLBART

📋 Experiments

Code Summarization

Dataset: Python (Wan et al.)

	BLEU-4	METEOR	ROUGE-L	Cost	Logs
Seq2Seq+Attn	25.57	14.40	39.41	0.09s/b	click here
Tree2Seq+Attn	23.35	12.59	36.49	0.48s/b	click here
Transformer	30.64	17.65	44.59	0.26s/b	click here
Transformer+RPE	31.57	17.74	45.18	0.27s/b	click here
PLBART	32.71	18.13	46.05	0.80s/b	TBC

Code Retrieval

Dataset: CodeSearchNet (Husain et al.)

MRR	Go	Java	JS	PHP	Python	Ruby	Cost	Logs
NBOW	66.59	59.92	47.15	54.75	63.33	42.86	0.16s/b	click here
ConV1d	70.87	60.49	38.81	61.92	67.29	36.53	0.30s/b	click here
BiRNN	65.80	48.60	23.23	51.36	48.28	19.35	0.74s/b	click here
SelfAttn	78.45	66.55	50.38	65.78	79.09	47.96	0.25s/b	click here

Code Completion

Dataset: Py150 (official processed) (raw)

MRR	Attr	Num	Name	Param	Tokens	Cost	Logs
LSTM	51.67	47.45	46.52	66.06	73.73	0.31s/b	click here
GTP-2	70.37	62.20	63.84	73.54	82.17	0.43s/b	click here
TravTrans	72.08	68.55	76.33	71.08	83.17	0.43s/b	click here

Type Inference

Dataset: CodeSearchNet-Java (Husain et al.)

	[email protected] (All types)	[email protected] (All types)	[email protected] (Any types)	[email protected] (Any types)	Cost	Logs
DeepTyper	0.52	0.67	0.43	0.67	0.42s/b	TBC
Transformer	0.32	0.64	0.37	0.75	0.85s/b	TBC

Heterogeneous Mapping

Dataset: OpenCL (Grewe et al.)

Accuracy	AMD	NVIDIA
Static mapping	58.82	56.91
Decision tree	70.29	74.56
Inst2vec	82.79	81.76
DeepTune	83.24	80.15

🏫 Examples & Tutorials

All the running commands here should be executed in the root of project folder (the path of your naturalcc). For example, in my environment I will stay at /data/wanyao/Dropbox/ghproj-v100/naturalcc.

We also have more detailed READMEs to start your tutorial of NaturalCC.

Step 1: Download and process a dataset from `datasets`, and follow the instructions from the README.md file.

# ref: dataset/python_wan/README.md
# download dataset
bash dataset/python_wan/download.sh
# clean data
python -m dataset.python_wan.clean
# cast data attributes into different files
python -m dataset.python_wan.attributes_cast

# ref: dataset/python_wan/summarization/README.md
# save code tokens and docstirng tokens into MMAP format
python -m dataset.python_wan.summarization.preprocess

Step 2 (optional): Register your self-defined models

If you want to create a new model, please add your model at ncc/models and ncc/modules.
If your training policy are more complex than we thought, you should update your criterions and training procedure at ncc/criterions and ncc/trainers, respectively.

Do not forget to update your self defined module at ncc/XX/__init__.py.

Step 3: Training and inference.

Select a task and a model from task list and follow the instructions in its README.md to start your learning.

# ref: run/summarization/transformer/README.md
# train
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m run.summarization.transformer.train -f config/python_wan/python > run/summarization/transformer/config/python_wan/python.log 2>&1 &
# inference
CUDA_VISIBLE_DEVICES=0 python -m run.summarization.transformer.eval -f config/python_wan/python -o run/summarization/transformer/config/python_wan/python.txt

❓ FAQ

Please fell free to contact me if you have any troubles.

😘 License and Acknowledgement

NaturalCC is MIT-licensed. The license applies to the pre-trained models as well. This project is also highly inspired by Fairseq and AllenNLP.

🔗 Related Links

NaturalCC-demo
About us: XCodeMind

❤️ Citation

Please cite as:

under reviewing

NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models

Related tags

Overview

NaturalCC

⭐ Features

🚀 Installation

Requirements

1. Install prerequisite libraries

2. Build or install NaturalCC

3. Half precision computation (optional)

4. Install GCC/G++ with conda (if you do not have permission)

📚 Dataset

🤖 Implementations

Code retrieval (search)

Code completion

Heterogeneous mapping

Code summarization

📋 Experiments

Code Summarization

Code Retrieval

Code Completion

Type Inference

Heterogeneous Mapping

🏫 Examples & Tutorials

Step 1: Download and process a dataset from datasets, and follow the instructions from the README.md file.

Step 2 (optional): Register your self-defined models

Step 3: Training and inference.

❓ FAQ

😘 License and Acknowledgement

🔗 Related Links

❤️ Citation

Owner

Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently

Power Core Simulator!

Code of TIP2021 Paper《SFace: Sigmoid-Constrained Hypersphere Loss for Robust Face Recognition》. We provide both MxNet and Pytorch versions.

Codes and scripts for "Explainable Semantic Space by Grounding Languageto Vision with Cross-Modal Contrastive Learning"

A wrapper around SageMaker ML Lineage Tracking extending ML Lineage to end-to-end ML lifecycles, including additional capabilities around Feature Store groups, queries, and other relevant artifacts.

Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification

Repository for RNNs using TensorFlow and Keras - LSTM and GRU Implementation from Scratch - Simple Classification and Regression Problem using RNNs

A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.

PyTorch implementation of EGVSR: Efficcient & Generic Video Super-Resolution (VSR)

This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

Compares various time-series feature sets on computational performance, within-set structure, and between-set relationships.

Repository for Multimodal AutoML Benchmark

implementation of the paper "MarginGAN: Adversarial Training in Semi-Supervised Learning"

Control-Raspberry-Pi-Robot-using-Hand-Gestures - A 4WD Robot car based on Raspberry Pi that controlled by hand gestures(using openCV and mediapipe)

Code release for General Greedy De-bias Learning

Platform-agnostic AI Framework 🔥

A torch implementation of "Pixel-Level Domain Transfer"

A pytorch reprelication of the model-based reinforcement learning algorithm MBPO

The code for paper Efficiently Solve the Max-cut Problem via a Quantum Qubit Rotation Algorithm

A Framework for Encrypted Machine Learning in TensorFlow

Step 1: Download and process a dataset from `datasets`, and follow the instructions from the README.md file.