NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models

Overview

NaturalCC

NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models for many software engineering tasks, e.g., code summarization, code retrieval, code completion, code clone detection and type inference. Our vision is to bridge the gap between programming language and natural language through machine learning techniques.

Version Python pytorch license


⭐ Features

  • A collection of code corpus with data preprocessing
  • Performance benchmark
  • Mixed precision training
    • Nvidia APEX
    • Automatic Mixed Precision
  • Multi-GPU training
  • Better logging output
  • Various Implementations:
    • tensorflow gradient clipping
    • optimizers or learning schedulers
    • baseline models
    • binary data formats

🚀 Installation

Requirements

  • PyTorch version >= 1.6.0
  • Python version >= 3.6
  • GCC/G++ > 5.0
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • (optional) For faster training, you need to install NVIDIA's apex library.

1. Install prerequisite libraries

git clone https://github.com/xcodemind/naturalcc && cd naturalcc
pip install -r requirements.txt

Once you installed prerequisite libraries, you can check them via python -m env_test

2. Build or install NaturalCC

Export your NaturalCC cache directory (data and models will be saved in this directory) to user variables(~/.bashrc or ~/.zshrc).

> ~/.bashrc">
echo "export NCC=/data/ncc_data" >> ~/.bashrc

Note: PyCharm cannot get environment variables and, therefore, we recommend you to register your NCC variable at ncc/__init__.py.

Compile Cython files to accelerate programs and register NaturalCC into your pip list

# compile for debug
# python setup.py build_ext --inplace
# install 
pip install --editable ./

3. Half precision computation (optional)

NaturalCC supports half precision training.

  • If your Pytorch.__version__ < 1.6.0 and nvcc -V is runnable, please install apex.
  • Otherwise, use Automatic Mixed Precision (AMP). Available Now (set amp: 1 in yaml file, An example).

4. Install GCC/G++ with conda (if you do not have permission)

Since NCC is build via Cython, your GCC/G++ version should be greater than 4.9. If you have the root permission, update GCC/G++; otherwise, install GCC/G++ with conda.

# install GCC/G++ with conda
conda install -c anaconda gxx_linux-64
conda install -c conda-forge gcc_linux-64
cd ~/anaconda/envs/XXX/bin
ln -s x86_64-conda_cos6-linux-gnu-gcc gcc
ln -s x86_64-conda_cos6-linux-gnu-g++ g++
# check
conda deactivate
conda activate XXX
>> type "gcc/g++ -v" in terminals

📚 Dataset

Currently, we have processed the following datasets:

🀖 Implementations

Code retrieval (search)

Code completion

Heterogeneous mapping

Code summarization

📋 Experiments

Code Summarization

Dataset: Python (Wan et al.)

BLEU-4 METEOR ROUGE-L Cost Logs
Seq2Seq+Attn 25.57 14.40 39.41 0.09s/b click here
Tree2Seq+Attn 23.35 12.59 36.49 0.48s/b click here
Transformer 30.64 17.65 44.59 0.26s/b click here
Transformer+RPE 31.57 17.74 45.18 0.27s/b click here
PLBART 32.71 18.13 46.05 0.80s/b TBC

Code Retrieval

Dataset: CodeSearchNet (Husain et al.)

MRR Go Java JS PHP Python Ruby Cost Logs
NBOW 66.59 59.92 47.15 54.75 63.33 42.86 0.16s/b click here
ConV1d 70.87 60.49 38.81 61.92 67.29 36.53 0.30s/b click here
BiRNN 65.80 48.60 23.23 51.36 48.28 19.35 0.74s/b click here
SelfAttn 78.45 66.55 50.38 65.78 79.09 47.96 0.25s/b click here

Code Completion

Dataset: Py150 (official processed) (raw)

MRR Attr Num Name Param Tokens Cost Logs
LSTM 51.67 47.45 46.52 66.06 73.73 0.31s/b click here
GTP-2 70.37 62.20 63.84 73.54 82.17 0.43s/b click here
TravTrans 72.08 68.55 76.33 71.08 83.17 0.43s/b click here

Type Inference

Dataset: CodeSearchNet-Java (Husain et al.)

[email protected] (All types) [email protected] (All types) [email protected] (Any types) [email protected] (Any types) Cost Logs
DeepTyper 0.52 0.67 0.43 0.67 0.42s/b TBC
Transformer 0.32 0.64 0.37 0.75 0.85s/b TBC

Heterogeneous Mapping

Dataset: OpenCL (Grewe et al.)

Accuracy AMD NVIDIA
Static mapping 58.82 56.91
Decision tree 70.29 74.56
Inst2vec 82.79 81.76
DeepTune 83.24 80.15

🏫 Examples & Tutorials

All the running commands here should be executed in the root of project folder (the path of your naturalcc). For example, in my environment I will stay at /data/wanyao/Dropbox/ghproj-v100/naturalcc.

We also have more detailed READMEs to start your tutorial of NaturalCC.

Step 1: Download and process a dataset from datasets, and follow the instructions from the README.md file.

# ref: dataset/python_wan/README.md
# download dataset
bash dataset/python_wan/download.sh
# clean data
python -m dataset.python_wan.clean
# cast data attributes into different files
python -m dataset.python_wan.attributes_cast

# ref: dataset/python_wan/summarization/README.md
# save code tokens and docstirng tokens into MMAP format
python -m dataset.python_wan.summarization.preprocess

Step 2 (optional): Register your self-defined models

  • If you want to create a new model, please add your model at ncc/models and ncc/modules.

  • If your training policy are more complex than we thought, you should update your criterions and training procedure at ncc/criterions and ncc/trainers, respectively.

    Do not forget to update your self defined module at ncc/XX/__init__.py.

Step 3: Training and inference.

  • Select a task and a model from task list and follow the instructions in its README.md to start your learning.
# ref: run/summarization/transformer/README.md
# train
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m run.summarization.transformer.train -f config/python_wan/python > run/summarization/transformer/config/python_wan/python.log 2>&1 &
# inference
CUDA_VISIBLE_DEVICES=0 python -m run.summarization.transformer.eval -f config/python_wan/python -o run/summarization/transformer/config/python_wan/python.txt

❓ FAQ

Please fell free to contact me if you have any troubles.

😘 License and Acknowledgement

NaturalCC is MIT-licensed. The license applies to the pre-trained models as well. This project is also highly inspired by Fairseq and AllenNLP.

🔗 Related Links

NaturalCC-demo
About us: XCodeMind

❀ Citation

Please cite as:

under reviewing
Learning multiple gaits of quadruped robot using hierarchical reinforcement learning

Learning multiple gaits of quadruped robot using hierarchical reinforcement learning We propose a method to learn multiple gaits of quadruped robot us

Yunho Kim 17 Dec 11, 2022
Keras Realtime Multi-Person Pose Estimation - Keras version of Realtime Multi-Person Pose Estimation project

This repository has become incompatible with the latest and recommended version of Tensorflow 2.0 Instead of refactoring this code painfully, I create

M Faber 769 Dec 08, 2022
RADIal is available now! Check the download section

Latest news: RADIal is available now! Check the download section. However, because we are currently working on the data anonymization, we provide for

valeo.ai 55 Jan 03, 2023
An educational tool to introduce AI planning concepts using mobile manipulator robots.

JEDAI Explains Decision-Making AI Virtual Machine Image The recommended way of using JEDAI is to use pre-configured Virtual Machine image that is avai

Autonomous Agents and Intelligent Robots 13 Nov 15, 2022
Sample Code for "Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL"

Sample Code for "Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL" This is the official codebase for Pessimism Meets I

3 Sep 19, 2022
DeepStruc is a Conditional Variational Autoencoder which can predict the mono-metallic nanoparticle from a Pair Distribution Function.

ChemRxiv | [Paper] XXX DeepStruc Welcome to DeepStruc, a Deep Generative Model (DGM) that learns the relation between PDF and atomic structure and the

Emil Thyge Skaaning KjÊr 13 Aug 01, 2022
Implementation of Memory-Compressed Attention, from the paper "Generating Wikipedia By Summarizing Long Sequences"

Memory Compressed Attention Implementation of the Self-Attention layer of the proposed Memory-Compressed Attention, in Pytorch. This repository offers

Phil Wang 47 Dec 23, 2022
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

Denis 29 Nov 21, 2022
Pytorch Implementation of PointNet and PointNet++++

Pytorch Implementation of PointNet and PointNet++ This repo is implementation for PointNet and PointNet++ in pytorch. Update 2021/03/27: (1) Release p

Luigi Ariano 1 Nov 11, 2021
Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Jiaxi Jiang 282 Jan 02, 2023
Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

Qiao Liu 50 Dec 18, 2022
Code for paper Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting

Decoupled Spatial-Temporal Graph Neural Networks Code for our paper: Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting.

S22 43 Jan 04, 2023
Deeper insights into graph convolutional networks for semi-supervised learning

deeper_insights_into_GCNs Deeper insights into graph convolutional networks for semi-supervised learning References data and utils.py come from Implem

Davidham3 17 Dec 16, 2022
A MNIST-like fashion product database. Benchmark

Fashion-MNIST Table of Contents Why we made Fashion-MNIST Get the Data Usage Benchmark Visualization Contributing Contact Citing Fashion-MNIST License

Zalando Research 10.5k Jan 08, 2023
Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch

Perceiver - Pytorch Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch Install $ pip install perceiver-pytorch Usage

Phil Wang 876 Dec 29, 2022
Deep Learning for Human Part Discovery in Images - Chainer implementation

Deep Learning for Human Part Discovery in Images - Chainer implementation NOTE: This is not official implementation. Original paper is Deep Learning f

Shintaro Shiba 63 Sep 25, 2022
Source code and data from the RecSys 2020 article "Carousel Personalization in Music Streaming Apps with Contextual Bandits" by W. Bendada, G. Salha and T. Bontempelli

Carousel Personalization in Music Streaming Apps with Contextual Bandits - RecSys 2020 This repository provides Python code and data to reproduce expe

Deezer 48 Jan 02, 2023
Kaggle-titanic - A tutorial for Kaggle's Titanic: Machine Learning from Disaster competition. Demonstrates basic data munging, analysis, and visualization techniques. Shows examples of supervised machine learning techniques.

Kaggle-titanic This is a tutorial in an IPython Notebook for the Kaggle competition, Titanic Machine Learning From Disaster. The goal of this reposito

Andrew Conti 800 Dec 15, 2022
September-Assistant - Open-source Windows Voice Assistant

September - Windows Assistant September is an open-source Windows personal assis

The Nithin Balaji 9 Nov 22, 2022
Tensorflow port of a full NetVLAD network

netvlad_tf The main intention of this repo is deployment of a full NetVLAD network, which was originally implemented in Matlab, in Python. We provide

Robotics and Perception Group 225 Nov 08, 2022