NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models

Overview

NaturalCC

NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models for many software engineering tasks, e.g., code summarization, code retrieval, code completion, code clone detection and type inference. Our vision is to bridge the gap between programming language and natural language through machine learning techniques.

Version Python pytorch license


Features

  • A collection of code corpus with data preprocessing
  • Performance benchmark
  • Mixed precision training
    • Nvidia APEX
    • Automatic Mixed Precision
  • Multi-GPU training
  • Better logging output
  • Various Implementations:
    • tensorflow gradient clipping
    • optimizers or learning schedulers
    • baseline models
    • binary data formats

🚀 Installation

Requirements

  • PyTorch version >= 1.6.0
  • Python version >= 3.6
  • GCC/G++ > 5.0
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • (optional) For faster training, you need to install NVIDIA's apex library.

1. Install prerequisite libraries

git clone https://github.com/xcodemind/naturalcc && cd naturalcc
pip install -r requirements.txt

Once you installed prerequisite libraries, you can check them via python -m env_test

2. Build or install NaturalCC

Export your NaturalCC cache directory (data and models will be saved in this directory) to user variables(~/.bashrc or ~/.zshrc).

> ~/.bashrc">
echo "export NCC=/data/ncc_data" >> ~/.bashrc

Note: PyCharm cannot get environment variables and, therefore, we recommend you to register your NCC variable at ncc/__init__.py.

Compile Cython files to accelerate programs and register NaturalCC into your pip list

# compile for debug
# python setup.py build_ext --inplace
# install 
pip install --editable ./

3. Half precision computation (optional)

NaturalCC supports half precision training.

  • If your Pytorch.__version__ < 1.6.0 and nvcc -V is runnable, please install apex.
  • Otherwise, use Automatic Mixed Precision (AMP). Available Now (set amp: 1 in yaml file, An example).

4. Install GCC/G++ with conda (if you do not have permission)

Since NCC is build via Cython, your GCC/G++ version should be greater than 4.9. If you have the root permission, update GCC/G++; otherwise, install GCC/G++ with conda.

# install GCC/G++ with conda
conda install -c anaconda gxx_linux-64
conda install -c conda-forge gcc_linux-64
cd ~/anaconda/envs/XXX/bin
ln -s x86_64-conda_cos6-linux-gnu-gcc gcc
ln -s x86_64-conda_cos6-linux-gnu-g++ g++
# check
conda deactivate
conda activate XXX
>> type "gcc/g++ -v" in terminals

📚 Dataset

Currently, we have processed the following datasets:

🤖 Implementations

Code retrieval (search)

Code completion

Heterogeneous mapping

Code summarization

📋 Experiments

Code Summarization

Dataset: Python (Wan et al.)

BLEU-4 METEOR ROUGE-L Cost Logs
Seq2Seq+Attn 25.57 14.40 39.41 0.09s/b click here
Tree2Seq+Attn 23.35 12.59 36.49 0.48s/b click here
Transformer 30.64 17.65 44.59 0.26s/b click here
Transformer+RPE 31.57 17.74 45.18 0.27s/b click here
PLBART 32.71 18.13 46.05 0.80s/b TBC

Code Retrieval

Dataset: CodeSearchNet (Husain et al.)

MRR Go Java JS PHP Python Ruby Cost Logs
NBOW 66.59 59.92 47.15 54.75 63.33 42.86 0.16s/b click here
ConV1d 70.87 60.49 38.81 61.92 67.29 36.53 0.30s/b click here
BiRNN 65.80 48.60 23.23 51.36 48.28 19.35 0.74s/b click here
SelfAttn 78.45 66.55 50.38 65.78 79.09 47.96 0.25s/b click here

Code Completion

Dataset: Py150 (official processed) (raw)

MRR Attr Num Name Param Tokens Cost Logs
LSTM 51.67 47.45 46.52 66.06 73.73 0.31s/b click here
GTP-2 70.37 62.20 63.84 73.54 82.17 0.43s/b click here
TravTrans 72.08 68.55 76.33 71.08 83.17 0.43s/b click here

Type Inference

Dataset: CodeSearchNet-Java (Husain et al.)

[email protected] (All types) [email protected] (All types) [email protected] (Any types) [email protected] (Any types) Cost Logs
DeepTyper 0.52 0.67 0.43 0.67 0.42s/b TBC
Transformer 0.32 0.64 0.37 0.75 0.85s/b TBC

Heterogeneous Mapping

Dataset: OpenCL (Grewe et al.)

Accuracy AMD NVIDIA
Static mapping 58.82 56.91
Decision tree 70.29 74.56
Inst2vec 82.79 81.76
DeepTune 83.24 80.15

🏫 Examples & Tutorials

All the running commands here should be executed in the root of project folder (the path of your naturalcc). For example, in my environment I will stay at /data/wanyao/Dropbox/ghproj-v100/naturalcc.

We also have more detailed READMEs to start your tutorial of NaturalCC.

Step 1: Download and process a dataset from datasets, and follow the instructions from the README.md file.

# ref: dataset/python_wan/README.md
# download dataset
bash dataset/python_wan/download.sh
# clean data
python -m dataset.python_wan.clean
# cast data attributes into different files
python -m dataset.python_wan.attributes_cast

# ref: dataset/python_wan/summarization/README.md
# save code tokens and docstirng tokens into MMAP format
python -m dataset.python_wan.summarization.preprocess

Step 2 (optional): Register your self-defined models

  • If you want to create a new model, please add your model at ncc/models and ncc/modules.

  • If your training policy are more complex than we thought, you should update your criterions and training procedure at ncc/criterions and ncc/trainers, respectively.

    Do not forget to update your self defined module at ncc/XX/__init__.py.

Step 3: Training and inference.

  • Select a task and a model from task list and follow the instructions in its README.md to start your learning.
# ref: run/summarization/transformer/README.md
# train
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m run.summarization.transformer.train -f config/python_wan/python > run/summarization/transformer/config/python_wan/python.log 2>&1 &
# inference
CUDA_VISIBLE_DEVICES=0 python -m run.summarization.transformer.eval -f config/python_wan/python -o run/summarization/transformer/config/python_wan/python.txt

FAQ

Please fell free to contact me if you have any troubles.

😘 License and Acknowledgement

NaturalCC is MIT-licensed. The license applies to the pre-trained models as well. This project is also highly inspired by Fairseq and AllenNLP.

🔗 Related Links

NaturalCC-demo
About us: XCodeMind

❤️ Citation

Please cite as:

under reviewing
Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

This is a Python implementation of cover trees, a data structure for finding nearest neighbors in a general metric space (e.g., a 3D box with periodic

Patrick Varilly 28 Nov 25, 2022
When BERT Plays the Lottery, All Tickets Are Winning

When BERT Plays the Lottery, All Tickets Are Winning Large Transformer-based models were shown to be reducible to a smaller number of self-attention h

Sai 16 Nov 10, 2022
给yolov5加个gui界面,使用pyqt5,yolov5是5.0版本

博文地址 https://xugaoxiang.com/2021/06/30/yolov5-pyqt5 代码执行 项目中使用YOLOv5的v5.0版本,界面文件是project.ui pip install -r requirements.txt python main.py 图片检测 视频检测

Xu GaoXiang 215 Dec 30, 2022
A standard framework for modelling Deep Learning Models for tabular data

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike.

801 Jan 08, 2023
Near-Duplicate Video Retrieval with Deep Metric Learning

Near-Duplicate Video Retrieval with Deep Metric Learning This repository contains the Tensorflow implementation of the paper Near-Duplicate Video Retr

2 Jan 24, 2022
GNPy: Optical Route Planning and DWDM Network Optimization

GNPy is an open-source, community-developed library for building route planning and optimization tools in real-world mesh optical networks

Telecom Infra Project 140 Dec 19, 2022
Using this you can control your PC/Laptop volume by Hand Gestures (pinch-in, pinch-out) created with Python.

Hand Gesture Volume Controller Using this you can control your PC/Laptop volume by Hand Gestures (pinch-in, pinch-out). Code Firstly I have created a

Tejas Prajapati 16 Sep 11, 2021
Implementations of polygamma, lgamma, and beta functions for PyTorch

lgamma Implementations of polygamma, lgamma, and beta functions for PyTorch. It's very hacky, but that's usually ok for research use. To build, run: .

Rachit Singh 24 Nov 09, 2021
This is a yolo3 implemented via tensorflow 2.7

YoloV3 - an object detection algorithm implemented via TF 2.x source code In this article I assume you've already familiar with basic computer vision

2 Jan 17, 2022
Pointer networks Tensorflow2

Pointer networks Tensorflow2 原文:https://arxiv.org/abs/1506.03134 仅供参考与学习,内含代码备注 环境 tensorflow==2.6.0 tqdm matplotlib numpy 《pointer networks》阅读笔记 应用场景

HUANG HAO 7 Oct 27, 2022
PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

Jacob Gildenblat 836 Dec 26, 2022
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

RIIT Our open-source code for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implement and standard

405 Jan 06, 2023
QRec: A Python Framework for quick implementation of recommender systems (TensorFlow Based)

Introduction QRec is a Python framework for recommender systems (Supported by Python 3.7.4 and Tensorflow 1.14+) in which a number of influential and

Yu 1.4k Jan 01, 2023
Transformers based fully on MLPs

Awesome MLP-based Transformers papers An up-to-date list of Transformers based fully on MLPs without attention! Why this repo? After transformers and

Fawaz Sammani 35 Dec 30, 2022
A Python library for Deep Graph Networks

PyDGN Wiki Description This is a Python library to easily experiment with Deep Graph Networks (DGNs). It provides automatic management of data splitti

Federico Errica 194 Dec 22, 2022
Dynamic Slimmable Network (CVPR 2021, Oral)

Dynamic Slimmable Network (DS-Net) This repository contains PyTorch code of our paper: Dynamic Slimmable Network (CVPR 2021 Oral). Architecture of DS-

Changlin Li 197 Dec 09, 2022
AttGAN: Facial Attribute Editing by Only Changing What You Want (IEEE TIP 2019)

News 11 Jan 2020: We clean up the code to make it more readable! The old version is here: v1. AttGAN TIP Nov. 2019, arXiv Nov. 2017 TensorFlow impleme

Zhenliang He 568 Dec 14, 2022
Hypercomplex Neural Networks with PyTorch

HyperNets Hypercomplex Neural Networks with PyTorch: this repository would be a container for hypercomplex neural network modules to facilitate resear

Eleonora Grassucci 21 Dec 27, 2022
PyTorch Implementation of ECCV 2020 Spotlight TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images

TuiGAN-PyTorch Official PyTorch Implementation of "TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images" (ECCV 2020 Spotligh

181 Dec 09, 2022
Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

Inter-Prototype (BMVC 2021): Official Project Webpage This repository provides the official PyTorch implementation of the following paper: Improving F

Jungsoo Lee 16 Jun 30, 2022