GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot

Overview

GPT-Code-Clippy (GPT-CC)

Please refer to our new GitHub Wiki which documents our efforts in detail in creating the open source version of GitHub Copilot



Courtesy of the awesome Aimee Trevett!

Introduction

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Datasets

The dataset used to train GPT-CC is obtained from SEART GitHub Search using the following criteria:

  • >10 GitHub stars
  • >2 commits
  • Must have a licence
  • Exclude forks
  • Size < 70708 bytes

These repositories are then combined with all of the GitHub repositories contain in The Pile.

The repositories are then filtered for duplicate files. Filtering is performed by regexing each file in each repository to obtain a list of "variables" (the tokens which only contain alphanumeric characters) and then filtering out any files which contain the same sequence of "variables. The deduplication script is available here.

The final dataset is available here. The dataset without the duplicates filtered out is also available here.

The datasheet discussing in more detail the construction, usage, and limitation of the dataset can be found here. We hope to get it officially into Huggingface's datasets library soon!

Models

The GPT-CC models are fine-tuned versions of GPT-2 and GPT-Neo.

The available models can be found here

The ones that perform relatively well (None improve on the standard GPT-Neo 125M model except for APPs specific models and only for the APPs task):

TODO: which is the recommended model?

Training

Training is done using the training scripts available here.

For fine-tuning GPTNeo-125M on CodeClippy dataset we used AdamW optimizer (beta1=0.9, beta2=0.95) with GPT3-like learning rate schedule (4k warmup steps from 0 to 5e-5 followed by 50k cosine decay steps to 5e-6), weight decay 0.1 and batch size 1024, sequence length 2048. The choice of relatively large batch size and low LR with long warmup are made to avoid agressive updates and preserve the knowledge contained in pretrained GPTNeo weights.

For fine-tuning GPTNe0-125M on APPS dataset we used AdamW optimizer (beta1=0.9, beta2=0.98) with linear learning rate schedule (800 warmup steps from 0 to peak LR followed by linear decay to 0, a range of value for peak LR was [1e-5; 1e-4]), weight decay 0.1 and batch size 256, sequence length 1024. We trained model for 5 epochs selecting best checkpoint judging by validation loss. The language modelling objective for APPS dataset is modified to backpropagate loss only for the tokens corresponding to code solution (refer to Hendrycks et al for more details).

For fine-tuning GPTNe0-1.3B on APPS dataset we used Adafactor optimizer with linear learning rate schedule (5k warmup steps from 0 to 2e-5 followed by linear decay to 0), weight decay 0.1 and batch size 24, sequence length 1024. The choice of hyperparameters for 1.3B model is in part determined by hardware limitations. We trained model for 5 epochs selecting best checkpoint judging by validation loss.

TODO: which is the recommended way to train GPT-CC?

Evaluation

The models are also evaluated on the APPS and HumanEval datasets.

Human Eval Results

Model [email protected] [email protected] [email protected] [email protected]
EleutherAI/gpt-neo 0.12% 0.24% 0.61% 1.22%
gpt-neo-125M-apps 0.06% 0.12% 0.30% 0.61%
dedup-filtered-no-resize-2048bs 0.00% 0.00% 0.00% 0.00%
1024-filtered 0.00% 0.00% 0.00% 0.00%
dedup-2048 0.00% 0.00% 0.00% 0.00%

APPS Eval Results

Coming soon...

Demo

A Visual Studio Code which uses the HuggingFace Inference API is available and can be found here.

We also have Huggingface's Space demo where you can specify and problem in the format of a programming competition question.

TODO: more information about this when complete.

Further Reading

For more information about GPT-CC, GitHub Copilot, etc, see:

TODO: add more further reading.

Acknowledgements

Special thanks to our contributors!!

Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning.

Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning Installation

Pytorch Lightning 1.6k Jan 08, 2023
Implementation of Invariant Point Attention, used for coordinate refinement in the structure module of Alphafold2, as a standalone Pytorch module

Invariant Point Attention - Pytorch Implementation of Invariant Point Attention as a standalone module, which was used in the structure module of Alph

Phil Wang 113 Jan 05, 2023
CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper)

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper) (Accepted for oral presentation at ACM

Minha Kim 1 Nov 12, 2021
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
The official implementation of the CVPR2021 paper: Decoupled Dynamic Filter Networks

Decoupled Dynamic Filter Networks This repo is the official implementation of CVPR2021 paper: "Decoupled Dynamic Filter Networks". Introduction DDF is

F.S.Fire 180 Dec 30, 2022
A Dying Light 2 (DL2) PAKFile Utility for Modders and Mod Makers.

Dying Light 2 PAKFile Utility A Dying Light 2 (DL2) PAKFile Utility for Modders and Mod Makers. This tool aims to make PAKFile (.pak files) modding a

RHQ Online 12 Aug 26, 2022
Text mining project; Using distilBERT to predict authors in the classification task authorship attribution.

DistilBERT-Text-mining-authorship-attribution Dataset used: https://www.kaggle.com/azimulh/tweets-data-for-authorship-attribution-modelling/version/2

1 Jan 13, 2022
LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

Simon Boehm 183 Jan 02, 2023
A TensorFlow implementation of Neural Program Synthesis from Diverse Demonstration Videos

ViZDoom http://vizdoom.cs.put.edu.pl ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is pri

Hyeonwoo Noh 1 Aug 19, 2020
Improving Deep Network Debuggability via Sparse Decision Layers

Improving Deep Network Debuggability via Sparse Decision Layers This repository contains the code for our paper: Leveraging Sparse Linear Layers for D

Madry Lab 35 Nov 14, 2022
StarGAN - Official PyTorch Implementation (CVPR 2018)

StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Yunjey Choi 5.1k Dec 30, 2022
【ACMMM 2021】DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning (ACMMM 2021) Overview We release the code of the DSANet (Dynamic S

Wenhao Wu 46 Dec 27, 2022
ICCV2021 Papers with Code

ICCV2021 Papers with Code

Amusi 1.4k Jan 02, 2023
Perform Linear Classification with Multi-way Data

MultiwayClassification This is an R package to perform linear classification for data with multi-way structure. The distance-weighted discrimination (

Eric F. Lock 2 Dec 15, 2020
PyTorch code for the "Deep Neural Networks with Box Convolutions" paper

Box Convolution Layer for ConvNets Single-box-conv network (from `examples/mnist.py`) learns patterns on MNIST What This Is This is a PyTorch implemen

Egor Burkov 515 Dec 18, 2022
Code implementing "Improving Deep Learning Interpretability by Saliency Guided Training"

Saliency Guided Training Code implementing "Improving Deep Learning Interpretability by Saliency Guided Training" by Aya Abdelsalam Ismail, Hector Cor

8 Sep 22, 2022
Few-Shot-Intent-Detection includes popular challenging intent detection datasets with/without OOS queries and state-of-the-art baselines and results.

Few-Shot-Intent-Detection Few-Shot-Intent-Detection is a repository designed for few-shot intent detection with/without Out-of-Scope (OOS) intents. It

Jian-Guo Zhang 73 Dec 26, 2022
New approach to benchmark VQA models

VQA Benchmarking This repository contains the web application & the python interface to evaluate VQA models. Documentation Please see the documentatio

4 Jul 25, 2022
constructing maps of intellectual influence from publication data

Influencemap Project @ ANU Influence in the academic communities has been an area of interest for researchers. This can be seen in the popularity of a

CS Metrics 13 Jun 18, 2022
App customer segmentation cohort rfm clustering

CUSTOMER SEGMENTATION COHORT RFM CLUSTERING TỔNG QUAN VỀ HỆ THỐNG DỮ LIỆU Nên chuyển qua theme màu dark thì sẽ nhìn đẹp hơn https://customer-segmentat

hieulmsc 3 Dec 18, 2021