GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot

Overview

GPT-Code-Clippy (GPT-CC)

Please refer to our new GitHub Wiki which documents our efforts in detail in creating the open source version of GitHub Copilot



Courtesy of the awesome Aimee Trevett!

Introduction

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Datasets

The dataset used to train GPT-CC is obtained from SEART GitHub Search using the following criteria:

  • >10 GitHub stars
  • >2 commits
  • Must have a licence
  • Exclude forks
  • Size < 70708 bytes

These repositories are then combined with all of the GitHub repositories contain in The Pile.

The repositories are then filtered for duplicate files. Filtering is performed by regexing each file in each repository to obtain a list of "variables" (the tokens which only contain alphanumeric characters) and then filtering out any files which contain the same sequence of "variables. The deduplication script is available here.

The final dataset is available here. The dataset without the duplicates filtered out is also available here.

The datasheet discussing in more detail the construction, usage, and limitation of the dataset can be found here. We hope to get it officially into Huggingface's datasets library soon!

Models

The GPT-CC models are fine-tuned versions of GPT-2 and GPT-Neo.

The available models can be found here

The ones that perform relatively well (None improve on the standard GPT-Neo 125M model except for APPs specific models and only for the APPs task):

TODO: which is the recommended model?

Training

Training is done using the training scripts available here.

For fine-tuning GPTNeo-125M on CodeClippy dataset we used AdamW optimizer (beta1=0.9, beta2=0.95) with GPT3-like learning rate schedule (4k warmup steps from 0 to 5e-5 followed by 50k cosine decay steps to 5e-6), weight decay 0.1 and batch size 1024, sequence length 2048. The choice of relatively large batch size and low LR with long warmup are made to avoid agressive updates and preserve the knowledge contained in pretrained GPTNeo weights.

For fine-tuning GPTNe0-125M on APPS dataset we used AdamW optimizer (beta1=0.9, beta2=0.98) with linear learning rate schedule (800 warmup steps from 0 to peak LR followed by linear decay to 0, a range of value for peak LR was [1e-5; 1e-4]), weight decay 0.1 and batch size 256, sequence length 1024. We trained model for 5 epochs selecting best checkpoint judging by validation loss. The language modelling objective for APPS dataset is modified to backpropagate loss only for the tokens corresponding to code solution (refer to Hendrycks et al for more details).

For fine-tuning GPTNe0-1.3B on APPS dataset we used Adafactor optimizer with linear learning rate schedule (5k warmup steps from 0 to 2e-5 followed by linear decay to 0), weight decay 0.1 and batch size 24, sequence length 1024. The choice of hyperparameters for 1.3B model is in part determined by hardware limitations. We trained model for 5 epochs selecting best checkpoint judging by validation loss.

TODO: which is the recommended way to train GPT-CC?

Evaluation

The models are also evaluated on the APPS and HumanEval datasets.

Human Eval Results

Model [email protected] [email protected] [email protected] [email protected]
EleutherAI/gpt-neo 0.12% 0.24% 0.61% 1.22%
gpt-neo-125M-apps 0.06% 0.12% 0.30% 0.61%
dedup-filtered-no-resize-2048bs 0.00% 0.00% 0.00% 0.00%
1024-filtered 0.00% 0.00% 0.00% 0.00%
dedup-2048 0.00% 0.00% 0.00% 0.00%

APPS Eval Results

Coming soon...

Demo

A Visual Studio Code which uses the HuggingFace Inference API is available and can be found here.

We also have Huggingface's Space demo where you can specify and problem in the format of a programming competition question.

TODO: more information about this when complete.

Further Reading

For more information about GPT-CC, GitHub Copilot, etc, see:

TODO: add more further reading.

Acknowledgements

Special thanks to our contributors!!

Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

105 Dec 23, 2022
Sum-Product Probabilistic Language

Sum-Product Probabilistic Language SPPL is a probabilistic programming language that delivers exact solutions to a broad range of probabilistic infere

MIT Probabilistic Computing Project 57 Nov 17, 2022
Si Adek Keras is software VR dangerous object detection.

Si Adek Python Keras Sistem Informasi Deteksi Benda Berbahaya Keras Python. Version 1.0 Developed by Ananda Rauf Maududi. Developed date: 24 November

Ananda Rauf 1 Dec 21, 2021
SynNet - synthetic tree generation using neural networks

SynNet This repo contains the code and analysis scripts for our amortized approach to synthetic tree generation using neural networks. Our model can s

Wenhao Gao 60 Dec 29, 2022
Neural HMMs are all you need (for high-quality attention-free TTS)

Neural HMMs are all you need (for high-quality attention-free TTS) Shivam Mehta, Éva Székely, Jonas Beskow, and Gustav Eje Henter This is the official

Shivam Mehta 0 Oct 28, 2022
This repository implements variational graph auto encoder by Thomas Kipf.

Variational Graph Auto-encoder in Pytorch This repository implements variational graph auto-encoder by Thomas Kipf. For details of the model, refer to

DaehanKim 215 Jan 02, 2023
Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation [Project website] [Paper] This project is a PyTorch i

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 6 Feb 28, 2022
Face Mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices

Face-Mesh Face Mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices. It employs machine learning

Farnam Javadi 9 Dec 21, 2022
This repo is developed for Strong Baseline For Vehicle Re-Identification in Track 2 Ai-City-2021 Challenges

A STRONG BASELINE FOR VEHICLE RE-IDENTIFICATION This paper is accepted to the IEEE Conference on Computer Vision and Pattern Recognition Workshop(CVPR

Cybercore Co. Ltd 78 Dec 29, 2022
(CVPR 2021) PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds by Mutian Xu*, Runyu Ding*, Hengshuang Zhao, and Xiaojuan Qi. Int

CVMI Lab 228 Dec 25, 2022
Curved Projection Reformation

Description Assuming that we already know the image of the centerline, we want the lumen to be displayed on a plane, which requires curved projection

夜听残荷 5 Sep 11, 2022
DLWP: Deep Learning Weather Prediction

DLWP: Deep Learning Weather Prediction DLWP is a Python project containing data-

Kushal Shingote 3 Aug 14, 2022
GULAG: GUessing LAnGuages with neural networks

GULAG: GUessing LAnGuages with neural networks Classify languages in text via neural networks. Привет! My name is Egor. Was für ein herrliches Frühl

Egor Spirin 12 Sep 02, 2022
Pytorch implementation of MaskGIT: Masked Generative Image Transformer

Pytorch implementation of MaskGIT: Masked Generative Image Transformer

Dominic Rampas 247 Dec 16, 2022
The second project in Python course on FCC

Assignment Write a function named add_time that takes in two required parameters and one optional parameter: a start time in the 12-hour clock format

Denise T 1 Dec 13, 2021
Compositional Sketch Search

Compositional Sketch Search Official repository for ICIP 2021 Paper: Compositional Sketch Search Requirements Install and activate conda environment c

Alexander Black 8 Sep 06, 2021
Code for the ECIR'22 paper "Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators"

Query Variation Generators This repository contains the code and annotation data for the ECIR'22 paper "Evaluating the Robustness of Retrieval Pipelin

Gustavo Penha 12 Nov 20, 2022
Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

Ravens is a collection of simulated tasks in PyBullet for learning vision-based robotic manipulation, with emphasis on pick and place. It features a Gym-like API with 10 tabletop rearrangement tasks,

Google Research 367 Jan 09, 2023
Human Action Controller - A human action controller running on different platforms.

Human Action Controller (HAC) Goal A human action controller running on different platforms. Fun Easy-to-use Accurate Anywhere Fun Examples Mouse Cont

27 Jul 20, 2022
Python tools for 3D face: 3DMM, Mesh processing(transform, camera, light, render), 3D face representations.

face3d: Python tools for processing 3D face Introduction This project implements some basic functions related to 3D faces. You can use this to process

Yao Feng 2.3k Dec 30, 2022