Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Related tags

Deep LearningAVATAR
Overview

AVATAR

  • Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
  • AVATAR stands for jAVA-pyThon progrAm tRanslation.
  • AVATAR is a corpus of 8,475 programming problems and their solutions written in Java and Python.
  • Supervised fine-tuning and evaluation in terms of Computational Accuracy, see details here.

Table of Contents

Dataset

We have collected the programming problems and their solutions from competitive programming sites, online platforms, and open source repositories. We list the sources below.

  • CodeForces
  • AtCoder
  • CodeJam
  • GeeksforGeeks
  • LeetCode
  • ProjectEuler

Data collected can be downloaded by following:

cd data
bash download.sh

To prepare the data, we perform the following steps.

  • Removing docstrings, comments, etc.
  • Use baseline models' tokenizer to perform tokenization.
  • Filter data based on length threshold (~512).
  • Perform de-duplication. (remove examples that are duplicates)

To perform the preparation, run:

cd data
bash prepare.sh

Models

We studied 8 models for program translation.

Models trained from scratch

Pre-trained models

Training & Evaluation

To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.

# Seq2Seq+Attn.
cd seq2seq
bash rnn.sh GPU_ID LANG1 LANG2

# Transformer
cd seq2seq
bash transformer.sh GPU_ID LANG1 LANG2

# CodeGPT
cd codegpt
bash run.sh GPU_ID LANG1 LANG2 CodeGPT

# CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID LANG1 LANG2

# CodeBERT
cd codebert
bash run.sh GPU_ID LANG1 LANG2

# GraphCoderBERT
cd graphcodebert
bash run.sh GPU_ID LANG1 LANG2

# PLBART
cd plbart
# fine-tuning either for Java->Python or Python-Java
bash run.sh GPU_ID LANG1 LANG2
# multilingual fine-tuning
bash multilingual.sh GPU_ID

# Naive Copy
cd naivecopy
bash run.sh
  • Here, LANG1 LANG2=Java Python or LANG1 LANG2=Python Java.
  • Download pre-trained PLBART, GraphCodeBERT, and Transcoder model files by running download.sh script.
  • We trained the models on GeForce RTX 2080 ti GPUs (11019MiB).

Benchmarks

We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), Exact Match (EM). We report the model performances below.

Training Models Java to Python Python to Java
CA BLEU SM DM CB EM CA BLEU SM DM CB EM
None Naive Copy - 23.4 - - - 0.0 - 26.9 - - - 0.0
TransCoder 76.9 36.8 31.0 17.1 29.1 0.1 100 49.4 37.6 18.5 31.9 0.0
TC-DOBF 77.7 43.4 29.7 33.9 34.8 0.0 100 46.1 36.0 12.6 28.8 0.0
From Scratch Seq2Seq+Attn. 66.5 56.3 39.1 18.4 37.9 1.0 71.8 62.7 46.6 28.5 43.0 0.8
Transformer 61.5 38.9 34.2 16.5 29.1 0.0 67.4 45.6 45.7 26.4 37.4 0.1
Pre-trained CodeGPT 47.3 38.2 32.5 11.5 26.1 1.1 71.2 44.0 38.8 26.7 33.8 0.1
CodeGPT-adapted 48.1 38.2 32.5 12.1 26.2 1.2 68.6 42.4 37.2 27.2 33.1 0.5
CodeBERT 62.3 59.3 37.7 16.2 36.7 0.5 74.7 55.3 38.4 22.5 36.1 0.6
GraphCodeBERT 65.7 59.7 38.9 16.4 37.1 0.7 57.2 60.6 48.4 20.6 40.1 0.4
PLBARTmono 76.4 67.1 42.6 19.3 43.3 2.4 34.4 69.1 57.1 34.0 51.4 1.2
PLBARTmulti 70.4 67.1 42.0 17.6 42.4 2.4 30.8 69.4 56.6 34.5 51.8 1.0

License

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.

Citation

@article{ahmad-etal-2021-avatar,
  title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2108.11590},
  year={2021}
}
Owner
Wasi Ahmad
I am a Ph.D. student in CS at UCLA.
Wasi Ahmad
StyleGAN-Human: A Data-Centric Odyssey of Human Generation

StyleGAN-Human: A Data-Centric Odyssey of Human Generation Abstract: Unconditional human image generation is an important task in vision and graphics,

stylegan-human 762 Jan 08, 2023
MMDetection3D is an open source object detection toolbox based on PyTorch

MMDetection3D is an open source object detection toolbox based on PyTorch, towards the next-generation platform for general 3D detection. It is a part of the OpenMMLab project developed by MMLab.

OpenMMLab 3.2k Jan 05, 2023
High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models arXiv | BibTeX High-Resolution Image Synthesis with Latent Diffusion Models Robin Rombach*, Andreas Blattmann*, Dominik Lorenz

CompVis Heidelberg 5.6k Dec 30, 2022
Motion Reconstruction Code and Data for Skills from Videos (SFV)

Motion Reconstruction Code and Data for Skills from Videos (SFV) This repo contains the data and the code for motion reconstruction component of the S

268 Dec 01, 2022
Implementation of various Vision Transformers I found interesting

Implementation of various Vision Transformers I found interesting

Kim Seonghyeon 78 Dec 06, 2022
This is the official pytorch implementation of the BoxEL for the description logic EL++

BoxEL: Box EL++ Embedding This is the official pytorch implementation of the BoxEL for the description logic EL++. BoxEL++ is a geometric approach bas

1 Nov 03, 2022
Keras-tensorflow implementation of Fully Convolutional Networks for Semantic Segmentation(Unfinished)

Keras-FCN Fully convolutional networks and semantic segmentation with Keras. Models Models are found in models.py, and include ResNet and DenseNet bas

645 Dec 29, 2022
unet for image segmentation

Implementation of deep learning framework -- Unet, using Keras The architecture was inspired by U-Net: Convolutional Networks for Biomedical Image Seg

zhixuhao 4.1k Dec 31, 2022
FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks

FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks Image Classification Dataset: Google Landmark, COCO, ImageNet Model: Efficient

FedML-AI 62 Dec 10, 2022
MASS (Mueen's Algorithm for Similarity Search) - a python 2 and 3 compatible library used for searching time series sub-sequences under z-normalized Euclidean distance for similarity.

Introduction MASS allows you to search a time series for a subquery resulting in an array of distances. These array of distances enable you to identif

Matrix Profile Foundation 79 Dec 31, 2022
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022
🔥RandLA-Net in Tensorflow (CVPR 2020, Oral & IEEE TPAMI 2021)

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds (CVPR 2020) This is the official implementation of RandLA-Net (CVPR2020, Oral

Qingyong 1k Dec 30, 2022
Official PyTorch implementation of "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble" (NeurIPS'21)

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble This is the code for reproducing the results of the paper Uncertainty-Bas

43 Nov 23, 2022
Official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution"

RealBasicVSR [Paper] This is the official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution, arXiv". This repository contain

Kelvin C.K. Chan 566 Dec 28, 2022
Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions

Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions Usage Clone the code to local. https://github.com/tanlab/MI

Computational Biology and Machine Learning lab @ TOBB ETU 3 Oct 18, 2022
The code repository for "PyCIL: A Python Toolbox for Class-Incremental Learning" in PyTorch.

PyCIL: A Python Toolbox for Class-Incremental Learning Introduction • Methods Reproduced • Reproduced Results • How To Use • License • Acknowledgement

Fu-Yun Wang 258 Dec 31, 2022
PyTorch Implementation of the paper Learning to Reweight Examples for Robust Deep Learning

Learning to Reweight Examples for Robust Deep Learning Unofficial PyTorch implementation of Learning to Reweight Examples for Robust Deep Learning. Th

Daniel Stanley Tan 325 Dec 28, 2022
Source code, data, and evaluation details for “Cross-Lingual Citations in English Papers: A Large-Scale Analysis of Prevalence, Formation, and Ramifications”

Analysis of cross-lingual citations in English papers Contents initial_analysis Source code, data, and evaluation details as published at ICADL2020 ci

Tarek Saier 1 Oct 27, 2022
learning and feeling SLAM together with hands-on-experiments

modern-slam-tutorial-python Learning and feeling SLAM together with hands-on-experiments 😀 😃 😆 Dependencies Most of the examples are based on GTSAM

Giseop Kim 59 Dec 22, 2022
AAAI-22 paper: SimSR: Simple Distance-based State Representationfor Deep Reinforcement Learning

SimSR Code and dataset for the paper SimSR: Simple Distance-based State Representationfor Deep Reinforcement Learning (AAAI-22). Requirements We assum

7 Dec 19, 2022