PLUR is a collection of source code datasets suitable for graph-based machine learning.

Overview

PLUR

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

Installation

SRC_DIR=${PWD}/src
mkdir -p ${SRC_DIR} && cd ${SRC_DIR}
# For Cubert.
git clone https://github.com/google-research/google-research --depth=1
export PYTHONPATH=${PYTHONPATH}:${SRC_DIR}/google-research
git clone https://github.com/google-research/plur && cd plur
python -m pip install -r requirements.txt
python setup.py install

Test execution on small dataset

cd plur
python3 plur_data_generation.py --dataset_name=manysstubs4j_dataset \
  --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 \
  --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2 \
  --train_data_percentage=40 \
  --validation_data_percentage=30 \
  --test_data_percentage=30

Usage

Basic usage

Data generation (step 1)

Data generation is done by calling plur.plur_data_generation.create_dataset(). The data generation runs in two stages:

  1. Convert raw data to plur.utils.GraphToOutputExample.
  2. Convert plur.utils.GraphToOutputExample to TFExample.

Stage 1 is unique for each dataset, but stage 2 is the same for almost all datasets.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

plur_data_generation.py also provides a command line interface, but it offers less flexibility.

python3 plur_data_generation.py --stage_1_dir=/tmp/code2seq_dataset/stage_1 --stage_2_dir=/tmp/code2seq_dataset/stage_2

Data loader (step 2)

After the data is generated, you can use PlurDataLoader to load the data. The data loader loads TFExamples but returns them as numpy arrays.

from plur.plur_data_loader import PlurDataLoader
from plur.util import constants

dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
split = constants.TRAIN_SPLIT_NAME
batch_size = 32
repeat_count = -1
drop_remainder = True
train_data_generator = PlurDataLoader(dataset_stage_2_directory, split, batch_size, repeat_count, drop_remainder)

for batch_data in train_data_generator:
  # your training loop...

Training (step 3)

This is the part where you use your own model to train on the PLUR data.

The models and the training code from the PLUR paper are not yet part of the current release. We plan to release it in the near future.

Evaluating (step 4)

Once the training is finished, you can generate the predictions on the test data and use plur_evaluator.py to evaluate the performance. plur_evaluator.py works in offline mode, meaning that it expects a file containing the ground truths, and a file containing the predictions.

python3 plur_evaluator.py --dataset_name=code2seq_dataset --target_file=/tmp/code2seq_dataset/targets.txt --prediction_file=/tmp/code2seq_dataset/predictions.txt

Transforming and filtering data

If there is something fundamental you want to change in the dataset, you should apply them in stage 1 of data generation, otherwise apply them in stage 2. The idea is that stage 1 should only be run once per dataset (to create the plur.utils.GraphToOutputExample), and stage 2 should be run each time you want to train on different data (to create the TFRecords).

All transformation and filtering functions are applied on plur.utils.GraphToOutputExample, see plur.utils.GraphToOutputExample for more information.

E.g. a transformation that can be run in stage 1 is that your model expects that graphs in the dataset have no loop, and you write your transformation function to remove loops. This will ensure that stage 2 will read data where the graph has no loops.

E.g. of filters that can be run in stage 2 is that you want to check your model performance on different graph sizes in terms of number of nodes. You write your own filter function to filter graphs with a large number of nodes.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
def _filter_graph_size(graph_to_output_example, graph_size=1024):
  return len(graph_to_output_example.get_nodes()) <= graph_size
stage_2_kwargs = dict(
    train_filter_funcs=(_filter_graph_size,),
    validation_filter_funcs=(_filter_graph_size,)
)
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

Advanced usage

plur.plur_data_generation.create_dataset() is just a thin wrapper around plur.stage_1.plur_dataset and plur.stage_2.graph_to_output_example_to_tfexample.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

is equivalent to

from plur.stage_1.code2seq_dataset import Code2seqDataset
from plur.stage_2.graph_to_output_example_to_tfexample import GraphToOutputExampleToTfexample

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
dataset = Code2seqDataset(dataset_stage_1_directory)
dataest.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

dataset = GraphToOutputExampleToTfexample(dataset_stage_1_directory, dataset_stage_2_directory, dataset_name)
dataset.stage_2_mkdirs()
dataset.run_pipeline()

You can check out plur.stage_1.code2seq_dataset for arguments relevant for code2seq dataset. For example code2seq dataset provides java-small, java-med and java-large datasets. Therefore you can create a java-large dataset in this way.

from plur.stage_1.code2seq_dataset import Code2seqDataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'

dataset = Code2seqDataset(dataset_stage_1_directory, dataset_size='large')
dataest.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

Adding a new dataset

All datasets should inherit plur.stage_1.plur_dataset.PlurDataset, and placed under plur/stage_1/, which requires you to implement:

  • download_dataset(): Code to download the dataset, we provide download_dataset_using_git() to download from git and download_dataset_using_requests() to download from a URL, which also works with a Google Drive URL. In download_dataset_using_git() we download the dataset from a specific commit id. In download_dataset_using_requests() we check the sha1sum for the downloaded files. This is to ensure that the same version of PLUR downloads the same raw data.
  • get_all_raw_data_paths(): It should return a list of paths, where each path is a file containing the raw data in the datasets.
  • raw_data_paths_to_raw_data_do_fn(): It should return a beam.DoFn class that overrides process(). The process() should tell beam how to open the files returned by get_all_raw_data_paths(). It is also here we define if the data belongs to any split (train/validation/test).
  • raw_data_to_graph_to_output_example(): This function transforms raw data from raw_data_paths_to_raw_data_do_fn() to GraphToOutputExample.

Then add/change the following lines in plur/plur_data_generation.py:

from plur.stage_1.foo_dataset import FooDataset

flags.DEFINE_enum('dataset_name', 'dummy_dataset',
                  ['code2seq_dataset', 'dummy_dataset',
                   'funcom_dataset', 'great_var_misuse_dataset',
                   'hoppity_single_ast_diff_dataset',
                   'manysstubs4j_dataset', 'foo_dataset'],
                  'Name of the dataset to generate data.')


def get_dataset_class(dataset_name):
  """Get the dataset class based on dataset_name."""
  if dataset_name == 'code2seq_dataset':
    return Code2SeqDataset
  elif dataset_name == 'dummy_dataset':
    return DummyDataset
  elif dataset_name == 'funcom_dataset':
    return FuncomDataset
  elif dataset_name == 'great_var_misuse_dataset':
    return GreatVarMisuseDataset
  elif dataset_name == 'hoppity_single_ast_diff_dataset':
    return HoppitySingleAstDiffDataset
  elif dataset_name == 'manysstubs4j_dataset':
    return ManySStuBs4JDataset
  elif dataset_name == 'foo_dataset':
    return FooDataset
  else:
    raise ValueError('{} is not supported.'.format(dataset_name))

Evaluation details

The details of how evaluation is performed are in plur/eval/README.md.

License

Licensed under the Apache 2.0 License.

Disclaimer

This is not an officially supported Google product.

Citation

Please cite the PLUR paper, Chen et al. https://proceedings.neurips.cc//paper/2021/hash/c2937f3a1b3a177d2408574da0245a19-Abstract.html

Owner
Google Research
Google Research
Bayesian optimization based on Gaussian processes (BO-GP) for CFD simulations.

BO-GP Bayesian optimization based on Gaussian processes (BO-GP) for CFD simulations. The BO-GP codes are developed using GPy and GPyOpt. The optimizer

KTH Mechanics 8 Mar 31, 2022
Create large-scale ML-driven multiscale simulation ensembles to study the interactions

MuMMI RAS v0.1 Released: Nov 16, 2021 MuMMI RAS is the application component of the MuMMI framework developed to create large-scale ML-driven multisca

4 Feb 16, 2022
The unified machine learning framework, enabling framework-agnostic functions, layers and libraries.

The unified machine learning framework, enabling framework-agnostic functions, layers and libraries. Contents Overview In a Nutshell Where Next? Overv

Ivy 8.2k Dec 31, 2022
Bayesian Modeling and Computation in Python

Bayesian Modeling and Computation in Python Open access and Code This repository contains the open access version of the text and the code examples in

Bayesian Modeling and Computation in Python 339 Jan 02, 2023
MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data

MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use

Pachter Lab 26 Nov 29, 2022
Flightfare-Prediction - It is a Flightfare Prediction Web Application Using Machine learning,Python and flask

Flight_fare-Prediction It is a Flight_fare Prediction Web Application Using Machine learning,Python and flask Using Machine leaning i have created a F

1 Dec 06, 2022
Machine Learning toolbox for Humans

Reproducible Experiment Platform (REP) REP is ipython-based environment for conducting data-driven research in a consistent and reproducible way. Main

Yandex 663 Dec 31, 2022
dirty_cat is a Python module for machine-learning on dirty categorical variables.

dirty_cat dirty_cat is a Python module for machine-learning on dirty categorical variables.

637 Dec 29, 2022
Python Automated Machine Learning library for tabular data.

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Scie

Daniel Khromov 47 Dec 17, 2022
A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

🤖 Interactive Machine Learning experiments: 🏋️models training + 🎨models demo

Oleksii Trekhleb 1.4k Jan 06, 2023
cleanlab is the data-centric ML ops package for machine learning with noisy labels.

cleanlab is the data-centric ML ops package for machine learning with noisy labels. cleanlab cleans labels and supports finding, quantifying, and lear

Cleanlab 51 Nov 28, 2022
scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

12 Jun 29, 2022
This project has Classification and Clustering done Via kNN and K-Means respectfully

This project has Classification and Clustering done Via kNN and K-Means respectfully. It later tests its efficiency via F1/accuracy/recall/precision for kNN and Davies-Bouldin Index for Clustering. T

Mohammad Ali Mustafa 0 Jan 20, 2022
Model factory is a ML training platform to help engineers to build ML models at scale

Model Factory Machine learning today is powering many businesses today, e.g., search engine, e-commerce, news or feed recommendation. Training high qu

16 Sep 23, 2022
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 363 Dec 14, 2022
XGBoost + Optuna

AutoXGB XGBoost + Optuna: no brainer auto train xgboost directly from CSV files auto tune xgboost using optuna auto serve best xgboot model using fast

abhishek thakur 517 Dec 31, 2022
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
Python ML pipeline that showcases mltrace functionality.

mltrace tutorial Date: October 2021 This tutorial builds a training and testing pipeline for a toy ML prediction problem: to predict whether a passeng

Log Labs 28 Nov 09, 2022
Coursera Machine Learning - Python code

Coursera Machine Learning This repository contains python implementations of certain exercises from the course by Andrew Ng. For a number of assignmen

Jordi Warmenhoven 859 Dec 10, 2022