Few-shot NLP benchmark for unified, rigorous eval

Related tags

Deep Learningflex
Overview

FLEX

FLEX is a benchmark and framework for unified, rigorous few-shot NLP evaluation. FLEX enables:

  • First-class NLP support
  • Support for meta-training
  • Reproducible fewshot evaluations
  • Extensible benchmark creation (benchmarks defined using HuggingFace Datasets)
  • Advanced sampling functions for creating episodes with class imbalance, etc.

For more context, see our arXiv preprint.

Together with FLEX, we also released a simple yet strong few-shot model called UniFew. For more details, see our preprint.

Leaderboards

These instructions are geared towards users of the first benchmark created with this framework. The benchmark has two leaderboards, for the Pretraining-Only and Meta-Trained protocols described in Section 4.2 of our paper:

  • FLEX (Pretraining-Only): for models that do not use meta-training data related to the test tasks (do not follow the Model Training section below).
  • FLEX-META (Meta-Trained): for models that use only the provided meta-training and meta-validation data (please do see the Model Training section below).

Installation

  • Clone the repository: git clone [email protected]:allenai/flex.git
  • Create a Python 3 environment (3.7 or greater), eg using conda create --name flex python=3.9
  • Activate the environment: conda activate flex
  • Install the package locally with pip install -e .

Data Preparation

Creating the data for the flex challenge for the first time takes about 10 minutes (using a recent Macbook Pro on a broadband connection) and requires 3GB of disk space. You can initiate this process by running

python -c "import fewshot; fewshot.make_challenge('flex');"

You can control the location of the cached data by setting the environment variable HF_DATASETS_CACHE. If you have not set this variable, the location should default to ~/.cache/huggingface/datasets/. See the HuggingFace docs for more details.

Model Evaluation

"Challenges" are datasets of sampled tasks for evaluation. They are defined in fewshot/challenges/__init__.py.

To evaluate a model on challenge flex (our first challenge), you should write a program that produces a predictions.json, for example:

#!/usr/bin/env python3
import random
from typing import Iterable, Dict, Any, Sequence
import fewshot


class YourModel(fewshot.Model):
    def fit_and_predict(
        self,
        support_x: Iterable[Dict[str, Any]],
        support_y: Iterable[str],
        target_x: Iterable[Dict[str, Any]],
        metadata: Dict[str, Any]
    ) -> Sequence[str]:
        """Return random label predictions for a fewshot task."""
        train_x = [d['txt'] for d in support_x]
        train_y = support_y
        test_x = [d['txt'] for d in target_x]
        test_y = [random.choice(metadata['labels']) for _ in test_x]
        # >>> print(test_y)
        # ['some', 'list', 'of', 'label', 'predictions']
        return test_y


if __name__ == '__main__':
    evaluator = fewshot.make_challenge("flex")
    model = YourModel()
    evaluator.save_model_predictions(model=model, save_path='/path/to/predictions.json')

Warning: Calling fewshot.make_challenge("flex") above requires some time to prepare all the necessary data (see "Data preparation" section).

Running the above script produces /path/to/predictions.json with contents formatted as:

{
    "[QUESTION_ID]": {
        "label": "[CLASS_LABEL]",  # Currently an integer converted to a string
        "score": float  # Only used for ranking tasks
    },
    ...
}

Each [QUESTION_ID] is an ID for a test example in a few-shot problem.

[Optional] Parallelizing Evaluation

Two options are available for parallelizing evaluation.

First, one can restrict evaluation to a subset of tasks with indices from [START] to [STOP] (exclusive) via

evaluator.save_model_predictions(model=model, start_task_index=[START], stop_task_index=[STOP])

Notes:

  • You may use stop_task_index=None (or omit it) to avoid specifying an end.
  • You can find the total number of tasks in the challenge with fewshot.get_challenge_spec([CHALLENGE]).num_tasks.
  • To merge partial evaluation outputs into a complete predictions.json file, use fewshot merge partial1.json partial2.json ... predictions.json.

The second option will call your model's .fit_and_predict() method with batches of [BATCH_SIZE] tasks, via

evaluator.save_model_predictions(model=model, batched=True, batch_size=[BATCH_SIZE])

Result Validation and Scoring

To validate the contents of your predictions, run:

fewshot validate --challenge_name flex --predictions /path/to/predictions.json

This validates all the inputs and takes some time. Substitute flex for another challenge to evaluate on a different challenge.

(There is also a score CLI command which should not be used on the final challenge except when reporting final results.)

Model Training

For the meta-training protocol (e.g., the FLEX-META leaderboard), challenges come with a set of related training and validation data. This data is most easily accessible in one of two formats:

  1. Iterable from sampled episodes. fewshot.get_challenge_spec('flex').get_sampler(split='[SPLIT]') returns an iterable that samples datasets and episodes from meta-training or meta-validation datasets, via [SPLIT]='train' or [SPLIT]='val', respectively. The sampler defaults to the fewshot.samplers.Sample2WayMax8ShotCfg sampler configuration (for the fewshot.samplers.sample.Sampler class), but can be reconfigured.

  2. Raw dataset stores. This option is for directly accessing the raw data. fewshot.get_challenge_spec('flex').get_stores(split='[SPLIT']) returns a mapping from dataset names to fewshot.datasets.store.Store instances. Each Store instance has a Store.store attribute containing a raw HuggingFace Dataset instance. The Store instance has a Store.label attribute with the Dataset object key for accessing the target label (e.g., via Store.store[Store.label]) and the FLEX-formatted text available at the flex.txt key (e.g., via Store.store['flex.txt']).

Two examples of these respective approaches are available at:

  1. The UniFew model repository. For more details on Unifew, see also the FLEX Arxiv paper.
  2. The baselines/bao/ directory, for training and evaluating the approach described in the following paper:

Yujia Bao*, Menghua Wu*, Shiyu Chang, and Regina Barzilay. Few-shot Text Classification with Distributional Signatures. In International Conference on Learning Representations 2020

Benchmark Construction and Optimization

To add a new benchmark (challenge) named [NEW_CHALLENGE], you must edit fewshot/challenges/__init__.py or otherwise add it to the registry. The above usage instructions would change to substitute [NEW_CHALLENGE] in place of flex when calling fewshot.get_challenge_spec('[NEW_CHALLENGE]') and fewshot.make_challenge('[NEW_CHALLENGE]').

For an example of how to optimize the sample size of the challenge, see scripts/README-sample-size.md.

Attribution

If you make use of our framework, benchmark, or model, please cite our preprint:

@misc{bragg2021flex,
      title={FLEX: Unifying Evaluation for Few-Shot NLP},
      author={Jonathan Bragg and Arman Cohan and Kyle Lo and Iz Beltagy},
      year={2021},
      eprint={2107.07170},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
A fast Evolution Strategy implementation in Python

Evostra: Evolution Strategy for Python Evolution Strategy (ES) is an optimization technique based on ideas of adaptation and evolution. You can learn

Mika 251 Dec 08, 2022
Repository for MuSiQue: Multi-hop Questions via Single-hop Question Composition

🎵 MuSiQue: Multi-hop Questions via Single-hop Question Composition This is the repository for our paper "MuSiQue: Multi-hop Questions via Single-hop

21 Jan 02, 2023
Interacting Two-Hand 3D Pose and Shape Reconstruction from Single Color Image (ICCV 2021)

Interacting Two-Hand 3D Pose and Shape Reconstruction from Single Color Image Interacting Two-Hand 3D Pose and Shape Reconstruction from Single Color

75 Dec 02, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 226 Dec 29, 2022
We present a regularized self-labeling approach to improve the generalization and robustness properties of fine-tuning.

Overview This repository provides the implementation for the paper "Improved Regularization and Robustness for Fine-tuning in Neural Networks", which

NEU-StatsML-Research 21 Sep 08, 2022
Dual Attention Network for Scene Segmentation (CVPR2019)

Dual Attention Network for Scene Segmentation(CVPR2019) Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang,and Hanqing Lu Introduction W

Jun Fu 2.2k Dec 28, 2022
Multi-Glimpse Network With Python

Multi-Glimpse Network Our code requires Python ≥ 3.8 Installation For example, venv + pip: $ python3 -m venv env $ source env/bin/activate (env) $ pyt

9 May 10, 2022
Breaking the Dilemma of Medical Image-to-image Translation

Breaking the Dilemma of Medical Image-to-image Translation Supervised Pix2Pix and unsupervised Cycle-consistency are two modes that dominate the field

Kid Liet 86 Dec 21, 2022
This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?".

Patches Are All You Need? 🤷 This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?". Code ov

ICLR 2022 Author 934 Dec 30, 2022
Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"

This is the official repository of my book "Deep Learning with PyTorch Step-by-Step". Here you will find one Jupyter notebook for every chapter in the book.

Daniel Voigt Godoy 340 Jan 01, 2023
Implementation of Axial attention - attending to multi-dimensional data efficiently

Axial Attention Implementation of Axial attention in Pytorch. A simple but powerful technique to attend to multi-dimensional data efficiently. It has

Phil Wang 250 Dec 25, 2022
Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan (ECCV 2020, in PyTorch)

Contrastive Unpaired Translation (CUT) video (1m) | video (10m) | website | paper We provide our PyTorch implementation of unpaired image-to-image tra

1.7k Dec 27, 2022
Official Implementation of "Designing an Encoder for StyleGAN Image Manipulation"

Designing an Encoder for StyleGAN Image Manipulation (SIGGRAPH 2021) Recently, there has been a surge of diverse methods for performing image editing

749 Jan 09, 2023
Pyeventbus: a publish/subscribe event bus

pyeventbus pyeventbus is a publish/subscribe event bus for Python 2.7. simplifies the communication between python classes decouples event senders and

15 Apr 21, 2022
An end-to-end framework for mixed-integer optimization with data-driven learned constraints.

OptiCL OptiCL is an end-to-end framework for mixed-integer optimization (MIO) with data-driven learned constraints. We address a problem setting in wh

Holly Wiberg 57 Dec 26, 2022
Bayesian Optimization Library for Medical Image Segmentation.

bayesmedaug: Bayesian Optimization Library for Medical Image Segmentation. bayesmedaug optimizes your data augmentation hyperparameters for medical im

Åžafak Bilici 7 Feb 10, 2022
Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

Face Recognition Using Pytorch Python 3.7 3.6 3.5 Status This is a repository for Inception Resnet (V1) models in pytorch, pretrained on VGGFace2 and

Tim Esler 3.3k Jan 04, 2023
Image data augmentation scheduler for albumentations transforms

albu_scheduler Scheduler for albumentations transforms based on PyTorch schedulers interface Usage TransformMultiStepScheduler import albumentations a

19 Aug 04, 2021
Annealed Flow Transport Monte Carlo

Annealed Flow Transport Monte Carlo Open source implementation accompanying ICML 2021 paper by Michael Arbel*, Alexander G. D. G. Matthews* and Arnaud

DeepMind 30 Nov 21, 2022
Use deep learning, genetic programming and other methods to predict stock and market movements

StockPredictions Use classic tricks, neural networks, deep learning, genetic programming and other methods to predict stock and market movements. Both

Linda MacPhee-Cobb 386 Jan 03, 2023