Few-shot NLP benchmark for unified, rigorous eval

Related tags

Deep Learningflex
Overview

FLEX

FLEX is a benchmark and framework for unified, rigorous few-shot NLP evaluation. FLEX enables:

  • First-class NLP support
  • Support for meta-training
  • Reproducible fewshot evaluations
  • Extensible benchmark creation (benchmarks defined using HuggingFace Datasets)
  • Advanced sampling functions for creating episodes with class imbalance, etc.

For more context, see our arXiv preprint.

Together with FLEX, we also released a simple yet strong few-shot model called UniFew. For more details, see our preprint.

Leaderboards

These instructions are geared towards users of the first benchmark created with this framework. The benchmark has two leaderboards, for the Pretraining-Only and Meta-Trained protocols described in Section 4.2 of our paper:

  • FLEX (Pretraining-Only): for models that do not use meta-training data related to the test tasks (do not follow the Model Training section below).
  • FLEX-META (Meta-Trained): for models that use only the provided meta-training and meta-validation data (please do see the Model Training section below).

Installation

  • Clone the repository: git clone [email protected]:allenai/flex.git
  • Create a Python 3 environment (3.7 or greater), eg using conda create --name flex python=3.9
  • Activate the environment: conda activate flex
  • Install the package locally with pip install -e .

Data Preparation

Creating the data for the flex challenge for the first time takes about 10 minutes (using a recent Macbook Pro on a broadband connection) and requires 3GB of disk space. You can initiate this process by running

python -c "import fewshot; fewshot.make_challenge('flex');"

You can control the location of the cached data by setting the environment variable HF_DATASETS_CACHE. If you have not set this variable, the location should default to ~/.cache/huggingface/datasets/. See the HuggingFace docs for more details.

Model Evaluation

"Challenges" are datasets of sampled tasks for evaluation. They are defined in fewshot/challenges/__init__.py.

To evaluate a model on challenge flex (our first challenge), you should write a program that produces a predictions.json, for example:

#!/usr/bin/env python3
import random
from typing import Iterable, Dict, Any, Sequence
import fewshot


class YourModel(fewshot.Model):
    def fit_and_predict(
        self,
        support_x: Iterable[Dict[str, Any]],
        support_y: Iterable[str],
        target_x: Iterable[Dict[str, Any]],
        metadata: Dict[str, Any]
    ) -> Sequence[str]:
        """Return random label predictions for a fewshot task."""
        train_x = [d['txt'] for d in support_x]
        train_y = support_y
        test_x = [d['txt'] for d in target_x]
        test_y = [random.choice(metadata['labels']) for _ in test_x]
        # >>> print(test_y)
        # ['some', 'list', 'of', 'label', 'predictions']
        return test_y


if __name__ == '__main__':
    evaluator = fewshot.make_challenge("flex")
    model = YourModel()
    evaluator.save_model_predictions(model=model, save_path='/path/to/predictions.json')

Warning: Calling fewshot.make_challenge("flex") above requires some time to prepare all the necessary data (see "Data preparation" section).

Running the above script produces /path/to/predictions.json with contents formatted as:

{
    "[QUESTION_ID]": {
        "label": "[CLASS_LABEL]",  # Currently an integer converted to a string
        "score": float  # Only used for ranking tasks
    },
    ...
}

Each [QUESTION_ID] is an ID for a test example in a few-shot problem.

[Optional] Parallelizing Evaluation

Two options are available for parallelizing evaluation.

First, one can restrict evaluation to a subset of tasks with indices from [START] to [STOP] (exclusive) via

evaluator.save_model_predictions(model=model, start_task_index=[START], stop_task_index=[STOP])

Notes:

  • You may use stop_task_index=None (or omit it) to avoid specifying an end.
  • You can find the total number of tasks in the challenge with fewshot.get_challenge_spec([CHALLENGE]).num_tasks.
  • To merge partial evaluation outputs into a complete predictions.json file, use fewshot merge partial1.json partial2.json ... predictions.json.

The second option will call your model's .fit_and_predict() method with batches of [BATCH_SIZE] tasks, via

evaluator.save_model_predictions(model=model, batched=True, batch_size=[BATCH_SIZE])

Result Validation and Scoring

To validate the contents of your predictions, run:

fewshot validate --challenge_name flex --predictions /path/to/predictions.json

This validates all the inputs and takes some time. Substitute flex for another challenge to evaluate on a different challenge.

(There is also a score CLI command which should not be used on the final challenge except when reporting final results.)

Model Training

For the meta-training protocol (e.g., the FLEX-META leaderboard), challenges come with a set of related training and validation data. This data is most easily accessible in one of two formats:

  1. Iterable from sampled episodes. fewshot.get_challenge_spec('flex').get_sampler(split='[SPLIT]') returns an iterable that samples datasets and episodes from meta-training or meta-validation datasets, via [SPLIT]='train' or [SPLIT]='val', respectively. The sampler defaults to the fewshot.samplers.Sample2WayMax8ShotCfg sampler configuration (for the fewshot.samplers.sample.Sampler class), but can be reconfigured.

  2. Raw dataset stores. This option is for directly accessing the raw data. fewshot.get_challenge_spec('flex').get_stores(split='[SPLIT']) returns a mapping from dataset names to fewshot.datasets.store.Store instances. Each Store instance has a Store.store attribute containing a raw HuggingFace Dataset instance. The Store instance has a Store.label attribute with the Dataset object key for accessing the target label (e.g., via Store.store[Store.label]) and the FLEX-formatted text available at the flex.txt key (e.g., via Store.store['flex.txt']).

Two examples of these respective approaches are available at:

  1. The UniFew model repository. For more details on Unifew, see also the FLEX Arxiv paper.
  2. The baselines/bao/ directory, for training and evaluating the approach described in the following paper:

Yujia Bao*, Menghua Wu*, Shiyu Chang, and Regina Barzilay. Few-shot Text Classification with Distributional Signatures. In International Conference on Learning Representations 2020

Benchmark Construction and Optimization

To add a new benchmark (challenge) named [NEW_CHALLENGE], you must edit fewshot/challenges/__init__.py or otherwise add it to the registry. The above usage instructions would change to substitute [NEW_CHALLENGE] in place of flex when calling fewshot.get_challenge_spec('[NEW_CHALLENGE]') and fewshot.make_challenge('[NEW_CHALLENGE]').

For an example of how to optimize the sample size of the challenge, see scripts/README-sample-size.md.

Attribution

If you make use of our framework, benchmark, or model, please cite our preprint:

@misc{bragg2021flex,
      title={FLEX: Unifying Evaluation for Few-Shot NLP},
      author={Jonathan Bragg and Arman Cohan and Kyle Lo and Iz Beltagy},
      year={2021},
      eprint={2107.07170},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
A PyTorch implementation of EfficientDet.

A PyTorch impl of EfficientDet faithful to the original Google impl w/ ported weights

Ross Wightman 1.4k Jan 07, 2023
PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning"

deepGCFX PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning" Pr

Thilini Cooray 4 Aug 11, 2022
Codes for CVPR2021 paper "PWCLO-Net: Deep LiDAR Odometry in 3D Point Clouds Using Hierarchical Embedding Mask Optimization"

PWCLO-Net: Deep LiDAR Odometry in 3D Point Clouds Using Hierarchical Embedding Mask Optimization (CVPR 2021) This is the official implementation of PW

Intelligent Robotics and Machine Vision Lab 42 Dec 18, 2022
TJU Deep Learning & Neural Network

Deep_Learning & Neural_Network_Lab 实验环境 Python 3.9 Anaconda3(官网下载或清华镜像都行) PyTorch 1.10.1(安装代码如下) conda install pytorch torchvision torchaudio cudatool

St3ve Lee 1 Jan 19, 2022
Soft actor-critic is a deep reinforcement learning framework for training maximum entropy policies in continuous domains.

This repository is no longer maintained. Please use our new Softlearning package instead. Soft Actor-Critic Soft actor-critic is a deep reinforcement

Tuomas Haarnoja 752 Jan 07, 2023
Official Implementation for Fast Training of Neural Lumigraph Representations using Meta Learning.

Fast Training of Neural Lumigraph Representations using Meta Learning Project Page | Paper | Data Alexander W. Bergman, Petr Kellnhofer, Gordon Wetzst

Alex 39 Oct 08, 2022
Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"

Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Ro

Meta Research 1.2k Jan 02, 2023
Project repo for Learning Category-Specific Mesh Reconstruction from Image Collections

Learning Category-Specific Mesh Reconstruction from Image Collections Angjoo Kanazawa*, Shubham Tulsiani*, Alexei A. Efros, Jitendra Malik University

438 Dec 22, 2022
Awesome Graph Classification - A collection of important graph embedding, classification and representation learning papers with implementations.

A collection of graph classification methods, covering embedding, deep learning, graph kernel and factorization papers

Benedek Rozemberczki 4.5k Jan 01, 2023
Code and models for "Rethinking Deep Image Prior for Denoising" (ICCV 2021)

DIP-denosing This is a code repo for Rethinking Deep Image Prior for Denoising (ICCV 2021). Addressing the relationship between Deep image prior and e

Computer Vision Lab. @ GIST 36 Dec 29, 2022
Python wrappers to the C++ library SymEngine, a fast C++ symbolic manipulation library.

SymEngine Python Wrappers Python wrappers to the C++ library SymEngine, a fast C++ symbolic manipulation library. Installation Pip See License section

136 Dec 28, 2022
harmonic-percussive-residual separation algorithm wrapped as a VST3 plugin (iPlug2)

Harmonic-percussive-residual separation plug-in This work is a study on the plausibility of a sines-transients-noise decomposition inspired algorithm

Derp Learning 9 Sep 01, 2022
MT-GAN-PyTorch - PyTorch Implementation of Learning to Transfer: Unsupervised Domain Translation via Meta-Learning

MT-GAN-PyTorch PyTorch Implementation of AAAI-2020 Paper "Learning to Transfer: Unsupervised Domain Translation via Meta-Learning" Dependency: Python

29 Oct 19, 2022
A coin flip game in which you can put the amount of money below or equal to 1000 and then choose heads or tail

COIN_FLIPPY ##This is a simple example package. You can use Github-flavored Markdown to write your content. Coinflippy A coin flip game in which you c

2 Dec 26, 2021
A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Mixup: Beyond Empirical Risk Minimization in PyTorch This is an unofficial PyTorch implementation of mixup: Beyond Empirical Risk Minimization. The co

Harry Yang 121 Dec 17, 2022
License Plate Detection Application

LicensePlate_Project 🚗 🚙 [Project] 2021.02 ~ 2021.09 License Plate Detection Application Overview 1. 데이터 수집 및 라벨링 차량 번호판 이미지를 직접 수집하여 각 이미지에 대해 '번호판

4 Oct 10, 2022
Implementation of ResMLP, an all MLP solution to image classification, in Pytorch

ResMLP - Pytorch Implementation of ResMLP, an all MLP solution to image classification out of Facebook AI, in Pytorch Install $ pip install res-mlp-py

Phil Wang 178 Dec 02, 2022
Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness This repository contains the code used for the exper

H.R. Oosterhuis 28 Nov 29, 2022
Contextual Attention Localization for Offline Handwritten Text Recognition

CALText This repository contains the source code for CALText model introduced in "CALText: Contextual Attention Localization for Offline Handwritten T

0 Feb 17, 2022
Code and models for "Pano3D: A Holistic Benchmark and a Solid Baseline for 360 Depth Estimation", OmniCV Workshop @ CVPR21.

Pano3D A Holistic Benchmark and a Solid Baseline for 360o Depth Estimation Pano3D is a new benchmark for depth estimation from spherical panoramas. We

Visual Computing Lab, Information Technologies Institute, Centre for Reseach and Technology Hellas 50 Dec 29, 2022