FLEX
FLEX is a benchmark and framework for unified, rigorous few-shot NLP evaluation. FLEX enables:
- First-class NLP support
- Support for meta-training
- Reproducible fewshot evaluations
- Extensible benchmark creation (benchmarks defined using HuggingFace Datasets)
- Advanced sampling functions for creating episodes with class imbalance, etc.
For more context, see our arXiv preprint.
Together with FLEX, we also released a simple yet strong few-shot model called UniFew. For more details, see our preprint.
Leaderboards
These instructions are geared towards users of the first benchmark created with this framework. The benchmark has two leaderboards, for the Pretraining-Only and Meta-Trained protocols described in Section 4.2 of our paper:
- FLEX (Pretraining-Only): for models that do not use meta-training data related to the test tasks (do not follow the Model Training section below).
- FLEX-META (Meta-Trained): for models that use only the provided meta-training and meta-validation data (please do see the Model Training section below).
Installation
- Clone the repository:
git clone [email protected]:allenai/flex.git
- Create a Python 3 environment (3.7 or greater), eg using
conda create --name flex python=3.9
- Activate the environment:
conda activate flex
- Install the package locally with
pip install -e .
Data Preparation
Creating the data for the flex challenge for the first time takes about 10 minutes (using a recent Macbook Pro on a broadband connection) and requires 3GB of disk space. You can initiate this process by running
python -c "import fewshot; fewshot.make_challenge('flex');"
You can control the location of the cached data by setting the environment variable HF_DATASETS_CACHE
. If you have not set this variable, the location should default to ~/.cache/huggingface/datasets/
. See the HuggingFace docs for more details.
Model Evaluation
"Challenges" are datasets of sampled tasks for evaluation. They are defined in fewshot/challenges/__init__.py
.
To evaluate a model on challenge flex
(our first challenge), you should write a program that produces a predictions.json
, for example:
#!/usr/bin/env python3
import random
from typing import Iterable, Dict, Any, Sequence
import fewshot
class YourModel(fewshot.Model):
def fit_and_predict(
self,
support_x: Iterable[Dict[str, Any]],
support_y: Iterable[str],
target_x: Iterable[Dict[str, Any]],
metadata: Dict[str, Any]
) -> Sequence[str]:
"""Return random label predictions for a fewshot task."""
train_x = [d['txt'] for d in support_x]
train_y = support_y
test_x = [d['txt'] for d in target_x]
test_y = [random.choice(metadata['labels']) for _ in test_x]
# >>> print(test_y)
# ['some', 'list', 'of', 'label', 'predictions']
return test_y
if __name__ == '__main__':
evaluator = fewshot.make_challenge("flex")
model = YourModel()
evaluator.save_model_predictions(model=model, save_path='/path/to/predictions.json')
Warning: Calling fewshot.make_challenge("flex")
above requires some time to prepare all the necessary data (see "Data preparation" section).
Running the above script produces /path/to/predictions.json
with contents formatted as:
{
"[QUESTION_ID]": {
"label": "[CLASS_LABEL]", # Currently an integer converted to a string
"score": float # Only used for ranking tasks
},
...
}
Each [QUESTION_ID]
is an ID for a test example in a few-shot problem.
[Optional] Parallelizing Evaluation
Two options are available for parallelizing evaluation.
First, one can restrict evaluation to a subset of tasks with indices from [START]
to [STOP]
(exclusive) via
evaluator.save_model_predictions(model=model, start_task_index=[START], stop_task_index=[STOP])
Notes:
- You may use
stop_task_index=None
(or omit it) to avoid specifying an end. - You can find the total number of tasks in the challenge with
fewshot.get_challenge_spec([CHALLENGE]).num_tasks
. - To merge partial evaluation outputs into a complete
predictions.json
file, usefewshot merge partial1.json partial2.json ... predictions.json
.
The second option will call your model's .fit_and_predict()
method with batches of [BATCH_SIZE]
tasks, via
evaluator.save_model_predictions(model=model, batched=True, batch_size=[BATCH_SIZE])
Result Validation and Scoring
To validate the contents of your predictions, run:
fewshot validate --challenge_name flex --predictions /path/to/predictions.json
This validates all the inputs and takes some time. Substitute flex
for another challenge to evaluate on a different challenge.
(There is also a score
CLI command which should not be used on the final challenge except when reporting final results.)
Model Training
For the meta-training protocol (e.g., the FLEX-META leaderboard), challenges come with a set of related training and validation data. This data is most easily accessible in one of two formats:
-
Iterable from sampled episodes.
fewshot.get_challenge_spec('flex').get_sampler(split='[SPLIT]')
returns an iterable that samples datasets and episodes from meta-training or meta-validation datasets, via[SPLIT]='train'
or[SPLIT]='val'
, respectively. The sampler defaults to thefewshot.samplers.Sample2WayMax8ShotCfg
sampler configuration (for thefewshot.samplers.sample.Sampler
class), but can be reconfigured. -
Raw dataset stores. This option is for directly accessing the raw data.
fewshot.get_challenge_spec('flex').get_stores(split='[SPLIT'])
returns a mapping from dataset names tofewshot.datasets.store.Store
instances. EachStore
instance has aStore.store
attribute containing a raw HuggingFace Dataset instance. TheStore
instance has aStore.label
attribute with the Dataset object key for accessing the target label (e.g., viaStore.store[Store.label]
) and the FLEX-formatted text available at theflex.txt
key (e.g., viaStore.store['flex.txt']
).
Two examples of these respective approaches are available at:
- The UniFew model repository. For more details on Unifew, see also the FLEX Arxiv paper.
- The
baselines/bao/
directory, for training and evaluating the approach described in the following paper:
Yujia Bao*, Menghua Wu*, Shiyu Chang, and Regina Barzilay. Few-shot Text Classification with Distributional Signatures. In International Conference on Learning Representations 2020
Benchmark Construction and Optimization
To add a new benchmark (challenge) named [NEW_CHALLENGE]
, you must edit fewshot/challenges/__init__.py
or otherwise add it to the registry. The above usage instructions would change to substitute [NEW_CHALLENGE]
in place of flex
when calling fewshot.get_challenge_spec('[NEW_CHALLENGE]')
and fewshot.make_challenge('[NEW_CHALLENGE]')
.
For an example of how to optimize the sample size of the challenge, see scripts/README-sample-size.md
.
Attribution
If you make use of our framework, benchmark, or model, please cite our preprint:
@misc{bragg2021flex,
title={FLEX: Unifying Evaluation for Few-Shot NLP},
author={Jonathan Bragg and Arman Cohan and Kyle Lo and Iz Beltagy},
year={2021},
eprint={2107.07170},
archivePrefix={arXiv},
primaryClass={cs.CL}
}