CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

Related tags

Deep LearningCLUES
Overview

License: MIT

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

This repo contains the data and source code for baseline models in the NeurIPS 2021 benchmark paper for Constrained Language Understanding Evaluation Standard (CLUES) under MIT License.

Overview

The benchmark data is located in the data directory. We also release source codes for two fine-tuning strategies on CLUES, one with classic fine-tuning and the other with prompt-based fine-tuning.

Classic finetuning

Setup Environment

  1. > git clone [email protected]:microsoft/CLUES.git
  2. > git clone [email protected]:namisan/mt-dnn.git
  3. > cp -rf CLUES/classic_finetuning/ mt-dnn/
  4. > cd mt-dnn/

Run Experiments

  1. Preprocess data
    > bash run_clues_data_process.sh

  2. Train/test Models
    > bash run_clues_batch.sh

Prompt fine-tuning

Setup

  1. cd prompt_finetuning
  2. Run sh setup.sh to automatically fetch dependency codebase and apply our patch for CLUES

Run Experiments

All prompt-based funetuning baselines run commands are in experiments.sh, simple run by sh experiments.sh

Leaderboard

Here we maintain a leaderboard, allowing researchers to submit their results as entries.

Submission Instructions

  • Each submission must be submitted as a pull request modifying the markdown file underlying the leaderboard.
  • The submission must attach an accompanying public paper and public source code for reproducing their results on our dataset.
  • A submission can be toward any subset of tasks in our benchmark, or toward the aggregate leaderboard.
  • For any task targeted by the submission, we require evaluation on (1) 10, 20, and 30 shots, and (2) all 5 splits of the corresponding dataset and a report of their mean and standard deviation.
  • Each leaderboard will be sorted by the 30-shot mean S1 score (where S1 score is a variant of F1 score defined in our paper).
  • The submission should not use data from the 4 other splits during few-shot finetuning of any 1 split, either as extra training set or as validation set for hyperparameter tuning.
  • However, we allow external data, labeled or unlabeled, to be used for such purposes. Each submission using external data must mark the corresponding columns "external labeled" and/or "external unlabeled". Note, in this context, "external data" refers to data used after pretraining (e.g., for task-specific tuning); in particular, methods using existing pretrained models only, without extra data, should not mark either column. For obvious reasons, models cannot be trained on the original labeled datasets from where we sampled the few-shot CLUES data.
  • In the table entry, the submission should include a method name and a citation, hyperlinking to their publicly released source code reproducing the results. See the last entry of the table below for an example.

Abbreviations

  • FT = (classic) finetuning
  • PT = prompt based tuning
  • ICL = in-context learning, in the style of GPT-3
  • μ±σ = mean μ and standard deviation σ across our 5 splits. Aggregate standard deviation is calculated using the sum-of-variance formula from individual tasks' standard deviations.

Benchmarking CLUES for Aggregate 30-shot Evaluation

Shots (K=30) external labeled external unlabeled Average ▼ SST-2 MNLI CoNLL03 WikiANN SQuAD-v2 ReCoRD
Human N N 81.4 83.7 69.4 87.4 82.6 73.5 91.9
T5-Large-770M-FT N N 43.1±6.7 52.3±2.9 36.8±3.8 51.2±0.1 62.4±0.6 43.7±2.7 12±3.8
BERT-Large-336M-FT N N 42.1±7.8 55.4±2.5 33.3±1.4 51.3±0 62.5±0.6 35.3±6.4 14.9±3.4
BERT-Base-110M-FT N N 41.5±9.2 53.6±5.5 35.4±3.2 51.3±0 62.8±0 32.6±5.8 13.1±3.3
DeBERTa-Large-400M-FT N N 40.1±17.8 47.7±9.0 26.7±11 48.2±2.9 58.3±6.2 38.7±7.4 21.1±3.6
RoBERTa-Large-355M-FT N N 40.0±10.6 53.2±5.6 34.0±1.1 44.7±2.6 48.4±6.7 43.5±4.4 16±2.8
RoBERTa-Large-355M-PT N N 90.2±1.8 61.6±3.5
DeBERTa-Large-400M-PT N N 88.4±3.3 62.9±3.1
BERT-Large-336M-PT N N 82.7±4.1 45.3±2.0
GPT3-175B-ICL N N 91.0±1.6 33.2±0.2
BERT-Base-110M-PT N N 79.4±5.6 42.5±3.2
LiST (Wang et al.) N Y 91.3 ±0.7 67.9±3.0
Example (lastname et al.) Y/N Y/N 0±0 0±0 0±0 0±0 0±0 0±0 0±0

Individual Task Performance over Multiple Shots

SST-2

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
GPT-3 (175B) ICL N N 85.9±3.7 92.0±0.7 91.0±1.6 -
RoBERTa-Large PT N N 88.8±3.9 89.0±1.1 90.2±1.8 93.8
DeBERTa-Large PT N N 83.4±5.3 87.8±3.5 88.4±3.3 91.9
Human N N 79.8 83 83.7 -
BERT-Large PT N N 63.2±11.3 78.2±9.9 82.7±4.1 91
BERT-Base PT N N 63.9±10.0 76.7±6.6 79.4±5.6 91.9
BERT-Large FT N N 46.3±5.5 55.5±3.4 55.4±2.5 99.1
BERT-Base FT N N 46.2±5.6 54.0±2.8 53.6±5.5 98.1
RoBERTa-Large FT N N 38.4±21.7 52.3±5.6 53.2±5.6 98.6
T5-Large FT N N 51.2±1.8 53.4±3.2 52.3±2.9 97.6
DeBERTa-Large FT N N 43.0±11.9 40.8±22.6 47.7±9.0 100
Example (lastname et al.) Y/N Y/N 0±0 0±0 0±0 -

MNLI

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N Y 78.1 78.6 69.4 -
LiST (wang et al.) N N 60.5±8.3 67.2±4.5 67.9±3.0 -
DeBERTa-Large PT N N 44.5±8.2 60.7±5.3 62.9±3.1 88.1
RoBERTa-Large PT N N 57.7±3.6 58.6±2.9 61.6±3.5 87.1
BERT-Large PT N N 41.7±1.0 43.7±2.1 45.3±2.0 81.9
BERT-Base PT N N 40.4±1.8 42.1±4.4 42.5±3.2 81
T5-Large FT N N 39.8±3.3 37.9±4.3 36.8±3.8 85.9
BERT-Base FT N N 37.0±5.2 35.2±2.7 35.4±3.2 81.6
RoBERTa-Large FT N N 34.3±2.8 33.4±0.9 34.0±1.1 85.5
BERT-Large FT N N 33.7±0.4 28.2±14.8 33.3±1.4 80.9
GPT-3 (175B) ICL N N 33.5±0.7 33.1±0.3 33.2±0.2 -
DeBERTa-Large FT N N 27.4±14.1 33.6±2.5 26.7±11.0 87.6

CoNLL03

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N N 87.7 89.7 87.4 -
BERT-Base FT N N 51.3±0 51.3±0 51.3±0 -
BERT-Large FT N N 51.3±0 51.3±0 51.3±0 89.3
T5-Large FT N N 46.3±6.9 50.0±0.7 51.2±0.1 92.2
DeBERTa-Large FT N N 50.1±1.2 47.8±2.5 48.2±2.9 93.6
RoBERTa-Large FT N N 50.8±0.5 44.6±5.1 44.7±2.6 93.2

WikiANN

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N N 81.4 83.5 82.6 -
BERT-Base FT N N 62.8±0 62.8±0 62.8±0 88.8
BERT-Large FT N N 62.8±0 62.6±0.4 62.5±0.6 91
T5-Large FT N N 61.7±0.7 62.1±0.2 62.4±0.6 87.4
DeBERTa-Large FT N N 58.5±3.3 57.9±5.8 58.3±6.2 91.1
RoBERTa-Large FT N N 58.5±8.8 56.9±3.4 48.4±6.7 91.2

SQuAD v2

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N N 71.9 76.4 73.5 -
T5-Large FT N N 43.6±3.5 28.7±13.0 43.7±2.7 87.2
RoBERTa-Large FT N N 38.1±7.2 40.1±6.4 43.5±4.4 89.4
DeBERTa-Large FT N N 41.4±7.3 44.4±4.5 38.7±7.4 90
BERT-Large FT N N 42.3±5.6 35.8±9.7 35.3±6.4 81.8
BERT-Base FT N N 46.0±2.4 34.9±9.0 32.6±5.8 76.3

ReCoRD

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N N 94.1 94.2 91.9 -
DeBERTa-Large FT N N 15.7±5.0 16.8±5.7 21.1±3.6 80.7
RoBERTa-Large FT N N 12.0±1.9 9.9±6.2 16.0±2.8 80.3
BERT-Large FT N N 9.9±5.2 11.8±4.9 14.9±3.4 66
BERT-Base FT N N 10.3±1.8 11.7±2.4 13.1±3.3 54.4
T5-Large FT N N 11.9±2.7 11.7±1.5 12.0±3.8 77.3

How do I cite CLUES?

@article{cluesteam2021,
  title={Few-Shot Learning Evaluation in Natural Language Understanding},
  author={Mukherjee, Subhabrata and Liu, Xiaodong and Zheng, Guoqing and Hosseini, Saghar and Cheng, Hao and Yang, Greg and Meek, Christopher and Awadallah, Ahmed Hassan and Gao, Jianfeng},
  year={2021}
}

Acknowledgments

MT-DNN: https://github.com/namisan/mt-dnn
LM-BFF: https://github.com/princeton-nlp/LM-BFF

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs This is an implemetation of the paper Few-shot Relation Extraction via Baye

MilaGraph 36 Nov 22, 2022
Melanoma Skin Cancer Detection using Convolutional Neural Networks and Transfer Learning🕵🏻‍♂️

This is a Kaggle competition in which we have to identify if the given lesion image is malignant or not for Melanoma which is a type of skin cancer.

Vipul Shinde 1 Jan 27, 2022
OpenDILab Multi-Agent Environment

Go-Bigger: Multi-Agent Decision Intelligence Environment GoBigger Doc (中文版) Ongoing 2021.11.13 We are holding a competition —— Go-Bigger: Multi-Agent

OpenDILab 441 Jan 05, 2023
ivadomed is an integrated framework for medical image analysis with deep learning.

Repository on the collaborative IVADO medical imaging project between the Mila and NeuroPoly labs.

144 Dec 19, 2022
Yet Another Robotics and Reinforcement (YARR) learning framework for PyTorch.

Yet Another Robotics and Reinforcement (YARR) learning framework for PyTorch.

Stephen James 51 Dec 27, 2022
piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

piSTAR Lab WARNING: This is an early release. Overview piSTAR Lab is a modular deep reinforcement learning platform built to make AI experimentation a

piSTAR Lab 0 Aug 01, 2022
Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

Facebook Research 182 Dec 30, 2022
The Python ensemble sampling toolkit for affine-invariant MCMC

emcee The Python ensemble sampling toolkit for affine-invariant MCMC emcee is a stable, well tested Python implementation of the affine-invariant ense

Dan Foreman-Mackey 1.3k Dec 31, 2022
A PyTorch implementation of PointRend: Image Segmentation as Rendering

PointRend A PyTorch implementation of PointRend: Image Segmentation as Rendering [arxiv] [Official Implementation: Detectron2] This repo for Only Sema

AhnDW 336 Dec 26, 2022
SBINN: Systems-biology informed neural network

SBINN: Systems-biology informed neural network The source code for the paper M. Daneker, Z. Zhang, G. E. Karniadakis, & L. Lu. Systems biology: Identi

Lu Group 15 Nov 19, 2022
Code for Learning Manifold Patch-Based Representations of Man-Made Shapes, in ICLR 2021.

LearningPatches | Webpage | Paper | Video Learning Manifold Patch-Based Representations of Man-Made Shapes Dmitriy Smirnov, Mikhail Bessmeltsev, Justi

Dima Smirnov 22 Nov 14, 2022
Tiny Object Detection in Aerial Images.

AI-TOD AI-TOD is a dataset for tiny object detection in aerial images. [Paper] [Dataset] Description AI-TOD comes with 700,621 object instances for ei

jwwangchn 116 Dec 30, 2022
TensorFlow2 Classification Model Zoo playing with TensorFlow2 on the CIFAR-10 dataset.

Training CIFAR-10 with TensorFlow2(TF2) TensorFlow2 Classification Model Zoo. I'm playing with TensorFlow2 on the CIFAR-10 dataset. Architectures LeNe

Chia-Hung Yuan 16 Sep 27, 2022
Calling Julia from Python - an experiment on data loading

Calling Julia from Python - an experiment on data loading See the slides. TLDR After reading Patrick's blog post, we decided to try to replace C++ wit

Abel Siqueira 8 Jun 07, 2022
Used to record WKU's utility bills on a regular basis.

WKU水电费小助手 一个用于定期记录WKU水电费的脚本 Looking for English Readme? 背景 由于WKU校园内的水电账单系统时常存在扣费延迟的现象,而补扣的费用缺乏令人信服的证明。不少学生为费用摸不着头脑,但也没有申诉的依据。为了更好地掌握水电费使用情况,留下一手证据,我开源

2 Jul 21, 2022
OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.

English | 简体中文 Documentation: https://mmtracking.readthedocs.io/ Introduction MMTracking is an open source video perception toolbox based on PyTorch.

OpenMMLab 2.7k Jan 08, 2023
Object detection using yolo-tiny model and opencv used as backend

Object detection Algorithm used : Yolo algorithm Backend : opencv Library required: opencv = 4.5.4-dev' Quick Overview about structure 1) main.py Load

2 Jul 06, 2022
This is the repository of our article published on MDPI Entropy "Feature Selection for Recommender Systems with Quantum Computing".

Collaborative-driven Quantum Feature Selection This repository was developed by Riccardo Nembrini, PhD student at Politecnico di Milano. See the websi

Quantum Computing Lab @ Politecnico di Milano 10 Apr 21, 2022
Deep Learning for 3D Point Clouds: A Survey (IEEE TPAMI, 2020)

🔥Deep Learning for 3D Point Clouds (IEEE TPAMI, 2020)

Qingyong 1.4k Jan 08, 2023
Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object

151 Dec 26, 2022