Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Overview

CheckList

This repository contains code for testing NLP Models as described in the following paper:

Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh Association for Computational Linguistics (ACL), 2020

Bibtex for citations:

 @inproceedings{checklist:acl20},  
 author = {Marco Tulio Ribeiro and Tongshuang Wu and Carlos Guestrin and Sameer Singh},  
 title = {Beyond Accuracy: Behavioral Testing of NLP models with CheckList},  
 booktitle = {Association for Computational Linguistics (ACL)},  
 year = {2020}  

Table of Contents

Installation

From pypi:

pip install checklist
jupyter nbextension install --py --sys-prefix checklist.viewer
jupyter nbextension enable --py --sys-prefix checklist.viewer

Note: --sys-prefix to install into python’s sys.prefix, which is useful for instance in virtual environments, such as with conda or virtualenv. If you are not in such environments, please switch to --user to install into the user’s home jupyter directories.

From source:

git clone [email protected]:marcotcr/checklist.git
cd checklist
pip install -e .

Either way, you need to install pytorch or tensorflow if you want to use masked language model suggestions:

pip install torch

For most tutorials, you also need to download a spacy model:

python -m spacy download en_core_web_sm

Tutorials

Please note that the visualizations are implemented as ipywidgets, and don't work on colab or JupyterLab (use jupyter notebook). Everything else should work on these though.

  1. Generating data
  2. Perturbing data
  3. Test types, expectation functions, running tests
  4. The CheckList process

Paper tests

Notebooks: how we created the tests in the paper

  1. Sentiment analysis
  2. QQP
  3. SQuAD

Replicating paper tests, or running them with new models

For all of these, you need to unpack the release data (in the main repo folder after cloning):

tar xvzf release_data.tar.gz

Sentiment Analysis

Loading the suite:

import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/sentiment/sentiment_suite.pkl'
suite = TestSuite.from_file(suite_path)

Running tests with precomputed bert predictions (replace bert on pred_path with amazon, google, microsoft, or roberta for others):

pred_path = 'release_data/sentiment/predictions/bert'
suite.run_from_file(pred_path, overwrite=True)
suite.summary() # or suite.visual_summary_table()

To test your own model, get predictions for the texts in release_data/sentiment/tests_n500 and save them in a file where each line has 4 numbers: the prediction (0 for negative, 1 for neutral, 2 for positive) and the prediction probabilities for (negative, neutral, positive).
Then, update pred_path with this file and run the lines above.

QQP

import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/qqp/qqp_suite.pkl'
suite = TestSuite.from_file(suite_path)

Running tests with precomputed bert predictions (replace bert on pred_path with roberta if you want):

pred_path = 'release_data/qqp/predictions/bert'
suite.run_from_file(pred_path, overwrite=True, file_format='binary_conf')
suite.visual_summary_table()

To test your own model, get predictions for pairs in release_data/qqp/tests_n500 (format: tsv) and output them in a file where each line has a single number: the probability that the pair is a duplicate.

SQuAD

import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/squad/squad_suite.pkl'
suite = TestSuite.from_file(suite_path)

Running tests with precomputed bert predictions:

pred_path = 'release_data/squad/predictions/bert'
suite.run_from_file(pred_path, overwrite=True, file_format='pred_only')
suite.visual_summary_table()

To test your own model, get predictions for pairs in release_data/squad/squad.jsonl (format: jsonl) or release_data/squad/squad.json (format: json, like SQuAD dev) and output them in a file where each line has a single string: the prediction span.

Testing huggingface transformer pipelines

See this notebook.

Code snippets

Templates

See 1. Generating data for more details.

import checklist
from checklist.editor import Editor
import numpy as np
editor = Editor()
ret = editor.template('{first_name} is {a:profession} from {country}.',
                       profession=['lawyer', 'doctor', 'accountant'])
np.random.choice(ret.data, 3)

['Mary is a doctor from Afghanistan.',
'Jordan is an accountant from Indonesia.',
'Kayla is a lawyer from Sierra Leone.']

RoBERTa suggestions

See 1. Generating data for more details.
In template:

ret = editor.template('This is {a:adj} {mask}.',  
                      adj=['good', 'bad', 'great', 'terrible'])
ret.data[:3]

['This is a good idea.',
'This is a good sign.',
'This is a good thing.']

Multiple masks:

ret = editor.template('This is {a:adj} {mask} {mask}.',
                      adj=['good', 'bad', 'great', 'terrible'])
ret.data[:3]

['This is a good history lesson.',
'This is a good chess move.',
'This is a good news story.']

Getting suggestions rather than filling out templates:

editor.suggest('This is {a:adj} {mask}.',
               adj=['good', 'bad', 'great', 'terrible'])[:5]

['idea', 'sign', 'thing', 'example', 'start']

Getting suggestions for replacements (only a single text allowed, no templates):

editor.suggest_replace('This is a good movie.', 'good')[:5]

['great', 'horror', 'bad', 'terrible', 'cult']

Getting suggestions through jupyter visualization:

editor.visual_suggest('This is {a:mask} movie.')

visual suggest

Multilingual suggestions

Just initialize the editor with the language argument (should work with language names and iso 639-1 codes):

import checklist
from checklist.editor import Editor
import numpy as np
# in Portuguese
editor = Editor(language='portuguese')
ret = editor.template('O João é um {mask}.',)
ret.data[:3]

['O João é um português.',
'O João é um poeta.',
'O João é um brasileiro.']

# in Chinese
editor = Editor(language='chinese')
ret = editor.template('西游记的故事很{mask}。',)
ret.data[:3]

['西游记的故事很精彩。',
'西游记的故事很真实。',
'西游记的故事很经典。']

We're using FlauBERT for french, German BERT for german, and XLM-RoBERTa for everything else (click the link for a list of supported languages). We can't vouch for the quality of the suggestions in other languages, but it seems to work reasonably well for the languages we speak (although not as well as English).

Lexicons (somewhat multilingual)

editor.lexicons is a dictionary, which can be used in templates. For example:

import checklist
from checklist.editor import Editor
import numpy as np
# Default: English
editor = Editor()
ret = editor.template('{male1} went to see {male2} in {city}.', remove_duplicates=True)
list(np.random.choice(ret.data, 3))

['Dan went to see Hugh in Riverside.',
'Stephen went to see Eric in Omaha.',
'Patrick went to see Nick in Kansas City.']

Person names and location (country, city) names are multilingual, depending on the editor language. We got the data from wikidata, so there is a bias towards names on wikipedia.

editor = Editor(language='german')
ret = editor.template('{male1} went to see {male2} in {city}.', remove_duplicates=True)
list(np.random.choice(ret.data, 3))

['Rolf went to see Klaus in Leipzig.',
'Richard went to see Jörg in Marl.',
'Gerd went to see Fritz in Schwerin.']

List of available lexicons:

editor.lexicons.keys()

dict_keys(['male', 'female', 'first_name', 'first_pronoun', 'last_name', 'country', 'nationality', 'city', 'religion', 'religion_adj', 'sexual_adj', 'country_city', 'male_from', 'female_from', 'last_from'])

Some of these cannot be used directly in templates because they are themselves dictionaries. For example, male_from, female_from, last_from and country_city are dictionaries from country to male names, female names, last names and most populous cities.
You can call editor.lexicons.male_from.keys() for a list of country names. Example usage:

import numpy as np
countries = ['France', 'Germany', 'Brazil']
for country in countries:
    ts = editor.template('{male} {last} is from {city}',
                male=editor.lexicons.male_from[country],
                last=editor.lexicons.last_from[country],
                city=editor.lexicons.country_city[country],
               )
    print('Country: %s' % country)
    print('\n'.join(np.random.choice(ts.data, 3)))
    print()

Country: France
Jean-Jacques Brun is from Avignon
Bruno Deschamps is from Vitry-sur-Seine
Ernest Picard is from Chambéry

Country: Germany
Rainer Braun is from Schwerin
Markus Brandt is from Gera
Reinhard Busch is from Erlangen

Country: Brazil
Gilberto Martins is from Anápolis
Alfredo Guimarães is from Indaiatuba
Jorge Barreto is from Fortaleza

Perturbing data for INVs and DIRs

See 2.Perturbing data for more details.
Custom perturbation function:

import re
import checklist
from checklist.perturb import Perturb
def replace_john_with_others(x, *args, **kwargs):
    # Returns empty (if John is not present) or list of strings with John replaced by Luke and Mark
    if not re.search(r'\bJohn\b', x):
        return None
    return [re.sub(r'\bJohn\b', n, x) for n in ['Luke', 'Mark']]

dataset = ['John is a man', 'Mary is a woman', 'John is an apostle']
ret = Perturb.perturb(dataset, replace_john_with_others)
ret.data

[['John is a man', 'Luke is a man', 'Mark is a man'],
['John is an apostle', 'Luke is an apostle', 'Mark is an apostle']]

General purpose perturbations (see tutorial for more):

import spacy
nlp = spacy.load('en_core_web_sm')
pdataset = list(nlp.pipe(dataset))
ret = Perturb.perturb(pdataset, Perturb.change_names, n=2)
ret.data

[['John is a man', 'Ian is a man', 'Robert is a man'],
['Mary is a woman', 'Katherine is a woman', 'Alexandra is a woman'],
['John is an apostle', 'Paul is an apostle', 'Gabriel is an apostle']]

ret = Perturb.perturb(pdataset, Perturb.add_negation)
ret.data

[['John is a man', 'John is not a man'],
['Mary is a woman', 'Mary is not a woman'],
['John is an apostle', 'John is not an apostle']]

Creating and running tests

See 3. Test types, expectation functions, running tests for more details.

MFT:

import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV, DIR
editor = Editor()

t = editor.template('This is {a:adj} {mask}.',  
                      adj=['good', 'great', 'excellent', 'awesome'])
test1 = MFT(t.data, labels=1, name='Simple positives',
           capability='Vocabulary', description='')

INV:

dataset = ['This was a very nice movie directed by John Smith.',
           'Mary Keen was brilliant.',
          'I hated everything about this.',
          'This movie was very bad.',
          'I really liked this movie.',
          'just bad.',
          'amazing.',
          ]
t = Perturb.perturb(dataset, Perturb.add_typos)
test2 = INV(**t)

DIR:

from checklist.expect import Expect
def add_negative(x):
    phrases = ['Anyway, I thought it was bad.', 'Having said this, I hated it', 'The director should be fired.']
    return ['%s %s' % (x, p) for p in phrases]

t = Perturb.perturb(dataset, add_negative)
monotonic_decreasing = Expect.monotonic(label=1, increasing=False, tolerance=0.1)
test3 = DIR(**t, expect=monotonic_decreasing)

Running tests directly:

from checklist.pred_wrapper import PredictorWrapper
# wrapped_pp returns a tuple with (predictions, softmax confidences)
wrapped_pp = PredictorWrapper.wrap_softmax(model.predict_proba)
test.run(wrapped_pp)

Running from a file:

# One line per example
test.to_raw_file('/tmp/raw_file.txt')
# each line has prediction probabilities (softmax)
test.run_from_file('/tmp/softmax_preds.txt', file_format='softmax', overwrite=True)

Summary of results:

test.summary(n=1)

Test cases: 400
Fails (rate): 200 (50.0%)

Example fails:
0.2 This is a good idea

Visual summary:

test.visual_summary()

visual summary

Saving and loading individual tests:

# save
test.save(path)
# load
test = MFT.from_file(path)

Custom expectation functions

See 3. Test types, expectation functions, running tests for more details.

If you are writing a custom expectation functions, it must return a float or bool for each example such that:

  • > 0 (or True) means passed,
  • <= 0 or False means fail, and (optionally) the magnitude of the failure, indicated by distance from 0, e.g. -10 is worse than -1
  • None means the test does not apply, and this should not be counted

Expectation on a single example:

def high_confidence(x, pred, conf, label=None, meta=None):
    return conf.max() > 0.95
expect_fn = Expect.single(high_confidence)

Expectation on pairs of (orig, new) examples (for INV and DIR):

def changed_pred(orig_pred, pred, orig_conf, conf, labels=None, meta=None):
    return pred != orig_pred
expect_fn = Expect.pairwise(changed_pred)

There's also Expect.testcase and Expect.test, amongst many others.
Check out expect.py for more details.

Test Suites

See 4. The CheckList process for more details.

Adding tests:

from checklist.test_suite import TestSuite
# assuming test exists:
suite.add(test)

Running a suite is the same as running an individual test, either directly or through a file:

from checklist.pred_wrapper import PredictorWrapper
# wrapped_pp returns a tuple with (predictions, softmax confidences)
wrapped_pp = PredictorWrapper.wrap_softmax(model.predict_proba)
suite.run(wrapped_pp)
# or suite.run_from_file, see examples above

To visualize results, you can call suite.summary() (same as test.summary), or suite.visual_summary_table(). This is what the latter looks like for BERT on sentiment analysis:

suite.visual_summary_table()

visual summary table

Finally, it's easy to save, load, and share a suite:

# save
suite.save(path)
# load
suite = TestSuite.from_file(path)

API reference

On readthedocs

Code of Conduct

Microsoft Open Source Code of Conduct

Owner
Marco Tulio Correia Ribeiro
Marco Tulio Correia Ribeiro
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
NLP Text Classification

多标签文本分类任务 近年来随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以

Jason 1 Nov 11, 2021
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
Estimation of the CEFR complexity score of a given word, sentence or text.

NLP-Swedish … allows to estimate CEFR (Common European Framework of References) complexity score of a given word, sentence or text. CEFR scores come f

3 Apr 30, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

Alexander H. Liu 43 Nov 15, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP prod

VinAI Research 109 Dec 02, 2022
Conditional probing: measuring usable information beyond a baseline

Conditional probing: measuring usable information beyond a baseline

John Hewitt 20 Dec 15, 2022
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
Natural Language Processing Tasks and Examples.

Natural Language Processing Tasks and Examples With the advancement of A.I. technology in recent years, natural language processing technology has bee

Soohwan Kim 53 Dec 20, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021