General tricks that may help you find bad, or noisy, labels in your dataset

Related tags

Miscellaneousdoubtlab
Overview

doubtlab

A lab for bad labels.

Warning still in progress.

This repository contains general tricks that may help you find bad, or noisy, labels in your dataset. The hope is that this repository makes it easier for folks to quickly check their own datasets before they invest too much time and compute on gridsearch.

Install

You can install the tool via pip.

python -m pip install doubtlab

Quickstart

Doubtlab allows you to define "reasons" for a row of data to deserve another look. These reasons can form a pipeline which can be used to retreive a sorted list of examples worth checking again.

from doubtlab import DoubtLab
from doubtlab.reasons import ProbaReason, WrongPredictionReason

# Let's say we have some model already
model.fit(X, y)

# Next we can the reasons for doubt. In this case we're saying
# that examples deserve another look if the associated proba values
# are low or if the model output doesn't match the associated label.
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model)
}

# Pass these reasons to a doubtlab instance.
doubt = DoubtLab(**reasons)

# Get the predicates, or reasoning, behind the order
predicates = doubt.get_predicates(X, y)
# Get the ordered indices of examples worth checking again
indices = doubt.get_indices(X, y)
# Get the (X, y) candidates worth checking again
X_check, y_check = doubt.get_candidates(X, y)

Features

The library implemented many "reaons" for doubt.

  • ProbaReason: assign doubt when a models' confidence-values are low
  • RandomReason: assign doubt randomly, just for sure
  • LongConfidenceReason: assign doubt when a wrong class gains too much confidence
  • ShortConfidenceReason: assign doubt when the correct class gains too little confidence
  • DisagreeReason: assign doubt when two models disagree on a prediction
  • CleanLabReason: assign doubt according to cleanlab

Related Projects

  • The cleanlab project was an inspiration for this one. They have a great heuristic for bad label detection but I wanted to have a library that implements many. Be sure to check out their work on the labelerrors.com project.
  • My employer, Rasa, has always had a focus on data quality. Some of that attitude is bound to have seeped in here. Be sure to check out Rasa X if you're working on virtual assistants.
Comments
  • `QuantileDifferenceReason` and `StandardDeviationReason`

    `QuantileDifferenceReason` and `StandardDeviationReason`

    Hey! I was thinking if it would make sense to add two more reasons for regressions tasks, namely something like HighLeveragePointReason and HighStudentizedResidualReason.

    Citing Wikipedia:

    • Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables (link)
    • A studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. [...] This is an important technique in the detection of outliers. (link)
    opened by FBruzzesi 31
  • Doubt Reason Based on Entropy

    Doubt Reason Based on Entropy

    If a machine learning model is very "confident" then the proba scores will have low entropy. The most uncertain outcome is a uniform distribution which would contain high entropy. Therefore, it could be sensible to add entropy as a reason for doubt.

    opened by koaning 10
  • Add staticmethods to reasons to prevent re-compute.

    Add staticmethods to reasons to prevent re-compute.

    I really like the current design with reasons just being function calls.

    However, when working with large datasets or in use cases where you already have the predictions of a model, I wonder if you have thought about letting users to pass either a sklearn model or the pre-computed probas (for those Reasons where it make sense). For threshold-based reasons and large datasets this could save some time and compute, allow for faster iteration, and it would open up the possibility of using other models beyond sklearn.

    I understand that the design wouldn't be as clean as it is right now, might cause miss-alignments if users don't send the correct shapes/positions, but I wonder if you have considered this (or any other way to pass pre-computed predictions).

    Just to illustrate what I mean (sorry about the dirty-pseudo code):

    class ProbaReason:
    
        def __init__(self, model=None, probas=None, max_proba=0.55):
            if not model or probas:
                 print("You should at least pass a model or probas")
            self.model = model
            self.probas = probas
            self.max_proba = max_proba
    
        def __call__(self, X, y=None):
            probas = probas if self.probas else self.model.predict_proba(X)
            result = probas.max(axis=1) <= self.max_proba
            return result.astype(np.float16)
    
    opened by dvsrepo 9
  • "Fair" Sorting

    Suppose there are 5 reasons for doubt, 4 of which overlap a lot. Then we may end up in a situation where we ignore a reason. That could be bad ... maybe it's worth exploring voting systems a bit to figure out alternative sorting methods.

    opened by koaning 7
  • Add example to docs that shows lambda X, y: y.isna()

    Add example to docs that shows lambda X, y: y.isna()

    Hey! First of all: this is a very cool project ;) I have been thinking about potential new "reasons" to doubt and I personally often look into predictions generated by a model whenever the data instance had missing values (and part of the model-pipeline imputes them)... So I wonder if it would be useful to have a FillNaNReason (or something similar) based, for example in the MissingIndicator transformer.

    opened by juanitorduz 4
  • added conda-install-option and badges to readme

    added conda-install-option and badges to readme

    This closes #14: doubtlab can now be installed with conda from conda-forge channel.

    • [x] Created conda-forge/doubtlab-feedstock to make doubtlab available on conda-forge channel.
    • [x] Added conda install option to readme.
    • [x] Added the following badges to readme.

    GitHub - License PyPI - Python Version PyPI - Package Version PyPI - Downloads Conda - Platform Conda (channel only) Docs - GitHub.io

    opened by sugatoray 4
  • Added a LICENSE

    Added a LICENSE

    Hi @koaning,

    I am assuming MIT License is okay for this repository. If you think otherwise, please feel free to make changes in the PR accordingly.

    • [x] Added an MIT License
    • [x] ~~Added a Citation file~~ Removed the citation file and updated the name of the PR. - ~~If you have an orcid, please consider adding it to the citation.cff file.~~
    opened by sugatoray 4
  • Add a conda installation option using conda-forge channel

    Add a conda installation option using conda-forge channel

    I have already started this one. Will push a PR once the conda installation option is available.

    See: Adding doubtlab from PyPI to conda-forge channel.

    @koaning As the primary maintainer of this repo, would you like to be listed as one of the maintainers of doubtlab on conda-forge channel? Please let me know, I will add your name as another maintainer of conda-forge/doubtlab-feedstock, once it is accepted.

    opened by sugatoray 3
  • Doubt about MarginConfidenceReason :-)

    Doubt about MarginConfidenceReason :-)

    Hi Vincent,

    Nice library! As mentioned a while ago on Twitter I'm doing a review to understand and compare different approaches to find label errors.

    I'm playing with the AG News dataset, which we know it contains a lot of errors from our own previous experiments with Rubrix (using the training loss and using cleanlab).

    While playing with the different reasons, I'm having difficulties to understand the reasoning behind the MarginConfidenceReason. As far as I can tell, if the model is doubting the margin between the top two predicted labels should be small, and that could point to an ambiguous example and/or a label error. If I read the code and description correctly, MarginConfidenceReason is doing the opposite, so I'd love to know the reasoning behind this to make sure I'm not missing something.

    For context, using the MarginConfidenceReason with the AG News training set yields almost the entire dataset (117788 examples for the default threshold of 0.2, and 112995 for threshold=0.5). I guess this could become useful when there's overlap with other reasons, but I want to make sure about the reasoning :-).

    opened by dvsrepo 2
  • updated docs: installation and badges

    updated docs: installation and badges

    Updated docs:

    • [x] updated installation (with conda)
    • [x] ~~added badges from readme~~

    @koaning I am not sure if you would prefer to include the badges in the docs (website). If you don't, please feel free to remove them.

    UPDATE: removed badges from the docs (docs/index.md).

    opened by sugatoray 2
  • Issue with cleanlab upgrading to v2

    Issue with cleanlab upgrading to v2

    Issue

    image

    Environment details

    image

    Temporary fix

    pip install "doubtlab==1.0.0"

    More permanent fix

    Pin doubtlab dependency to "doubtlab<2.0.0"

    More more permanent fix

    They've made some changes to their API

    Let me know if you'd like me to make a PR

    Thanks for a great package @koaning 😄

    opened by duarteocarmo 1
  • Consider a fairlearn demo.

    Consider a fairlearn demo.

    When two models disagree something interesting might be happening. But that'll only happen if you have two models that are actually different.

    What if you have one model that's better at accuracy and another one that's better at fairness.

    Maybe these labels deserve more attention too.

    opened by koaning 0
  • Assign Doubt for Dissimilarity from Labelled Set

    Assign Doubt for Dissimilarity from Labelled Set

    Suppose that y can contain NaN values if they aren't labeled. In that case, we may want to favor a subset of these NaN values. In particular: if they differ substantially from the already labeled datapoints.

    The idea here is that we may be able to sample more diverse datapoints.

    opened by koaning 10
  • Does it make sense to add an ensemble for spaCy?

    Does it make sense to add an ensemble for spaCy?

    This seems to be a like-able method of dealing with text outside the realm of scikit-learn. But I prefer to delay this feature until I really understand the use-case. For anything related to entities we cannot use sklearn, but tags/classes should work fine as-is.

    opened by koaning 1
Releases(0.2.4)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
A Python library for inspecting JVM class files (.class)

lawu Lawu is a human-friendly library for assembling, disassembling, and exploring JVM class files. It's highly suitable for automation tasks. Documen

Tyler Kennedy 45 Oct 23, 2022
Easy, clean, reliable Python 2/3 compatibility

Overview: Easy, clean, reliable Python 2/3 compatibility python-future is the missing compatibility layer between Python 2 and Python 3. It allows you

Python Charmers 1.2k Jan 08, 2023
Liquid Rocket Engine Cooling Simulation

Liquid Rocket Engine Cooling Simulation NASA CEA The implemented class calls NASA CEA via RocketCEA. INSTALL GUIDE In progress install instructions fo

John Salib 1 Jan 30, 2022
Reproduce digital electronics in Python

Pylectronics Reproduce digital electronics in Python Report Bug · Request Feature Table of Contents About The Project Getting Started Prerequisites In

Filipe Garcia 45 Dec 20, 2021
WMIC Serial Checker For Python

WMIC Serial Checker Follow me here: Discord | Github FR: A but éducatif seulement. EN: For educational purposes only. ❓ Informations FR: WMIC Serial C

AkaTool's 0 Apr 25, 2022
ticguide: quick + painless TESS observing information

ticguide: quick + painless TESS observing information Complementary to the TESS observing tool tvguide (see also WTV), which tells you if your target

Ashley Chontos 5 Nov 05, 2022
Doom o’clock is a website/project that features a countdown of “when will the earth end” and a greenhouse gas effect emission prediction that’s predicted

Doom o’clock is a website/project that features a countdown of “when will the earth end” and a greenhouse gas effect emission prediction that’s predicted

shironeko(Hazel) 4 Jan 01, 2022
"Cambio de monedas" Change-making problem with Python, dynamic programming best solutions,

Change-making-problem / Cambio de monedas Entendiendo el problema Dada una cantidad de dinero y una lista de denominaciones de monedas, encontrar el n

Juan Antonio Ayola Cortes 1 Dec 08, 2021
Projeto job insights - Projeto avaliativo da Trybe do Bloco 32: Introdução à Python

Termos e acordos Ao iniciar este projeto, você concorda com as diretrizes do Código de Ética e Conduta e do Manual da Pessoa Estudante da Trybe. Boas

Lucas Muffato 1 Dec 09, 2021
A student information management system in Python

Student-information-management-system 本项目是一个学生信息管理系统,这个项目是用Python语言实现的,也实现了图形化界面的显示,同时也实现了管理员端,学生端两个登陆入口,同时底层使用的是Redis做的数据持久化。 This project is a stude

liuyunfei 7 Nov 15, 2022
A numbers extract from string python package

Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.com/FayasNoushad/Numbers-Extract/blob/main/LICENS

Fayas Noushad 4 Nov 28, 2021
Multiperiod Reports by Month/Quarter/Year in Beancount.

Multiperiod Reports by Month/Quarter/Year in Beancount. Plotting income and expenses over time. Treemap plot of expenses.

Altynbek Isabekov 16 Aug 13, 2022
IPO Checker for NEPSE

IPO Checker Checks more than one account for an IPO. Usage: ipo_checker.py [-h] --file FILE IPO Checker for a list. optional arguments: -h, --help

Sagar Tamang 4 Sep 20, 2022
Script for resizing MTD partitions on a QNAP device in order to be available to upgrade from buster to bullseye

QNAP partitions resize for kirkwood devices. As explained by Marin Michlmayr, Debian bullseye support on kirkwood QNAP devices was dropped due to [mai

Arnaud Mouiche 26 Jan 05, 2023
A Curated Collection of Awesome Python Scripts

A Curated Collection of Awesome Python Scripts that will make you go wow. This repository will help you in getting those green squares. Hop in and enjoy the journey of open source. 🚀

Prathima Kadari 248 Dec 31, 2022
An awesome list of AI for art and design - resources, and popular datasets and how we may apply computer vision tasks to art and design.

Awesome AI for Art & Design An awesome list of AI for art and design - resources, and popular datasets and how we may apply computer vision tasks to a

Margaret Maynard-Reid 20 Dec 21, 2022
Types for the Rasterio package

types-rasterio Types for the rasterio package A work in progress Install Not yet published to PyPI pip install types-rasterio These type definitions

Kyle Barron 7 Sep 10, 2021
Analyzes crypto candles over a set time period and then trades based on winning patterns found

patternstrade Analyzes crypto candles over a set time period and then trades based on winning patterns found. Heavily customizable. Warning: This was

ConnorCreate 14 May 29, 2022
A tool for removing PUPs using signatures

Unwanted program removal tool A tool for removing PUPs using signatures What is the unwanted program removal tool? The unwanted program removal tool i

4 Sep 20, 2022
Packages of Example Data for The Effect

causaldata This repository will contain R, Stata, and Python packages, all called causaldata, which contain data sets that can be used to implement th

103 Dec 24, 2022