Algorithmic encoding of protected characteristics and its implications on disparities across subgroups
This repository contains the code for the paper
B. Glocker, S. Winzeck. Algorithmic encoding of protected characteristics and its implications on disparities across subgroups. 2021. under review. arXiv:2110.14755
Dataset
The CheXpert imaging dataset together with the patient demographic information used in this work can be downloaded from https://stanfordmlgroup.github.io/competitions/chexpert/.
Code
For running the code, we recommend setting up a dedicated Python environment.
Setup Python environment using conda
Create and activate a Python 3 conda environment:
conda create -n pymira python=3
conda activate chexploration
Install PyTorch using conda:
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
Setup Python environment using virtualenv
Create and activate a Python 3 virtual environment:
virtualenv -p python3 <path_to_envs>/chexploration
source <path_to_envs>/chexploration/bin/activate
Install PyTorch using pip:
pip install torch torchvision
Install additional Python packages:
pip install matplotlib jupyter pandas seaborn pytorch-lightning scikit-learn scikit-image tensorboard tqdm openpyxl
How to use
In order to replicate the results presented in the paper, please follow these steps:
- Download the CheXpert dataset, copy the file
train.csvto thedatafilesfolder - Download the CheXpert demographics data, copy the file
CHEXPERT DEMO.xlsxto thedatafilesfolder - Run the notebook
chexpert.sample.ipynbto generate the study data - Adjust the variable
img_data_dirto point to the imaging data and run the following scripts- Run the script
chexpert.disease.pyto train a disease detection model - Run the script
chexpert.sex.pyto train a sex classification model - Run the script
chexpert.race.pyto train a race classification model
- Run the script
- Run the notebook
chexpert.predictions.ipynbto evaluate all three prediction models - Run the notebook
chexpert.explorer.ipynbfor the unsupervised exploration of feature representations
Additionally, there are scripts chexpert.sex.split.py and chexpert.race.split.py to run SPLIT on the disease detection model. The default setting in all scripts is to train a DenseNet-121 using the training data from all patients. The results for models trained on subgroups only can be produced by changing the path to the datafiles (e.g., using full_sample_train_white.csv and full_sample_val_white.csv instead of full_sample_train.csv and full_sample_val.csv).
Note, the Python scripts also contain code for running the experiments using a ResNet-34 backbone which requires less GPU memory.
Trained models
All trained models, feature embeddings and output predictions can be found here.
Funding sources
This work is supported through funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 757173, Project MIRA, ERC-2017-STG) and by the UKRI London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare.
License
This project is licensed under the Apache License 2.0.
