Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees"
Installation
We use pip to install things into a python virtual environment. Refer to requirements.txt for package requirements. We use nestly + SCons to run simulations.
File descriptions
generate_data_single_pop.py -- Simulate a data stream from a single population following a logistic regression model.
- Inputs:
--simulation: string for selecting the type of distribution shift. Options for this argument are the keys inSIM_SETTINGSinconstants.py.
- Outputs:
--out-file: pickle file containing the data stream
generate_data_two_pop.py -- Simulate a data stream from two subpopulations, where each are generated using logistic regression models. Similar arguments as generate_data_single_pop.py. The percentage split beween the two subpopulations is controlled by the --subpopulations argument.
- Outputs:
--out-file: pickle file containing the data stream
create_modeler.py -- Creates a model developer who fits the original prediction model and may propose a continually refitted model at each time point.
- Inputs:
--data-file: pickle file with the entire data stream--simulation: string for selecting the model refitting strategy by the model developer. Options are to keep the model locked (locked), refit on all accumulated data (cumulative_refit), and refit on the latest observations within some window length (boxed, window length specified by--max-box). The last two options is to train an ensemble with the original and thecumulative_refitmodels (combo_refit) and train an ensemble with the original and theboxedmodels (combo_boxed).
- Outputs:
--out-file: pickle file containing the modeler
main.py -- Given the data and the model developer, run online model recalibration/revision using MarBLR and BLR.
- Inputs:
--data-file: pickle file with the entire data stream--model-file: pickle file with the model developer--type-i-regret-factor: Type I regret will be controlled at the rate ofargs.type_i_regret_factor* (Initial loss of the original model)--reference-recalibs: comma-separated string to select which other online model revisers to run. Options are no updating at alllocked, ADAMadam, cumulative logistic regressioncumulativeLR.
- Outputs:
--obs-scores-file: csv file containing predicted probabilities and observed outcomes on the data stream--history-file: csv file containing the predicted and actual probabilities on a held-out test data stream (only available if the data stream was simulated)--scores-file: csv file containing performance measures on a held-out test data stream (only available if the data stream was simulated)--recalibrators-file: pickle file containing the history of the online model revisers
Reproducing simulation results
The simulation_recalib folder contains the first set of simulations for online model recalibration. The simulation_revise folder contains the second set of simulations where we perform online logistic revision. The simulation_revise folder contains the third set of simulations where we perform online ensembling of the original model with a continually refitted model. The copd_analysis folder contains code for online model recalibration and revision for the COPD dataset. To reproduce the simulations, run scons
.