MAg: a simple learning-based patient-level aggregation method for detecting microsatellite instability from whole-slide images

Related tags

Deep LearningMAg
Overview

MAg

Paper

This is code and some potentially useful data of the paper MAg: a simple learning-based patient-level aggregation method for detecting microsatellite instability from whole-slide images.

Our paper has been accepted in the IEEE International Symposium on Biomedical Imaging (ISBI) 2022. And the arXiv link is here: https://arxiv.org/abs/2201.04769.

Abstract

MAg: a simple learning-based patient-level aggregation method for detecting microsatellite instability from whole-slide images

The prediction of microsatellite instability (MSI) and microsatellite stability (MSS) is essential in predicting both the treatment response and prognosis of gastrointestinal cancer. In clinical practice, a universal MSI testing is recommended, but the accessibility of such a test is limited. Thus, a more cost-efficient and broadly accessible tool is desired to cover the traditionally untested patients. In the past few years, deep-learning-based algorithms have been proposed to predict MSI directly from haematoxylin and eosin (H&E)-stained whole-slide images (WSIs). Such algorithms can be summarized as (1) patch-level MSI/MSS prediction, and (2) patient-level aggregation. Compared with the advanced deep learning approaches that have been employed for the first stage, only the naïve first-order statistics (e.g., averaging and counting) were employed in the second stage. In this paper, we propose a simple yet broadly generalizable patient-level MSI aggregation (MAg) method to effectively integrate the precious patch-level information. Briefly, the entire probabilistic distribution in the first stage is modeled as histogram-based features to be fused as the final outcome with machine learning (e.g., SVM). The proposed MAg method can be easily used in a plug-and-play manner, which has been evaluated upon five broadly used deep neural networks: ResNet, MobileNetV2, EfficientNet, Dpn and ResNext. From the results, the proposed MAg method consistently improves the accuracy of patient-level aggregation for two publicly available datasets. It is our hope that the proposed method could potentially leverage the low-cost H&E based MSI detection method. The comparison of our method and the two common used method (counting and averaging) is shown below:


The proposed method is shown in figure below:

File structure

Here is the structure of the MAg file:

image

Dataset prepare

1.The whole patch-level datasets can be downloaded from https://zenodo.org/record/2530835#.YXIlO5pBw2z. Each patch in this folder belongs to a patient, and the file name of the patch can be used to get to which patient it belongs. For example, the patch blk-AAAFIYHTSVIE-TCGA-G4-6309-01Z-00-DX1.png belongs to the patient TCGA-G4-6309.

2.We have split the CRC_DX and STAD datasets into training set, validation set and testing set in the patient-level. So after downloading them from the link, please split the dataset according to the patient name we list in the name_patient file.

3.Certainly, if you want to change the way of splitting the data set, you can also split the data set by yourself. For your reference, you can use the code in link https://github.com/jnkather/MSIfromHE/blob/master/step_05_split_train_test.m to do this split.

Data description

For your experiment to go smoothly, this is the description of some data you may use to input or output in the process of reproducing the MAg:

1.In the code 2.0.patch2image_counting.ipynb, you will use the files which supply names of patients and these files are placed in the file /MAg/name_patients/. The names of patients are provided in this folder according to different datasets, sets and classes.

NOTE: in the experiment, you will encounter some patient-level for loops in the code, so please modify the range parameters in the for loops according to the number of patients in different sets and different classes.

2.In the code 2.1.patient-level MAg-SVM_histogram.ipynb and 2.2.patient-level MAg-network.ipynb, you will use histogram-based features as the new training set, testing set and validation set, which will be obtained by 2.0. If you just want to test the performance of MAg instead of doing a complete reproduction, we also provide the histogram-based features in our experiments here: /MAg/datasets, according to different patch-level datasets, sets, models and classes.

3.In order to compare the performance of MAg and other baselines, you may also use the results of other baselines in the code. We have provided the results with counting baseline in this folder: /MAg/results/counting_baselines_results according to different patch-level classification models, which can also been obtained from 2.0.

4.Moreover, for your reference, we provide the results of each patch in this folder: /MAg/results/patch_level_result/.

5.We also provide names of patches in the folder /MAg/name_patch/

How to use MAg?

The code of our method is in the demo file. Follow the steps below, you can easily use MAg to complete training and prediction.

1.Firstly, please use 1.patch-level classification training.ipynb to do patch-level training and get classification models. The Timm library is such a creative invention that it can help you easily complete this training process. For example, if you want to use ResNet18 in this stage, just use the code below after entering the working file:

import timm
!python train.py path_to_your_dataset -d ImageFolder --drop 0.25 --train-split train --val-split validation --pretrained --model resnet18 --num-classes 2 --opt adam --lr 1e-6 --hflip 0.5 --epochs 40 -b 32 --output path_to_your_model

The script train.py and other scripts useful in Timm can be obtained from this link: https://github.com/rwightman/pytorch-image-models Also, here are some very helpful links that teachs you how to use Timm: https://fastai.github.io/timmdocs/ and https://rwightman.github.io/pytorch-image-models/

2.Secondly, after using the above process to obtain the classification model in the patch-level, you can use 2.0.patch2image_counting.ipynb to make the patch-level prediction. In this process, just follow the operation of the code in the notebook and you can get patch-level probabilities and histogram-based features.

3.After getting the patch-level model and patient-level histogram-based features from processes above, you can now train the patient-level classification models. Here we provide two different methods to complete this training, which means you can train it in an SVM with 2.1.patient-level MAg-SVM_histogram.ipynb or in a two-layer fully-connected neural network with 2.2.patient-level MAg-network.ipynb. If your dataset is not very large, we suggest you using the 2.1 while you had better use 2.2 if you have a hugh dataset(e.g, a dataset which contains 100000 patients).

4.Both the 2.1 and the 2.2 contain code that can do a simple testing process. After getting the patient-level classification model, just continue following the code in these two notebooks and you will get the final result (e.g, F1 score, BACC and AUC).

NOTE: the patient-level training also require you to follow the split you did before, so please remember to save the patient-level histogram-based features in xlsx files like train.xlsx, validation.xlsx and test.xlsx

5.If you just want to get the reproduced result according to our parameters of SVM, please use the demo reproduce_demo.ipynb.

6.In the demo file, we also provide some notebooks whose file names start with 0. These demos are used by us in our experiment. Although they are not directly related to the MAg process, we think they may be able to help you in your own experiment. Their roles are different. For example, 0.3.confusion_matrix.ipynb can help you calculate a patient-level confusion matrix. The role of each demo can be viewed at the beginning of their code.

Why not try the MAg_lib!

As you can see, these seemingly complex and illogical jupyter notebooks do not achieve the modularity and portability of MAg. So we provide a very early version of the MAg_lib library and hope it can help you call it directly (Up to now, we only provide the MAg method using SVM. In the future, we may add other ML methods into it). Here are some instrutions and tips that may help you when using the MAg_lib.

  1. In the MAg_lib, in order to achieve a more concise code, we no longer use xlsx format files to store data. Instead, we use dict(or json) format to perform the functions. So in MAg_lib.MAg.convert_format, we provide the functions json_file_to_dict and dict_to_json_file to do the conversion task between dict and json file.

  2. Now let us see how to use the much more concise MAg_lib to achieve the MAg task! The functions in this library have their own unique purposes, so we strongly recommend that you open these files before using them and quickly scan the comments of each function to understand their role and input and output formats, and then combine them according to your needs. Here is an example to use it:

First, you need three json files that associate the sample names with the pathes of the patches, corresponding to the training set, validation set, and test set. The format of the files is like:

ce1e16d7a374731442c75fba0598dc4

Second, you need to do the patch-level prediction with the classification which is the same as the step1 in How to use MAg?. With the function in MAg_lib, you can directly get the dict contains patient_level features (Please remember to do this step in all three set so that you can do the next step!). Here is the example code:

import MAg_lib.modules
import timm
model = timm.create_model(model_name, num_classes = 2,checkpoint_path = path_to_model)
save_features_dict = MAg_lib.modules.MAg.get_feature(model,path_to_step1_json,hist_num = 10)

The save_features_dict is like:

c91c1bd0d4fa8d97c6dea70467fa5d5

In fact, the function we provide can directly perform patient-level prediction on the json file containing the name of patches, that is, if you are not interested in getting the features and want to skip it, please use this function directly and you will get the prediction results:

save_predict_dict = MAg_lib.modules.MAg.patient_predict(model, path_to_test_json, method, hist_num, svm)

NOTE: up to know we provide three choices in the parameter method: 'counting', 'averaging', ''MAg, which represent counting baseline, averaging baseline and our MAg method. And the hist_num and svm are required only when you choose 'MAg'.

Then you can get the dict which contains the final patient-level prediction results. The save_predict_dict is like:

ad6ed3f70b04f28cb4f5d5097f762d2

  1. Then you may ask such a question: How can I get the SVM I need in MAg? The sklearn.svm solve it smoothly. In our initial experiments, we manually adjusted the parameters of the SVM to obtain the best performing one on the validation set to do the prediction task. (Please remember to use MAg_lib.modules.convert_format.convert_feature to convert the json file containing features to the feature list for training and validation)

Here we also provide a naive function similar to the grid search method for parameter optimization for your reference. If you have some better optimization methods, please contact us and we are willing to discuss about this topic:

from MAg_lib.modules.MAg import find_best_svm
best_parameters = find_best_svm(X,y,X_val,y_val,['sigmoid'],C,class_weight)

X,yrepresent training set and X_val,y_val represent validation set. The next three parameters are three lists which provide the kernals, penalty coefficients and class weights you want to let this function try.

And BTW, here is another function which can evaluate the performance of SVM:

eval_dict = MAg_lib.modules.MAg.evaluate(X_val,y_val,svm)

Trained models

In the folder trained models, we provide the parameters of the SVMs in our past experiments that can make the model get the best performance on the validation set.

NOTE: if you want to use our data for stage-2 training, these parameters are only applicable to experiments that do not use oversampling on the MSIMUT class, that is, please just set the number of training samples as 188 in CRC_DX and 124 in STAD and do not use the rest copied samples. Or if you want to try to use oversampling on the MSIMUT class, you are more than welcome to tell us your results.

Experiment and results

The experiments were performed on a Google Colab workstation with a NVIDIA Tesla P100 GPU. In stage I, five prevalent approaches have been used to be the baseline feature extractors, including ResNet, MobileNetV2, EfficientNet, Dpn, and ResNext models. And in stage II, we mainly use SVM to complete it. Moreover, to assess the generalizability, the experiments above were done in both the CRC dataset and the STAD dataset. Below is the results of our experiments and comparison between MAg and two commonly used methods (counting and averaging):


Some supplements

Because our research is still in a very early stage of exploration, our code may have some defects. In the future, we may continue to improve the code, hoping that it can achieve higher portability and modularity. If you encounter any problems in the process of using MAg or have any suggestions for this research, please let us know in github or contact us directly 😊

Owner
Calvin Pang
A green hand in Computer Vision and Deep Learning. Dream of becoming an expert. A good product manager is the soul of a team, and this is what I want to be.
Calvin Pang
Official implementation of the paper "Steganographer Detection via a Similarity Accumulation Graph Convolutional Network"

SAGCN - Official PyTorch Implementation | Paper | Project Page This is the official implementation of the paper "Steganographer detection via a simila

ZHANG Zhi 1 Nov 26, 2021
Efficient Multi Collection Style Transfer Using GAN

Proposed a new model that can make style transfer from single style image, and allow to transfer into multiple different styles in a single model.

Zhaozheng Shen 2 Jan 15, 2022
[NeurIPS 2021] Source code for the paper "Qu-ANTI-zation: Exploiting Neural Network Quantization for Achieving Adversarial Outcomes"

Qu-ANTI-zation This repository contains the code for reproducing the results of our paper: Qu-ANTI-zation: Exploiting Quantization Artifacts for Achie

Secure AI Systems Lab 8 Mar 26, 2022
A Dataset for Direct Quotation Extraction and Attribution in News Articles.

DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles DirectQuote is a corpus containing 19,760 paragraphs and 10,3

THUNLP-MT 9 Sep 23, 2022
Implementation of Memory-Compressed Attention, from the paper "Generating Wikipedia By Summarizing Long Sequences"

Memory Compressed Attention Implementation of the Self-Attention layer of the proposed Memory-Compressed Attention, in Pytorch. This repository offers

Phil Wang 47 Dec 23, 2022
NeuroFind - A solution to the to the Task given by the Oberseminar of Messtechnik Institute of TU Dresden in 2021

NeuroFind A solution to the to the Task given by the Oberseminar of Messtechnik

1 Jan 20, 2022
PyTorch implementation of Spiking Neural Networks trained on surrogate gradient & BPTT using snntorch.

snn-localization repo PyTorch implementation of Spiking Neural Networks trained on surrogate gradient & BPTT using snntorch. Install Dependencies Orig

Sami BARCHID 1 Jan 06, 2022
Art Project "Schrödinger's Game of Life"

Repo of the project "Team Creative Quantum AI: Schrödinger's Game of Life" Installation new conda env: conda create --name qcml python=3.8 conda activ

ℍ◮ℕℕ◭ℍ ℝ∈ᛔ∈ℝ 2 Sep 15, 2022
FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation.

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation [Project] [Paper] [arXiv] [Home] Official implementation of FastFCN:

Wu Huikai 815 Dec 29, 2022
50-days-of-Statistics-for-Data-Science - This repository consist of a 50-day program

50-days-of-Statistics-for-Data-Science - This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.

komal_lamba 22 Dec 09, 2022
An optimization and data collection toolbox for convenient and fast prototyping of computationally expensive models.

An optimization and data collection toolbox for convenient and fast prototyping of computationally expensive models. Hyperactive: is very easy to lear

Simon Blanke 422 Jan 04, 2023
Semi-Supervised Learning for Fine-Grained Classification

Semi-Supervised Learning for Fine-Grained Classification This repo contains the code of: A Realistic Evaluation of Semi-Supervised Learning for Fine-G

25 Nov 08, 2022
NP DRAW paper released code

NP-DRAW: A Non-Parametric Structured Latent Variable Model for Image Generation This repo contains the official implementation for the NP-DRAW paper.

ZENG Xiaohui 22 Mar 13, 2022
A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

P-tuning A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''. How to use our code We have released the code

THUDM 562 Dec 27, 2022
[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

OW-DETR: Open-world Detection Transformer (CVPR 2022) [Paper] Akshita Gupta*, Sanath Narayan*, K J Joseph, Salman Khan, Fahad Shahbaz Khan, Mubarak Sh

Akshita Gupta 127 Dec 27, 2022
Discovering Explanatory Sentences in Legal Case Decisions Using Pre-trained Language Models.

Statutory Interpretation Data Set This repository contains the data set created for the following research papers: Savelka, Jaromir, and Kevin D. Ashl

17 Dec 23, 2022
Out-of-Town Recommendation with Travel Intention Modeling (AAAI2021)

TrainOR_AAAI21 This is the official implementation of our AAAI'21 paper: Haoran Xin, Xinjiang Lu, Tong Xu, Hao Liu, Jingjing Gu, Dejing Dou, Hui Xiong

Jack Xin 13 Oct 19, 2022
Sentiment analysis translations of the Bhagavad Gita

Sentiment and Semantic Analysis of Bhagavad Gita Translations It is well known that translations of songs and poems not only breaks rhythm and rhyming

Machine learning and Bayesian inference @ UNSW Sydney 3 Aug 01, 2022
Learning Open-World Object Proposals without Learning to Classify

Learning Open-World Object Proposals without Learning to Classify Pytorch implementation for "Learning Open-World Object Proposals without Learning to

Dahun Kim 149 Dec 22, 2022
eXPeditious Data Transfer

xpdt: eXPeditious Data Transfer About xpdt is (yet another) language for defining data-types and generating code for serializing and deserializing the

Gianni Tedesco 3 Jan 06, 2022