Bulk2Space is a spatial deconvolution method based on deep learning frameworks

Overview

Bulk2Space

Spatially resolved single-cell deconvolution of bulk transcriptomes using Bulk2Space

python 3.8

Bulk2Space is a spatial deconvolution method based on deep learning frameworks, which converts bulk transcriptomes into spatially resolved single-cell expression profiles.

Image text

Installation

For bulk2space, the python version need is over 3.8. If you have installed Python3.6 or Python3.7, consider installing Anaconda, and then you can create a new environment.

conda create -n bulk2space python=3.8.5
conda activate bulk2space

cd bulk2space
pip install -r requirements.txt 

Usage

Run the demo data

If you choose the spatial barcoding-based data(like 10x Genomics or ST) as spatial reference, run the following command:

python bulk2space.py --project_name test1 --data_path example_data/demo1 --input_sc_meta_path demo1_sc_meta.csv --input_sc_data_path demo1_sc_data.csv --input_bulk_path demo1_bulk.csv --input_st_data_path demo1_st_data.csv --input_st_meta_path demo1_st_meta.csv --BetaVAE_H --epoch 10 --spot_data True

else, if you choose the image-based in situ hybridization data(like MERFISH, SeqFISH, and STARmap) as spatial reference, run the following command:

python bulk2space.py --project_name test2 --data_path example_data/demo2 --input_sc_meta_path demo2_sc_meta.csv --input_sc_data_path demo2_sc_data.csv --input_bulk_path demo2_bulk.csv --input_st_data_path demo2_st_data.csv --input_st_meta_path demo2_st_meta.csv --BetaVAE_H --epoch 10 --spot_data False

Run your own data

When using your own data, make sure

  • the bulk.csv file must contain one column of gene expression

    Sample
    Gene1 5.22
    Gene2 3.67
    ... ...
    GeneN 15.76
  • the sc_meta.csv file must contain two columns of cell name and cell type. Make sure the column names are correct, i.e., Cell and Cell_type

    Cell Cell_type
    Cell_1 Cell_1 T cell
    Cell_2 Cell_2 B cell
    ... ... ...
    Cell_n Cell_n Monocyte
  • the st_meta.csv file must contain at least two columns of spatial coordinates. Make sure the column names are correct, i.e., xcoord and ycoord

    xcoord ycoord
    Cell_1 / Spot_1 1.2 5.2
    Cell_2 / Spot_2 5.4 4.3
    ... ... ...
    Cell_n / Spot_n 11.3 6.3
  • the sc_data.csv and st_data.csv files are gene expression matrices

Then you will get your results in the output_data folder.

For more details, see user guide in the document.

About

Bulk2Space manuscript is under major revision. Should you have any questions, please contact Jie Liao at [email protected], Jingyang Qian at [email protected], or Yin Fang at [email protected]

Comments
  • Data availability

    Data availability

    Hey team, thanks for coming up with this useful tool. I'm looking to follow your tutorial on hypothalamus deconvolution, and it seems the lcm.gz data file on your Github only contains a single file, without all the processes count matrices and cell metadata table. Is that supposed to be the case? If so, I wonder how I should process this single file to generate the input data I need. Thanks for any heads up!

    opened by loganminhdang 6
  • Cannot locate the bulk2space.py script and directory after installation

    Cannot locate the bulk2space.py script and directory after installation

    Hi, I'm writing to seek your assistance on an issue I'm having. After installation of the conda environment, I cannot locate the bulk2space directory, which should contain the bulk2space python script to run the algorithm. The installation also seems incomplete, seeing that after I manually retrieve the python script from your Github page, I received the following error message: Traceback (most recent call last): File "bulk2space.py", line 2, in from utils.tool import * ModuleNotFoundError: No module named 'utils'

    I would appreciate any guidance. Thanks!

    opened by loganminhdang 5
  • Preproccessed PDAC data

    Preproccessed PDAC data

    Hello,

    I am trying to understand how to use bulk2space by going though the tutorials. I am currently going though the first tutorial with the PDAC datasets. I would like to know how you generated the preprocessed files "st_data" and "st_meta".

    I went to the original data from Moncada et al. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111672) but I don't know which files you used from there to make the above preprocessed files. Could you clarify that and explain a bit more in detail how you generated "st_data" and "st_meta"? This will be helpful to understand how to process other reference datasets.

    opened by AlexUOM 4
  • the question of

    the question of "quick start" section

    Dear professors, We are very sorry to bother you. We recently downloaded the bulk2space and used the test data of demo1, but we don't know why there are no result output, and we don't know whether the data are written normally. After operation, the Bulk2space-1.0.0-Py3.8egg displayed empty. Some information are as follows. I would wonder if you can help check it in your busy schedule or if there is any other step guidance. bulk2space

    opened by coconutll 2
  • Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

    Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

    Hi, thanks for coming up with this useful tool.  I have bulk RNAseq data and scRNA-seq data from the same patient which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. Here are my questions: 1.Why do I only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data? However, there are many other celltypes like Fibroblasts, T cell and B cell in my scRNA reference. 2.How to normalize my bulk data?

    Thanks, Qi.

    opened by zhangqi234 2
  • convert bulk transcriptomes into spatially resolved single-cell expression profile

    convert bulk transcriptomes into spatially resolved single-cell expression profile

    Hi, I'm new to bulk2space, and I only have bulk RNAseq data from mouse brain which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. I know how to convert bulk RNAseq data into single cell data. Here are my questions:

    1. How to get the spatial information from my bulk RNAseq data, do I have to do some experiments about spatial information by Laser capture microdissection (LCM) technology?
    2. since my bulk RNAseq data are form brain tissue, tissues contain many layers of cells. How do you distinguish between different layers of cells? Or do I have to do bulk RNAseq from single layers?

    Thanks, Echo.

    opened by Echoloria 2
  • Cannot import CascadeForestClassifier from deepforest

    Cannot import CascadeForestClassifier from deepforest

    I am running the bulk2space.py script via Python 3.8.5. The deepforest package is installed and imports successfully, but I am still receiving the following error message:

    ImportError: cannot import name 'CascadeForestClassifier' from 'deepforest'

    I would appreciate any help you could offer.

    opened by sarah-chapin 2
  • Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

    Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

    Hi,

    Thank you very much for putting together this code.

    I would like to better understand when Bulk2Space might help versus when there are limits to applicability to Bulk2Space, following a journal club presentation where I learned more about the paper and method.

    I apologize that I am not sure how best to precisely ask my question, but I have tried to use a few examples to try and give a sense of what I am asking about.

    Example 1 (Exact Code for Concrete Test):

    In the spirit of a GitHub “issue,” I tried to start with concrete examples for discussion based upon issue #8 .

    I have attached a summary of that analysis (PDAC_Test.pdf), and I have also attached any input files not already provided on this repository.

    However, when I changed the bulk RNA-Seq gene symbols in order to use the same gene symbols for both the PDAC example and the demo1 example, I lost the Ductal cells in the PDAC example that otherwise still used only files derived from the same samples used for the PDAC example. I also have some more details notes in the uploaded PDF.

    Nevertheless, if that might possibly help the discussion, I have provided those.

    If there are any other relatively small files that it would help to upload to GitHub, then I would also be happy to add those. For example, I also ran the analysis with epoch_num=1000 instead of epoch_num=3500. I am currently not providing those results, but my impression is that they look qualitatively similar in terms of cancer cell and ductal cell assignments (for all of the provided PDAC files).

    Example 2 (Theoretical Question):

    Is it possible to run bulk2spatial as described below?

    1) Use bulk RNA-Seq + scRNA-Seq + spatial data that all come from Patient A.

    2) Export model from Patient A.

    3) Only provide bulk RNA-Seq data from patient B, and test how predictions from model defined on Patient A compare to scRNA-Seq and spatial data generated for Patient B.

    • Additionally, if I understand correctly, then I think an image for the tissue for Patient B can not be provided. If so, I think the shape of the issue section for Patient B can’t be known, and I would guess the spatial coordinates from Bulk2Space might not be directly applicable to interpret Patient B. However, if I might be misunderstanding anything, then please let me know.

    Example 3 (Summary Questions):

    Am I correctly understanding that consecutive slides are often used in the paper? For example, the 2 slices in Figure S17f already have different shapes, and it looks like you a projection of estimations on the histology image for slide 2 was not (or could not?) be provided.

    Data from different patients would be even more different. So, is it reasonable and/or correct to say that there is a preference to use all 3 data types generated from the same experiment? Even if the exact slice is not the same, the true composition of the multiple data types can hopefully be as close as possible?

    For example, I am not sure if the difference is sufficiently extreme, but let’s say Patient A has histology like the “Inflammation” sample in Figure 6 and Patient B has histology like the “Cancer” sample in Figure 6. If you didn’t have a spatial transcriptomics (ST) dataset for Patient B, then I think use of the ST data from Patient A might not be of much benefit to Patient B. Do you think that is a fair conclusion?

    Similarly, if your training sample had 90% tumor, then I would expect limitations is looking at the projection from a spatial transcriptomics project where the tissues had a very different percent tumor such as closer to 20% tumor. I would also expect there often could be a challenge in even knowing the general shape of an independent/unrelated tumor sample, and I believe that you should not be able to know the spatial information for the tumor cells within an independent tissue without a more direct measurement.

    I am not sure if the points above might also possibly relate to the shift in the frequency of cancer cells per spot with the reduced/matching gene symbols in the uploaded PDF for Example 1.

    However, if I am then understanding correctly, then might that be at least somewhat contradictory to what I believe is a recommendation to use public data in issue #7? If I might be misunderstanding anything, then please let me know.

    Thank you very much for your help!

    Sincerely, Charles

    Code.zip demo1_bulk-FALSE_PDAC_LABEL.csv demo1_bulk-FALSE_PDAC_LABEL-MATCHING_SUBSET.csv pdac_bulk-MATCHING_SUBSET.csv

    SC Cell_Type_Counts.pdf SC Cell_Type_Correlation.pdf ST Spot_Deconvolution.pdf ST Cancer_Cells_per_Spot.pdf

    PDAC_Test.pdf

    opened by cwarden45 0
  • Confused about the train/test steps

    Confused about the train/test steps

    Dear Professors,

    Thanks for coming up with this great tool. However, I'm confused with how to use it by the tutorial. In PDAC deconvolution, the tutorial only uses the train_vae function, however, in demo1 tutorial for example, it uses additional load_vae_and_generate function from the .pth vae model from train_vae function.

    So here comes to my question, if I only focus on the first step to transform bulkRNA to single-cell RNA (i.e., no consideration of further scRNA to spatial RNA):

    If I have e.g., two bulkRNAseq from 2-month-old and 7-month-old mice lung cancer tissue, say bulkA and bulkB. I also have one single-cell RNA reference, say scRNAref. When I deconvolute bulkA using scRNAref to a new, bulk2space-generated scRNA data (name it "generated-scRNA from bulkA"), I will get a .pth vae model (name it "A.pth"). Next, when I'd like to deconvolute bulkB, which step should I use? Should I 1) use "load_vae_and_generate" function that use the previous A.pth model, or 2) use "train_vae" function that will generate a new B.pth model?

    I believe this is crucial because it directly guides us how to use this tool. In CIBERSORT, we provide only two variables, the bulkRNAseq and the reference immune cell expression profile. The reference would not change most of the time, thus we just feed CIBERSORT with many bulkRNAseq dataset and it will return many generated immune cell expression dataframes. Simple and easy. But in Bulk2space, we got a new .pth model everytime if we follow step 2, and to be honest, I don't know what this .pth model is used for if not following step1 to use it to load and generate new scRNA dataset.

    Besides issues above, if we use step 1), there'll also be problems. What if bulkA and bulkB are from different status of tissues as the example above? I see that in the article, you mentioned that "the state of each cell type still fluctuates within a relatively stable high-dimensional space". But if bulkA was from a pre-cancerous tissue, and bulkB was from a cancerous tissue, would bulk2space still work fine? This is important because if we'd like to deconvolute bulkRNAseq from longitudinal dataset, for example, a series of bulkRNAseq data from 10 timepoints along cancer progression that contains normal, pre-cancerous, turning stage and finally cancerous tissue, or a series of bulkRNAseq data from different development stages of liver, what is the correct way of using bulk2space if I want single-cell RNA dataset from bulkRNA? Would bulk2space still work under this scenario?

    Also, does bulk2space requires that scRNA ref and bulkRNA are from similar status of tissue? For example, can bulk2space deconvolute bulkRNA derived from cancer lung using the reference scRNA derived from normal lung?

    Actually I've tried to use step 1 (i.e., the same model) to deal with my longitudinal dataset but the results seemed very identical concerning the distribution of cell types that bulk2space returned (which should have some difference at least in immune cell types since I'm deconvoluting bulkRNA from normal and cancer tissues using the same scRNA ref). Also, another key issue is, I don't know whether the generated sc_cell_type and sc_data dataframe can be treated as a standard Seurat object that we can use standard analysis pipeline (like filtering nfeature and nCount, scaling, centering, pca, umap, or newly assign cell types according to FindMarkers function, etc. Acturally I've tried on them but the PCA, tSNE or UMAP can't efficiently separate cell types well), and whether different scRNA datasets generated by bulk2space can be supported to integrate into a single Seurat object like other normal single-cell data do?

    Thank you so much and it would be of a great help if the experts in your team who developped this nice tool could answer the issues above.

    opened by Bennylikescoding 1
  •  β-VAE  algorithm in the paper

    β-VAE algorithm in the paper

    Hello, author, In Figure 1b of your paper,I don't know why β-VAE can analyze the rate of cells of each cell type. I have studied this algorithm carefully and its input and output should correspond, so I don't understand why the input cell type is changed into the output of a single cell. Could you please answer it, or what is the input data of this step? image

    opened by wxpbioinfo 0
  • Question: Scalability

    Question: Scalability

    Good day,

    I am eager to test this excellent tool on our data. I have seen in the tutorial and demo data that the vignette uses only one bulk RNA sample as well as an ST experiment.

    Is it possible to scale up and process several bulk RNA samples and ST experiments in one go? and for the inferred single-cell data derived from the bulk, can we have those integrated across multiple biological replicates, as if they were truly scRNA-seq data?

    Thanks in advance!

    opened by ccruizm 2
  • model.train_df_and_spatial_deconvolution error

    model.train_df_and_spatial_deconvolution error

    Hi, thanks for coming up with this useful tool. When I conducted the model.train_df_and_spatial_deconvolution function to decompose ST data into spatially resolved single-cell transcriptomics data, I found the error like "pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False". I don't know what caused this error.

    1668324980352

    opened by zhangqi234 7
Releases(v1.0.0)
Owner
Dr. FAN, Xiaohui
single-cell omics; spatial transcriptomics; TCM network biology
Dr. FAN, Xiaohui
[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

DAB-DETR This is the official pytorch implementation of our ICLR 2022 paper DAB-DETR. Authors: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi

336 Dec 25, 2022
NVIDIA Deep Learning Examples for Tensor Cores

NVIDIA Deep Learning Examples for Tensor Cores Introduction This repository provides State-of-the-Art Deep Learning examples that are easy to train an

NVIDIA Corporation 10k Dec 31, 2022
Paper: Cross-View Kernel Similarity Metric Learning Using Pairwise Constraints for Person Re-identification

Cross-View Kernel Similarity Metric Learning Using Pairwise Constraints for Person Re-identification T M Feroz Ali, Subhasis Chaudhuri, ICVGIP-20-21

T M Feroz Ali 3 Jun 17, 2022
Learning Features with Parameter-Free Layers (ICLR 2022)

Learning Features with Parameter-Free Layers (ICLR 2022) Dongyoon Han, YoungJoon Yoo, Beomyoung Kim, Byeongho Heo | Paper NAVER AI Lab, NAVER CLOVA Up

NAVER AI 65 Dec 07, 2022
Code for the IJCAI 2021 paper "Structure Guided Lane Detection"

SGNet Project for the IJCAI 2021 paper "Structure Guided Lane Detection" Abstract Recently, lane detection has made great progress with the rapid deve

Jinming Su 27 Dec 08, 2022
Streamlit component for TensorBoard, TensorFlow's visualization toolkit

streamlit-tensorboard This is a work-in-progress, providing a function to embed TensorBoard, TensorFlow's visualization toolkit, in Streamlit apps. In

Snehan Kekre 27 Nov 13, 2022
PyTorch Implementation of Temporal Output Discrepancy for Active Learning, ICCV 2021

Temporal Output Discrepancy for Active Learning PyTorch implementation of Semi-Supervised Active Learning with Temporal Output Discrepancy, ICCV 2021.

Siyu Huang 33 Dec 06, 2022
Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation"

DSP Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation". Accepted by ACM Multimedia 2021. Authors

20 Oct 24, 2022
A coin flip game in which you can put the amount of money below or equal to 1000 and then choose heads or tail

COIN_FLIPPY ##This is a simple example package. You can use Github-flavored Markdown to write your content. Coinflippy A coin flip game in which you c

2 Dec 26, 2021
Bootstrapped Unsupervised Sentence Representation Learning (ACL 2021)

Install first pip3 install -e . Training python3 training/unsupervised_tuning.py python3 training/supervised_tuning.py python3 training/multilingual_

yanzhang_nlp 26 Jul 22, 2022
OpenVINO黑客松比赛项目

Window_Guard OpenVINO黑客松比赛项目 英文名称:Window_Guard 中文名称:窗口卫士 硬件 树莓派4B 8G版本 一个磁石开关 USB摄像头(MP4视频文件也可以) 软件(库) OpenVINO RPi 使用方法 本项目使用的OPenVINO是是2021.3版本,并使用了

Tango 6 Jul 04, 2021
CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

vadim epstein 690 Jan 02, 2023
Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

Bosch Research 272 Dec 28, 2022
Transformer part of 12th place solution in Riiid! Answer Correctness Prediction

kaggle_riiid Transformer part of 12th place solution in Riiid! Answer Correctness Prediction. Please see here for more information. Execution You need

Sakami Kosuke 2 Apr 23, 2022
An easier way to build neural search on the cloud

An easier way to build neural search on the cloud Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g

Jina AI 17k Jan 02, 2023
Compressed Video Action Recognition

Compressed Video Action Recognition Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl. In CVPR, 2018. [Proj

Chao-Yuan Wu 479 Dec 26, 2022
Point Cloud Registration using Representative Overlapping Points.

Point Cloud Registration using Representative Overlapping Points (ROPNet) Abstract 3D point cloud registration is a fundamental task in robotics and c

ZhuLifa 36 Dec 16, 2022
Colour detection is necessary to recognize objects, it is also used as a tool in various image editing and drawing apps.

Colour Detection On Image Colour detection is the process of detecting the name of any color. Simple isn’t it? Well, for humans this is an extremely e

Astitva Veer Garg 1 Jan 13, 2022
Train SN-GAN with AdaBelief

SNGAN-AdaBelief Train a state-of-the-art spectral normalization GAN with AdaBelief https://github.com/juntang-zhuang/Adabelief-Optimizer Acknowledgeme

Juntang Zhuang 10 Jun 11, 2022
This repository contains the DendroMap implementation for scalable and interactive exploration of image datasets in machine learning.

DendroMap DendroMap is an interactive tool to explore large-scale image datasets used for machine learning. A deep understanding of your data can be v

DIV Lab 33 Dec 30, 2022