SPACEL workflow (1/3): Deconvolution by Spoint on mouse brain ST dataset

July 2023

Dataset: 75 ST slices of mouse brain (here)

[2]:

import pandas as pd
import scanpy as sc
import anndata
import os
from tqdm import tqdm
import scanpy as sc
import numpy as np
import sys

Load spatial transcriptomics data

The input data are anndata objects stored raw counts for scRNA-seq and ST. The scRNA-seq anndata must have cell type annotation in .obs.

[3]:

adata = sc.read('../data/ST_mouse_brain/mouse_brain_st.h5ad')
scadata = sc.read_h5ad('../data/ST_mouse_brain/scRNA_Mouse_Nervous_System.h5ad')

[4]:

scadata.var_names_make_unique()
scadata.obs_names_make_unique()

Initialize and train the Spoint model

In this step, we initialize the Spoint model using anndata objects for scRNA-seq and ST as input. Thecelltype_key parameter represents the column name of the cell type annotation in the .obs attribute of the scRNA-seq anndata object. The sm_size parameter controls the number of simulated spots, and it is important to have a sufficient sm_size for accurate prediction. However, it should be noted that increasing the sm_size will also increase the simulation and training time. In general, we recommend setting sm_size to a value greater than 100,000.

[5]:

import SPACEL
from SPACEL.setting import set_environ_seed
set_environ_seed()
from SPACEL import Spoint

Setting environment seed: 42
Using GPU: 1

Global seed set to 0

[6]:

spoint_model = Spoint.init_model(scadata,adata,celltype_key='Description',sm_size=500000,use_gpu=True,n_threads=2)
spoint_model.train(max_steps=5000, batch_size=1024)

Setting global seed: 42
### Finding marker genes...
Adrenergic cell groups of the medulla              200
Noradrenergic neurons of the medulla               200
Non-telencephalon astrocytes, fibrous              200
Non-border Cck interneurons, hippocampus           200
Non-border Cck interneurons, cortex/hippocampus    200
                                                  ...
Neuroblasts, cerebellum                            122
Purkinje cells                                     116
Granular layer interneurons, cerebellum             67
Pmch neurons, hypothalamus                          46
Neuroblast-like, habenula                           18
Name: Description, Length: 103, dtype: int64
### Used gene numbers: 5723
### Initializing sample probability
### Genetating simulated spatial data using scRNA data with mode: unbalance
### Genetating simulated spatial data using scRNA data with mode: sqrt
### Genetating simulated spatial data using scRNA data with mode: balance

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]

Epoch 100/100: 100%|██████████| 100/100 [51:24<00:00, 32.87s/it, loss=3.79e+03, v_num=1]

`Trainer.fit` stopped: `max_epochs=100` reached.

Epoch 100/100: 100%|██████████| 100/100 [51:24<00:00, 30.84s/it, loss=3.79e+03, v_num=1]

Step 5000: test inf loss=-0.787, train inf loss=-0.736, test rec loss=-0.381, train rec loss=-0.345, st test rec loss=-0.399, mmd loss=0.015: 100%|██████████| 5000/5000 [4:36:21<00:00,  3.32s/it]

Then, we utilize the trained model to predict the cell type composition of each spot in the spatial transcriptomics data. This prediction will generate a DataFrame object, where each row corresponds to a spot in the spatial transcriptomics data, each column represents a cell type from the single-cell RNA-seq data, and each entry indicates the proportion of a particular cell type in a spot. Additionally, we can obtain the anndata object of the spatial transcriptomics data with the deconvolution results embedded in the .obs attribute.

[7]:

pre = spoint_model.deconv_spatial()
st_ad = spoint_model.st_ad
st_ad.write('../data/ST_mouse_brain/mouse_brain_st.h5ad')

Visualization results

As a demonstration, we plotted the predicted compositions of cell types on a slice, where the cell types are selected as the top 20 with the highest proportion within each spot.

[8]:

import matplotlib
matplotlib.rcParams['pdf.fonttype'] = 42
matplotlib.rcParams['ps.fonttype'] = 42
matplotlib.rcParams['font.serif'] = ['Arial']
sc.settings.set_figure_params(dpi=50,dpi_save=300,facecolor='white',fontsize=10,vector_friendly=True,figsize=(3,3))
sc.settings.verbosity = 3

[9]:

st_ad = st_ad[st_ad.obs.slice==37]
celltype = st_ad.obs.loc[:,pre.columns].max(0)

[10]:

sc.pl.embedding(st_ad,color=celltype.sort_values(ascending=False)[:20].index,basis='spatial',ncols=5)

../_images/tutorials_ST_mouse_brain_Spoint_18_0.png