0%

python-scanpy

scanpy | Preprocessing and clustering 3k PBMCs

本文记录使用scanpy处理3k PBMCs scRNA-seq数据的流程。

环境配置

创建一个虚拟环境以方便管理相关的库。

1
2
3
4
5
6
7
8
9
conda create --name pysc python=3.9
conda activate pysc
conda install -c anaconda ipykernel
python -m ipykernel install --user --name pysc

pip3 install scanpy
pip3 install pandas
pip3 install loompy

本文使用PBMCs 3k数据可以在该网址下载(http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz).

下载后,将文件解压至当前的data目录下.

1
2
3
4
# !mkdir data
# !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
# !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
# !mkdir output
1
2
3
import numpy as np
import pandas as pd
import scanpy as sc
1
2
3
sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')
e:\miniconda3\envs\pysc\lib\site-packages\umap\distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
e:\miniconda3\envs\pysc\lib\site-packages\umap\distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
e:\miniconda3\envs\pysc\lib\site-packages\umap\distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
e:\miniconda3\envs\pysc\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm


scanpy==1.9.3 anndata==0.9.1 umap==0.5.3 numpy==1.24.4 scipy==1.11.1 pandas==2.0.3 scikit-learn==1.3.0 statsmodels==0.14.0 python-igraph==0.10.6 pynndescent==0.5.10


e:\miniconda3\envs\pysc\lib\site-packages\umap\umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
1
results_file='output/pbmc3k.h5ad' # the file that will store the analysis results
1
2
3
4
adata = sc.read_10x_mtx('data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True # write a cache file for faster subsequent reading
)
... reading from cache file cache\data-filtered_gene_bc_matrices-hg19-matrix.h5ad
1
2
3
# remove duplicated symbols
adata.var_names_make_unique() # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
adata
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

我们读入的数据为AnnData格式。

AnnData是python中处理带注释的数据的一种格式,读入的数据并不直接读入内存当中,而是创建与磁盘数据的链接来进行处理。对于单细胞测序数据而言,一般与细胞相关数据存储于.obs,而与feature相关数据存储于.var

anndata_schema

Preprocessing

scanpy包提供的api有几个主要的modules,包括:

  • preprocessing: scanpy.pp, such as pp.calculate_qc_metrics, pp.filter_cells, pp.filter_genes, …;
  • tools: scanpy.tl, such as tl.pca, tl.tsne, tl.umap, …;
  • plots: scanpy.pl, such as pl.scatter, pl.highest_expr_genes, pl.umap, …

    https://scanpy.readthedocs.io/en/stable/api.html

接下来,我们先使用scanpy包进行数据预处理,展示高表达基因在各个细胞中的counts。

1
sc.pl.highest_expr_genes(adata, n_top=20)
normalizing counts per cell
    finished (0:00:00)

执行简单的过滤操作。保留至少有200个基因表达的细胞,至少有3个细胞表达的基因。

1
2
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 19024 genes that are detected in less than 3 cells

然后,再根据基因的counts和线粒体基因表达进行进一步过滤。

首先,计算线粒体基因比例.

adata.var 存的是feature-level相关的信息,adata.obs 存的是cell-level的信息

1
2
3
adata.var['mt'] = adata.var_names.str.startswith("MT-") 
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata.obs['pct_counts_mt']
AAACATACAACCAC-1    3.017776
AAACATTGAGCTAC-1    3.793596
AAACATTGATCAGC-1    0.889736
AAACCGTGCTTCCG-1    1.743085
AAACCGTGTATGCG-1    1.224490
                      ...   
TTTCGAACTCTCAT-1    2.110436
TTTCTACTGAGGCA-1    0.929422
TTTCTACTTCCTCG-1    2.197150
TTTGCATGAGAGGC-1    2.054795
TTTGCATGCCTCAC-1    0.806452
Name: pct_counts_mt, Length: 2700, dtype: float32
1
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True)
e:\miniconda3\envs\pysc\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

1
2
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')


过滤基因数和线粒体基因占比过高的细胞

通过切片的方法过滤

1
2
3
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]
adata
View of AnnData object with n_obs × n_vars = 2638 × 13714
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

接下来,我们对read counts进行文库大小校正,并进行log转换。

随后,根据normalized expression鉴定highly-variable genes以供后续PCA等分析使用。

1
2
3
4
5
6
7
8
9
10
11
# make a copy of raw count
adata.layers['counts'] = adata.X.copy()

# Total-count normalization
sc.pp.normalize_total(adata, target_sum=1e4)

# log transformation
sc.pp.log1p(adata)

# store normalized data
adata.layers['data'] = adata.X.copy()
normalizing counts per cell
    finished (0:00:00)


e:\miniconda3\envs\pysc\lib\site-packages\scanpy\preprocessing\_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)
1
2
3
# Identify highly-variable genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

鉴定的基因存储在adata.var.highly_variable中,后续可被PCA、clustering和UMAP/tSNE等函数识别,所以不需要再对数据进行过滤。

1
adata.var.highly_variable
AL627309.1       False
AP006222.2       False
RP11-206L10.2    False
RP11-206L10.9    False
LINC00115        False
                 ...  
AC145212.1       False
AL592183.1       False
AL354822.1       False
PNRC2-1          False
SRSF10-1         False
Name: highly_variable, Length: 13714, dtype: bool

这里将adata.raw设置为校正后的数据,以供后续差异分析和可视化使用。相当于是把默认的表达矩阵替换为校正过的。可以通过.raw.to_adata()再替换为原始数据。

1
2
adata.raw = adata
print(adata.raw)
Raw AnnData with n_obs × n_vars = 2638 × 13714
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
1
2
3
4
5
6
7
8
9
10
11
# filtering hvg
adata = adata[:, adata.var.highly_variable]

# regress out effects of total counts per cell and the percentage of mt genes expressed
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])

# cale each gene to unit variance. clip values exceeding standard deviation 10.
sc.pp.scale(adata, max_value=10)

# store scaled data
adata.layers['scaled'] = adata.X.copy()
regressing out ['total_counts', 'pct_counts_mt']
    sparse input is densified and may lead to high memory use
    finished (0:00:03)
1
2
3
4
5
# PCA
sc.tl.pca(adata, svd_solver='arpack')

# elbow plot
sc.pl.pca_variance_ratio(adata, log=True)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:00)

1
adata
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'

Computing the neighborhood graph

接下来,我们基于PCA计算细胞的邻接图

1
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:02)

Embedding the neighborhood graph

使用UMAP可视化

1
2
sc.tl.umap(adata)
sc.pl.umap(adata, color=['CST3','NKG7', 'PPBP'])
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:03)

pl.umap默认使用adata.raw中的值作图,由于上面我们将其替换为了normalized后的值。如果想用scaled values作图,可以使用参数use_raw=False

1
sc.pl.umap(adata, color=['CST3','NKG7', 'PPBP'], use_raw=False)


Clustering the neighborhood graph

使用Leiden graph-clustering进行聚类

1
2
sc.tl.leiden(adata, resolution=1)
sc.pl.umap(adata, color=['leiden'])
running Leiden clustering
    finished: found 8 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)


e:\miniconda3\envs\pysc\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

1
2
# save the result
adata.write(results_file)

Finding marker genes

Wilcoxon rank-sum test to identify markers

1
2
3
# compare each group with the rest of cell
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False) # sharey: Controls if the y-axis of each panels should be shared.
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:01)

1
adata.write(results_file)
1
2
3
4
5
6
7
# marker gene
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

# top 5 ranked genes per cluster
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)

0 1 2 3 4 5 6 7
0 RPS12 LYZ CD74 CCL5 NKG7 LST1 HLA-DPA1 PF4
1 LDHB S100A9 CD79A NKG7 GNLY FCER1G HLA-DPB1 SDPR
2 RPS25 S100A8 HLA-DRA B2M GZMB AIF1 HLA-DRA GNG11
3 RPS27 TYROBP CD79B CST7 CTSW COTL1 HLA-DRB1 PPBP
4 RPS6 FTL HLA-DPB1 GZMA PRF1 FCGR3A CD74 NRGN
1
2
3
4
5
6
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
{group + '_' + key[:1] : result[key][group] # create new dictionary
for group in groups for key in ['names', 'pvals']}
).head(5)

0_n 0_p 1_n 1_p 2_n 2_p 3_n 3_p 4_n 4_p 5_n 5_p 6_n 6_p 7_n 7_p
0 RPS12 1.659522e-223 LYZ 7.634876e-249 CD74 2.487145e-183 CCL5 1.674383e-128 NKG7 1.203971e-96 LST1 1.322111e-88 HLA-DPA1 5.422417e-21 PF4 4.722886e-10
1 LDHB 3.010889e-218 S100A9 4.626358e-246 CD79A 1.679730e-170 NKG7 5.307754e-104 GNLY 1.257170e-88 FCER1G 6.259712e-85 HLA-DPB1 7.591860e-21 SDPR 4.733899e-10
2 RPS25 9.032935e-198 S100A8 1.622835e-238 HLA-DRA 6.949695e-167 B2M 1.364508e-81 GZMB 1.429027e-88 AIF1 1.348814e-83 HLA-DRA 1.306768e-19 GNG11 4.733899e-10
3 RPS27 8.254312e-188 TYROBP 2.957652e-220 CD79B 2.569135e-154 CST7 7.977466e-78 CTSW 4.144726e-87 COTL1 5.974694e-82 HLA-DRB1 1.865104e-19 PPBP 4.744938e-10
4 RPS6 8.002057e-185 FTL 2.479195e-214 HLA-DPB1 3.580735e-148 GZMA 7.266551e-74 PRF1 1.692100e-85 FCGR3A 1.392377e-77 CD74 5.853161e-19 NRGN 4.800511e-10
1
2
3
# compare between groups
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)

1
2
# 0 vs. 1
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
e:\miniconda3\envs\pysc\lib\site-packages\seaborn\categorical.py:166: FutureWarning: Setting a gradient palette using color= is deprecated and will be removed in version 0.13. Set `palette='dark:black'` for same effect.
  warnings.warn(msg, FutureWarning)

1
2
3
# Reload the object with the computed differential expression (i.e. DE via a comparison with the rest of the groups):
adata = sc.read(results_file)
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
e:\miniconda3\envs\pysc\lib\site-packages\seaborn\categorical.py:166: FutureWarning: Setting a gradient palette using color= is deprecated and will be removed in version 0.13. Set `palette='dark:black'` for same effect.
  warnings.warn(msg, FutureWarning)

1
2
# plot expression across clusters
sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')



1
2
3
4
5
6
7
8
# rename cluster based on cell types
new_cluster_names = [
'CD4 T', 'CD14 Monocytes',
'B', 'CD8 T',
'NK', 'FCGR3A Monocytes',
'Dendritic', 'Megakaryocytes']
adata.rename_categories('leiden', new_cluster_names)
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')
WARNING: saving figure to file figures\umap.pdf


e:\miniconda3\envs\pysc\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

1
2
# dotplot for marker genes
sc.pl.dotplot(adata, marker_genes, groupby='leiden')
e:\miniconda3\envs\pysc\lib\site-packages\scanpy\plotting\_dotplot.py:749: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored
  dot_ax.scatter(x, y, **kwds)

1
2
# violin plot
sc.pl.stacked_violin(adata, marker_genes, groupby='leiden')



1
adata
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

总结

scanpy中scRNA-seq分析流程包括:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# read in data
adata = sc.read_10x_mtx('data/filtered_gene_bc_matrices/hg19/', cache=True)
# normalization
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# highly variable genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
# scale
sc.pp.scale(adata, max_value=10)
# PCA
sc.tl.pca(adata, svd_solver='arpack')
# find neighbors
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
# clustering
sc.tl.leiden(adata)
# UMAP
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'])
# find markers
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')

由于后续需要在Geneformer中使用该数据,这里我将行名替换为ensembl id,并在细胞注释.obs中增加一列cell_type指定细胞类型和organ_major指定组织类型。同时,替换默认表达矩阵为原始counts。随后,以loom格式存储。

1
2
3
4
5
6
# make dataset suitable for Geneformer
# convert row attributes to Ensembl IDs
adata.var['gene_name'] = adata.var_names
adata.var_names = adata.var['gene_ids']
adata.var.rename(columns={"gene_ids": "ensembl_id"}, inplace=True)
adata.var

ensembl_id n_cells mt n_cells_by_counts mean_counts pct_dropout_by_counts total_counts highly_variable means dispersions dispersions_norm mean std gene_name
gene_ids
ENSG00000186827 ENSG00000186827 155 False 155 0.077407 94.259259 209.0 True 0.277410 2.086050 0.665406 -3.244448e-10 0.424481 TNFRSF4
ENSG00000127054 ENSG00000127054 202 False 202 0.094815 92.518519 256.0 True 0.385194 4.506987 2.955005 -1.157975e-10 0.460416 CPSF3L
ENSG00000215915 ENSG00000215915 9 False 9 0.009259 99.666667 25.0 True 0.038252 3.953486 4.352607 8.472988e-12 0.119465 ATAD3C
ENSG00000162585 ENSG00000162585 501 False 501 0.227778 81.444444 615.0 True 0.678283 2.713522 0.543183 3.502168e-10 0.685145 C1orf86
ENSG00000157916 ENSG00000157916 608 False 608 0.298148 77.481481 805.0 True 0.814813 3.447533 1.582528 9.461503e-11 0.736050 RER1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ENSG00000160223 ENSG00000160223 34 False 34 0.016667 98.740741 45.0 True 0.082016 2.585818 1.652185 9.984445e-12 0.217672 ICOSLG
ENSG00000184900 ENSG00000184900 570 False 570 0.292963 78.888889 791.0 True 0.804815 4.046776 2.431045 -3.911696e-10 0.723121 SUMO3
ENSG00000173638 ENSG00000173638 31 False 31 0.018519 98.851852 50.0 True 0.058960 3.234231 2.932458 -1.969986e-10 0.173017 SLC19A1
ENSG00000160307 ENSG00000160307 94 False 94 0.076667 96.518519 207.0 True 0.286282 3.042992 1.078783 5.980517e-10 0.399946 S100B
ENSG00000160310 ENSG00000160310 588 False 588 0.275926 78.222222 745.0 True 0.816647 2.774169 0.629058 -6.552444e-10 0.762753 PRMT2

1838 rows × 14 columns

1
2
3
4
5
# add cell_type and organ_major columns for cell
adata.obs['cell_type'] = adata.obs['leiden']
adata.obs['organ_major'] = 'immune'
adata.obs.rename(columns={"total_counts": "n_counts"}, inplace=True)
adata.obs

n_genes n_genes_by_counts n_counts total_counts_mt pct_counts_mt leiden cell_type organ_major
AAACATACAACCAC-1 781 779 2419.0 73.0 3.017776 CD4 T CD4 T immune
AAACATTGAGCTAC-1 1352 1352 4903.0 186.0 3.793596 B B immune
AAACATTGATCAGC-1 1131 1129 3147.0 28.0 0.889736 CD4 T CD4 T immune
AAACCGTGCTTCCG-1 960 960 2639.0 46.0 1.743085 FCGR3A Monocytes FCGR3A Monocytes immune
AAACCGTGTATGCG-1 522 521 980.0 12.0 1.224490 NK NK immune
... ... ... ... ... ... ... ... ...
TTTCGAACTCTCAT-1 1155 1153 3459.0 73.0 2.110436 CD14 Monocytes CD14 Monocytes immune
TTTCTACTGAGGCA-1 1227 1224 3443.0 32.0 0.929422 B B immune
TTTCTACTTCCTCG-1 622 622 1684.0 37.0 2.197150 B B immune
TTTGCATGAGAGGC-1 454 452 1022.0 21.0 2.054795 B B immune
TTTGCATGCCTCAC-1 724 723 1984.0 16.0 0.806452 CD4 T CD4 T immune

2638 rows × 8 columns

1
2
3
# replace count matrix with raw counts
adata.X = adata.layers['counts']
adata.X.toarray()[10:15,0:3]
array([[0., 1., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32)
1
2
3
4
# save in h5ad
adata.write(results_file)
# save in loom
adata.write_loom("output/pbmc3k.loom")
The loom file will lack these fields:
{'X_umap', 'PCs', 'X_pca'}
Use write_obsm_varm=True to export multi-dimensional annotations
e:\miniconda3\envs\pysc\lib\site-packages\loompy\bus_file.py:68: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def twobit_to_dna(twobit: int, size: int) -> str:
e:\miniconda3\envs\pysc\lib\site-packages\loompy\bus_file.py:85: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def dna_to_twobit(dna: str) -> int:
e:\miniconda3\envs\pysc\lib\site-packages\loompy\bus_file.py:102: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def twobit_1hamming(twobit: int, size: int) -> List[int]:

Ref:

https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html#Clustering-3K-PBMCs

https://anndata.readthedocs.io/en/latest/