Seurat standard pipeline
记录一下Seurat标准的单细胞分析流程,这里使用官方提供的pbmc3k作为示例
pbmc3k: https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
Seurat单细胞分析流程主要就是以下十句代码
1 | pbmc.counts <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19/") |
以下详细展开某一步的功能
1 | library(dplyr) |
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
1 | library(Seurat) |
## Attaching SeuratObject
1 | library(patchwork) |
创建Seurat对象
Seurat接受counts文件作为输入(一般经过cellranger处理),创建包含细胞信息和counts信息的对象。
1 | # Load the PBMC dataset |
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
1 | pbmc |
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
Seurat对象中的counts以稀疏矩阵的方式存储以节省内存,.
表示没有检测到counts
1 | # Lets examine a few genes in the first thirty cells |
## 3 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
##
## CD3D 4 . 10 . . 1 2 3 1 .
## TCL1A . . . . . . . . 1 .
## MS4A1 . 6 . . . . . . 1 1
质控
一般而言,我们需要对数据进行质控以保证数据的质量,在进行后续的分析。常用的质控指标包括:
每个细胞的唯一基因数目
低质量或空液泡往往只能检测到少量基因
双液泡(doublet)或多液泡(multiplets)会具有异常多的基因数目
每个细胞的总counts数(相当于每个细胞的测序深度)
线粒体基因占比
- 低质量或死细胞会具有异常高的线粒体基因表达
由于每个细胞的基因数和测序深度在cellranger分析的时候已经计算过了,这里我们只需要再计算线粒体基因表达的比例即可
1 | # The [[ operator can add columns to object metadata. This is a great place to stash QC stats |
Seurat将细胞相关的元数据以列的形式存储在 pbmc@meta.data
1 | # Show QC metrics for the first 5 cells |
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
## AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
1 | # Visualize QC metrics as a violin plot |
随后,我们过滤掉基因数(nFeature_RNA
)大于2500或小于200的细胞,以及线粒体基因组比例大于5%的细胞
1 | pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) |
需要注意的是这里的过滤标准在适合在这个数据集中使用,未必适用于其他的数据集。更好的质控方法是根据质控指标的分位数进行过滤,例如过滤掉 nFeature_RNA
上四分位数和下四分位数的细胞。
另外,这里只使用了三种指标对细胞进行质控,在实际分析中我们还可以使用其他工具进行更精密的质控,例如:
- SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data (https://github.com/constantAmateur/SoupX)
- DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors (https://github.com/chris-mcginnis-ucsf/DoubletFinder)
- DropletQC: improved identification of empty droplets and damaged cells in single-cell RNA-seq data (https://github.com/powellgenomicslab/DropletQC)
Normalization
在质控后,我们进行counts的normalization,默认使用 “LogNormalize” 的方法,即将每个基因的counts除以细胞总的counts数,乘上10,000,再进行对数转换。
1 | pbmc <- NormalizeData(pbmc) |
Seurat提供了另外的normalization方法,通过
normalization.method
指定, 包括:
“CLR”: centered log ratio transformation
“RC”: equals to “LogNormalize” without log-transformation
校正后的数据在 pbmc[["RNA"]]@data
1 | pbmc[["RNA"]]@data[c("CD3D", "TCL1A", "MS4A1"), 1:10] |
## 3 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
##
## CD3D 2.864242 . 3.489706 . . 1.726902 2.321937 2.658463 2.179642
## TCL1A . . . . . . . . 2.179642
## MS4A1 . 2.583047 . . . . . . 2.179642
##
## CD3D .
## TCL1A .
## MS4A1 2.309182
特征选择
Seurat选择在细胞细胞之间具有高度变异性的基因(例如某些细胞高表达,而其他细胞不表达)进行后续分析,这是由于这些基因可以代表了细胞与细胞间的主要生物学差异
1 | pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) |
## When using repel, set xnudge and ynudge to 0 for optimal results
1 | plot1 + plot2 |
默认选择前2000个高度变异基因。
数据缩放
Normalization后,需要对数据进行缩放(Scaling)。 Scaling后,数据的均值为0,方差为1
1 | all.genes <- rownames(pbmc) |
## Centering and scaling data matrix
scaled data存放在 pbmc[["RNA"]]@scale.data
1 | pbmc[["RNA"]]@scale.data[c("CD3D", "TCL1A", "MS4A1"), 1:10] |
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## CD3D 1.2509633 -0.9797929 1.7380926 -0.9797929
## TCL1A -0.3187677 -0.3187677 -0.3187677 -0.3187677
## MS4A1 -0.4110536 2.5965712 -0.4110536 -0.4110536
## AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 AAACGCTGACCAGT-1 AAACGCTGGTTCTT-1
## CD3D -0.9797929 0.3651696 0.8286000 1.0906967
## TCL1A -0.3187677 -0.3187677 -0.3187677 -0.3187677
## MS4A1 -0.4110536 -0.4110536 -0.4110536 -0.4110536
## AAACGCTGTAGCCA-1 AAACGCTGTTTCTG-1
## CD3D 0.7177763 -0.9797929
## TCL1A 2.3330706 -0.3187677
## MS4A1 2.1268583 2.2776908
线性降维
Seurat 使用PCA进行降维,这里只对 FindVariableFeatures
挑选出的高变基因进行PCA分析
1 | pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) |
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
## CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
## MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
## PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
## HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A
## SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC
## GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
## AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
## LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5
## RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX
## Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B
## FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB
1 | DimPlot(pbmc, reduction = "pca") |
1 | DimHeatmap(pbmc, dims = 1:5, cells = 500, balanced = TRUE) |
维数选择
Seurat在主成分PC上进行聚类。然而直接对所有PC聚类是不现实的,我们需要选择足够的PC以代表数据的主要变异度,同时控制计算资源的开销。
因此,Seurat结合JackStraw程序和置换检验对PC进行显著性分析,鉴定出显著的PC以进行后续分析。
1 | # NOTE: This process can take a long time for big datasets, comment out for expediency. More |
PC11之后,PC的p-value就发生了迅速的上升,而变得不显著。
1 | JackStrawPlot(pbmc, dims = 1:15) |
## Warning: Removed 23496 rows containing missing values (geom_point).
我们还可以结合elbow plot进行判断,选择拐点和曲线平滑的PC
1 | ElbowPlot(pbmc) |
综上,我们选取前10个维度进行后续分析
细胞聚类
Seurat使用基于图的聚类算法对细胞进行聚类
FindNeighbors
中的 dims
参数指定聚类使用的维度
FindClusters
中的 resolution
参数指定类别的精度,越大则分出越多的类;越小则类别越少
1 | pbmc <- FindNeighbors(pbmc, dims = 1:10) |
## Computing nearest neighbor graph
## Computing SNN
1 | pbmc <- FindClusters(pbmc, resolution = 0.5) |
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95965
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8723
## Number of communities: 9
## Elapsed time: 0 seconds
1 | table(Idents(pbmc)) |
##
## 0 1 2 3 4 5 6 7 8
## 711 480 472 344 279 162 144 32 14
非线性降维(UMAP/tSNE)
非线性降维捕捉数据内部的流式(manifold)以将细胞投射到低维空间中。
1 | # If you haven't installed UMAP, you can do so via reticulate::py_install(packages = |
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 22:52:48 UMAP embedding parameters a = 0.9922 b = 1.112
## 22:52:48 Read 2638 rows and found 10 numeric columns
## 22:52:48 Using Annoy for neighbor search, n_neighbors = 30
## 22:52:48 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 22:52:48 Writing NN index file to temp file C:\Users\lda\AppData\Local\Temp\Rtmp8CGyov\file5f84f137628
## 22:52:48 Searching Annoy index using 1 thread, search_k = 3000
## 22:52:49 Annoy recall = 100%
## 22:52:49 Commencing smooth kNN distance calibration using 1 thread
## 22:52:50 Initializing from normalized Laplacian + noise
## 22:52:50 Commencing optimization for 500 epochs, with 105124 positive edges
## 22:52:57 Optimization finished
1 | pbmc <- RunTSNE(pbmc, dims = 1:10) |
1 | # note that you can set `label = TRUE` or use the LabelClusters function to help label |
1 | DimPlot(pbmc, reduction = "tsne", label = TRUE) |
UMAP和tSNE的降维效果不同,需要根据实际情况选择。
在这里可以保存中间数据,作为一个checkpoint
1 | saveRDS(pbmc, file = "data/pbmc_tutorial.rds") |
鉴定差异表达特征(cluster markers)
Seurat支持对cluster之间进行差异表达分析,主要有 FindMarkers
和 FindAllMarkers
两种方法。
这里鉴定cluster 5和cluster 0, 3之间的差异基因。如果不指定 ident.2
则鉴定cluster 5 与其余clusters的差异基因。
min.pct
指定差异基因需要在cluster中的表达占比
1 | # find all markers distinguishing cluster 5 from clusters 0 and 3 |
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCGR3A 2.150929e-209 4.267579 0.975 0.039 2.949784e-205
## IFITM3 6.103366e-199 3.877105 0.975 0.048 8.370156e-195
## CFD 8.891428e-198 3.411039 0.938 0.037 1.219370e-193
## CD68 2.374425e-194 3.014535 0.926 0.035 3.256286e-190
## RP11-290F20.3 9.308287e-191 2.722684 0.840 0.016 1.276538e-186
FindAllMarkers
可以一次寻找所有clusters的markers,但只返回上调的markers
1 | # find markers for every cluster compared to all remaining cells, report only the positive |
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
1 | pbmc.markers %>% |
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat.geom
## # A tibble: 18 x 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.17e- 83 1.33 0.435 0.108 1.60e- 79 0 CCR7
## 2 1.74e-109 1.07 0.897 0.593 2.39e-105 0 LDHB
## 3 0. 5.57 0.996 0.215 0. 1 S100A9
## 4 0. 5.48 0.975 0.121 0. 1 S100A8
## 5 7.99e- 87 1.28 0.981 0.644 1.10e- 82 2 LTB
## 6 2.61e- 59 1.24 0.424 0.111 3.58e- 55 2 AQP3
## 7 0. 4.31 0.936 0.041 0. 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
## 9 4.93e-169 3.01 0.595 0.056 6.76e-165 4 GZMK
## 10 1.17e-178 2.97 0.957 0.241 1.60e-174 4 CCL5
## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
## 13 6.82e-175 4.92 0.958 0.135 9.36e-171 6 GNLY
## 14 1.05e-265 4.89 0.986 0.071 1.44e-261 6 GZMB
## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 3.68e-110 8.58 1 0.024 5.05e-106 8 PPBP
## 18 7.73e-200 7.24 1 0.01 1.06e-195 8 PF4
Visualization
Seurat提供多种基因表达量可视化方法
- 小提琴图
1 | VlnPlot(pbmc, features = c("MS4A1", "CD79A")) |
- 细胞降维图
1 | FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", |
- 热图
1 | pbmc.markers %>% |
细胞注释
我们可以根据细胞marker基因的表达对细胞进行注释。虽然目前有一些自动注释的工具,但总的来说大家还是根据细胞的经典markers对细胞进行注释。
这里,我们根据教程中提供的cluster markers和细胞类型进行注释
Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 CD14, LYZ CD14+ Mono
2 IL7R, S100A4 Memory CD4+
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet
1 | new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", |
1 | saveRDS(pbmc, file = "data/pbmc3k_final.rds") |
至此,Seurat分析的常规流程就结束了。
Ref:
Seurat - Guided Clustering Tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html