单细胞测序——scImpute

An accurate and robust imputation method scImpute for single-cell RNA-seq data

http://jsb.ucla.edu/sites/default/files/publications/NC_scImpute.pdf

18年UCLA刚发在NC上的一篇technology的文章,软件试过有用但是在本人数据上并没有明显的优越性,有待更多考证。

不过原理需求甚解以便效尤后续之分析。

the workflow in the imputation step of scImpute method. scImpute first learns each gene’s dropout probability in each cell by fitting a mixture model. Next, scImpute imputes the (highly probable) dropout values in cell j (gene set Aj ) by borrowing information of the same gene in other similar cells, which are selected based on gene set Bj (not severely affected by dropout events).

selecting matrix 的图示原理:

与其他软件的比较,空白对照和平行对照:scimpute、MAGIC、SAVER

one more dimension reduction plot:

clustering:

效果先摆上来,确实很不错。那么具体算法和统计模型是如何架构的呢:

step by step Factorization:

establishing the normalized count matrix

1. PCA is performed on matrix X for dimension reduction and the resulting matrix is denoted as Z, where columns represent cells and rows represent principal components (PCs). The purpose of dimension reduction is to reduce the impact of large portions of dropout values. The PCs are selected such that at least 40% of the variance in data could be explained. 2. Based on the PCA-transformed data Z, the distance matrix DJ×J between the cells could be calculated. For each cell j, we denote its distance to the nearest neighbor as lj. For the set L = {l1, …, lJ}, we denote its first quartile as Q1, and third quartile as Q3. The outlier cells are those cells which do not have close neighbors:

   equation1

For each outlier cell, we set its candidate neighbor set Nj = ?. Please note that the outlier cells could be a result of experimental/technical errors or biases, but they may also represent real biological variation as rare cell types. scImpute would not impute gene expression values in outlier cells, nor use them to impute gene expression values in other cells. 3. The remaining cells {1, …, J}\O are clustered into K groups by spectral clustering23. We denote gj = k if cell j is assigned to cluster k (k = 1, …, K). Hence, cell j has the candidate neighbor set Nj ? j ′ : gj ′ ? gj; j ′ ≠j

用PCA的方法,将上一步的均一化之后的matrix降维分析,得到一个叫Z的e PCA-transformed的矩阵,据此算出一个细胞与细胞之间的distance matrix Dj*j,outliners就通过这个距离矩阵中的细胞间相对位置进行outliners的判断与筛选,acoording to equation1

之后再将剩下来的细胞,在一个O range 范围内的细胞都cluster到一个group,这样得到的k个groups就可以call it as the subpopulations of those cell population

###############那么处理好哪些细胞是可用的问题之后我们需要对这些大数据量的单细胞分布做一个有统计学意义的描述:
For each gene i, its expression in cell subpopulation k is modeled as a random variable XekT i with density functi

  equation2

其中 是基因i在细胞亚群k中的 dropout率 α,β 是基因i 在gamam分布中的形态与位置参数, and μ, σ 是基因i在正态分布中的均值和标准差,这些参数都是用的EM也就是最大似然估计来进行的估测。

这个公式主要意义在于诠释了在基因的不同表达情况下,如何更好的衡量它是不是一个dorpout的value还是反应了一个真实的生物变异

   equation3

公式三就是dropout rate的计算公式

下面来到文章中核心的如何去impute those we found dropout points above:

Imputation of dropout values. Now, we impute the gene expressions cell by cell. For each cell j, we select a gene set Aj in need of imputation based on the genes’ dropout probabilities in cell j: Aj = {i : dij ≥ t}, where t is a threshold on dropout probabilities. We also have a gene set Bj = {i : dij < t} that have accurate gene expression with high confidence and do not need imputation. We learn cells’ similarities through the gene set Bj. Then we impute the expression of genes in the set Aj by borrowing information from the same gene’s expression in other similar cells learned from Bj. Supplementary Figs. 19 and 20c give some real data distributions of genes‘ zero count proportions across cells and genes‘ dropout probabilities, showing that it is reasonable to divide genes into two sets. To learn the cells similar to cell j from Bj, we use the non-negative least squares (NNLS) regression:

          equation4

Recall that Nj represents the indices of cells that are candidate neighbors of cell j. The response XBj;j is a vector representing the Bj rows in the j-th column of X, the design matrix XBj;Nj is a sub-matrix of X with dimensions Bj  Nj , and the coefficients β(j) is a vector of length Nj . Note that NNLS itself has the property of leading to a sparse estimate bβejT , whose components may have exact zeros39, so NNLS can be used to select similar cells of cell j from its neighbors Nj. Finally, the estimated coefficients bβejT from the set Bj are used to impute the expression of genes in the set Aj in cell j:

         equation5

说了一大堆,最核心大的就是 We learn cells’ similarities through the gene set Bj. Then we impute the expression of genes in the set Aj by borrowing information from the same gene’s expression in other similar cells learned from Bj.NNLS是非负最小二乘回归的缩写,在寻找cellJ邻近的相似细胞的时候可以派上用场,Bj是从gene set B中得到的估计系数,用于在A geneset中对有dropout 的基因表达矩阵进行impute。其中A gene set 是找出来的需要impute的set 然而B是找出来的相对标准以及精确的不需impute的gene set,用一个dij与t threshold的一个比较得出。得到的一个稀疏估计值βhatJ , 是拥有几乎完全为0的表达量组分的一个估计系数(whose components may have exact zeros)

至此我们可以将need imputed matrix Xij分为从来自A geneset 以及B geneset的两种的情况。

We construct a separate regression model for each cell to impute the expression of genes with high dropout probabilities。整个scimpute的过程,只需要两个参数的人为设置,第一个是K就是cluster到多少个gourd的个数,以及一个dropout的rate threshold t。

advantages of scimpute in article:scImpute simultaneously determines the values that need imputation, and would not introduce biases to the high expression values of accurately measured genes。但是scImpute的inputing 相对保守不会overscImpute也不会过于sparse。

############validation step

Generation of simulated scRNA-seq data.##自行看文章

Four evaluation measures of clustering results

(adjusted Rand index, Jaccard index, normalized mutual information (nmi), and purity)

adjusted Rand index:是在聚类分类中的用的比较多的经典的检验方法,惩罚的是假阳性以及假阴性的分类事件,

Jaccard index:类似ARI,但是JI并不能很精确的判定真阴性事件。

NMI:是从信息理论的角度解读亚群与亚群之间的相似性

purity:纯度,从真正的一个聚类中的得来的样本数的百分比。

原文地址:https://www.cnblogs.com/beckygogogo/p/9194947.html

时间: 2024-08-02 21:18:38

单细胞测序——scImpute的相关文章

单细胞测序方法大比拼

[生物技术]单细胞测序方法大比拼 测序技术 导读 在单细胞研究的大潮中,新的测序方法层出不穷.不过,很少有人对这些方法进行系统的比对.慕尼黑大学生物学家Wolfgang Enard最近领导团队,在小鼠胚胎干细胞的基因表达研究中比较了一 导读 在单细胞研究的大潮中,新的测序方法层出不穷.不过,很少有人对这些方法进行系统的比对.慕尼黑大学生物学家Wolfgang Enard最近领导团队,在小鼠胚胎干细胞的基因表达研究中比较了一些常用的单细胞测序方法,包括Smart-seq.CEL-seq.SCRB-

单细胞测序数据的差异表达分析方法总结

无论是传统的多细胞转录组测序(bulk RNA-seq)还是单细胞转录组测序(scRNA-seq),差异表达分析(differential expression analysis)是比较两组不同样本基因表达异同的基本方法,可获得一组样本相对于另一组样本表达显著上调(up-regulated)和下调的基因(down-regulated),从而可进一步研究这些差异表达基因的功能,包括富集的通路(pathway)或生物学过程(biological process). 由于单细胞测序技术的局限性,单细胞

基于单细胞测序数据构建细胞状态转换轨迹(cell trajectory)方法总结

细胞状态转换轨迹构建示意图(Trapnell et al. Nature Biotechnology, 2014) 在各种生物系统中,细胞都会展现出一系列的不同状态(如基因表达的动态变化等),这些状态(state)之间会按照一定的时间顺序转换.最典型的比如细胞的分化过程,从不成熟的细胞逐渐分化为成熟细胞.此外,细胞在受到外界刺激或扰动时,细胞内基因的表达也可能发生一系列的变化,从而呈现出一系列状态的转换. 这些特别提一下,细胞状态(cell state)和细胞亚型(cell subtype)是两

scRNA-seq单细胞测序数据分析工具汇总

本文总结自一篇综述: Computational approaches for interpreting scRNA-seq data 单细胞分析分为两个层次: cell level gene level Tools for the visualization and clustering of cells. Tools for the ordering of cells & bifurcation/branch identification Tools for gene-level analy

单细胞转录组测序数据的可变剪接(alternative splicing)分析方法总结

可变剪接(alternative splicing),在真核生物中是一种非常基本的生物学事件.即基因转录后,先产生初始RNA或称作RNA前体,然后再通过可变剪接方式,选择性的把不同的外显子进行重连,从而产生不同的剪接异构体(isoform).这种方式,使得一个基因可产生多个不同的转录本,这些转录本分别在细胞/个体分化发育的不同阶段,在不同的组织中有各自特异的表达和功能,从而极大地丰富了编码RNA和非编码RNA种类和数量,进而增加了转录组和蛋白质组的复杂性. 可变剪接主要有以下五种常见的形式: 1

xgene:之ROC曲线、ctDNA、small-RNA seq、甲基化seq、单细胞DNA, mRNA

灵敏度高 == 假阴性率低,即漏检率低,即有病人却没有发现出来的概率低. 用于判断:有一部分人患有一种疾病,某种检验方法可以在人群中检出多少个病人来. 特异性高 == 假阳性率低,即错把健康判定为病人的概率低. 用于:被某种试验判定为患病的人中,又有多少是真的患了这种病的. 好的检测方法:有高的灵敏度(低的假阴性率).同时又有高的特异性(低的假阳性率). ROC 曲线: 横轴:100 - 特异性..即100减去特异性,特异性高,100减去特异性就低,故越小越好. 纵轴:灵敏度值. ROC分析图的

第三章 RNA测序

RNA测序(RNA Sequencing,简称RNA-Seq,也被称为全转录物组鸟枪法测序Whole Transcriptome Shotgun Sequencing,简称WTSS),是基于二代测序技术研究转录组学的方法,可以快速获取给定时刻的一个基因组中RNA的种类和数量. RNA-Seq有助于查看基因的不同转录本.转录后修饰.基因融合.突变/SNP和基因表达随时间的变化,或在不同组中基因表达的差异. RNA-Seq除了可以查看mRNA转录本,还可以查看总RNA.小RNA,例如miRNA.tR

Science综述 | 用单细胞基因组学将人类细胞表型匹配到基因型

原文地址:https://science.sciencemag.org/content/365/6460/1401 <更多精彩,可关注微信公众号:AIPuFuBio,和大型免费综合生物信息学资源和工具平台AIPuFu:www.aipufu.com> Mapping human cell phenotypes to genotypes with single-cell genomics 摘要 身体所有细胞的累积活动,以及它们无数的相互作用.存亡历程和环境变化,产生了人类个体特有的性状.对人类细胞

Nature | 新技术scSLAM-seq可在单细胞水平揭示转录动态变化的核心特征

原文地址:https://doi.org/10.1038/s41586-019-1369-y scSLAM-seq reveals core features of transcription dynamics in single cells 摘要 单细胞转录组测序(single-cell RNA-seq, scRNA-seq)强调了细胞间表达异质性在健康和疾病表型变异中的重要作用.然而,目前的scRNA-seq方法仅仅提供了基因表达的一个快照,很少传达关于转录的真实时间动态和随机特性的信息.s