单细胞数据初步处理 | drop-seq | QC | 质控 | 正则化 normalization

比对

The raw Drop-seq data was processed with the standard pipeline (Drop-seq tools version 1.12 from McCarroll laboratory). Reads were aligned to the ENSEMBL release 84Mus musculusgenome.

10x Genomics data was processed using the same pipeline as for Drop-seq data, adjusting the barcode locations accordingly

我还没有深入接触10x和drop-seq的数据,目前的10x数据都是用官网cellranger跑出来的。

质控

We selected cells for downstream processing in each Drop-seq run, using the quality control metrics output by the Drop-seq tools package9, as well as metrics derived from the UMI matrix.

1) We first removed cells with a low number (<700) of unique detected genes. From the remaining cells, we filtered additional outliers.

2) We removed cells for which the overall alignment rate was less than the mean minus three standard deviations.

3) We removed cells for which the total number of reads (after log10 transformation) was not within three standard deviations of the mean.

4) We removed cells for which the total number of unique molecules (UMIs, after log10 transformation) was not within three standard deviations of the mean.

5) We removed cells for which the transcriptomic alignment rate (defined by PCT_USABLE_BASES) was not within three standard deviations of the mean.

6) We removed cells that showed an unusually high or low number of UMIs given their number of reads by fitting a loess curve (span= 0.5, degree= 2) to the number of UMIs with number of reads as predictor (both after log10 transformation). Cells with a residual more than three standard deviations away from the mean were removed.

7) With the same criteria, we removed cells that showed an unusually high or low number of genes given their number of UMIs. Of these filter steps, step 1 removed the majority of cells.

Steps 2 to 7 removed only a small number of additional cells from each eminence (2% to 4%), and these cells did not exhibit unique or biologically informative patterns of gene expression.

1. 过滤掉基因数量太少的细胞;

2. 过滤基因组比对太差的细胞;

3. 过滤掉总reads数太少的细胞;

4. 过滤掉UMI太少的细胞;

5. 过滤掉转录本比对太少的细胞;

6. 根据统计分析,过滤reads过多或过少的细胞;

7. 根据统计分析,过滤UMI过低或过高的细胞;

注:连过滤都有点统计的门槛,其实也简单,应该是默认为正态分布,去掉了左右极端值。

还有一个就是简单的拟合回归,LOESS Curve Fitting (Local Polynomial Regression)

正则化

The raw data per Drop-seq run is a UMI count matrix with genes as rows and cells as columns. The values represent the number of UMIs that were detected. The aim of normalization is to make these numbers comparable between cells by removing the effect of sequencing depth and biological sources of heterogeneity that may confound the signal of interest, in our case cell cycle stage.

目前有很多正则化的方法,但是作者还是自己开发了一个。

正则化就是去掉一些影响因素,使得我们的数据之间可以相互比较。这里就提到了两个最主要的因素:测序深度和细胞周期。

A common approach to correct for sequencing depth is to create a new normalized expression matrix x with (see Fig), in which ci,j is the molecule count of gene i in cell j and mj is the sum of all molecule counts for cell j. This approach assumes that ci,j increases linearly with mj, which is true only when the set of genes detected in each cell is roughly the same.

可以看到常规的正则化方法是不适合的,

However, for Drop-seq, in which the number of UMIs is low per cell compared to the number of genes present, the set of genes detected per cell can be quite different. Hence, we normalize the expression of each gene separately by modelling the UMI counts as coming from a generalized linear model with negative binomial distribution, the mean of which can be dependent on technical factors related to sequencing depth. Specifically, for every gene we model the expected value of UMI counts as a function of the total number of reads assigned to that cell, and the number of UMIs per detected gene (sum of UMI divided by number of unique detected genes).

这个就有些门槛了,用了广义线性回归模型来做正则化。

To solve the regression problem, we use a generalized linear model (glm function of base R package) with a regularized overdispersion parameter theta. Regularizing theta helps us to avoid overfitting which could occur for genes whose variability is mostly driven by biological processes rather than sampling noise and dropout events. To learn a regularized theta for every gene, we perform the following procedure.

1) For every gene, obtain an empirical theta using the maximum likelihood model (theta.ml function of the MASS R package) and the estimated mean vector that is obtained by a generalized linear model with Poisson error distribution.

2) Fit a line (loess, span = 0.33, degree = 2) through the variance–mean UMI count relationship (both log10 transformed) and predict regularized theta using the fit. The relationship between variance and theta and mean is given by variance= mean + (mean2/theta).
Normalized expression is then defined as the Pearson residual of the regression model, which can be interpreted as the number of standard deviations by which an observed UMI count was higher or lower than its expected value. Unless stated otherwise, we clip expression to the range [-30, 30] to prevent outliers from dominating downstream analyses.

原文地址:https://www.cnblogs.com/leezx/p/8648328.html

时间: 2024-10-10 22:21:12

单细胞数据初步处理 | drop-seq | QC | 质控 | 正则化 normalization的相关文章

单细胞数据整合方法 | Comprehensive Integration of Single-Cell Data

操作代码:https://satijalab.org/seurat/ Comprehensive Integration of Single-Cell Data 实在是没想到,这篇seurat的V3里面的整合方法居然发在了Cell主刊. 果然:大佬+前沿领域=无限可能 可以看到bioRxiv上是November 02, 2018发布的,然后Cell主刊June 06, 2019正式发表. 方法的创意应该在2017年底就有了,那时候我才刚来做single cell. Single-cell tra

单细胞数据高级分析之初步降维和聚类 | Dimensionality reduction | Clustering

Dimensionality reduction. Throughout the manuscript we use diffusion maps, a non-linear dimensionality reduction technique37. We calculate a cell-to-cell distance matrix using 1 - Pearson correlation and use the diffuse function of the diffusionMap R

Excel数据导入QC操作流程

环境: QC9 WindowsXP Office2007   1. 准备 1.通过Excel导入QC,需要下载Microsoft Excel Add-in: http://update.external.hp.com/qualitycenter/qc100/msoffice/msexcel/index.html 2.安装Microsoft Excel Add-in,安装前需要关闭Excel应用程序,并且以管理员身份运行. 这里注意:如果以前安装过该插件的早期版本,最好将其卸载后再安装. 3.完成

sql 数据表操作 create alert drop

--创建数据表php41_goods商品表create table php41_goods(    goods_id mediumint unsigned not null  auto_increment comment '主键',    goods_name varchar(32) not null comment '商品名称',    goods_price decimal(10,2) not null default 0 comment '市场价格',    goods_shop_pric

Oracle中恢复drop掉的表中的数据

今天同事不小心把生产上的一张表直接drop掉了,没有做备份,哥们慌的一匹,来找我这个小白来帮忙解决,于是心血来潮简单总结一下. 其实在oralce中,用drop删掉一张表,其实不会真正的删除,只是把表放到了回收站中,可以通过flashback命令来恢复drop掉的表. 例如: 1.创建一张表,删除:再创建一张同名表,字段不同,再删除 在 select original_name,dropscn from recyclebin时候,两个表的dropscn是不同的,在利用闪回恢复时 flashbac

阿里天池大数据之移动推荐算法大赛总结及代码全公布

移动推荐算法比赛已经结束了一个多星期了,现在写一篇文章来回顾一下自己的参赛历程. 首先,对不了解这个比赛的同学们介绍一下这个比赛(引用自官网): 赛题简介 2014年是阿里巴巴集团移动电商业务快速发展的一年,例如2014双11大促中移动端成交占比达到42.6%,超过240亿元.相比PC时代,移动端网络的访问是随时随地的,具有更丰富的场景数据,比如用户的位置信息.用户访问的时间规律等. 本次大赛以阿里巴巴移动电商平台的真实用户-商品行为数据为基础,同时提供移动时代特有的位置信息,而参赛队伍则需要通

数据竞赛思路分享:机场客流量的时空分布预测

历时两个月的比赛终于结束了,最终以第32名的成绩告终,在此和大家分享下解决问题的思路. 从初赛到复赛,有走过弯路,也有突然灵光一现的时刻.一路走来,对数据各种把玩,分析了各种可能的情况,尽可能得挖掘数据中潜在的信息来构建更为准确的模型. 本文无法涵盖所有的分析历程,但是会涉及解决问题的主要思路以及部分代码,详细的代码见Github页面 竞赛详细信息参见比赛官方网站 1. 问题描述 机场拥有巨大的旅客吞吐量,与巨大的人员流动相对应的则是巨大的服务压力.安防.安检.突发事件应急.值机.行李追踪等机场

mysqldump备份及数据还原

mysqldump建议使用在10g以下的数据进行备份,是mysql数据库的逻辑备份 mysqldump 指令介绍 -A 所有表  -all-databases -B   --databases   db1 db2 ...: 备份指定的多个库,同时创建库(不能跟A同时使用) -x 锁表   --lock-all-tables    请求锁定所有表之后再备份,对MyISAM.InnoDB.Aria做温备 --single-transaction: 能够对InnoDB存储引擎实现热备:(要确定表是in

怎么从sqlserver的存储过程获得返回的数据

1.返回一个数值 declare @count int exec @count = testReturn \'111\',\'222\' select @count @count就是返回的数值是int类型 2.返回一个数据表 首先要新建一个数据表CREATE TABLE test ( TMP_ORDER_ID varchar(20))然后将返回的数据插入到表中insert into test exec [dbo].[P_ORDER_ADD] 'U0001','01','2015040923202