Single cell RNA-seq denoising using a deep count autoencoder

Autoencoder:机器学习中的自动编码器,这篇文章里面用的是去噪编码器,坊间称之为denoise autoencoder(DAE),在sc-RNAseq中除去dropout的噪声是非常理想的一种模型。

Therefore,这篇文章已经发表在了NC的18年预印本上,证明其方法和文章质量很是不错。

基本原理of DAE:

###measurement noise from dropout events, moves data points away from the data manifold (black line). The autoencoder is trained to denoise the data by mapping corrupted data points back onto the data manifold (green arrows),蓝色实心点代表的是corruption points(变体位点)也就是被观测到的点,而蓝色空心点代表的是没有noise的数据点。而manifold则是指的是流形学习中形成的分类流。manifold learning也是一种非常常用以及常见的机器学习方法,后续中我们依然会进一步的介绍。

而文章中,说明的是DCA takes the count distribution(数目分布), overdispersion and sparsity of the data(过离散性和稀疏性) into account using a zero-inflated negative binomial noise model(零膨胀负二项模型,后续会对ZINB模型做进一步的介绍,因为本puppy对它很是钟爱), and nonlinear gene-gene(基因与基因之间的非线性作用) or gene-dispersion (基因与离散度的交互作用)interactions are captured.

Input counts, mean, dispersion and dropout probabilities are denoted as x, μ, θ and π. respectively。A typical autoencoders compresses high dimensional data into lower dimensions in order to constrain the model and extract features that summarize the data well in the bottleneck layer.

Extension in machine learning:

###################

To test whether a ZINB loss function is necessary, we compared DCA to a classical autoencoder with mean squared error (MSE) loss function using log transformed count data. The MSE based autoencoder was unable to recover the celltypes, indicating that the specialized ZINB loss function is necessary for scRNA-seq data.

下述为没有dp的数据,有dp的数据,DCA的ZINB损失函数的的去噪自动编码器method,以及经典DCA中log化之后的均方误的损失函数method。主要目的是突出它们开发的新型零膨胀负二项分布的损失函数的模型优越性,是显著优于经典函数的模型。四种情况下的分类,降维以及聚类分析印证。文中很多显示该法优越性的图,本文重方法,画图这里不多表。

method:

ZINB is parameterized with mean and dispersion parameters of the negative binomial component (μ and θ) and the mixture coefficient that represents the weight of the point mass (π):这里的派指的是dropout基因表达量计数的权重——estimates dropout 

首先,调整基因表达矩阵的library size, log value 以及 z-score normalization

再用Xbar 去计算E B D分别代表的是编码层,瓶颈层以及解码层的不同matrix

前三步在进行矩阵的LU分解

后三步在根据分解出来的矩阵进行参数的定义,D是解码层得出的激活函数矩阵

The loss function that is likelihood of zero-inflated negative binomial distribution:

根据上述系列等式得到的参数,损失函数的参数估计,其中的估计过程就用到了ZINB分布,其中的M代表的是乘上了size factor的mean matrix,公式如:

同时,如果你有不同的dp rate的权重,你就需要不同用dp以及零膨胀参数的脊先验概率去估计它:公式如下:

where NLLZINB function represents the negative log likelihood of ZINB distribution.

||TT||^2是TT在二维空间向量中的范数大小。使用ZINB的负逻辑似然值去估计X TT M 塞塔变得可行。

总之,这个公式的优点在于可以在不同量级或者等级的基因离散程度上都有很好的参数估计效果。(cites:the alternative option estimates a scalar dispersion parameter per gene )

====================================

====================================

现存的很多单细胞优化的分析软件有:scImpute、MAGIC、DCA、SAVER、BEIR、etc

大家感兴趣的可以去搜索一下,以飨the socpe your knowledge

原文地址:https://www.cnblogs.com/beckygogogo/p/9195139.html

时间: 2024-10-10 22:21:31

Single cell RNA-seq denoising using a deep count autoencoder的相关文章

单细胞参考文献 single cell

许多分析软件 : https://github.com/seandavi/awesome-single-cell#software-packages Smart-seq.CEL-seq.SCRB-seq和Drop-seq.Smart-seqSMART(Switching mechanism at 5' end of the RNA transcript)是一个具有里程碑意义的重要技术.实际上,能够从单细胞生成全长cDNA的测序方案并不多,Smart-seq就是其中之一.对于等位基因特异性表达或者

Multiclonal Invasion in Breast Tumors Identified by Topographic Single Cell Sequencing

Title:  Multiclonal Invasion in Breast Tumors Identified by Topographic Single Cell Sequencing 课题的目的和意义: Ductal carcinoma in situ (DCIS,原位导管癌)是一种早期的乳腺癌,很少发展成invasive ductal carcinoma(IDC,浸润型导管癌).由于瘤内异质性和导管中肿瘤细胞数量低,因此很难刻画在侵袭期间的基因组进化过程.为了克服这个障碍,作者开发了To

Cell theory|Bulk RNA-seq|Cellar heterogeneity|Micromanipulation|Limiting dilution|LCM|FACS|MACS|Droplet|10X genomics|Human cell atlas|Spatially resolved transcriptomes|ST|Slide-seq|SeqFISH|MERFISH

生物信息学 Cell theory:7个要点 All known living things are made up of one or more cells. All living cells arise from pre-existing cells by division. The cell is the fundamental unit of structure and function in all living organisms. The activity of an organi

Deep Networks : Overview

Overview In the previous sections, you constructed a 3-layer neural network comprising an input, hidden and output layer. While fairly effective for MNIST, this 3-layer model is a fairly shallow network; by this, we mean that the features (hidden lay

细胞迁移 | cell migration

一些基本概念: 参考: Single-Cell Migration in Complex Microenvironments: Mechanics and Signaling Dynamics Single and collective cell migration: the mechanics of adhesions Single-cell Migration Chip for Chemotaxis-based Microfluidic Selection of Heterogeneous

RNA测序相对基因表达芯片有什么优势?

RNA测序相对基因表达芯片有什么优势? RNA-Seq和基因表达芯片相比,哪种方法更有优势?关键看适用不适用.那么RNA-Seq适用哪些研究方向?是否您的研究?来跟随本文了解一下RNA测序相对基因表达芯片有什么优势? 无假设的研究设计和更高的发现能力RNA-Seq是一种基于测序的强大方法,让研究人员能够打破传统技术的低效和花费,如实时定量PCR(RT-PCR)和芯片.无论是将RNA-Seq添加到现有的研究方法中,还是从一种方法彻底转换到另一种,RNA-Seq都带来了许多显而易见的优势.这种方法不

Research Guide for Video Frame Interpolation with Deep Learning

Research Guide for Video Frame Interpolation with Deep Learning This blog is from: https://heartbeat.fritz.ai/research-guide-for-video-frame-interpolation-with-deep-learning-519ab2eb3dda In this research guide, we’ll look at deep learning papers aime

更新tableView的某个cell

异步加载完数据后更新某个cell,这应该是非常常见的使用方法了,我们经常会用reloadData. 效果: 源码: // // RootViewController.m // DataTableView // // Copyright (c) 2014年 Y.X. All rights reserved. // #import "RootViewController.h" #import "SDWebImage.h" @interface RootViewContr

(转) Written Memories: Understanding, Deriving and Extending the LSTM

R2RT Written Memories: Understanding, Deriving and Extending the LSTM Tue 26 July 2016 When I was first introduced to Long Short-Term Memory networks (LSTMs), it was hard to look past their complexity. I didn’t understand why they were designed they