Lasso linear model实例 | Proliferation index | 评估单细胞的增殖指数

背景:We developed a cell-cycle scoring approach that uses expression data to compute an index for every cell that scores the cell according to its expression of cell-cycle genes. In brief, our approach proceeded through four steps. (A) We reduced dimensionality of the dataset to the cell-cycle relevant genes. (B) In this subspace we performed, as a first approximation, a simple K-means clustering to separate non cycling from cycling cells and (C) we used this clustering as a reference to learn a function that takes the gene expression as the input and returns a cell-cycle score as an output. (D) We used this function to calculate a score for each single cell.

数据是每个细胞的基因表达矩阵,需求是根据基因表达信息计算每一个细胞的增殖指数(依据是细胞周期基因)。

我们常规能想到的就是建立一个线性模型,每一个细胞周期基因当做一个变量,输出一个数值,就是增殖指数,然后正则化到0~1.

问题是这样的话,每个基因前面的系数怎么确定?所以建议一个简单的方程是不可行的,我们必须要做有监督学习模型。那么有监督的数据怎么来呢?我们的数据没有lable啊。

下面就是文章中的方法:

我们需要计算增殖指数的数据没有lable,那我们就手动为其建立lable。

通过简单的kmeans聚类,我们就可以筛选出增殖指数高的细胞类群,以此为训练集,来构建监督学习模型。

然后用建好的模型再来对我们的数据进行预测,得到每一个细胞的增殖指数。

We started by selecting a wide selection of genes related to cell-cycle and proliferation. We used the PANTHER GO database and selected all the genes that were described by one of the following terms: DNA metabolic process, DNA replication, mitosis, regulation of cell cycle, cell cycle, cytokinesis, histone, DNA-directed DNA polymerase, DNA polymerase processivity factor, centromere DNAbinding protein. We restricted our features to those genes. Genes that were detected at less than 10 molecules in the dataset were removed. We calculated the pairwise correlation coefficient matrix, and selected the genes that were strongly correlated (99th percentile of the matrix) with at least 12 other genes. The genes passing the filters described above were used for clustering cells using K-means (Python scikit-learn implementation, on log-centered data, default parameters) with the rationale that the main axis of variation expected would span across dividing and non-dividing cells. Then a linear regression model with L1-norm regularization was fitted that used a learning function which took expression data of a cell and categorized into two classes, 1 when a cell belongs to the cycling cluster and 0 when it did not. Importantly, to avoid both overfitting the score on the first approximation clusters and also to obtain a more generalizable model, we used a strong regularization (5 times the one determined by cross-validation; alpha = 0.01).

This procedure was used for both the mouse and human embryonic dataset. The function learnt on the human embryonic dataset was also used to determine the proliferation index of the hPSCs.

当然文章的处理更加细心:

1. 首先从PANTHER GO数据库选出cell cycle相关的基因;

2. 计算了每个基因的相关性,去掉了独立存在的基因;

3. K-means聚类分三类,得到学习数据

4. linear regression model with L1-norm,为防止过拟合,参数设得比较严格。

这种方法从机器学习的角度给了一个大致的增殖指数,肯定不会错,但是应该也不会太准,但是用于比较不同细胞的增殖差异还是足够的。

如果想要ground truth,就必须要得到实验上更严格的数据来源,比如高度增殖的细胞和完全不增殖的细胞的基因表达数据。

代码:ipynb-lamanno2016-proliferation.ipynb

代码注释已经比较完善,后续会进行总结分析,并扩展延伸到其他应用上。

所以这种模型通用性还是比较强的。

比如拿细胞凋亡和细胞衰老相关的基因来计算每个细胞的衰老程度。

核心问题是如何选择出合适的gene list!对于有的指标很难选出合适的gene list。

原文地址:https://www.cnblogs.com/leezx/p/8623037.html

时间: 2024-07-31 01:11:03

Lasso linear model实例 | Proliferation index | 评估单细胞的增殖指数的相关文章

Bayesian generalized linear model (GLM) | 贝叶斯广义线性回归实例

学习GLM的时候在网上找不到比较通俗易懂的教程.这里以一个实例应用来介绍GLM. We used a Bayesian generalized linear model (GLM) to assign every gene to one or more cell populations, as previously described (Zeisel et al., 2015). 在单细胞RNA-seq的分析中,可以用GLM来寻找marker. 贝叶斯 + 广义 + 线性回归 线性回归:这个最基

ISL - Ch.6 Linear Model Selection and Regularization

Q: Why might we want to use another fitting procedure instead of least squares? A: alternative fitting procedures can yield better prediction accuracy and model interpretability. 6.1 Subset Selection 6.1.1 Best Subset Selection Now in order to select

Generalized Linear Model

最近一直在回顾linear regression model和logistic regression model,但对其中的一些问题都很疑惑不解,知道我看到广义线性模型即Generalized Linear Model后才恍然大悟原来这些模型是这样推导的,在这里与诸位分享一下,具体更多细节可以参考Andrew Ng的课程. 一.指数分布 广义线性模型都是由指数分布出发来推导的,所以在介绍GLM之前先讲讲什么是指数分布.指数分布的形式如下: η是参数,T(y)是y的充分统计量,即T(y)可以完全表

Note for video Machine Learning and Data Mining——Linear Model

Here is the note for lecture three. the linear model Linear model is a basic and important model in machine learning. 1. input representation The data we get usually needs some changes, most of them is the input data. In linear model, input =(x1,x2,x

asp.net Post Get提交数据转Model实例

此功能是将客户端HTTP协议POST GET方式提交的数据转换为某个Model实例,对于客户端浏览器Ajax提交的键值对或json格式数据直接转换为Model类的实例: /******************************************************************************** ** 作者:Tyler ** 创始时间:2013-05-28 ** 描述:通过js ajax 或 HTTP其他方式提交的GET,POST数据转换为指定的Model实例

关于myBatis的问题There is no getter for property named 'USER_NAME' in 'class com.bky.model.实例类'

现在流行的 ssm(spring + struts2 + myBatis)  持久层的mybatis是需要配置映射器的,找了个demo连接的数据库是mysql 于是就修改了一下弄成了连接oracle的 一切就绪之后跑起来 执行插入操作的时候问题来了 ,报了一个这个错我的表是B 字段是id ,user_name ,password  实例类的字段是 id , userName,password, 这里有个user_name 和userName 搞了一下午弄的头疼,后来发现了猫腻, There is

20170320_系统管理_用户管理1_反射得到model实例

|-用户管理增删改查框架 |-代码结构优化 |-第一个改进:合并dao与service层. |-第二个改进:合并getModel,代码声明问题. ps: 通过反射获得泛型的真实类型 通过反射得到model的实例. 包裹代码块异常:Shift + Alt + z ===================================================================== 1.action 2.struts.xml 3.service 4.serviceImpl 5.d

django 拷贝一个 model 实例

今天做一个拷贝功能,把某个 obj 拷贝并修改部分数据,提交表单后保存为一个新实例.结果google 出来的结果不对,都是相互copy 的代码,大概如下: 1 obj = MyModel.objects.get(id=1) 2 obj.pk = None 3 obj.save() 后来好不容易找到一个正确的,特此记录: 1 if request.method == "POST": 2 form = AuthorCopyForm(request.POST,instance=author)

ISLR第六章Linear Model Selection and Regularization

本章主要介绍几种可替代普通最小二乘拟合的其他一些方法. Why might we want to use another fitting procedure instead of least squares? better prediction accuracy(预测精度) and better model interpretability(模型解释力). 主要介绍三种方法: Subset Selection.Shrinkage.Dimension Reduction 6.1Subset Sel