GOSemSim:GO-terms Semantic Similarity Measures
Installation
Install GOSemSim
is easy, follow the guide in the Bioconductor page:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
## biocLite("BiocUpgrade") ## you may need this
biocLite("GOSemSim")
GO ID
找到 Gene Ontology (GO)勾选下面几个选项
然后获得所有酵母蛋白的Gene ontology数据
提取Gene Ontology ID
# -*- coding: utf-8 -*-
"""
Created on Fri Oct 28 19:04:38 2016
@author: sun
"""
import pandas as pd
import re
yeast=pd.read_csv(‘yeast.csv‘)
#Gene ontology (biological process)
#Gene ontology (molecular function)
#Gene ontology (cellular component)
bp=yeast[‘Gene ontology (biological process)‘]
bp=bp.fillna(value=‘‘)
for i in range(len(bp)):
temp=re.findall(r"GO:\d{7}",bp[i])
bp[i]=‘;‘.join(temp)
mf=yeast[‘Gene ontology (molecular function)‘]
mf=mf.fillna(value=‘‘)
for i in range(len(mf)):
temp=re.findall(r"GO:\d{7}",mf[i])
mf[i]=‘;‘.join(temp)
cc=yeast[‘Gene ontology (cellular component)‘]
cc=cc.fillna(value=‘‘)
for i in range(len(cc)):
temp=re.findall(r"GO:\d{7}",cc[i])
cc[i]=‘;‘.join(temp)
yeast[‘Gene ontology (biological process)‘]=bp
yeast[‘Gene ontology (molecular function)‘]=mf
yeast[‘Gene ontology (cellular component)‘]=cc
yeast.to_csv(‘go.csv‘,index=False,columns =[‘Entry‘,
‘Gene ontology (cellular component)‘,
‘Gene ontology (molecular function)‘,
‘Gene ontology (biological process)‘])
获得Gene ontology
获取gold_yeast的Gene ontology
yeast_gold_protein_pair.csv
yeast=pd.read_csv(‘yeast_gold_protein_pair.csv‘)
go=pd.read_csv(‘go.csv‘,index_col=0)
protein_a=go.loc[yeast.idA,:]
protein_b=go.loc[yeast.idB,:]
protein_a.to_csv(‘GOProteinA.csv‘)
protein_b.to_csv(‘GOProteinB.csv‘)
GOProteinA.csv
GOProteinB.csv
R语言的安装
官方网址:https://www.r-project.org/
科大镜像:https://mirrors.ustc.edu.cn/CRAN/
没什么好说的,直接双击安装即可。注意:不能装到带有空格的目录中
注意:以下内容FQ或许会顺利点
R的一个可视化界面(RStudio)的安装
下载地址:https://www.rstudio.com/products/rstudio/download/
直接选择对应的操作系统就行了
也没什么好说,双击直接安装就行了。
软件界面如下
然后开始安装GOSemSim,运行文章开头安装的代码
如果顺利的话,这样就算成功安装GOSemSim了。
如果没FQ的话,一般会有如下错误
解决办法:
进入到R的安装目录,编辑etc/Rprofile.site
添加 options(download.file.method="libcurl"),重新打开RStudio。
选择Tools->Global Options...->Package->Cran mirror选择科大的镜像。如图所示
最后如果还是不行,建议手动下载安装包,下面列出了所需安装包。
GOSemSim 说明文档
http://www.bioconductor.org/packages/release/bioc/vignettes/GOSemSim/inst/doc/GOSemSim.html
刚刚下载好的Supported organisms的安装
3.3 Supported organisms
For IC-based methods, information of GO term is species specific. We need to calculate IC
for all GO terms of a species before we measure semantic similarity. GOSemSim support all organisms that have an OrgDb
object available.
Bioconductor have already provided OrgDb
for about 20 species, seehttp://bioconductor.org/packages/release/BiocViews.html#___OrgDb.
首先需要下载酵母的OrgDb数据库
打开RStudio把刚刚下载好的org.Sc.sgd.db_3.4.0.tar.gz安装上去。
Tools->Install Packages
Install from:Package Archive File(.zip;.tar.gz)
点击Browse...选择刚刚下载好的org.Sc.sgd.db_3.4.0.tar.gz文件
到这里我们需要的文件已经安装好了
Once we have OrgDb
, we can build annotation data needed by GOSemSim via godata
function.
library(GOSemSim)hsGO <- godata(‘org.Hs.eg.db‘, ont="MF")
## [1] "preparing gene to GO mapping data..."
## [1] "preparing IC data..."
User can set computeIC=FALSE
if they only want to use Wang’s method.
goSim 和mgoSim的介绍
In GOSemSim, we implemented all these IC-based and graph-based methods. goSim function calculates semantic similarity between two GO terms, while mgoSim function calculates semantic similarity between two sets of GO terms.
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")
## [1] 0.155
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")
## [1] 0.158
go1 = c("GO:0004022","GO:0004024","GO:0004174")go2 = c("GO:0009055","GO:0005515")mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
## GO:0009055 GO:0005515
## GO:0004022 0.205 0.158
## GO:0004024 0.185 0.141
## GO:0004174 0.205 0.158
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
## [1] 0.192
执行结果
> library(GOSemSim)
> hsGO <- godata(‘org.Hs.eg.db‘, ont="MF")
[1]"preparing gene to GO mapping data..."
[1]"preparing IC data..."
> goSim("GO:0004022","GO:0005515", semData=hsGO, measure="Jiang")
[1]0.155
> goSim("GO:0004022","GO:0005515", semData=hsGO, measure="Wang")
[1]0.158
> go1 = c("GO:0004022","GO:0004024","GO:0004174")
> go2 = c("GO:0009055","GO:0005515")
> mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
GO:0009055 GO:0005515
GO:00040220.2050.158
GO:00040240.1850.141
GO:00041740.2050.158
> mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
[1]0.192
至此测试完成!
使用gold_yeast数据获得GO特征
代码如下
library(GOSemSim)
library(org.Sc.sgd.db)
GOProteinA<- read.csv("GOProteinA.csv", stringsAsFactors = F)
GOProteinB<- read.csv("GOProteinB.csv", stringsAsFactors = F)
cc <- c()
mf <- c()
bp <- c()
scGo <- godata(‘org.Sc.sgd.db‘, ont ="cc", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,2], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,2], split =";")[[1]])
cc[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
scGo <- godata(‘org.Sc.sgd.db‘, ont ="mf", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,3], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,3], split =";")[[1]])
mf[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
scGo <- godata(‘org.Sc.sgd.db‘, ont ="bp", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,4], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,4], split =";")[[1]])
bp[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
GOFeature<-
data.frame(GOProteinA$Entry,GOProteinB$Entry, cc, mf, bp)
write.csv(GOFeature,
‘GOFeature.csv‘,
na =‘0‘,#将nan值填充为0
row.names = FALSE)
GOFeature.csv
与原论文(Ensemble learning prediction of protein–protein interactions using proteins functional annotations)提供的数据相比,
结果不一样?下图是原论文数据
检查错误
- 以第一对蛋白质为例,首先单独获得P00546和P25302的GO ID
P00546
Gene ontology (cellular component) cc总共5个
GO:0005935;GO:0000307;GO:0005737;GO:0005783;GO:0005634
cellular bud neck [GO:0005935];
cyclin-dependent protein kinase holoenzyme complex [GO:0000307];
cytoplasm [GO:0005737];
endoplasmic reticulum [GO:0005783];
nucleus [GO:0005634]
Gene ontology (molecular function) mf总共5个
GO:0005524;GO:0004693;GO:0042393;GO:0004674;GO:0000993
ATP binding [GO:0005524];
cyclin-dependent protein serine/threonine kinase activity [GO:0004693];
histone binding [GO:0042393];
protein serine/threonine kinase activity [GO:0004674];
RNA polymerase II core binding [GO:0000993]
Gene ontology (biological process) bp总共35个
GO:0006370;GO:0051301;GO:0000706;GO:1990758;GO:2001033;GO:0051447;GO:0045930;GO:0045875;GO:0045892;GO:0007070;
GO:0018105;GO:0018107;GO:0070816;GO:0051446;GO:0045931;GO:0010571;GO:0010696;GO:0045893;GO:0045944;GO:0010898;
GO:1990139;GO:0034504;GO:1902002;GO:1990802;GO:1990804;GO:1990801;GO:1990803;GO:0010568;GO:0010569;GO:0010570;
GO:0060303;GO:0090169;GO:0032210;GO:0007130;GO:0016192
7-methylguanosine mRNA capping [GO:0006370];
cell division [GO:0051301];
meiotic DNA double-strand break processing [GO:0000706];
mitotic sister chromatid biorientation [GO:1990758];
negative regulation of double-strand break repair via nonhomologous end joining [GO:2001033];
negative regulation of meiotic cell cycle [GO:0051447];
negative regulation of mitotic cell cycle [GO:0045930];
negative regulation of sister chromatid cohesion [GO:0045875];
negative regulation of transcription, DNA-templated [GO:0045892];
negative regulation of transcription from RNA polymerase II promoter during mitosis [GO:0007070];
peptidyl-serine phosphorylation [GO:0018105];
peptidyl-threonine phosphorylation [GO:0018107];
phosphorylation of RNA polymerase II C-terminal domain [GO:0070816];
positive regulation of meiotic cell cycle [GO:0051446];
positive regulation of mitotic cell cycle [GO:0045931];
positive regulation of nuclear cell cycle DNA replication [GO:0010571];
positive regulation of spindle pole body separation [GO:0010696];
positive regulation of transcription, DNA-templated [GO:0045893];
positive regulation of transcription from RNA polymerase II promoter [GO:0045944];
positive regulation of triglyceride catabolic process [GO:0010898];
protein localization to nuclear periphery [GO:1990139];
protein localization to nucleus [GO:0034504];
protein phosphorylation involved in cellular protein catabolic process [GO:1902002];
protein phosphorylation involved in DNA double-strand break processing [GO:1990802];
protein phosphorylation involved in double-strand break repair via nonhomologous end joining [GO:1990804];
protein phosphorylation involved in mitotic spindle assembly [GO:1990801];
protein phosphorylation involved in protein localization to spindle microtubule [GO:1990803];
regulation of budding cell apical bud growth [GO:0010568];
regulation of double-strand break repair via homologous recombination [GO:0010569];
regulation of filamentous growth [GO:0010570];
regulation of nucleosome density [GO:0060303];
regulation of spindle assembly [GO:0090169];
regulation of telomere maintenance via telomerase [GO:0032210];
synaptonemal complex assembly [GO:0007130];
vesicle-mediated transport [GO:0016192]
P25302
Gene ontology (cellular component) cc总共2个
GO:0000790;GO:0033309
nuclear chromatin [GO:0000790];
SBF transcription complex [GO:0033309]
Gene ontology (molecular function) mf总共3个
GO:0003677;GO:0042802;GO:0001077
DNA binding [GO:0003677];
identical protein binding [GO:0042802];
transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding [GO:0001077]
Gene ontology (biological process) bp总共2个
GO:0061408;GO:0071931
positive regulation of transcription from RNA polymerase II promoter in response to heat stress [GO:0061408];
positive regulation of transcription involved in G1/S transition of mitotic cell cycle [GO:0071931]
程序计算个数没错
- 分别计算CC,MF,BP
ccA <- c(strsplit(GOProteinA[1,2], split =";")[[1]])
ccB <- c(strsplit(GOProteinB[1,2], split =";")[[1]])
mfA <- c(strsplit(GOProteinA[1,3], split =";")[[1]])
mfB <- c(strsplit(GOProteinB[1,3], split =";")[[1]])
bpA <- c(strsplit(GOProteinA[1,4], split =";")[[1]])
bpB <- c(strsplit(GOProteinB[1,4], split =";")[[1]])
scGo <- godata(‘org.Sc.sgd.db‘, ont ="cc")
cc <- mgoSim(ccA, ccB, semData = scGo, measure ="Wang")
scGo <- godata(‘org.Sc.sgd.db‘, ont ="mf")
mf <- mgoSim(mfA, mfB, semData = scGo, measure ="Wang")
scGo <- godata(‘org.Sc.sgd.db‘, ont ="bp")
bp <- mgoSim(bpA, bpB, semData = scGo, measure ="Wang")
单个执行结果与原始程序运行结果一样。
未找到错误继续
- 检查原论文说明。
似乎也没有其他说明。
结论。
- 可以肯定一点。程序没错。
- 可能是数据更新了。原始论文是2014的论文。2年可是GO数据的更新导致数据不一致。