GOSemSim：GO-terms Semantic Similarity Measures

Installation

Install GOSemSim is easy, follow the guide in the Bioconductor page:

## try http:// if https:// URLs are not supportedsource("https://bioconductor.org/biocLite.R")## biocLite("BiocUpgrade") ## you may need thisbiocLite("GOSemSim")

GO ID

找到 Gene Ontology (GO)勾选下面几个选项

然后获得所有酵母蛋白的Gene ontology数据

提取Gene Ontology ID

# -*- coding: utf-8 -*-
"""
Created on Fri Oct 28 19:04:38 2016
@author: sun
"""
import pandas as pd
import re
yeast=pd.read_csv(‘yeast.csv‘)
#Gene ontology (biological process)
#Gene ontology (molecular function)
#Gene ontology (cellular component)
bp=yeast[‘Gene ontology (biological process)‘]
bp=bp.fillna(value=‘‘)
for i in range(len(bp)):
temp=re.findall(r"GO:\d{7}",bp[i])
bp[i]=‘;‘.join(temp)
mf=yeast[‘Gene ontology (molecular function)‘]
mf=mf.fillna(value=‘‘)
for i in range(len(mf)):
temp=re.findall(r"GO:\d{7}",mf[i])
mf[i]=‘;‘.join(temp)
cc=yeast[‘Gene ontology (cellular component)‘]
cc=cc.fillna(value=‘‘)
for i in range(len(cc)):
temp=re.findall(r"GO:\d{7}",cc[i])
cc[i]=‘;‘.join(temp)
yeast[‘Gene ontology (biological process)‘]=bp
yeast[‘Gene ontology (molecular function)‘]=mf
yeast[‘Gene ontology (cellular component)‘]=cc
yeast.to_csv(‘go.csv‘,index=False,columns =[‘Entry‘,
‘Gene ontology (cellular component)‘,
‘Gene ontology (molecular function)‘,
‘Gene ontology (biological process)‘])

获得Gene ontology

获取gold_yeast的Gene ontology

yeast_gold_protein_pair.csv

yeast=pd.read_csv(‘yeast_gold_protein_pair.csv‘)
go=pd.read_csv(‘go.csv‘,index_col=0)
protein_a=go.loc[yeast.idA,:]
protein_b=go.loc[yeast.idB,:]
protein_a.to_csv(‘GOProteinA.csv‘)
protein_b.to_csv(‘GOProteinB.csv‘)

GOProteinA.csv

GOProteinB.csv

R语言的安装

官方网址：https://www.r-project.org/

科大镜像：https://mirrors.ustc.edu.cn/CRAN/

没什么好说的，直接双击安装即可。注意：不能装到带有空格的目录中

注意：以下内容FQ或许会顺利点

R的一个可视化界面（RStudio）的安装

下载地址：https://www.rstudio.com/products/rstudio/download/

直接选择对应的操作系统就行了

也没什么好说，双击直接安装就行了。

软件界面如下

然后开始安装GOSemSim，运行文章开头安装的代码

如果顺利的话，这样就算成功安装GOSemSim了。

如果没FQ的话，一般会有如下错误

解决办法：

进入到R的安装目录，编辑etc/Rprofile.site

添加 options(download.file.method="libcurl")，重新打开RStudio。

选择Tools->Global Options...->Package->Cran mirror选择科大的镜像。如图所示

最后如果还是不行，建议手动下载安装包，下面列出了所需安装包。

GOSemSim 说明文档

http://www.bioconductor.org/packages/release/bioc/vignettes/GOSemSim/inst/doc/GOSemSim.html

刚刚下载好的Supported organisms的安装

3.3 Supported organisms

For IC-based methods, information of GO term is species specific. We need to calculate IC for all GO terms of a species before we measure semantic similarity. GOSemSim support all organisms that have an OrgDb object available.

Bioconductor have already provided OrgDb for about 20 species, seehttp://bioconductor.org/packages/release/BiocViews.html#___OrgDb.

首先需要下载酵母的OrgDb数据库

打开RStudio把刚刚下载好的org.Sc.sgd.db_3.4.0.tar.gz安装上去。

Tools->Install Packages

Install from:Package Archive File(.zip;.tar.gz)

点击Browse...选择刚刚下载好的org.Sc.sgd.db_3.4.0.tar.gz文件

到这里我们需要的文件已经安装好了

Once we have OrgDb, we can build annotation data needed by GOSemSim via godata function.

library(GOSemSim)hsGO <- godata(‘org.Hs.eg.db‘, ont="MF")

## [1] "preparing gene to GO mapping data..."
## [1] "preparing IC data..."

User can set computeIC=FALSE if they only want to use Wang’s method.

goSim 和mgoSim的介绍

In GOSemSim, we implemented all these IC-based and graph-based methods. goSim function calculates semantic similarity between two GO terms, while mgoSim function calculates semantic similarity between two sets of GO terms.

goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")

## [1] 0.155

goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")

## [1] 0.158

go1 = c("GO:0004022","GO:0004024","GO:0004174")go2 = c("GO:0009055","GO:0005515")mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)

##            GO:0009055 GO:0005515
## GO:0004022      0.205      0.158
## GO:0004024      0.185      0.141
## GO:0004174      0.205      0.158

mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")

## [1] 0.192

执行结果

> library(GOSemSim)
> hsGO <- godata(‘org.Hs.eg.db‘, ont="MF")
[1]"preparing gene to GO mapping data..."
[1]"preparing IC data..."
> goSim("GO:0004022","GO:0005515", semData=hsGO, measure="Jiang")
[1]0.155
> goSim("GO:0004022","GO:0005515", semData=hsGO, measure="Wang")
[1]0.158
> go1 = c("GO:0004022","GO:0004024","GO:0004174")
> go2 = c("GO:0009055","GO:0005515")
> mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
GO:0009055 GO:0005515
GO:00040220.2050.158
GO:00040240.1850.141
GO:00041740.2050.158
> mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
[1]0.192

至此测试完成！

使用gold_yeast数据获得GO特征

代码如下

library(GOSemSim)
library(org.Sc.sgd.db)
GOProteinA<- read.csv("GOProteinA.csv", stringsAsFactors = F)
GOProteinB<- read.csv("GOProteinB.csv", stringsAsFactors = F)
cc <- c()
mf <- c()
bp <- c()
scGo <- godata(‘org.Sc.sgd.db‘, ont ="cc", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,2], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,2], split =";")[[1]])
cc[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
scGo <- godata(‘org.Sc.sgd.db‘, ont ="mf", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,3], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,3], split =";")[[1]])
mf[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
scGo <- godata(‘org.Sc.sgd.db‘, ont ="bp", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,4], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,4], split =";")[[1]])
bp[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
GOFeature<-
data.frame(GOProteinA$Entry,GOProteinB$Entry, cc, mf, bp)
write.csv(GOFeature,
‘GOFeature.csv‘,
na =‘0‘,#将nan值填充为0
row.names = FALSE)

GOFeature.csv

与原论文（Ensemble learning prediction of protein–protein interactions using proteins functional annotations）提供的数据相比，

结果不一样？下图是原论文数据

检查错误

以第一对蛋白质为例，首先单独获得P00546和P25302的GO ID

P00546

Gene ontology (cellular component) cc总共5个

GO:0005935;GO:0000307;GO:0005737;GO:0005783;GO:0005634

cellular bud neck [GO:0005935];
cyclin-dependent protein kinase holoenzyme complex [GO:0000307];
cytoplasm [GO:0005737];
endoplasmic reticulum [GO:0005783];
nucleus [GO:0005634]

Gene ontology (molecular function) mf总共5个

GO:0005524;GO:0004693;GO:0042393;GO:0004674;GO:0000993

ATP binding [GO:0005524];
cyclin-dependent protein serine/threonine kinase activity [GO:0004693];
histone binding [GO:0042393];
protein serine/threonine kinase activity [GO:0004674];
RNA polymerase II core binding [GO:0000993]

Gene ontology (biological process) bp总共35个

GO:0006370;GO:0051301;GO:0000706;GO:1990758;GO:2001033;GO:0051447;GO:0045930;GO:0045875;GO:0045892;GO:0007070;

GO:0018105;GO:0018107;GO:0070816;GO:0051446;GO:0045931;GO:0010571;GO:0010696;GO:0045893;GO:0045944;GO:0010898;

GO:1990139;GO:0034504;GO:1902002;GO:1990802;GO:1990804;GO:1990801;GO:1990803;GO:0010568;GO:0010569;GO:0010570;

GO:0060303;GO:0090169;GO:0032210;GO:0007130;GO:0016192

7-methylguanosine mRNA capping [GO:0006370];
cell division [GO:0051301];
meiotic DNA double-strand break processing [GO:0000706];
mitotic sister chromatid biorientation [GO:1990758];
negative regulation of double-strand break repair via nonhomologous end joining [GO:2001033];
negative regulation of meiotic cell cycle [GO:0051447];
negative regulation of mitotic cell cycle [GO:0045930];
negative regulation of sister chromatid cohesion [GO:0045875];
negative regulation of transcription, DNA-templated [GO:0045892];
negative regulation of transcription from RNA polymerase II promoter during mitosis [GO:0007070];
peptidyl-serine phosphorylation [GO:0018105];
peptidyl-threonine phosphorylation [GO:0018107];
phosphorylation of RNA polymerase II C-terminal domain [GO:0070816];
positive regulation of meiotic cell cycle [GO:0051446];
positive regulation of mitotic cell cycle [GO:0045931];
positive regulation of nuclear cell cycle DNA replication [GO:0010571];
positive regulation of spindle pole body separation [GO:0010696];
positive regulation of transcription, DNA-templated [GO:0045893];
positive regulation of transcription from RNA polymerase II promoter [GO:0045944];
positive regulation of triglyceride catabolic process [GO:0010898];
protein localization to nuclear periphery [GO:1990139];
protein localization to nucleus [GO:0034504];
protein phosphorylation involved in cellular protein catabolic process [GO:1902002];
protein phosphorylation involved in DNA double-strand break processing [GO:1990802];
protein phosphorylation involved in double-strand break repair via nonhomologous end joining [GO:1990804];
protein phosphorylation involved in mitotic spindle assembly [GO:1990801];
protein phosphorylation involved in protein localization to spindle microtubule [GO:1990803];
regulation of budding cell apical bud growth [GO:0010568];
regulation of double-strand break repair via homologous recombination [GO:0010569];
regulation of filamentous growth [GO:0010570];
regulation of nucleosome density [GO:0060303];
regulation of spindle assembly [GO:0090169];
regulation of telomere maintenance via telomerase [GO:0032210];
synaptonemal complex assembly [GO:0007130];
vesicle-mediated transport [GO:0016192]

P25302

Gene ontology (cellular component) cc总共2个

GO:0000790;GO:0033309

nuclear chromatin [GO:0000790];
SBF transcription complex [GO:0033309]

Gene ontology (molecular function) mf总共3个

GO:0003677;GO:0042802;GO:0001077

DNA binding [GO:0003677];
identical protein binding [GO:0042802];
transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding [GO:0001077]

Gene ontology (biological process) bp总共2个

GO:0061408;GO:0071931

positive regulation of transcription from RNA polymerase II promoter in response to heat stress [GO:0061408];
positive regulation of transcription involved in G1/S transition of mitotic cell cycle [GO:0071931]

程序计算个数没错

分别计算CC,MF,BP

ccA <- c(strsplit(GOProteinA[1,2], split =";")[[1]])
ccB <- c(strsplit(GOProteinB[1,2], split =";")[[1]])
mfA <- c(strsplit(GOProteinA[1,3], split =";")[[1]])
mfB <- c(strsplit(GOProteinB[1,3], split =";")[[1]])
bpA <- c(strsplit(GOProteinA[1,4], split =";")[[1]])
bpB <- c(strsplit(GOProteinB[1,4], split =";")[[1]])
scGo <- godata(‘org.Sc.sgd.db‘, ont ="cc")
cc <- mgoSim(ccA, ccB, semData = scGo, measure ="Wang")
scGo <- godata(‘org.Sc.sgd.db‘, ont ="mf")
mf <- mgoSim(mfA, mfB, semData = scGo, measure ="Wang")
scGo <- godata(‘org.Sc.sgd.db‘, ont ="bp")
bp <- mgoSim(bpA, bpB, semData = scGo, measure ="Wang")

单个执行结果与原始程序运行结果一样。

未找到错误继续

检查原论文说明。

似乎也没有其他说明。

结论。

可以肯定一点。程序没错。
可能是数据更新了。原始论文是2014的论文。2年可是GO数据的更新导致数据不一致。

来自为知笔记(Wiz)

附件列表

时间： 2024-11-09 19:59:12

GOSemSim