DisGeNET 数据库 数据的下载以及数据的应用

DisGeNET数据库 整合了多个数据库的gene-disease associations (GDAs)和大量的文献,并且采用文本挖掘技术对孟德尔疾病、复杂疾病和环境性疾病进行了相关性分析。具体技术包括对基因-疾病词汇的mapping、DisGeNET本体分析。

使用的数据源如下:  

Curated Data:

UNIPROT、CTD、CLINVAR、ORPHANET、GWAS CATALOG

Predicted Data

CTD、MGD、RGD

Literature Data:

GAD、LHGDN、BeFree Data

Variant Data:

dbSNP、EXAC、1000 Genomes Project、Ensembl

打开数据的下载界面如图所示:

.

一.数据的理解

文件均为制表符分隔文件

1.首先我们关注  Gene-Disease Associations

下面我们来区分一下三种不用的文件类型

1.来自UniProt,CGI,ClinGen,Genomics England Panel App,PsyGeNET,Orphanet,HPO和CTD(人类数据)的基因疾病协会

2.使用BeFree系统获得的基因 - 疾病关联

3.DisGeNET中的所有基因 - 疾病关联

The columns in the files are:
geneId 		-> NCBI Entrez Gene Identifier  :NCBI Entrez基因标识符
geneSymbol	-> Official Gene Symbol:官方基因符号
DSI		-> The Disease Specificity Index for the gene :该基因的疾病特异性指数
DPI		-> The Disease Pleiotropy Index for the gene :该基因的疾病多效性指数
diseaseId 	-> UMLS concept unique identifier:UMLS概念唯一标识符
diseaseName 	-> Name of the disease:疾病名称
diseaseType  	-> The DisGeNET disease type: disease, phenotype and group:谱系疾病类型:疾病、表型、类群
diseaseClass	-> The MeSH disease class(es):MeSH疾病类(es)
diseaseSemanticType	-> The UMLS Semantic Type(s) of the disease:本病的UMLS语义类型
score		-> DisGENET score for the Gene-Disease association:基因疾病协会的DisGENET评分
EI		-> The Evidence Index for the Gene-Disease association:基因疾病协会的证据指数
YearInitial	-> First time that the Gene-Disease association was reported:这是基因疾病关系首次被报道的时间
YearFinal	-> Last time that the Gene-Disease association was reported:这是基因疾病关系最后一次被报道的时间
NofPmids	-> Total number of publications reporting the Gene-Disease association:报告基因疾病关系的出版物总数
NofSnps		-> Total number of SNPs associated to the Gene-Disease association:与基因疾病关系相关的snp总数
source		-> Original source reporting the Gene-Disease association:基因疾病关系的原始报告

2.下面分析Variant-Disease Associations

curated_variant_disease_associations.tsv.gz 		=> Variant-Disease associations from UniProt, ClinVar, GWASdb and the GWAS Catalog
befree_variant_disease_associations.tsv.gz 		=> Variant-Disease associations obtained using BeFree System
all_variant_disease_associations.tsv.gz 		=> All Variant-Disease associations in DisGeNET

The columns in the files are:
snpId 		-> dbSNP variant Identifier
chromosome	-> Chromosome of the variant
position	-> Position in chromosome
DSI		-> The Disease Specificity Index for the variant
DPI		-> The Disease Pleiotropy Index for the variant
diseaseId 	-> UMLS concept unique identifier
diseaseName 	-> Name of the disease
diseaseType  	-> The DisGeNET disease type: disease, phenotype and group
diseaseClass	-> The MeSH disease class(es)
diseaseSemanticType	-> The UMLS Semantic Type(s) of the disease
score		-> DisGENET score for the Variant-Disease association
EI		-> The Evidence Index for the Variant-Disease association
YearInitial	-> First time that the Variant-Disease association was reported
YearFinal	-> Last time that the Variant-Disease association was reported
NofPmids	-> Total number of publications reporting the Variant-Disease association
source		-> Original source reporting the Variant-Disease association

3.gene_disease_pmid:PMIDPubMed唯一标识码,PubMed Unique Identifier),用于为PubMed搜索引擎中收录的生命科学和医学等领域的文献编号
all_gene_disease_pmid_associations.tsv.gz		=> All Gene-Disease-PMID associations in DisGeNET

The columns in the files are:
geneId 		-> NCBI Entrez Gene Identifier
geneSymbol	-> Official Gene Symbol
DSI		-> The Disease Specificity Index for the gene
DPI		-> The Disease Pleiotropy Index for the gene
diseaseId 	-> UMLS concept unique identifier
diseaseName 	-> Name of the disease
diseaseType  	-> The DisGeNET disease type: disease, phenotype and group
diseaseClass	-> The MeSH disease class(es)
diseaseSemanticType	-> The UMLS Semantic Type(s) of the disease
score		-> DisGENET score for the Gene-Disease association
EI		-> The Evidence Index for the Gene-Disease association
YearInitial	-> First time that the Gene-Disease association was reported
YearFinal	-> Last time that the Gene-Disease association was reported
pmid		-> Publication reporting the Gene-Disease association
source		-> Original source reporting the Gene-Disease association

4.variant_disease_pmid
all_variant_disease_pmid_associations.tsv.gz 		=> All Variant-Disease-PMID associations in DisGeNET

The columns in the files are:
snpId 		-> dbSNP variant Identifier
chromosome	-> Chromosome of the variant
position	-> Position in chromosome
DSI	-> The Disease Specificity Index for the variant
DPI	-> The Disease Pleiotropy Index for the variant
diseaseId 	-> UMLS concept unique identifier
diseaseName 	-> Name of the disease
diseaseType 	-> disease, phenotype, or group
diseaseType  	-> The DisGeNET disease type: disease, phenotype and group
diseaseClass	-> The MeSH disease class(es)
score		-> DisGENET score for the Variant-Disease association
YearInitial	-> First time that the Variant-Disease association was reported
YearFinal	-> Last time that the Variant-Disease association was reported
pmid		-> Publication reporting the Variant-Disease association
source		-> Original source reporting the Variant-Disease association

5.disease_mappings.tsv.gz				=> Mappings from UMLS concept unique identifier to disease vocabularies: DO, EFO, HPO, ICD9CM, MSH, NCI, OMIM, and ORDO

6.variant_to_gene_mappings.tsv.gz 		=> Variant mapped to their corresponding genes, according to dbSNP. 

The columns in the files are:
snpId 		-> dbSNP variant Identifier
geneId		-> NCBI Entrez Gene Identifier
geneSymbol		-> Official Gene Symbol

7.UMLS CUI to several disease vocabulariesThe file contains the mappings of DisGeNET genes (Entrez Gene Identifiers) to UniProt entriesUMLS CUI to several disease vocabularies
8.Other FilesFile with BeFree gene-disease-pmid associations for PubAnnotation BeFree gene-disease-pmid associations for Pubannotation
二.各个数据库统计

三. DisGeNET 打分系统

根据报告关联的来源数量分配DisGeNET分数

四. 疾病特异性指数Disease Specificity Index(DSI)

where:

- Nd is the number of diseases associated to the gene

- NT is the total number of diseases in DisGeNET (13,674)

The DSI ranges from 0 to 1.

DSI = 0 implies that the gene is associated only to phenotypes.

Example: TNF, associated to more than 1,500 diseases, has a DSI of 0.247, while IDH3A is associated to one disease, with a DSI of 1.

五.疾病多效应指数Disease Pleiotropy Index(DPI)

where:

- Ndc is the number of the different MeSH disease classes of the diseases associated to the gene

- NTC is the total number of MeSH diseases classes in DisGeNET (27)

The DPI ranges from 0 to 1.

DPI = 0 implies that the gene is associated only to phenotypes, or that the associated diseases do not map to any MeSH classes.

Example: gene KCNE2 is associated to 38 diseases and 10 phenotypes. 36 out of the 38 diseases have a MeSH disease class. The 36 diseases are associated to 10 different MeSH classes. The DPI index for KCNE2 = 10/27*100 ~ 0.37. Nevertheless, gene APOE, associated to more than 700 diseases, of different disease classes, has a DPI of 1.

六. 词汇mapping(Vocabulary Mapping)

Diseases:

The vocabulary used for diseases in the current release of DisGeNET is the Unified Medical Language System®(UMLS®) vocabulary. The repositories of gene-disease associations use different disease vocabularies, OMIM® terms for diseases from UniProt, CTDTM, and MGD; MeSH terms used by CTDTM, LHGDN, and RGD, UMLS® Concept Unique Identifiers (CUIs) from CLINVAR; Orphanet identifiers are mapped using Orphanet cross-references. Disease names from GAD and the GWAS Catalog are normalized using the UMLS Metathesaurus. We also used UMLS® Metathesaurus® concept structure to map MIM and MeSH terms to UMLS® CUIs.

Genes:

For human genes, HGNC symbols (used for some entries in GAD), and Uniprot accession numbers (used by Uniprot) are converted to NCBI Entrez gene identifiers using an in house dictionary that crossreferences HGNC, Uniprot and NCBI-Gene information. For mapping of mouse and rat genes, we used files ftp://ftp.informatics.jax.org/pub/reports/HOM_MouseHumanSequence.rpt, and ftp://rgd.mcw.edu/pub/data_release/RGD_ORTHOLOGS.txt both with information of orthology from MGD and RGD, respectively to map rat and mouse Entrez gene identifiers to human Entrez identifiers. We discarded the relationships when a human ortholog of the mouse or rat gene could not be found.

七. The DisGeNET Association Type Ontology

八. 数据属性

疾病

·the disease name, provided by theUMLS®Metathesaurus®

·theUMLS®semantic types

·theMeSHclass: We classify the diseases according the MeSH hierarchy using 23 upper level concepts of the MeSH tree branch C (Diseases) plus three concepts of the F branch (Psychiatry and Psychology: "Behavior and Behavior Mechanisms", "Psychological Phenomena and Processes", and "Mental Disorders").

·The top level concepts from theHuman Disease Ontology.

·The DisGeNET disease type:disease,phenotype and group.

UMLS semantic types:

- Disease or Syndrome

- Neoplastic Process

- Acquired Abnormality

- Anatomical Abnormality

- Congenital Abnormality

- Mental or Behavioral Dysfunction

UMLS® semantic types:

- Pathologic Function

- Sign or Symptom

- Finding

- Laboratory or Test Result

- Individual Behavior

- Clinical Attribute

- Organism Attribute

- Organism Function

- Organ or Tissue Function

- Cell or Molecular Dysfunction

These classifications were manually checked. In addition, disease entries referring to disease groups such as "Cardiovascular Diseases", "Autoimmune Diseases", "Neurodegenerative Diseases, and "Lung Neoplasms" were classified as disease group.

Removed terms considered as diseases by other sources, but are not strictly diseases, such as terms belonging to the following UMLS® semantic types:

- Gene or Genome

- Genetic Function

- Immunologic Factor

- Injury or Poisoning

These attributes are shown in the different views of the browser, and they are all shown in the Disease Tab.

基因

·the official gene symbol, from theNCBI

·the NCBI Official Full Name

·theUniprotaccession

·the top level Panther protein class.

·the top level Reactome pathways.

·the Specificity Index (SI)

·the Pleiotropy Index (PI)

突变

·The position in the chromosome

·The reference and alternative alleles

·The class of the variant: SNP, deletion, insertion, indel, somatic SNV, substitution, sequence alteration, and tandem repeat

·The allelic frequency according to the 1000 Genomes Project

·The allelic frequency according to the Exome Aggregation Consortium

·The most severe consequence type according to the VEP

·Links to dbSNP

·Links to ClinVar

·Links to Ensembl

基因-疾病相关性

·theDisGeNET score

·the DisGeNET Gene-Disease Association Type

·the publication(s) that reports the gene-disease association, with the Pubmed Identifier

·a representative sentence from the publication describing the association between the gene and the disease (If a representative sentence is not found, we provide the title of the paper)

·the original source reporting the Gene-Disease Association

·For some sources, we provide the variant(s) associated to the gene-disease association

九. 提供Cytoscape插件

DisGeNET Cytoscape App

十.支持本地下载

可以下载tab格式的文档(Curated、BeFree gene-disease associations和publications)

提供RDF Linked Dataset

同时query大量的数据,还支持python、perl、R脚本。

还提供mapping功能:

1. UniProt Downloads    DisGeNET genes -> UniProt entries

2. UMLS CUI  ->  MeSH Identifier

 

  

原文地址:https://www.cnblogs.com/wangshicheng/p/11018801.html

时间: 2024-10-11 16:20:47

DisGeNET 数据库 数据的下载以及数据的应用的相关文章

8天掌握EF的Code First开发系列之3 管理数据库创建,填充种子数据以及LINQ操作详解

本篇目录 管理数据库创建 管理数据库连接 管理数据库初始化 填充种子数据 LINQ to Entities详解 什么是LINQ to Entities 使用LINQ to Entities操作实体 LINQ操作 懒加载和预加载 插入数据 更新数据 删除数据 本章小结 本人的实验环境是VS 2013 Update 5,windows 10,MSSQL Server 2008. 上一篇<Code First开发系列之领域建模和管理实体关系>,我们主要介绍了EF中“约定大于配置”的概念,如何创建数据

MySQL数据库(7)_MySQL 数据备份与还原

1.使用mysqldump命令备份 mysqldump命令将数据库中的数据备份成一个文本文件.表的结构和表中的数据将存储在生成的文本文件中. mysqldump命令的工作原理很简单.它先查出需要备份的表的结构,再在文本文件中生成一个CREATE语句.然后,将表中的所有记录转换成一条INSERT语句.然后通过这些语句,就能够创建表并插入数据. 1.备份一个数据库 mysqldump基本语法: mysqldump -u username -p dbname table1 table2 ...-> B

Entity Framework应用:使用Code First模式管理数据库创建和填充种子数据

一.管理数据库连接 1.使用配置文件管理连接之约定 在数据库上下文类中,如果我们只继承了无参数的DbContext,并且在配置文件中创建了和数据库上下文类同名的连接字符串,那么EF会使用该连接字符串自动计算出数据库的位置和数据库名.比如,我们的数据库上下文定义如下: 1 using System; 2 using System.Collections.Generic; 3 using System.Data.Entity; 4 using System.Linq; 5 using System.

使用neo4j图数据库的import工具导入数据 -方法和注意事项

背景 最近我在尝试存储知识图谱的过程中,接触到了Neo4j图数据库,这里我摘取了一段Neo4j的简介: Neo4j是一个高性能的,NOSQL图形数据库,它将结构化数据存储在网络上而不是表中.它是一个嵌入式的.基于磁盘的.具备完全的事务特性的Java持久化引擎,但是它将结构化数据存储在网络(从数学角度叫做图)上而不是表中.Neo4j也可以被看作是一个高性能的图引擎,该引擎具有成熟数据库的所有特性.程序员工作在一个面向对象的.灵活的网络结构下而不是严格.静态的表中--但是他们可以享受到具备完全的事务

下载行政区划数据

目录 抓取行政区划数据 天地图接口 接口信息 代码 民政部数据 获取全国县级行政区信息 全国县级行政区边界 政府驻地地理位置 代码 抓取行政区划数据 天地图接口 天地图官网都有相关介绍,这里只是简单的搬运一下. 接口说明地址:http://lbs.tianditu.gov.cn/server/administrative.html 接口信息 天地图行政区划API是一类简单的HTTP/HTTPS接口,提供由行政区划地名.行政区划编码查询中心点.轮廓.所属上级行政区划的功能. 请求: http://

Oracle 10g通过创建物化视图实现不同数据库间表级别的数据同步

摘自:http://blog.csdn.net/javaee_sunny/article/details/53439980 目录(?)[-] Oracle 10g 物化视图语法如下 实例演示 主要步骤 在A节点创建原表和物化视图日志 在B节点创建连接A节点的远程链接 在B节点处创建目标表和与目标表名称相同的物化视图 在B节点处刷新物化视图 升级采用存储过程定时任务JOB方式定时刷新物化视图 进一步优化 文章更新记录 参考文章 Oracle 10g 物化视图语法如下: create materia

利用PHP实现登录与注册功能以及使用PHP读取mysql数据库——以表格形式显示数据

登录界面 <body><form action="login1.php" method="post"><div>用户名:<input type="text" name="uid" /></div><br /><div>密码:<input type="password" name="pwd" />

使用GEOquery下载GEO数据--转载

最近需要下载一大批GEO上的数据,问题是我要下载的Methylation数据根本就没有sra文件,换言之不能使用Aspera之类的数据进行下载.但是后来我发现了GEOquery这个不错的R包,不知道是网络问题还是怎么,GEOquery有时候运行也不太稳定,但是总体来说,很好地解决了我的问题. 首先假设我们想要下载的数据是GSE77445,这是一批DNA甲基化数据,我们可以在R语言中安装GEOquery之后,载入R包,然后直接输入: Data <- getGEO("GSE77445"

关于数据库中varchar/nvarchar类型数据的获取注意事项

当在页面后台获取数据库表中某字段的数据时,需注意该数据的类型.防止因实际数据的字符长度因达不到指定数据类型规定的字符长度而导致空格的占位符. 比如: MSSQL中某一表的结构如下:   表中的数据: 当从表中获取某一个用户名(userName)或密码(userPwd)时,如果取出的数据作为条件进行判断是,需注意获取处的数据最好进行Trim()处理,去除数据两边的空格占位符 比如: 没有对数据进行Trim()处理前,从数据库中获取的数据因为原本的数据长度不够而导致空格占位符 对数据进行Trim()