GWAS Catalog数据库简介

GWAS Catalog

The NHGRI-EBI Catalog of published genome-wide association studies

EBI负责维护的一个收集已发表的GWAS研究的数据库

Catalog stats

  • Last data release on 2019-09-24
  • 4220 publications
  • 107486 SNPs
  • 157336 associations
  • Genome assembly GRCh38.p12
  • dbSNP Build 151
  • Ensembl Build 96

基本的搜索方法

搜索表型:如breast carcinoma,会得到相关的非常规范的表型信息,EFO,就像GO一样,是一套表型分类规则。然后还会得到表型相关的基因。

搜索SNP:如rs7329174,会得到变异的详细信息,和对应的基因。

搜索人名:Yao,会得到相关的文献

搜索染色体位置:如2q37.1,Cytogenetic region

搜索基因:如HBS1L

搜索区域:如6:16000000-25000000

说是数据库,其实就是一个table,从这里下载,不过100MB

表里面有这些数据:

DATE ADDED TO CATALOG* +: Date a study is published in the catalog

PUBMEDID* +: PubMed identification number

FIRST AUTHOR* +: Last name and initials of first author

DATE* +: Publication date (online (epub) date if available)

JOURNAL* +: Abbreviated journal name

LINK* +: PubMed URL

STUDY* +: Title of paper

DISEASE/TRAIT* +: Disease or trait examined in study

INITIAL SAMPLE DESCRIPTION* +: Sample size and ancestry description for stage 1 of GWAS (summing across multiple Stage 1 populations, if applicable)

REPLICATION SAMPLE DESCRIPTION* +: Sample size and ancestry description for subsequent replication(s) (summing across multiple populations, if applicable)

REGION*: Cytogenetic region associated with rs number

CHR_ID*: Chromosome number associated with rs number

CHR_POS*: Chromosomal position associated with rs number

REPORTED GENE(S)*: Gene(s) reported by author

MAPPED GENE(S)*: Gene(s) mapped to the strongest SNP. If the SNP is located within a gene, that gene is listed. If the SNP is intergenic, the upstream and downstream genes are listed, separated by a hyphen.

UPSTREAM_GENE_ID*: Entrez Gene ID for nearest upstream gene to rs number, if not within gene

DOWNSTREAM_GENE_ID*: Entrez Gene ID for nearest downstream gene to rs number, if not within gene

SNP_GENE_IDS*: Entrez Gene ID, if rs number within gene; multiple genes denotes overlapping transcripts

UPSTREAM_GENE_DISTANCE*: distance in kb for nearest upstream gene to rs number, if not within gene

DOWNSTREAM_GENE_DISTANCE*: distance in kb for nearest downstream gene to rs number, if not within gene

STRONGEST SNP-RISK ALLELE*: SNP(s) most strongly associated with trait + risk allele (? for unknown risk allele). May also refer to a haplotype.

SNPS*: Strongest SNP; if a haplotype it may include more than one rs number (multiple SNPs comprising the haplotype)

MERGED*: denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;)

SNP_ID_CURRENT*: current rs number (will differ from strongest SNP when merged = 1)

CONTEXT*: SNP functional class

INTERGENIC*: denotes whether SNP is in intergenic region (0 = no; 1 = yes)

RISK ALLELE FREQUENCY*: Reported risk/effect allele frequency associated with strongest SNP in controls (if not available among all controls, among the control group with the largest sample size). If the associated locus is a haplotype the haplotype frequency will be extracted.

P-VALUE*: Reported p-value for strongest SNP risk allele (linked to dbGaP Association Browser). Note that p-values are rounded to 1 significant digit (for example, a published p-value of 4.8 x 10-7 is rounded to 5 x 10-7).

PVALUE_MLOG*: -log(p-value)

P-VALUE (TEXT)*: Information describing context of p-value (e.g. females, smokers).

OR or BETA*: Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR <1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are >1. Appropriate unit and increase/decrease are included for beta coefficients.

95% CI (TEXT)*: Reported 95% confidence interval associated with strongest SNP risk allele, along with unit in the case of beta-coefficients. If 95% CIs are not published, we estimate these using the standard error, where available.

PLATFORM (SNPS PASSING QC)*: Genotyping platform manufacturer used in Stage 1; also includes notation of pooled DNA study design or imputation of SNPs, where applicable

CNV*: Study of copy number variation (yes/no)

ASSOCIATION COUNT+: Number of associations identified for this study

一些问题:

什么是Genotyping technology?

什么是Experimental Factor Ontology trait?

什么是Cytogenetic region?karyotype

什么是trait + risk allele?这里要分清SNP和allele的概念,SNP是位点,而allele则是该位点上碱基。考虑一下DNA双链,以及多倍体。

什么是risk/effect allele frequency?

odds ratio在GWAS里是个什么指标?wiki

The odds ratio is the ratio of two odds, which in the context of GWA studies are the odds of case for individuals having a specific allele and the odds of case for individuals who do not have that same allele.

As an example, suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by ‘A‘ and the number of individuals in the control group having allele T is represented by ‘B‘. Similarly, the number of individuals in the case group having allele C is represented by ‘X‘ and the number of individuals in the control group having allele C is represented by ‘Y‘. In this case the odds ratio for allele T is A:B (meaning ‘A to B‘, in standard odds terminology) divided by X:Y, which in mathematical notation is simply (A/B)/(X/Y).

When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is typically calculated using a simple chi-squared test. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.[18]

什么是MAF?the frequency of the minor allele

GWAS数据可以有哪些注释?phenotype annotation、population and linkage disequilibrium (LD) information

什么是CP loci?an effective region associated with at least two phenotypes

什么是genotype-calling?

GWAS的最基本的QC有哪些?

Quality Control Procedures for Genome Wide Association Studies

Data quality control in genetic case-control association studies

  • minor allele frequency (MAF) > 0.01; statistical power is extremely low for rare SNPs,很好理解,如果一个非常罕见的SNP,需要非常大的样本量才能有足够的power
  • Hardy-Weinberg equilibrium (HWE) test p-value > 5E-05;
  • missing genotypes rate < 10%; Genotypes are classified as missing if the genotype-calling algorithm cannot infer the genotype with sufficient confidence. Can be calculated across each individual and/or SNP.

什么是Experimental Factor Ontology?

什么是LD information (r2 and D’ values)?

Mathematical properties of the r2 measure of linkage disequilibrium

待续~

原文地址:https://www.cnblogs.com/leezx/p/11594823.html

时间: 2024-10-29 16:36:59

GWAS Catalog数据库简介的相关文章

SQLite数据库简介

SQLite是D.Richard Hipp用C语言编写的开源嵌入式数据库引擎.它支持大多数的SQL92标准,并且可以在所有主要的操作系统上运行. SQLite由以下几个部分组成:SQL编译器.内核.后端以及附件.SQLite通过利用虚拟机和虚拟数据库引擎(VDBE),是调试.修改和扩展SQLite的内核变得更加方便.所有SQL语句都被编译成易读的.可以在SQLite虚拟机中执行的程序集.SQLite的整体结构图如下: 值得一提的是,袖珍型的SQLite竟然可以支持高达2TB大小的数据库,每个数据

让你提前认识软件开发(25):数据库简介

第2部分 数据库SQL语言 数据库简介 数据库是个通用化的综合性的数据集合,它可以供各种用户共享且具有最小的冗余度和较高的数据与程序的独立性.目前,国际上主导的大型数据库管理系统有ORACLE.SQL SERVER.SYBASE.INFORMIX和INGRES等. 数据库中常用的编程语言是SQL语言,按其功能可分为四大部分: (1) 数据定义语言(Data Definition Language,DDL),用于定义.撤销和修改数据模式. (2) 数据查询语言(Data Query Languag

SQL Server之 (一) 数据库简介 SQL Server环境配置 数据库基础知识

   前言 这个是我工作两年多后,再次从最基础的SQL入门开始,认真的学一遍SQL Server,捡漏和巩固都有;因为自己刚开始学的时候,总是心烦气躁,最近换工作,发现1到2年经验,问到基础性的东西还是很多,这个时候需要的是扎实的基础功夫,所以一系列打击+反省后,自己节假日在家从最基础重新认识一下SQL Server,继续沉淀一下.哪里有不对或需深入探讨,请直接留言或者小窗我;欢迎~ (一) 数据库简介   SQL Server环境配置   数据库基础知识 1.什么是数据库,数据库有哪些特点,为

MongoDB数据库简介及安装

一.MongoDB数据库简介 简介 MongoDB是一个高性能,开源,无模式的,基于分布式文件存储的文档型数据库,由C++语言编写,其名称来源取自"humongous",是一种开源的文档数据库──NoSql数据库的一种.NoSql,全称是 Not Only Sql,指的是非关系型的数据库. 特点 MongoDB数据库的特点是高性能.易部署.易使用,存储数据非常方便.主要功能特性有: * 面向集合存储,易存储对象类型的数据. * 模式自由. * 支持动态查询. * 支持完全索引,包含内部

Oracle数据库简介以及windows安装过程

Oracle数据库简介 也许很多人熟悉SQL server,并不是太了解Oracle数据库,这里进行一下简单的介绍 Oracle数据库的创始人是劳伦斯.埃里斯 Oracle数据库能被多个操作系统使用 eg:windows,linux,Solaris,AIX等 现在我们把Oracle和SQL server 进行一下简单的比较 1:Oracle数据库对系统的支持比SQL server多,而SQL server数据库是微软研发,只能在windows上使用. 2:架构不同,Oracle数据库中,一个实例

MongoDB,无模式文档型数据库简介

MongoDB的名字源自一个形容词humongous(巨大无比的),在向上扩展和快速处理大数据量方面,它会损失一些精度,在旧金山举行的MondoDB大会上,Merriman说:“你不适宜用它来处理复杂的金融事务,如证券交易,数据的一致性可能无法得到保证”.若想了解更多关于MongoDB的信息,请看51CTO数据库频道推荐<MongoDB,无模式文档型数据库简介>. NoSQL数据库都被贴上不同用途的标签,如MongoDB和CouchDB都是面向文档的数据库,但这并不意味着它们可以象JSON(J

oracle学习入门系列之四 oracle数据库简介

oracle学习入门系列之四 oracle数据库简介 终于平滑过渡到oracle了,我们在第一篇中黑了拉里一次,这里就需要给拉里洗白了.话说当年钱钟书先生写完<围城>之后,无意中说,一个鸡蛋就算好吃,也没必要知道下蛋的母鸡是哪只.蛤蟆觉得有点不妥,钱钟书先生那是文人,自然要清高,而且他本是"下蛋"的母鸡当然不愿意被吃蛋的俗人打扰的,况且当时也没有统计粉丝一说.可是我们是吃蛋的啊,而且是大老粗,现在还统计粉丝数量,我们就需要知道谁下的蛋,是不是毒蛋,对不对?也得看看下了这么好

1 数据库简介

1.数据库简介 当前使用的数据库,主要分为两类 文档型,如sqlite,就是一个文件,通过对文件的复制完成数据库的复制 服务型,如mysql.postgre,数据存储在一个物理文件中,但是需要使用终端以tcp/ip协议连接,进行数据库的读写操作 2.E-R模型 当前物理的数据库都是按照E-R模型进行设计的 E表示entry,实体 R表示relationship,关系 一个实体转换为数据库中的一个表 关系描述两个实体之间的对应规则,包括 一对一 一对多 多对多 关系转换为数据库表中的一个列 *在关

MySQL数据库基础(一)——MySQL数据库简介

MySQL数据库基础(一)--MySQL数据库简介 一.MySQL简介 1.MySQL简介 MySQL是一个轻量级关系型数据库管理系统,由瑞典MySQL AB公司开发,目前属于Oracle公司.目前MySQL被广泛地应用在Internet上的中小型网站中,由于体积小.速度快.总体拥有成本低,开放源码.免费,一般中小型网站的开发都选择Linux + MySQL作为网站数据库.MySQL是一个关系型数据库管理系统,MySQL是一种关联数据库管理系统,关联数据库将数据保存在不同的表中,而不是将所有数据