Google云平台使用方法 | Hail | GWAS

参考:

Hail

Hail - Tutorial  windows也可以安装:Spark在Windows下的环境搭建

spark-2.2.0-bin-hadoop2.7 - Hail依赖的平台,并行处理

google cloud platform - 云平台

Broad‘s data cluster set-up tool

对Google cloud SDK的一个简单的wrap,方便操作。

cloudtools is a small collection of command line tools intended to make using Hail on clusters running in Google Cloud‘s Dataproc service simpler.
These tools are written in Python and mostly function as wrappers around the gcloud suite of command line tools included in the Google Cloud SDK.

Google cloud基本使用

安装gcloud

登录,[GCloud] 讓gcloud連到新的 Google 帳戶下的 Google Cloud Platform

只需15分钟,使用谷歌云平台运行Jupyter Notebook

基本操作:

创建项目

进入控制台,点击三点符号

创建和删除虚拟机

gcloud dataproc clusters delete name

上传和删除文件

gcloud datastore create-indexes index.yaml  

在程序中读入和写出文件

f1 = hc.read("gs://somewhere")

  

目前只是单独的使用一个VM,如何想批量并行使用Google cloud的VM就必须要使用分布式管理系统,如spark等,hail就是集成了spark。

Hail的基本使用

This snippet starts a cluster named "testcluster" with the 1 master machine, 2 worker machines (the minimum/default), and 6 additional preemptible worker machines. Then, after the cluster is started (this can take a few minutes), a Hail script is submitted to the cluster "testcluster".

spark基本原理

1. 在本地运行wrapper,创建Google cloud虚拟机

cluster start testcluster   --master-machine-type n1-highmem-8   --worker-machine-type n1-standard-8   --num-workers 8   --version devel   --spark 2.2.0   --zone asia-east1-a

2. 启动notebook

cluster connect testcluster notebook

3. 本地提交脚本到Google cloud上

cluster submit testcluster myhailscript.py

4. 登录到Google cloud,安装必备软件

gcloud compute ssh testcluster-m --zone asia-east1-a

5. 安装sklearn

sudo su # to be root and install packages
/opt/conda/bin/conda install scikit-learn

文章案例

Genome-wide gene-environment analyses of depression and reported lifetime traumatic experiences in UK Biobank

把这篇文章搞懂80%,遗传和统计就基本入门了,操作性很强。

Depression is more frequently observed among individuals exposed to traumatic events. The relationship between trauma exposure and depression, including the role of genetic variation, is complex and poorly understood. The UK Biobank concurrently assessed depression and reported trauma exposure in 126,522 genotyped individuals of European ancestry. We compared the shared aetiology of depression and a range of phenotypes, contrasting individuals reporting trauma exposure with those who did not (final sample size range: 24,094- 92,957). Depression was heritable in participants reporting trauma exposure and in unexposed individuals, and the genetic correlation between the groups was substantial and not significantly different from 1. Genetic correlations between depression and psychiatric traits were strong regardless of reported trauma exposure, whereas genetic correlations between depression and body mass index (and related phenotypes) were observed only in trauma exposed individuals. The narrower range of genetic correlations in trauma unexposed depression and the lack of correlation with BMI echoes earlier ideas of endogenous depression.

Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression

Major depressive disorder (MDD) is a common illness accompanied by considerable morbidity, mortality, costs, and heightened risk of suicide. We conducted a genome-wide association meta-analysis based in 135,458 cases and 344,901 controls and identified 44 independent and significant loci. The genetic findings were associated with clinical features of major depression and implicated brain regions exhibiting anatomical differences in cases. Targets of antidepressant medications and genes involved in gene splicing were enriched for smaller association signal. We found important relationships of genetic risk for major depression with educational attainment, body mass, and schizophrenia: lower educational attainment and higher body mass were putatively causal, whereas major depression and schizophrenia reflected a partly shared biological etiology. All humans carry lesser or greater numbers of genetic risk factors for major depression. These findings help refine the basis of major depression and imply that a continuous measure of risk underlies the clinical phenotype.

一些问题

Hail是用来干嘛的?

案例:gnomAD

The Neale Lab at the Broad Institute used Hail to perform QC and genome-wide association analysis of 2419 phenotypes across 10 million variants and 337,000 samples from the UK Biobank in 24 hours. paper

Hail’s functionality is exposed through Python and backed by distributed algorithms built on top of Apache Spark to efficiently analyze gigabyte-scale data on a laptop or terabyte-scale data on a cluster.

  • a library for analyzing structured tabular and matrix data
  • a collection of primitives for operating on data in parallel
  • a suite of functionality for processing genetic data
  • not an acronym
# conda env create -n hail -f $HAIL_HOME/python/hail/environment.yml
source activate hail
cd $HAIL_HOME/tutorials
jhail

 

运行GWAS

1kg_annotations.txt

Sample  Population      SuperPopulation isFemale        PurpleHair      CaffeineConsumption
HG00096 GBR     EUR     False   False   77.0
HG00097 GBR     EUR     True    True    67.0
HG00098 GBR     EUR     False   False   83.0
HG00099 GBR     EUR     True    False   64.0
HG00100 GBR     EUR     True    False   59.0
HG00101 GBR     EUR     False   True    77.0

1kg.mt目录

.
├── _SUCCESS
├── cols
│   ├── _SUCCESS
│   ├── metadata.json.gz
│   └── rows
│       ├── metadata.json.gz
│       └── parts
│           └── part-0
├── entries
│   ├── _SUCCESS
│   ├── metadata.json.gz
│   └── rows
│       ├── metadata.json.gz
│       └── parts
│           ├── part-00-2-0-0-6886f608-afb6-1e68-684b-3c5920e7edd5
│           ├── part-01-2-1-0-3d30160f-dba0-16f4-e898-4e7c30148855
│           ├── part-02-2-2-0-1051da4b-6799-6074-7d32-9bd7fa9ed9af
├── globals
│   ├── _SUCCESS
│   ├── globals
│   │   ├── metadata.json.gz
│   │   └── parts
│   │       └── part-0
│   ├── metadata.json.gz
│   └── rows
│       ├── metadata.json.gz
│       └── parts
│           └── part-0
├── metadata.json.gz
├── references
└── rows
    ├── _SUCCESS
    ├── metadata.json.gz
    └── rows
        ├── metadata.json.gz
        └── parts
            ├── part-00-2-0-0-6886f608-afb6-1e68-684b-3c5920e7edd5
            ├── part-01-2-1-0-3d30160f-dba0-16f4-e898-4e7c30148855
            ├── part-02-2-2-0-1051da4b-6799-6074-7d32-9bd7fa9ed9af

问题:只需15分钟,使用谷歌云平台运行Jupyter Notebook

 

GWAS的原理

临床生物信息学中的GWAS分析

GWAS基本分析内容

  

 

待续~

原文地址:https://www.cnblogs.com/leezx/p/8973288.html

时间: 2024-10-17 04:57:32

Google云平台使用方法 | Hail | GWAS的相关文章

Google云平台使用方法

参考: Hail Hail - Tutorial spark-2.2.0-bin-hadoop2.7 google cloud platform Broad'sdataclusterset-up tool Genome-wide gene-environment analyses of depression and reported lifetime traumatic experiences in UK BiobankGenome-wide association analyses ident

Google云平台对于2014世界杯半决赛的预测,德国阿根廷胜!

由于本人是个足球迷,前段日子Google利用自己云平台预测世界杯八进四的比赛并取得了75%的正确率的事情让我振动不小.虽然这些年一直听说大数据的预测和看趋势能力如何如何强大,但这次的感受更加震撼,因为世界杯是很多人都在关注并尝试去预测的比赛,Google云平台在这个时候站出来预测比赛无疑很让人充满期待. 当然有件事情必须要讲的是,世界杯从来都是冷门出现在小组赛最多,而进入淘汰赛之后就越来越少,所以Google在八进四才开始预测无疑是件很讨巧的做法.不过不管怎么说,靠大数据预测小概率事件本来就非常

Google云平台技术架构

Google Cloud  设计原理: 1.分布式文件系统: Google Distributed File System(GSF) 为了满足Google迅速增长的数据处理需求,我们设计并实现了Google文件系统(Google File System – GFS).GFS与传统的分布式文件系统有着很多相同的设计目标,比如,性能.可伸缩性.可靠性以及可用性.但是,我们的设计还基于我们对我们自己的应用 的负载情况和技术环境的观察的影响,不管现在还是将来,GFS和早期文件系统的假设都有明显的不同.所

容器云平台和Kubernetes之间不得不说的那些事

前言我们知道,传统的应用部署方式是将应用直接部署于单独的物理机或虚拟机中.但是在企业数字化转型的浪潮下,如何满足日益丰满的业务需求,如何高效践行敏捷研发,如何更好的将应用落地实施于客户现场,保障稳定高可用并利于维护,是传统企业不得不面对并解决的问题. 用友云技术中台为助力企业数字化转型提供了大量利器,比如本文将着重提及的容器云平台,就是其中之一. 容器云平台,是基于容器的运行时引擎,利用Kubernetes等容器调度方案,用以解决开发.测试.运行环境统一,服务快速部署,运行期服务管理.调度等问题

在云平台上基于Go语言+Google图表API提供二维码生成应用

二维码能够说已经深深的融入了我们的生活其中.到处可见它的身影:但通常我们都是去扫二维码, 曾经我们分享给朋友一个网址直接把Url发过去,如今我们能够把自己的信息生成二维码再分享给他人. 这里就分享一下基于Go语言+Google图表API提供二维码生成功能的小应用,并演示怎样把它公布到云平台上, 让每一个人都能够通过网络訪问使用它. Google图表API Google在http://chart.apis.google.com 上提供了一个将表单数据自己主动转换为图表的服务. 只是,该服务非常难交

Discuz! X3.1直接进入云平台列表的方法

Discuz! X3.1已经改版,后台不能直接进云平台列表,不方便操作,操作云平台服务时,大家可以这样操作: 1.登录后台:2.访问域名进入云平台列表http://你域名/admin.php?frames=yes&action=cloud&operation=applist 1.登录后台:2.访问域名进入开通云平台页面 :http://你域名/admin.php?action=cloud&operation=open

容器云平台在传统企业落地的一些思考和探索

本文内容是我今天在一个云原生论坛上演讲的材料,加上一些备注,现在分享给大家. 从应用的承载和部署方式这一角度看,一共经历了传统的物理机架构.虚拟化架构.和现在的容器化三种架构.但是,容器并不是一种虚拟化技术,它与虚拟机有实质性区别. 虽然把云分为IaaS.PaaS 和 SaaS 已经好多年了,但是,它们只有的差别,一直是想得出但摸不到.对我个人来说,只有在搞了OpenStack 后才算了解了一些IaaS,只有在用了 OpenShift 后才算了解了一些PaaS.这两个产品,对我都有云启蒙性的帮助

Azure云平台学习之路(三)——Cloud Services

1.什么是云服务? 能够部署高度可用的且可无限缩放的应用程序和API.简而言之,就是你写的CMD程序按照一定的框架进行少量修改就能运行在Azure云平台上. 2.Azure云服务有什么特点? (1)专注应用程序而不是硬件,PaaS的一种. (2)支持多种框架和语言. (3)集成了运行状况监视和负载平衡. (4)自动缩放优化成本和性能 3.建立云服务之前,我们需要建立一个云存储,来记录我们的程序的日志信息(当然,这不是必须的) (1)选择左边导航栏的"存储".主面板上显示的是所有已有的存

阿里云平台之docker容器故障总结

各位网友及同行大家好: 今天遇到了一些关于阿里云平台docker容器的问题,现在把自己在解决问题的思路和自己的一点心 得体会分享给大家,如果有遇到过类似问题的网络可以借鉴一下简单的小思路: 问题描述:云平台docker 容器处于停止状态,无法进行启动:(由于此容器也是带着应用的,不 应该停止的,以下是简单的思路) 解决办法:1.一般遇到此问题时候先登陆到宿主机上,然后查看以下docker进程是否存在: (备注:因为有的时候docker进程有时候会被卡死:这时候需要将其进程杀掉,然后重新启动) l