一个全基因组重测序分析实战

Original 2017-06-08 曾健明 生信技能树

这里选取的是 GATK best practice 是目前认可度最高的全基因组重测序分析流程,尤其适用于 人类研究。

PS:其实本文应该属于直播我的基因组系列,有两个原因把它单独拿出来,

  1. 首先,直播我的基因组阅读量太低了,可能是大家觉得错过了前面的,后面的看起来没有必要,这里我可以肯定的告诉大家,这一讲是独立的,而且是全流程,你学好了这个,整个直播我的基因组就可以不用看了。
  2. 其次,最近有一些朋友写了一些GATK的教程,但是大多不合我意,作为回应,我也写一个,秀出我的教程风格。

流程介绍

bwa(MEM alignment)

picard(SortSam)

picard(MarkDuplicates)

picard(FixMateInfo)

GATK(RealignerTargetCreator)

GATK(IndelRealigner)

GATK(BaseRecalibrator)

GATK(PrintReads)

GATK(HaplotypeCaller)

GATK(GenotypeGVCFs)

在本文,我将会把我的 全基因组重测序数据走完上面所有的流程,并给出代码和时间消耗情况。

准备工作

首先是软件安装

  1. ## Download and install BWA
  2. cd ~/biosoft
  3. mkdir bwa &&  cd bwa
  4. #http://sourceforge.net/projects/bio-bwa/files/
  5. wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2
  6. tar xvfj bwa-0.7.15.tar.bz2 # x extracts, v is verbose (details of what it is doing), f skips prompting for each individual file, and j tells it to unzip .bz2 files
  7. cd bwa-0.7.15
  8. make
  9. ## Download and install samtools
  10. ## http://samtools.sourceforge.net/
  11. ## http://www.htslib.org/doc/samtools.html
  12. cd ~/biosoft
  13. mkdir samtools &&  cd samtools
  14. wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
  15. tar xvfj samtools-1.3.1.tar.bz2
  16. cd samtools-1.3.1
  17. ./configure --prefix=/home/jianmingzeng/biosoft/myBin
  18. make
  19. make install
  20. ~/biosoft/myBin/bin/samtools --help
  21. ~/biosoft/myBin/bin/plot-bamstats --help
  22. cd htslib-1.3.1
  23. ./configure --prefix=/home/jianmingzeng/biosoft/myBin
  24. make
  25. make install
  26. ~/biosoft/myBin/bin/tabix
  27. ## Download and install picardtools
  28. ## https://sourceforge.net/projects/picard/
  29. ## https://github.com/broadinstitute/picard
  30. cd ~/biosoft
  31. mkdir picardtools &&  cd picardtools
  32. wget http://ncu.dl.sourceforge.net/project/picard/picard-tools/1.119/picard-tools-1.119.zip
  33. unzip picard-tools-1.119.zip
  34. mkdir 2.9.2 && cd 2.9.2
  35. wget https://github.com/broadinstitute/picard/releases/download/2.9.2/picard.jar
  36. ## GATK 需要自行申请下载,不能公开

其次是必备数据的下载:

  1. cd ~/reference
  2. mkdir -p  genome/human_g1k_v37  && cd genome/human_g1k_v37
  3. # http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/
  4. nohup wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz  &
  5. gunzip human_g1k_v37.fasta.gz
  6. wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
  7. wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/README.human_g1k_v37.fasta.txt
  8. java -jar ~/biosoft/picardtools/picard-tools-1.119/CreateSequenceDictionary.jar R=human_g1k_v37.fasta O=human_g1k_v37.dict
  9. cd ~/reference
  10. mkdir -p index/bwa && cd index/bwa   ~/reference/index/bwa/human_g1k_v37  ~/reference/genome/human_g1k_v37/human_g1k_v37.fasta 1>human_g1k_v37.bwa_index.log 2>&1   &
  11. mkdir -p ~/biosoft/GATK/resources/bundle/b37
  12. cd ~/biosoft/GATK/resources/bundle/b37
  13. wget ftp://[email protected]/bundle/b37/1000G_phase1.indels.b37.vcf.gz
  14. wget ftp://[email protected]/bundle/b37/1000G_phase1.indels.b37.vcf.idx.gz
  15. wget ftp://[email protected]/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz
  16. wget ftp://[email protected]/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
  17. gunzip 1000G_phase1.indels.b37.vcf.idx.gz
  18. gunzip 1000G_phase1.indels.b37.vcf.gz
  19. gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.gz
  20. gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
  21. mkdir -p ~/annotation/variation/human/dbSNP
  22. cd ~/annotation/variation/human/dbSNP
  23. ## https://www.ncbi.nlm.nih.gov/projects/SNP/
  24. ## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh38p2/
  25. ## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/
  26. nohup wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz &
  27. wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz.tbi

只有当软件安装完毕,还有参考基因组等必备文件准备齐全了,才能正式进入全基因组重测序分析流程!

全基因组重测序数据介绍

上面是我的全基因组数据fastq文件的截图,测序分成了5条lane,每条lane的数据量不一致。

数据分析

fastq2bam

首先给出代码:

  1. module load java/1.8.0_91
  2. GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta
  3. INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37
  4. GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar
  5. PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar
  6. DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz
  7. SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz
  8. INDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz
  9. TMPDIR=/home/jianmingzeng/tmp/software
  10. ## samtools and bwa are in the environment
  11. ## samtools Version: 1.3.1 (using htslib 1.3.1)
  12. ## bwa Version: 0.7.15-r1140
  13. : ‘
  14. ## please keep the confige in three columns format, which are fq1 fq2 sampe
  15. cat $1 |while read id
  16. do
  17.    arr=($id)
  18.    fq1=${arr[0]}
  19.    fq2=${arr[1]}
  20.    sample=${arr[2]}
  21.    #####################################################
  22.    ################ Step 1 : Alignment #################
  23.    #####################################################
  24.    echo bwa `date`
  25.    bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2 > $sample.sam
  26.    echo bwa `date`
  27.    #####################################################
  28.    ################ Step 2: Sort and Index #############
  29.    #####################################################
  30.    echo SortSam `date`
  31.    java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD SortSam SORT_ORDER=coordinate INPUT=$sample.sam OUTPUT=$sample.bam
  32.    samtools index $sample.bam
  33.    echo SortSam `date`
  34.    #####################################################
  35.    ################ Step 3: Basic Statistics ###########
  36.    #####################################################
  37.    echo stats `date`
  38.    samtools flagstat $sample.bam > ${sample}.alignment.flagstat
  39.    samtools stats  $sample.bam > ${sample}.alignment.stat
  40.    echo plot-bamstats -p ${sample}_QC  ${sample}.alignment.stat
  41.    echo stats `date`
  42.    #####################################################
  43.    ####### Step 4: multiple filtering for bam files ####
  44.    #####################################################
  45.    ###MarkDuplicates###
  46.    echo MarkDuplicates `date`
  47.    java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD MarkDuplicates \
  48.    INPUT=$sample.bam OUTPUT=${sample}_marked.bam METRICS_FILE=$sample.metrics  
  49.    echo MarkDuplicates `date`
  50.    ###FixMateInfo###
  51.    echo FixMateInfo `date`
  52.    java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD FixMateInformation \
  53.    INPUT=${sample}_marked.bam OUTPUT=${sample}_marked_fixed.bam SO=coordinate  
  54.    samtools index ${sample}_marked_fixed.bam
  55.    echo FixMateInfo `date`
  56.    echo ${sample}_marked_fixed.bam >>files.bamlist
  57.    rm $sample.sam $sample.bam ${sample}_marked.bam
  58. done
  59. samtools merge [email protected] 5  -b files.bamlist  merged.bam
  60. samtools index merged.bam

上面的代码有一点长,希望大家能用心的来理解,其实就是一个批量处理,对5条lane的测序数据循环处理,其实正式流程里面我一般是并行的,而不是循环,这里是为了给大家秀一下时间消耗情况,让大家对全基因组重测序分析有一个感性的认知。

时间消耗如下:

对L1来说,时间消耗如下:

  1. [main] Real time: 15870.794 sec; CPU: 77463.156 sec
  2. picard.sam.SortSam done. Elapsed time: 45.60 minutes.
  3. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 64.20 minutes.
  4. picard.sam.FixMateInformation done. Elapsed time: 58.05 minutes.

总共耗时约7.2小时,仅仅是对10G的fastq完成比对压缩排序去PCR重复。

如果是其它文件大小的fastq输入数据,那么这个流程耗时如下:

  1. [main] Real time: 9527.240 sec; CPU: 47758.233 sec
  2. [main] Real time: 16000.325 sec; CPU: 80595.629 sec
  3. [main] Real time: 29286.523 sec; CPU: 147524.841 sec
  4. [main] Real time: 28104.568 sec; CPU: 141519.377 sec
  5. picard.sam.SortSam done. Elapsed time: 29.02 minutes.
  6. picard.sam.SortSam done. Elapsed time: 61.26 minutes.
  7. picard.sam.SortSam done. Elapsed time: 98.39 minutes.
  8. picard.sam.SortSam done. Elapsed time: 117.16 minutes.
  9. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 35.52 minutes.
  10. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 54.41 minutes.
  11. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 90.40 minutes.
  12. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 93.03 minutes.
  13. picard.sam.FixMateInformation done. Elapsed time: 35.92 minutes.
  14. picard.sam.FixMateInformation done. Elapsed time: 66.31 minutes.
  15. picard.sam.FixMateInformation done. Elapsed time: 131.65 minutes.
  16. picard.sam.FixMateInformation done. Elapsed time: 122.31 minutes.

前面我们说过,这5条lane的数据其实是可以并行完成这几个步骤的,最长耗时约12小时。 每个数据处理我都分配了 5个线程, 40G的内存

GATK重新处理bam文件

主要是针对上一个步骤合并了5个lane之后的 merge.bam文件

  1. -rw-rw-r-- 1 jianmingzeng jianmingzeng  57G Jun  7 11:32 merged.bam
  2. -rw-rw-r-- 1 jianmingzeng jianmingzeng 8.4M Jun  7 12:05 merged.bam.bai

代码是:

  1. module load java/1.8.0_91
  2. GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta
  3. INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37
  4. GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar
  5. PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar
  6. DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz
  7. KG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz
  8. Mills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf
  9. KG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf
  10. TMPDIR=/home/jianmingzeng/tmp/software
  11. ## samtools and bwa are in the environment
  12. ## samtools Version: 1.3.1 (using htslib 1.3.1)
  13. ## bwa Version: 0.7.15-r1140
  14. sample=‘merge‘
  15. ###RealignerTargetCreator###
  16. echo RealignerTargetCreator `date`
  17. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T RealignerTargetCreator \
  18. -I ${sample}.bam -R $GENOME -o ${sample}_target.intervals \
  19. -known $Mills_indels -known $KG_indels -nt 5
  20. echo RealignerTargetCreator `date`
  21. ###IndelRealigner###
  22. echo IndelRealigner `date`
  23. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T IndelRealigner \
  24. -I ${sample}.bam -R $GENOME -targetIntervals ${sample}_target.intervals \
  25. -o ${sample}_realigned.bam -known $Mills_indels -known $KG_indels
  26. echo IndelRealigner `date`
  27. ###BaseRecalibrator###
  28. echo BaseRecalibrator `date`
  29. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T BaseRecalibrator \
  30. -I ${sample}_realigned.bam -R $GENOME -o ${sample}_temp.table -knownSites $DBSNP
  31. echo BaseRecalibrator `date`
  32. ###PrintReads###
  33. echo PrintReads `date`
  34. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T PrintReads \
  35. -R $GENOME -I ${sample}_realigned.bam -o ${sample}_recal.bam -BQSR ${sample}_temp.table
  36. samtools index ${sample}_recal.bam
  37. echo PrintReads `date`
  38. ###delete_intermediate_files###

对L1样本来说,时间消耗如下:

  1. INFO  15:50:24,097 ProgressMeter - Total runtime 1165.34 secs, 19.42 min, 0.32 hours
  2. INFO  17:21:00,917 ProgressMeter - Total runtime 4265.44 secs, 71.09 min, 1.18 hours
  3. INFO  19:58:23,969 ProgressMeter - Total runtime 9436.69 secs, 157.28 min, 2.62 hours
  4. INFO  23:41:00,540 ProgressMeter - Total runtime 13349.77 secs, 222.50 min, 3.71 hours

可以看到最耗费时间的步骤是最后一个 PrintReads

如果是对5条lane合并的merged.bam来说,消耗时间如下:

  1. INFO  17:54:59,396 ProgressMeter - Total runtime 5194.10 secs, 86.57 min, 1.44 hours
  2. INFO  02:04:10,907 ProgressMeter - Total runtime 22558.84 secs, 375.98 min, 6.27 hours
  3. ···························
  4. ···························

可以看到时间消耗与输入的bam文件大小有关,merged.bam是L1.bam的6倍大小,当然,时间上并没有成正比。总之,对这个全基因组数据来说,时间消耗太夸张了,以至于我写完这篇文章这GATK的4个bam操作还没跑完。对L1需要约8小时,那么对merge.bam应该是需要40个小时。

variant calling by gatk hc

代码是:

  1. module load java/1.8.0_91
  2. GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta
  3. INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37
  4. GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar
  5. PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar
  6. DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz
  7. KG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz
  8. Mills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf
  9. KG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf
  10. TMPDIR=/home/jianmingzeng/tmp/software
  11. ## samtools and bwa are in the environment
  12. ## samtools Version: 1.3.1 (using htslib 1.3.1)
  13. ## bwa Version: 0.7.15-r1140
  14. fq1=P_jmzeng_DHG09057_AH33KVALXX_L1_1.clean.fq.gz
  15. fq2=P_jmzeng_DHG09057_AH33KVALXX_L1_2.clean.fq.gz
  16. sample=‘merge‘
  17. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T HaplotypeCaller  \
  18. -R $GENOME -I ${sample}_recal.bam --dbsnp $DBSNP  \
  19. -stand_emit_conf 10 -o  ${sample}_recal_raw.snps.indels.vcf
  20. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T HaplotypeCaller  \
  21. -R $GENOME -I ${sample}_realigned.bam --dbsnp $DBSNP  \
  22. -stand_emit_conf 10 -o  ${sample}_realigned_raw.snps.indels.vcf

时间消耗如下:

  1. INFO  20:40:49,063 ProgressMeter - Total runtime 39243.88 secs, 654.06 min, 10.90 hours
  2. INFO  08:53:17,633 ProgressMeter - Total runtime 43939.69 secs, 732.33 min, 12.21 hours

可以看到对 recal.bam的处理比 recal.bam时间上要少2个小时,但是时间均消耗很长。

全部流程走完输出的文件如下(仅显示L1的流程数据):

流程探究

如果只给代码,那么这个教程意义不大,如果给出了input和output,还给出了时间消耗情况,那么这个教程可以说是中上水平了,读者只需要拿到数据就可以自己重复出来,既能估算硬件配置又能对大致的时间消耗有所了解。

但,这仍然不够,对我来说,我还可以介绍为什么要走每一个流程,以及每一个流程到底做了什么。可以这么说,你看完下面的流程探究,基本上就相当于你自己做过了一个全基因组重测序分析实战

我这里就对 L1样本进行解密:

首先的BWA

这个没什么好说的,基因组数据的比对首选,耗时取决于fastq文件的reads数目。

  1. CMD: bwa mem -t 5 -R @RG\tID:L1\tSM:L1\tLB:WGS\tPL:Illumina /home/jianmingzeng/reference/index/bwa/human_g1k_v37 P_jmzeng_DHG09057_AH33KVALXX_L1_1.clean.fq.gz P_jmzeng_DHG09057_AH33KVALXX_L1_2.clean.fq.gz
  2. [main] Real time: 15870.794 sec; CPU: 77463.156 sec

接下来是PICARD

共3个步骤用到了这个软件,消耗时间及内存分别如下:

  1. picard.sam.SortSam done. Elapsed time: 44.15 minutes. Runtime.totalMemory()=13184794624
  2. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 53.71 minutes. Runtime.totalMemory()=39832256512
  3. picard.sam.FixMateInformation done. Elapsed time: 53.79 minutes. Runtime.totalMemory()=9425649664

比对得到的都是sam格式数据,文件占硬盘空间太大,一般需要压缩成二进制的bam格式文件,用的是 SortSam 至于 FixMateInformation步骤仅仅是对bam文件增加了MC和MQ这两个tags

  1. add MC (CIGAR string for mate) and MQ (mapping quality of the mate/next segment) tags

而 markduplicates 步骤就比较复杂了,因为没有选择 REMOVE_DUPLICATES=True 所以并不会去除reads,只是标记一下而已,就是把sam文件的第二列改一下。

  1. Read 119776742 records.
  2. INFO    2017-06-05 10:57:22     MarkDuplicates  Marking 14482525 records as duplicates.
  3. INFO    2017-06-05 10:57:22     MarkDuplicates  Found 943146 optical duplicate clusters.

下面列出了部分被改变的flag值,可以去下面的网页去查看每个flag的含义。

  1. # https://broadinstitute.github.io/picard/explain-flags.html
  2. # diff  -y -W 50   |grep ‘|‘
  3. 163              | 1187
  4. 83              | 1107
  5. 99              | 1123
  6. 163              | 1187
  7. 147              | 1171
  8. 83              | 1107
  9. 99              | 1123
  10. 99              | 1123
  11. 147              | 1171
  12. 147              | 1171
  13. 99              | 1123
  14. 147              | 1171
  15. 163              | 1187
  16. 83              | 1107

最后是GATK

SplitNCigarReads 这个步骤对基因组数据来说可以略去,主要是针对于转录组数据的

命令是:

  1. Program Args: -T SplitNCigarReads -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta \
  2. -I L1_marked_fixed.bam -o L1_marked_fixed_split.bam \
  3. -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

程序运行的log日志是:

  1. INFO  13:04:52,813 ProgressMeter - Total runtime 2398.74 secs, 39.98 min, 0.67 hours
  2. INFO  13:04:52,854 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)
  3. INFO  13:04:52,854 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
  4. INFO  13:04:52,854 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
  5. INFO  13:04:52,855 MicroScheduler -   -> 0 reads (0.00% of total) failing ReassignOneMappingQualityFilter

可以看到,对全基因组测序数据来说,这个步骤毫无效果,而且还耗时40分钟,应该略去。

然后是indel区域的重排,需要结合 RealignerTargetCreator 和 IndelRealigner

命令是:

  1. Program Args: -T RealignerTargetCreator -I L1_marked_fixed_split.bam \
  2. -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -o L1_target.intervals \
  3. -known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \
  4. -known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf -nt 5

程序运行的log日志是:

  1. INFO  15:50:24,097 ProgressMeter - Total runtime 1165.34 secs, 19.42 min, 0.32 hours
  2. INFO  15:50:24,097 MicroScheduler - 22094746 reads were filtered out during the traversal out of approximately 120826819 total reads (18.29%)
  3. INFO  15:50:24,104 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
  4. INFO  15:50:24,104 MicroScheduler -   -> 1774279 reads (1.47% of total) failing BadMateFilter
  5. INFO  15:50:24,104 MicroScheduler -   -> 14006627 reads (11.59% of total) failing DuplicateReadFilter
  6. INFO  15:50:24,104 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
  7. INFO  15:50:24,104 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
  8. INFO  15:50:24,104 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
  9. INFO  15:50:24,105 MicroScheduler -   -> 6313840 reads (5.23% of total) failing MappingQualityZeroFilter
  10. INFO  15:50:24,105 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
  11. INFO  15:50:24,105 MicroScheduler -   -> 0 reads (0.00% of total) failing Platform454Filter
  12. INFO  15:50:24,105 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是:

  1. Program Args: -T IndelRealigner -I L1_marked_fixed_split.bam \
  2. -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta \
  3. -targetIntervals L1_target.intervals -o L1_realigned.bam \
  4. -known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \
  5. -known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf

程序运行的log日志是:

  1. INFO  17:21:00,917 ProgressMeter - Total runtime 4265.44 secs, 71.09 min, 1.18 hours
  2. INFO  17:21:00,920 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)
  3. INFO  17:21:00,920 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
  4. INFO  17:21:00,920 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

最后是碱基质量的矫正,需要结合 BaseRecalibrator 和 PrintReads

命令是:

  1. Program Args: -T BaseRecalibrator -I L1_realigned.bam \
  2. -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -o L1_temp.table \
  3. -knownSites /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

程序运行的log日志是:

  1. INFO  19:58:23,969 ProgressMeter - Total runtime 9436.69 secs, 157.28 min, 2.62 hours
  2. INFO  19:58:23,970 MicroScheduler - 21179430 reads were filtered out during the traversal out of approximately 120614036 total reads (17.56%)
  3. INFO  19:58:23,970 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
  4. INFO  19:58:23,970 MicroScheduler -   -> 14073643 reads (11.67% of total) failing DuplicateReadFilter
  5. INFO  19:58:23,970 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
  6. INFO  19:58:23,970 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
  7. INFO  19:58:23,971 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
  8. INFO  19:58:23,971 MicroScheduler -   -> 7105787 reads (5.89% of total) failing MappingQualityZeroFilter
  9. INFO  19:58:23,971 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
  10. INFO  19:58:23,971 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是:

  1. Program Args: -T PrintReads -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_realigned.bam -o L1_recal.bam -BQSR L1_temp.table

程序运行的log日志是:

  1. INFO  23:41:00,540 ProgressMeter - Total runtime 13349.77 secs, 222.50 min, 3.71 hours
  2. INFO  23:41:00,542 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)
  3. INFO  23:41:00,542 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
  4. INFO  23:41:00,542 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

可以看到这个步骤非常的耗时,而且bam文件的大小近乎翻倍了。

最后是GATK真正的功能,variant-calling

我这里不仅仅是对最后recal的bam进行variant-calling 步骤,同时也对realign的bam做了,所以下面显示两个时间消耗的记录,因为GATK的 BaseRecalibrator 步骤太耗费时间,而且极大的增加了bam文件的存储,所以有必要确认这个步骤是否有必要。

命令是:

  1. Program Args: -T HaplotypeCaller -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_recal.bam --dbsnp /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz -stand_emit_conf 10 -o L1_recal_raw.snps.indels.vcf

程序运行的log日志是:

  1. INFO  20:40:49,062 ProgressMeter -            done    3.101804739E9    10.9 h           12.0 s      100.0%    10.9 h       0.0 s
  2. INFO  20:40:49,063 ProgressMeter - Total runtime 39243.88 secs, 654.06 min, 10.90 hours
  3. INFO  20:40:49,064 MicroScheduler - 22384946 reads were filtered out during the traversal out of approximately 119776742 total reads (18.69%)
  4. INFO  20:40:49,064 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
  5. INFO  20:40:49,064 MicroScheduler -   -> 13732328 reads (11.46% of total) failing DuplicateReadFilter
  6. INFO  20:40:49,065 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
  7. INFO  20:40:49,065 MicroScheduler -   -> 8652618 reads (7.22% of total) failing HCMappingQualityFilter
  8. INFO  20:40:49,066 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
  9. INFO  20:40:49,066 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
  10. INFO  20:40:49,066 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
  11. INFO  20:40:49,067 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是:

  1. Program Args: -T HaplotypeCaller -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_realigned.bam --dbsnp /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz -stand_emit_conf 10 -o L1_realigned_raw.snps.indels.vcf

程序运行的log日志是:

  1. INFO  08:53:17,633 ProgressMeter -            done    3.101804739E9    12.2 h           14.0 s      100.0%    12.2 h       0.0 s
  2. INFO  08:53:17,633 ProgressMeter - Total runtime 43939.69 secs, 732.33 min, 12.21 hours
  3. INFO  08:53:17,634 MicroScheduler - 22384946 reads were filtered out during the traversal out of approximately 119776742 total reads (18.69%)
  4. INFO  08:53:17,634 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
  5. INFO  08:53:17,635 MicroScheduler -   -> 13732328 reads (11.46% of total) failing DuplicateReadFilter
  6. INFO  08:53:17,635 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
  7. INFO  08:53:17,635 MicroScheduler -   -> 8652618 reads (7.22% of total) failing HCMappingQualityFilter
  8. INFO  08:53:17,636 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
  9. INFO  08:53:17,636 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
  10. INFO  08:53:17,636 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
  11. INFO  08:53:17,637 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

如果想了解更多的全基因组重测序分析内容,欢迎点击阅读原文查看,也欢迎把此文转载给有需要的朋友。

对我的全基因组重测序数据来说,处理这个GATK最佳实践,耗时约12+40+15=67小时。

还有关于GATK调用多线程来加快处理步骤的事情,我就不多说了,你其实可以去GATK官网查看详细阅读说明。

猜你喜欢

工作资讯 | 学习课程 | 好书分享

菜鸟入门

Linux | Perl | R语言

数据分析

ChIP-seq(上)ChIP-seq(下)RNA-seq

WGS,WES,RNA-seq组与ChIP-seq之间的异同

编程实践

第0题 | 探索人类基因组序列

直播基因组分析

我的基因组 | 解惑帖

一个标准的基因检测报告目录

生信技能树

时间: 2024-08-05 02:54:18

一个全基因组重测序分析实战的相关文章

全基因组重测序基础及高级分析知识汇总

全基因组重测序基础及高级分析知识汇总 oddxix 已关注 2018.09.20 17:04 字数 11355 阅读 212评论 0喜欢 6 转自:http://www.360doc.com/content/18/0208/11/19913717_728563847.shtml 全基因组重测序是通过对已有参考序列(Reference Sequence)的物种的不同个体进行基因组测序,并以此为基础进行个体或群体水平的遗传差异性分析.通过全基因组重测序,研究者可以找到大量的单核苷酸多态性位点(SNP

Genome-wide Complex Trait Analysis(GCTA)-全基因组复杂性状分析

GCTA(全基因组复杂性状分析)工具开发目的是针对复杂性状的全基因组关联分析,评估SNP解释的表型方差所占的比例(该网站地址:http://cnsgenomics.com/software/gcta/).目前GCTA工具可实现以下功能: 1 评估全基因组SNP的亲缘关系(遗传关系) 2 评估全基因组SNP的近交系数 3 评估所有的常染色体SNP对于变异的解释度 4 评估遗传方差与X-染色体的关联 5 检测遗传方差对X-染色体的剂量补偿效应 6 预测单个个体和单个SNP的全基因组加性遗传效应 7

PacBio全基因组测序和组装

PacBio公司的业务范围也就5个(官网): Whole Genome Sequencing Targeted Sequencing Complex Populations RNA Sequencing Epigenetics 其中全基因组测序应该是PacBio的拿手好戏,因为它这么贵(貌似是二代的10倍),但它的核心优势就是长,还有无偏向性:这在科研上可就立马变成香饽饽了,现在用纯二代技术根本就发不了基因组的文章了,稍微高端点的分析都会用上三代的技术. Fully characterize g

基于全基因组测序数据鉴定结构变异的四大类算法总结

不同类型的基因组变异示意图(图片来源:labspaces) 上次给大家总结介绍了基因组单核苷酸多态性(single nucleotide polymorphism,SNP)的鉴定方法,今天给大家介绍结构变异(structural variations,SV)的种类及基于基因组测序数据的鉴定方法. 因为结构变异是造成物种表型差异的一个重要原因,且与各类疾病,特别是癌症的发生.发展紧密相关,因此研究结构变异非常重要. 结构变异通常是指长度大于1Kb的基因组序列变异,包括多种不同的类型:插入(inse

全基因组关联分析(Genome-Wide Association Study,GWAS)流程

全基因组关联分析流程: 一.准备plink文件 1.准备PED文件 PED文件有六列,六列内容如下: Family ID Individual ID Paternal ID Maternal ID Sex (1=male; 2=female; other=unknown) Phenotype PED文件是空格(空格或制表符)分隔的文件. PED文件长这个样: 2.准备MAP文件 MAP文件有四列,四列内容如下: chromosome (1-22, X, Y or 0 if unplaced) r

恶意代码分析实战

恶意代码分析实战(最权威的恶意代码分析指南,理论实践分析并重,业内人手一册的宝典) [美]Michael Sikorski(迈克尔.斯科尔斯基), Andrew Honig(安德鲁.哈尼克)著   <恶意代码分析实战>是一本内容全面的恶意代码分析技术指南,其内容兼顾理论,重在实践,从不同方面为读者讲解恶意代码分析的实用技术方法. <恶意代码分析实战>分为21章,覆盖恶意代码行为.恶意代码静态分析方法.恶意代码动态分析方法.恶意代码对抗与反对抗方法等,并包含了 shellcode分析

Science重磅 | 新技术Slide-seq能以高空间分辨率测量全基因组的表达情况

原文地址:https://science.sciencemag.org/content/363/6434/1463.full Slide-seq: A scalable technology for measuring genome-wide expresssion at high spatial resolution 摘要 细胞在组织中的空间位置强烈影响者它们的功能,然而,目前缺乏可高通量且全基因组范围内在单细胞水平对基因表达进行准确捕获的技术.原文作者开发了Slide-seq技术,这是一种将

Docker最全教程——从理论到实战(三)

原文:Docker最全教程--从理论到实战(三) 往期链接: https://www.cnblogs.com/codelove/p/10030439.html https://www.cnblogs.com/codelove/p/10036608.html 写在前面 容器是应用走向云端之后必然的发展趋势,因此笔者非常乐于和大家分享我们这段时间对容器的理解.心得和实践. 本教程持续编写了2个星期左右并且一直在完善.补充具体的细节和实践,预计全部完成需要1到2个月的时间.由于编写的过程中极其费时,并

全流程开发 TP6.0实战高并发电商服务系统

第1章 课程简介[PHP行情分析]本章主要讲解本课程的主线, 导学内容,PHP行情分析等让同学们对当前PHP发展充满信心等,同时还分析了企业级开发流程以及规范说明,让同学们对中大型公司的敏捷开发有一个初步认知. 第2章 环境及框架准备[必备基础]本章主要讲解环境的安装,通过composer获取TP6源码,nginx的配置等工作,环境是我们一切学习的根源,造起来. 第3章 TP6基础知识[新框架]本章主要讲解了TP5/TP6异同之处,基础的控制器层.模型层的使用,杜绝无效请求让代码更加健壮,数据库