1, 什么是GATK?
The Genome Analysis Toolkit or GATK is a software package developed at the
Broad Institute to analyse next-generation resequencing data.
The toolkit offers a wide variety of tools, with a primary focus on variant
discovery and genotyping as well as strong emphasis on data quality
assurance.
Its robust architecture, powerful processing engine and high-performance
computing features make it capable of taking on projects of any size.
2, 如何用GATK call SNP?
用来call snp的数据为经过处理过的bam文件。如何处理另见博文。用到的工具为HaplotypeCaller。假如我有四个bam文件,
LC17-1_L005.sorted.rmp.rg.recal.bam,
LC17-2_L008.sorted.rmp.rg.recal.bam,
RC17-1_L003.sorted.rmp.rg.recal.bam,
RC17-3_L004.sorted.rmp.rg.recal.bam,
都是经过处理,符合GATK要求的bam文件,这四个文件都属于样本C17,我现在要用对样本C17 call snp, 具体命令如下:
java -jar ./GenomeAnalysisTK.jar -nct 50 -T HaplotypeCaller -R
RAP_cDAN.fasta \
-I LC17-1_L002.sorted.rmp.rg.recal.bam -I LC17-1_L005.sorted.rmp.rg.recal.bam
\
-I LC17-2_L006.sorted.rmp.rg.recal.bam -I LC17-2_L008.sorted.rmp.rg.recal.bam
\
-I LC17-3_L002.sorted.rmp.rg.recal.bam -I RC17-1_L003.sorted.rmp.rg.recal.bam
\
-I RC17-2_L004.sorted.rmp.rg.recal.bam -I RC17-3_L004.sorted.rmp.rg.recal.bam
\
-o gatk.vcf
以上几行命令要在同一行,所以看到每行最后有换行符,工具选用的是GATK中的HaplotypeCaller,
-R后跟参考序列,-I 后是bam文件,这几个bam文件都属于一个sample, -o后跟输出文件名字。
-nct 是指定线程数,目前并不能多线程,只能用一个cpu。
结果文件就为gatk.vcf。