annovar对人类基因组和非人类基因组variants注释流程 / 憋错料

部分翻译：Hui Y, Kai W. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR[J]. Nature Protocols, 2015, 10(10).

此文只是用于作者和所有初接触annovar软件者分享交流。更深入学习请仔细阅读全文。转载请注明。

ANNOVAR是一个perl编写的命令行工具，能在安装了perl解释器得多种操作系统上执行。允许多种输入文件格式，包括最常被使用的VCF格式。输出文件也有多种格式，包括注释过的VCF文件、用tab或者逗号分隔的text文件。ANNOVAR能快速注释遗传变异并预测其功能。类似的variants注释软件还有 VEP, snpEff, VAAST, AnnTools等等.

ANNOVAR支持三种不同形式的注释： gene-based, region-based 和filter-based. 这三种注释分别针对于每一个variant的不同方面：基于基因的注释（gene-based annotation）揭示variant与已知基因直接的关系以及对其产生的功能性影响；基于区域的注释（region-based annotation）揭示variant 与不同基因组特定段的关系，例如：它是否落在已知的保守基因组区域；基于过滤子的注释（ filter-based annotation
）则给出这个variant的一系列信息，如： population frequency in different populations 和various types of variant-deleteriousness prediction scores, 这些可被用来过滤掉一些公共的及 probably（大概,肯定的成分较大,是most
likely） nondeleterious variants.

(A) 用ANNOVAR注释人类基因组variants信息

(i)填写登记表，下载ANNOVAR软件（http://annovar.openbio informatics.org/）， ‘annovar.latest.tar.gz’ file,解压文件

tar xvfz annovar.latest.tar.gz

关键：也可将目录路径添加到操作系统的环境变量中去，这样就可以通过输出命令名直接运行 ANNOVAR脚本。

(ii)下载所有需要的注释信息库，对于基因注释的已经在下好的 ANNOVAR package中了。如果要进行其他注释，需要按以下命令下载数据库到 ‘humandb/’ 目录里：

perl annotate_variation.pl --downdb --buildver hg19 cytoBand humandb/
perl annotate_variation.pl --downdb --webfrom annovar --buildver hg19 1000g2014oct humandb/
perl annotate_variation.pl --downdb --webfrom annovar --buildver hg19 exac03 humandb/
perl annotate_variation.pl --downdb --webfrom annovar --buildver hg19 ljb26_all humandb/
perl annotate_variation.pl --downdb --webfrom annovar --buildver hg19 clinvar_20140929 humandb/
perl annotate_variation.pl --downdb --webfrom annovar --buildver hg19 snp138 humandb/

这里下载的是几个通常用到的数据库：

1、‘cytoBand’ 是每个细胞间band（cytogenetic band）的染色体坐标信息 ,

2、 ‘1000g2014oct’ for alternative allele frequency in the 1000 Genomes Project (version October 2014),

是2014年10版，1000基因组项目（和ExAV 外显子集合联合一样，是公开、开放的数据库）里面供选择的等位基因频率信息

3、‘exac03’for the variants reported in the Exome Aggregation Consortium (version 0.3),

是0.3版外显子集合联合中报道过的variants.

4、 ‘ljb26_all’ for various functional deleteriousness prediction scores from the dbNSFP database (version 2.6),

dbNSFP:
A Lightweight Database of Human NonsynonymousSNPs
and TheirFunctionalPredictions on
ResearchGate

5、 ‘clinvar_20140929’ for the variants reported in the ClinVar database (version 20140929)

ClinVar是美国国家生物技术信息中心（NCBI）于2012年11月宣布、2013年4月正式启动的公共、免费数据库。作为核心数据库，ClinVar数据库整合了十多个不同类型数据库、通过标准的命名法来描述疾病，同时支持科研人员将数据下载到本地中，开展更为个性化的研究。在遗传变异和临床表型方面，NCBI和不同的研究组已经建立了各种各样的数据库，数据信息相对比较分散，ClinVar数据库的目的在于整合这些分散的数据、将变异、临床表型、实证数据以及功能注解与分析等四个方面的信息，通过专家评审，逐步形成一个标准的、可信的、稳定的遗传变异-临床表型相关的数据库。

6、‘snp138’ for the dbSNP database (version 138).

注意：1、第一个命令中不包含 ‘--webfrom annovar’ 选项, 因此是从the UCSC Genome Browser annotation database下载文件的；

2、 ‘--buildver hg19’ 选项是针对hg19这一版的基因组的；

3、运行上面命令后，在 ‘humandb/’ 目录下会多几个以 ‘hg19’为前缀的文件。

(iii) 用the ‘table_annovar.pl’ 来注释variants。允许在同一命令中用输出的特定顺序来对多个注释类型进行自定义选择（custom selection）。

输入下列命令，用之前下载好的注释数据库来注释vcf格式文件中的variants

perl table_annovar.pl <variant.vcf> humandb/ --outfile final --buildver hg19 --protocol refGene,cytoBand,1000g2014oct_eur,1000g2014oct_afr,exac03,
ljb26_all,clinvar_20140929,snp138 --operation g,r,f,f,f,f,f,f --vcfinput

<variant.vcf> 参考（refers to ）输入的vcf文件的名称

‘--protocol’ 选项后跟注释来源数据库的准确名称

‘--operation’ 选项后跟注释的类型: ‘g’ 表示基于基因的注释（gene-based annotation）、‘r’ 表示基于区域的注释（region-based annotation）、‘f’ 表示基于筛选子的注释（ filter-based annotation）.

‘--outfile’ 选项是指定输出文件的前缀

关键步骤（ CR ITICAL STEP）： 1、确保注释数据库的名称正确并且是按你想要在输出文件中显示的顺序排列的；

2、确保 ‘--operation’指定的注释类型顺序和‘--protocol’指定的数据库顺序是一致的；

3、确保每个protocal名称或注释类型之间只有一个逗号，并且没有空白。

(iv) ‘final.hg19_multianno.vcf’.输出文件应该是以个VCF格式文件，INFO那列以 ‘key=value’ 形式、 ‘;’分割成几个小区域. eg:‘Func.refGene=intronic;Gene.refGene=SAMD11’. 每个键值对代表一个ANNOVAR注释信息。输出文件可以用为VCF格式文件设计的基因分析软件进一步处理。

(v) ‘final.hg19_multianno.txt’. 每一行代表一个variant 。用tab分隔，多余列为加上的注释信息，顺序按 ‘--protocol’ 选项所设定的注释类型argument。

(B) 用 ANNOVAR 对非人类的物种进行基于基因的注释（Gene-based annotation）

 CR ITICAL STEP关键：以注释大猩猩基因组（with the genome build identifier as panTro2.）为例。ANNOVAR的安装同A(i).

对于gene-based annotation， ANNOVAR需要genePred format的gene definition file和 FASTA format 的transcript sequence file；

(i). 输入以下命令，下载大猩猩基因组定义文件（ gene definition file）及序列的 FASTA 文件到‘chimpdb/’目录

perl annotate_variation.pl --downdb --buildver panTro2 gene chimpdb/
perl annotate_variation.pl --downdb --buildver panTro2 seq chimpdb/panTro2_seq

(ii) 注意ANNOVAR数据库中只包含人类基因组已建好的转录本，不包含其他物种的。故需要按以下命令自行建立对应物种的transcript FASTA file

perl retrieve_seq_from_fasta.pl chimpdb/panTro2_refGene.txt --seqdir chimpdb/panTro2_seq --format refGene --outfile chimpdb/panTro2_refGeneMrna.fa

1、 ‘--seqdir’说明下载的序列文件的所在目录；

2、‘--format’ 说明 gene definition file的格式.；

3、 ‘--outfile’ 指定输出mRNA 序列文件的名称；

关键：跟在‘--outfile’后的输出文件名应该是 ‘<buildver>_refGeneMrna.fa’这种形式，否则下一步找不到正确的 transcript FASTA sequence file.

(iii) 注释variants，with the chimpanzee gene annotation:

perl table_annovar.pl <variant.vcf> chimpdb/ --vcfinput --outfile final --buildver panTro2 --protocol refGene --operation g

Here <variant.vcf> is the input VCF file, ‘chimpdb/’ is the directory of the downloaded data

(iv) 输出结果文件核对。 ‘final.panTro2_multianno.txt’ file. The gene annotation for chimpanzee is added after the input variants.

关键：如果没有现成可用的gene definition file ，可以将基因预测工具产生的 GFF3 or GTF 文件转换成 gene definition file.

以构建拟南芥（Arabidopsis thaliana）的注释所需文件为例

#1. 在http://plants.ensembl.org/info/website/ftp/index.html 下载Arabidopsis 的 GTF file 和 genome FASTA file，到 ‘atdb’目录下.

mkdir atdb                                                                                                                                                cd atdb                                                                                                                                                   wget ftp://ftp.ensemblgenomes.org/pub/release-27/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz

wget ftp://ftp.ensemblgenomes.org/pub/release-27/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.27.gtf.gz                <span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">                                                                                                                                                                                                                               </span>

#2. 解压文件

gunzip Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz                                                                                                    gunzip Arabidopsis_thaliana.TAIR10.27.gtf.gz  <span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">                                                                                                                                                                                                    </span>

#3、下载gff3ToGenePred’ 或gtfToGenePred 工具（http://hgdown load.soe.ucsc.edu/admin/exe/linux.x86_64/），推荐使用GTF格式，因为有些GFF3格式文件转换可能不正确

#4. 用 gtfToGenePred 工具将 GTF file 转换 GenePred file:

gtfToGenePred -genePredExt Arabidopsis_thaliana.TAIR10.27.gtf AT_refGene.txt  <span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">                                                                                                                                                                     </span>

#5. 用retrieve_seq_from_fasta.pl生成 transcript FASTA file

perl ../retrieve_seq_from_fasta.pl --format refGene --seqfile Arabidopsis_thaliana.TAIR10.27.dna.genome.fa AT_refGene.txt AT_refGeneMrna.fa

#After this step, the annotation database files needed for gene-based annotation are ready. Now you can annotate a given VCF file using the procedure starting from B(iii). Please note that the ‘--buildver’ argument should be set to ‘AT’.

参考http://annovar.openbioinformatics.org/en/latest/user-guide/gene/ for more details.bases and other arguments are the same as in the human genome annotation.