Variant Call Format(VCF)

Introduction

Variant Call Format (VCF) is a text file format for storing marker and genotype data. This short tutorial describes how Variant Call Format encodes data for single nucleotide variants.

Every VCF file has three parts in the following order:

  1. Meta-information lines (lines beginning with "##").
  2. One header line (line beginning with "#CHROM").
  3. Data lines contain marker and genotype data (one variant per line). A data line is called a VCF record.

Each VCF record has the same number of tab-separated fields as the header line. The symbol "." is used to denote missing data.

Example VCF file

##fileformat=VCFv4.1

##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">

##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">

##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods">

#CHROM        POS           ID        REF        ALT    QUAL     FILTER      INFO       FORMAT         SAMP001         SAMP002

20      1291018    rs11449       G           A         .         PASS           .          GT                   0/0                  0/1

20      2300608    rs84825       C           T         .         PASS            .         GT:GP              0/1:.                0/1:0.03,0.997,0

20      2301308    rs84823       T           G         .         PASS            .         GT:PL               ./.:.                 1/1:10,5,0

Meta-information lines

Each meta-information line must have the format ##KEY=VALUE and cannot contain white-space. The first meta-information line must specify the VCF version number (version 4.1 in the example). Additional meta-information lines are optional, but are often included to describe terms used in the FILTER, INFO, and FORMAT fields. In the example, the additional meta-information lines say that that GT means genotype, GP means the probability of each possible genotype call, and GL means the likelihood of each possible genotype call.

Marker information

The first nine columns of the header line and data lines describe the variants:

CHROM the chromosome.
POS the genome coordinate of the first base in the variant. Within a chromosome, VCF records are sorted in order of increasing position.
ID a semicolon-separated list of marker identifiers.
REF the reference allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC")
ALT the alternate allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC"). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles.
QUAL probability that the ALT allele is incorrectly specified, expressed on the the phred scale (-10log10(probability)).
FILTER Either "PASS" or a semicolon-separated list of failed quality control filters.
INFO additional information (no white space, tabs, or semi-colons permitted).
FORMAT colon-separated list of data subfields reported for each sample. The format fields in the Example are explained below.

Sample data

After the nine fixed columns, the remaining columns contain the sample identifier and the colon-separated data subfields for each individual. The data subfields in a record must match that record‘s format subfields.

The most common format subfield is GT (genotype) data. If the GT subfield is present, it must be the first subfield. In the sample data, genotype alleles are numeric: the REF allele is 0, the first ALT allele is 1, and so on. The allele separator is ‘/‘ for unphased genotypes and ‘|‘ for phased genotypes. In the example, all genotypes are unphased, and the genotypes for SAMP001 are homozygote reference, heterozygote, and missing in the first, second, and third records.

The second record contains a GP(genotype probability) format subfield, and the third record contains PL (phred-scaled genotype likelihood) format subfield. GP and GL data subfields are three comma-separated values corresponding to the REF/REF, REF/ALT, and ALT/ALT genotypes in that order. To convert a phred-scaled likelihood P to a raw likelihood L, use the formula L = 10(-P/10).

In the second record of the Example, the GP data subfield is missing for SAMP001 and the GP subfield for SAMP002 has probabilities of 0.03, 0.97, and 0 for the REF/REF, REF/ALT, and ALT/ALT genotypes.

In the third record of the Example, the GL data subfield is missing for SAMP001. The GL subfield for SAMP002 has phred-scaled likelihoods of 10, 5, and 0 and raw likelihoods of 0.1, 0.316, and 1 for the REF/REF, REF/ALT, and ALT/ALT genotypes. It is not necessary for the genotype likelihoods to sum to 1.0.

Resources

Here are some tools for manipulating VCF files:

know more about VCF,please click here:  http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

http://vcftools.sourceforge.net/VCF-poster.pdf

时间: 2024-08-02 09:30:50

Variant Call Format(VCF)的相关文章

VCF (Variant Call Format)格式详解

VCF文件示例(VCFv4.2) ##fileformat=VCFv4.2 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="

The Variant Call Format

VCF is a text format. It contains meta-information lines, a header line, and then data lines each containing information about a posittion in the genome. The fomat also has the ability to contain genotype information on samples for each position. (图看

VCF文件详细信息

Variant Call Format(VCF)是一个用于存储基因序列突变信息的文本格式.表示单碱基突变, 插入/缺失, 拷贝数变异和结构变异等.BCF格式文件是VCF格式的二进制文件. CHROM [chromosome]: 染色体名称. POS [position]: 参考基因组突变碱基位置,如果是INDEL(插入缺失),位置是INDEL的第一个碱基位置. ID [identifier]: 突变的名称.若没有,则用'.'表示其为一个新变种. REF [reference base(s)]:

VCFtools

The C++ executable module examples This page provides usage examples for the executable module. Extended documentation for all of the options can be found on the manual page. Running the program Getting basic file statistics Applying a filter Writing

21 、GPD PSL

1.Variant Call Format(VCF) Example ##fileformat=VCFv4.0 ##fileDate=20110705 ##reference=1000GenomesPilot-NCBI37 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type

IDP-ASE(haplotyping and quantifying allele-specific expression at the gene and gene isoform level by hybrid sequencing)VCF File GPD File in Extended Format.

学习来源地址: 1,https://github.com/bdeonovic/IDPASE.jl 2, https://www.healthcare.uiowa.edu/labs/au/ Prepare necessary input files (http://www.cnblogs.com/renping/p/7488170.html)(http://www.cnblogs.com/renping/p/7391028.html ) VCF File GPD File in Extended

vcf格式简介

1)背景 伴随着大规模的基因分型及测序工程的产生(例如1000 Genomes Project),之前的信息贮存格式例如gff文件它记录了每一个基因的详细信息,其中许多基因信息在基因组之间是共享的,而我们需要记录的仅仅是不同基因组之间变异的地方,因此这些格式会显得格外冗余.这就迫切需要一种新的格式来记录高效的记录这些变异信息.VCF(Variant Call Format)就是这样一种用来贮存基因序列变异信息的文本文件(通常是压缩格式). 2)VCF格式简介 VCF 格式文件包含有3部分:元信息

samtools常用命令详解

samtools的说明文档:http://samtools.sourceforge.net/samtools.shtml samtools是一个用于操作sam和bam文件的工具合集.包含有许多命令.以下是常用命令的介绍 1. view view命令的主要功能是:将sam文件转换成bam文件:然后对bam文件进行各种操作,比如数据的排序(不属于本命令的功能)和提取(这些操作 是对bam文件进行的,因而当输入为sam文件的时候,不能进行该操作):最后将排序或提取得到的数据输出为bam或sam(默认的

samtools和bcftools使用说明

转自:http://www.cnblogs.com/emanlee/p/4316581.html samtools是一个用于操作sam和bam文件的工具合集.包含有许多命令.以下是常用命令的介绍 1. view view命令的主要功能是:将sam文件转换成bam文件:然后对bam文件进行各种操作,比如数据的排序(不属于本命令的功能)和提取(这些操作 是对bam文件进行的,因而当输入为sam文件的时候,不能进行该操作):最后将排序或提取得到的数据输出为bam或sam(默认的)格式. bam文件优点