From:
https://arnaudceol.wordpress.com/2014/09/18/chromosome-coordinate-systems-0-based-1-based/
I’ve had hard time figuring out that different website and file formats are using different systems to represent genome coordinate.
Basically, the bases can be numerated in two way: starting at 0 or starting at 1. Those are the 0-based and 1-based coordinate system.
0-based:
ACTGACTG 012345678
1-based:
ACTGACTG 123456789
Then you say that the system is inclusive if the last index is part of the sequence or exclusive if it is not.
For instance to represent the sequence TGAC:
0-based inclusive: 2-5
1-based inclusive: 3-6
1-based exclusive: 3-7
I’ve tried to figure out which website-application are using each
coordinate system. The results can be found bellow. For each source, I
provide the URL of the reference website where I found the information,
and a caption where the system is described.
I found most of those links in Biostar (https://www.biostars.org/p/6373/) and on the blog of Casey M. Bergman (http://bergmanlab.smith.man.ac.uk/?p=36), who also wrote an article about this argument: https://www.landesbioscience.com/journals/mge/article/19479/.
- Ensembl: 1-based inclusive, ref: http://www.ensembl.org/Help/Faq?id=286)(http://www.ensembl.org/info/docs/api/core/core_tutorial.html
Ensembl, and many other bioinformatics applications, use inclusive
coordinates which start at 1. The first nucleotide of a DNA sequence is
1 and the first amino acid of a protein sequence is also 1. The length
of a sequence is defined as end – start + 1.(Ensembl gtf format = gff2 = 1-based)
- UCSC: internal representation: 0-based start and 1-based end, display: 1-based, ref: http://www.ensembl.org/Help/Faq?id=286) http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1
Question:
“I am confused about the start coordinates for items in the refGene
table. It looks like you need to add “1” to the starting point in order
to get the same start coordinate as is shown by the Genome Browser. Why
is this the case?”
Response:
Our internal database representations of coordinates always have a
zero-based start and a one-based end. We add 1 to the start before
displaying coordinates in the Genome Browser. Therefore, they appear as
one-based start, one-based end in the graphical display. The refGene.txt file is a database file, and consequently is based on the internal representation.
We use this particular internal representation because it
simplifies coordinate arithmetic, i.e. it eliminates the need to add or
subtract 1 at every step. Unfortunately, it does create some confusion
when the internal representation is exposed or when we forget to add 1
before displaying a start coordinate. However, it saves us from much
trickier bugs. If you use a database dump file but would prefer to see
the one-based start coordinates, you will always need to add 1 to each
start coordinate.
If you submit data to the browser in position format
(chr#:##-##), the browser assumes this information is 1-based. If you
submit data in any other format (BED (chr# ## ##) or otherwise), the
browser will assume it is 0-based. You can see this both in our liftOver
utility and in our search bar, by entering the same numbers in position
or BED format and observing the results. Similarly, any data returned
by the browser in position format is 1-based, while data returned in BED
format is 0-based.
- Bed format: 0-based exclusive, ref: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
BED format uses zero-based, half-open
coordinates, so the first 25 bases of a sequence are in the range 0-25
(those bases being numbered 0 to 24)
The first three required BED fields are:
chrom – The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
chromStart – The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
chromEnd – The
ending position of the feature in the chromosome or scaffold. The
chromEnd base is not included in the display of the feature. For
example, the first 100 bases of a chromosome are defined as
chromStart=0, chromEnd=100, and span the bases numbered 0-99.
- MAF: 1-based inclusive, ref: https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+%28MAF%29+Specification+-+v2.4
Lowest numeric position of the
reported variant on the genomic reference sequence. start: Mutation
start coordinate (1-based coordinate system), end: Highest numeric
genomic position of the reported variant on the genomic reference
sequence. Mutation end coordinate (inclusive, 1-based coordinate
system).
- Other 1-based coordinate formats: ref: http://samtools.github.io/hts-specs/SAMv1.pdf:SAM, VCF, GFF and Wiggle
- Other 0-based coordinate formats: ref: http://samtools.github.io/hts-specs/SAMv1.pdf: BAM, BCFv2 and PSL