Biopython - sequences and alphabets

The Sequence object

Some examples will also require a working internet connection in order to run.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> my_seq
Seq(‘AGTACACTGGT‘, IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()

A Seq object in python acts like a normal python string.

>>> for letter in my_seq:
...     print letter
>>> len(my_seq)
>>> my_seq[4:12]
>>> my_seq[::-1]
>>> str(my_seq)

Nucleotide counts, transcription, translation

>>> my_seq.count("A")
3

to get the GC nucleotide content.

>>> from Bio.SeqUtils import GC
>>> GC(my_seq)
45.45454545454545

transcription and translation

>>> my_mRNA = my_seq.transcribe()
Seq(‘AGUACACUGGU‘, IUPACUnambiguousRNA())
>>> my_seq.translate()
Seq(‘STL‘, IUPACProtein())

complement and reverse complement

>>> str(my_seq)
‘AGTACACTGGT‘
>>> my_seq.complement()
Seq(‘TCATGTGACCA‘, IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq(‘ACCAGTGTACT‘, IUPACUnambiguousDNA())

You can translate directly from the DNA coding sequence or you can use the mRNA directly.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
>>> messenger_rna
Seq(‘AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG‘, IUPACUnambiguousRNA())
>>> messenger_rna.translate()
Seq(‘MAIVMGR*KGAR*‘, HasStopCodon(IUPACProtein(), ‘*‘))

Now, you may want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):

>>> coding_dna.translate()
Seq(‘MAIVMGR*KGAR*‘, HasStopCodon(IUPACProtein(), ‘*‘))
>>> coding_dna.translate(to_stop=True)
Seq(‘MAIVMGR‘, IUPACProtein())

Exercise

  1. There is so much stuff available in biopython.  What happens if you do this?

    >>> from Bio.Data import CodonTable
    >>> standard_table = CodonTable.unambiguous_dna_by_id[1]
    >>> mito_table = CodonTable.unambiguous_dna_by_id[2]
    >>> print standard_table
    >>> print mito_table
    

The Sequence record object

The SeqRecord objects are the basic data type for the SeqIO objects and they are very similar to Seq objects,however, there are a few additional attributes.

  • seq  - The sequence itself, typically a Seq object.
  • id    - The primary ID used to identify the sequence – a string. In most cases this is something like an accession number.
  • name - A ‘common’ name/id for the sequence – a string. In some cases this will be the same as the accession number, but it could also be a clone name. Analagous to the LOCUS id in a GenBank record.
  • description - A human readable description or expressive name for the sequence – a string.

We can think of the SeqRecord as a container that has the above attributes including the Seq.

Exercise

  1. Copy the following script into an editor and save as ‘BioSequences.py’
  2. Open a terminal window and cd into the appropriate directory.
  3. Fill in the missing lines with code
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

## create a simple SeqRecord object
simple_seq = Seq("GATCAGGATTAGGCC")
simple_seq_r = SeqRecord(simple_seq)
simple_seq_r.id = "AC12345"
simple_seq_r.description = "I am not a real sequence"

## print summary
print simple_seq_r.id
print simple_seq_r.description
print str(simple_seq_r.seq)
print simple_seq_r.seq.alphabet

## translate the sequence
translated_seq = simple_seq_r.seq.translate()
print translated_seq

# exercise 1 -- translate the sequence only until the stop codon

# exercise 2 -- get the reverse complement of the sequence

# exercise 3 -- get the reverse of the sequence (just like for lists)

# exercise 4 -- get the GC nucleotide content

The Sequence IO object

There is one more object that we have to discuss and this the SeqIO object is like a container for multiple SeqRecord objects.

import os
from Bio import SeqIO

‘‘‘
We use a list here to save the gene records from a FASTA file
If you have many records a dictionary will make the program faster.

‘‘‘

## save the sequence records to a list
allSeqRecords = []
allSeqIDs     = []
pathToFile = os.path.join("..","data","ls_orchid.fasta")
for seq_record in SeqIO.parse(pathToFile, "fasta"):
    allSeqRecords.append(seq_record)
    allSeqIDs.append(seq_record.id.split("|")[1])
    print seq_record.id
    print str(seq_record.seq)
    print len(seq_record)

## print out fun stuff about the sequences
print "We found ", len(allSeqIDs), "sequences"
print "information on the third sequence:"
ind = 2
seqRec = allSeqRecords[ind]
print "\t", "GI number     ", allSeqIDs[ind]
print "\t", "full id       ", seqRec.id
print "\t", "num nucleo.   ", len(seqRec.seq)
print "\t", "1st 10 nucleo.", seqRec.seq[:10]

Just as easy as it is to read a set of files we can save modified versions (i.e. QA). Also, it is almost the exact same code as above to parse sequences from a GenBank (.gb) file.

There is really way to much to cover in the time we have, but if you have Next Generation Sequencing data then refer to sections 4.8, 16.1.7 and 16.1.8 of the biopython tutorial. There is even support for binary formats (i.e. SFF).

时间: 2024-09-28 14:19:00

Biopython - sequences and alphabets的相关文章

Biopython - basics

Introduction From the biopython website their goal is to “make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts.” These modules use the biopython tutorial as a template for what you will l

[LeetCode]Repeated DNA Sequences

题目:Repeated DNA Sequences 给定包含A.C.G.T四个字符的字符串找出其中十个字符的重复子串. 思路: 首先,string中只有ACGT四个字符,因此可以将string看成是1,3,7,20这三个数字的组合串: 并且可以发现{ACGT}%5={1,3,2,0};于是可以用两个位就能表示上面的四个字符: 同时,一个子序列有10个字符,一共需要20bit,即int型数据类型就能表示一个子序列: 这样可以使用计数排序的思想来统计重复子序列: 这个思路时间复杂度只有O(n),但是

Repeated DNA Sequences

package cn.edu.xidian.sselab.hashtable; import java.util.ArrayList;import java.util.HashSet;import java.util.List;import java.util.Set; /** *  * @author zhiyong wang * title: Repeated DNA Sequences * content: *  All DNA is composed of a series of nuc

LeetCode-Repeated DNA Sequences (位图算法减少内存)

Repeated DNA Sequences All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all

【矩阵快速幂 】Codeforces 450B - Jzzhu and Sequences (公式转化)

[题目链接]click here~~ [题目大意] Jzzhu has invented a kind of sequences, they meet the following property: You are given x and y, please calculate fn modulo1000000007(109?+?7). [解题思路] /*A - Jzzhu and Sequences Codeforces 450B - Jzzhu and Sequences ( 矩阵快速幂 )

Codeforces Round #FF(255) C. DZY Loves Sequences (LIS升级)

题目:C. DZY Loves Sequences (LIS升级) 题意: 在n个数中,最多改变一个数字,并求能够达到的最长严格上升子序列(连续)长度 分析: 考虑第i个数,能否改变后拼接前后两个字串,并维护当前最大值 状态: left[i]:  表示以i为终点的最长严格上升子序列长度 right[i]: 表示以i为起点的最长严格上升子序列长度 dp[i]:   表示改变第i个数后,拼接前后字串的长度 转移方程:       dp[i] = max{left[i-1] + right[i+1] 

Codeforces Round #243 (Div. 1)——Sereja and Two Sequences

题目链接 题意:给两个长度分别为n和m的序列,现在有两种操作:1.分别选择两个序列的一个非空前缀,切两个前缀的最后一位相同,删除之,得到1分(只累计),消耗e:2.直接删除两个序列,消耗值定于两个序列之前删除的元素个数之和,并且使得得到的分有效(之前没有有效分) 分析: 首先,问题其实就是转化成,进行若干次操作1,然后进行操作2 还要找到一个判别标准,来评判较优的状态(贪心) 每次的消耗值比较大,其实可以计算出最大的删除次数,这个值不是很大 状态表示: 简单的,一个状态可以表示为串A的位置.串B

[2016-04-13][codeforces][447][C][DZY Loves Sequences]

时间:2016-04-13 23:39:47 星期三 题目编号:[2016-04-13][codeforces][447][C][DZY Loves Sequences] 题目大意:给定一串数字,问改变其中一个数字之和,最长能得到多长的严格增加的子串 分析: 维护每个数字往左和往右能延续多长(严格减,增),然后枚举每个点, 如果这个点已经在一个严格增加的序列中,那么ans =min(n, max(ans , l[i] + r[i] + 1)) 即左右两边延伸之后,改变后面非递增的一个数字 注意这

[LeetCode] 187. Repeated DNA Sequences 解题思路

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long seq