DNA sequence open reading frames (ORFs) | DNA序列的开放阅读框ORF预测

常见的ORF预测工具

Open Reading Frame Finder- NCBI

ORF Finder - SMS

OrfPredictor  - YSU

基本概念

开放阅读框(英语:Open reading frame;缩写:ORF;其他译名:开放阅读框架、开放读架等)是指在给定的阅读框架中,不包含终止密码子的一串序列。这段序列是生物个体的基因组中,可能作为蛋白质编码序列的部分。基因中的ORF包含并位于开始编码与终止编码之间。由于一段DNA或RNA序列有多种不同读取方式,因此可能同时存在许多不同的开放阅读框架。有一些计算机程序可分析出最可能是蛋白质编码的序列。

关键词:

1. 不包含终止密码子的一串序列;

2. 可能作为蛋白质编码序列的部分;

3. 有多种不同读取方式,因此可能同时存在许多不同的开放阅读框架;

4. 有些工具会用blast比对来提高可信度

示例

一段5‘-UCUAAAGGUCCA-3‘序列。此序列共有3种读取法:

  1. UCU AAA GGU CCA
  2. CUA AAG GUC
  3. UAA AGG UCC

由于UAA为终止编码,因此第三种读取法不具编译出蛋白质的潜力,故只有前两者为开放阅读框架

个人当然是推荐使用NCBI大佬开发的工具的啦,发文章可信度高些。

以下是Linux版该工具的说明:

USAGE
  ORFfinder [-h] [-help] [-xmlhelp] [-in Input_File] [-id Accession_GI]
    [-b begin] [-e end] [-c circular] [-g Genetic_code] [-s Start_codon]
    [-ml minimal_length] [-n nested_ORFs] [-strand Strand] [-out Output_File]
    [-outfmt output_format] [-logfile File_Name] [-conffile File_Name]
    [-version] [-version-full] [-dryrun]

DESCRIPTION
   Searching open reading frames in a sequence

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -xmlhelp
   Print USAGE, DESCRIPTION and ARGUMENTS in XML format; ignore all other
   parameters
 -logfile <File_Out>
   File to which the program log should be redirected
 -conffile <File_In>
   Program‘s configuration (registry) data file
 -version
   Print version number;  ignore other arguments
 -version-full
   Print extended version data;  ignore other arguments
 -dryrun
   Dry run the application: do nothing, only test all preconditions

 *** Input query options (one of them has to be provided):
 -in <File_In>
   name of file with the nucleotide sequence in FASTA format
   (more than one sequence is allowed)
   Default = `‘
 -id <String>
   Accession or gi number of the nucleotide sequence
   (ignored, if the file name is provided)
   Default = `‘

 *** Query sequence details:
 -b <Integer>
   Start address of sequence fragment to be processed
   Default = `1‘
 -e <Integer>
   Stop address of sequence fragment to be processed (0 - to the end of the
   sequence)
   Default = `0‘
 -c <Boolean>
   Is the sequence circular? (t/f) *** Under development
   Default = `false‘

 *** Search parameters:
 -g <Integer>
   Genetic code to use (1-31)
   see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details
   Default = `1‘
 -s <Integer>
   ORF start codon to use:
       0 = "ATG" only
       1 = "ATG" and alternative initiation codons
       2 = any sense codon
   Default = `1‘
 -ml <Integer>
   Minimal length of the ORF (nt)
   Value less than 30 is automatically changed by 30.
   Default = `75‘
 -n <Boolean>
   Ignore nested ORFs (completely placed within another)
   Default = `false‘
 -strand <String>
   Output ORFs on specified strand only (both|plus|minus)
   Default = `both‘

 *** Output options:
 -out <File_Out>
   Output file name
 -outfmt <Integer>
   Output options:
       0 = list of ORFs in FASTA format
       1 = CDS in FASTA format
       2 = Text ASN.1
       3 = Feature table
   Default = `0‘

  

ORFfinder -in in.fasta -s 2 -ml 100 -out test.out -outfmt 3

  

原文地址:https://www.cnblogs.com/leezx/p/8645696.html

时间: 2024-10-14 22:50:13

DNA sequence open reading frames (ORFs) | DNA序列的开放阅读框ORF预测的相关文章

HDU 1560 DNA sequence(DNA序列)

p.MsoNormal { margin: 0pt; margin-bottom: .0001pt; text-align: justify; font-family: Calibri; font-size: 10.5000pt } h1 { margin-top: 5.0000pt; margin-bottom: 5.0000pt; text-align: center; font-family: 宋体; color: rgb(26,92,200); font-weight: bold; fo

hdu 1560 DNA sequence(迭代加深搜索)

DNA sequence Time Limit : 15000/5000ms (Java/Other)   Memory Limit : 32768/32768K (Java/Other) Total Submission(s) : 15   Accepted Submission(s) : 7 Font: Times New Roman | Verdana | Georgia Font Size: ← → Problem Description The twenty-first century

poj 2778 DNA Sequence(AC自动机+矩阵快速幂)

题目链接:poj 2778 DNA Sequence 题目大意:给定一些含有疾病的DNA序列,现在给定DNA长度,问有多少种不同的DNA序列是健康的. 解题思路:对DNA片段建立AC自动机,因为最多10个串,每个串最长为10,所以最多可能有100个节点,在长度为n时 以每个节点终止的健康字符串个数形成一个状态集,通过AC自动机形成的边可以推导出n+1的状态集,走到单词节点是 非法的,所以同样的我们可以先走到单词节点,但是从单词节点不向后转移.这样可以构造一个矩阵,剩下的就是矩阵 快速幂.注意的一

HDOJ 1560 DNA sequence 状压dp 或 IDA*

http://acm.hdu.edu.cn/showproblem.php?pid=1560 题意: 给不超过8个子串,每个子串最多5位,且都只包含ATCG,求最短的母串长度. 分析: 又是上个月写的,所以有点忘了..正解是IDA*.然后可以状压dp,记忆化搜索.dp[i],i用6进制表示,每位表示对应的子串匹配那么多长度所需要的最短母串长度.比如两个子串,13=2*6^1+1*6^0,dp[13]就表示第一个串匹配了第一位,第二个串匹配前两位所需要的最短母串长度. 状态讲完了,不过实际上程序里

DNA sequence(映射+BFS)

Problem Description The twenty-first century is a biology-technology developing century. We know that a gene is made of DNA. The nucleotide bases from which DNA is built are A(adenine), C(cytosine), G(guanine), and T(thymine). Finding the longest com

DNA sequence HDU - 1560

DNA sequence Time Limit: 15000/5000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Total Submission(s): 4217    Accepted Submission(s): 2020 Problem Description The twenty-first century is a biology-technology developing century. We kno

poj2778 DNA Sequence【AC自动机】【矩阵快速幂】

DNA Sequence Time Limit: 1000MS   Memory Limit: 65536K Total Submissions: 19991   Accepted: 7603 Description It's well known that DNA Sequence is a sequence only contains A, C, T and G, and it's very useful to analyze a segment of DNA Sequence,For ex

【HDU - 1560】DNA sequence (dfs+回溯)

DNA sequence 直接中文了 题目描述 21世纪是生物科技飞速发展的时代.我们都知道基因是由DNA组成的,而DNA的基本组成单位是A,C,G,T.在现代生物分子计算中,如何找到DNA之间的最长公共子序列是一个基础性问题. 但是我们的问题不是那么简单:现在我们给定了数个DNA序列,请你构造出一个最短的DNA序列,使得所有我们给定的DNA序列都是它的子序列. 例如,给定"ACGT","ATGC","CGTT","CAGT"

【POJ】2278 DNA Sequence

各种wa后,各种TLE.注意若AC非法,则ACT等一定非法.而且尽量少MOD. 1 #include <iostream> 2 #include <cstdio> 3 #include <cstring> 4 #include <queue> 5 using namespace std; 6 7 #define MAXN 105 8 #define NXTN 4 9 10 char str[15]; 11 12 typedef struct Matrix {