DNA sequence(映射+BFS)

Problem Description

The twenty-first century is a biology-technology developing century. We know that a gene is made of DNA. The nucleotide bases from which DNA is built are A(adenine), C(cytosine), G(guanine), and T(thymine). Finding the longest common subsequence between DNA/Protein sequences is one of the basic problems in modern computational molecular biology. But this problem is a little different. Given several DNA sequences, you are asked to make a shortest sequence from them so that each of the given sequence is the subsequence of it.

For example, given "ACGT","ATGC","CGTT" and "CAGT", you can make a sequence in the following way. It is the shortest but may be not the only one.

Input

The first line is the test case number t. Then t test cases follow. In each case, the first line is an integer n ( 1<=n<=8 ) represents number of the DNA sequences. The following k lines contain the k sequences, one per line. Assuming that the length of any sequence is between 1 and 5.

Output

For each test case, print a line containing the length of the shortest sequence that can be made from these sequences.

SampleInput

1
4
ACGT
ATGC
CGTT
CAGT

SampleOutput

8

题意就是给你几个DNA序列,要求找到一个序列,使得所有序列都是它的子序列(不一定连续)。直接搜MLE、TLE、RE,所以不能直接搜索,一般处理这种序列问题,都是把序列映射到整数或其他便于处理的东西上。题目还说了每个DNA的序列长度不会超过5,所以我们可以按位处理映射到一个整数上,而且题目只需要我们输出最短的序列长度,所以我们也不必去映射字符,映射长度便够了。最多8个字符,每个字符1-5长度,所以最大数为6^8。好为什么是6^8,不明明是5^8么,这个我暂时先不解释,我加在了代码注释里。代码:
  1 #include <iostream>
  2 #include <string>
  3 #include <cstdio>
  4 #include <cstdlib>
  5 #include <sstream>
  6 #include <iomanip>
  7 #include <map>
  8 #include <stack>
  9 #include <deque>
 10 #include <queue>
 11 #include <vector>
 12 #include <set>
 13 #include <list>
 14 #include <cstring>
 15 #include <cctype>
 16 #include <algorithm>
 17 #include <iterator>
 18 #include <cmath>
 19 #include <bitset>
 20 #include <ctime>
 21 #include <fstream>
 22 #include <limits.h>
 23 #include <numeric>
 24
 25 using namespace std;
 26
 27 #define F first
 28 #define S second
 29 #define mian main
 30 #define ture true
 31
 32 #define MAXN 1000000+5
 33 #define MOD 1000000007
 34 #define PI (acos(-1.0))
 35 #define EPS 1e-6
 36 #define MMT(s) memset(s, 0, sizeof s)
 37 typedef unsigned long long ull;
 38 typedef long long ll;
 39 typedef double db;
 40 typedef long double ldb;
 41 typedef stringstream sstm;
 42 const int INF = 0x3f3f3f3f;
 43
 44 int t,n;
 45 map<int,int>vis;
 46 char s[10][10];    //保存序列
 47 int len[10];    //保存每个序列的长度
 48 int p[10] = {1,6,36,216,1296,7776,46656,279936,1679616,10077696};    //6的k次方表
 49 char temp[4]={‘A‘,‘C‘,‘G‘,‘T‘};
 50
 51 struct node{
 52     int step;    //长度
 53     int st;    //也就是映射数
 54     node(){}
 55     node(int _step, int _st):step(_step),st(_st){}
 56 };
 57
 58 int bfs(int res){
 59     vis.clear();
 60     queue<node>q;
 61     q.push(node(0,0));
 62     vis[0] = 1;
 63     while(!q.empty()){
 64         node nxt,k = q.front();
 65         q.pop();
 66         if(k.st == res){    //当映射等于结果时 返回长度
 67             return k.step;
 68         }
 69         for(int i = 0; i < 4; i++){
 70             nxt.st = 0;
 71             nxt.step = k.step+1;
 72             int tp = k.st;
 73             for(int j = 1; j <= n; j++){
 74                 int x = tp%6;    //得到位数
 75                 tp /= 6;
 76                 if(x == len[j] || s[j][x+1] != temp[i]){    //判断字符是否匹配
 77                     nxt.st += x*p[j-1];
 78                 }
 79                 else{
 80                     nxt.st += (x+1)*p[j-1];
 81                 }
 82             }
 83             if(vis[nxt.st] == 0){    //标记是否已经搜过
 84                 q.push(nxt);
 85                 vis[nxt.st] = 1;
 86             }
 87         }
 88     }
 89 }
 90
 91 int main(){
 92     ios_base::sync_with_stdio(false);
 93     cout.tie(0);
 94     cin.tie(0);
 95     cin>>t;
 96     while(t--){
 97         cin>>n;
 98         int res = 0;
 99         for(int i = 1; i <= n; i++){    //因为数组从0开始计数,但我们映射以及后面操作都是基于位置,所以从1开始
100             cin>>s[i]+1;    //同理从一开始
101             len[i] = strlen(s[i]+1);
102             res += len[i]*p[i-1];    //这也就是为什么是6^8,因为我们是从1开始有5个状态而不是0
103         }
104         cout << bfs(res) <<endl;
105     }
106     return 0;
107 }

所以这题你非要从0位置搞,弄5^8确实没错,也可以做出来,但是操作会繁琐很多,还不如从方便的角度多加一个长度。



这道题的难度就是不知道怎么入手,即使知道转换处理也不知道该如何转换以及如何搜索,这里我们避免了去从字符开始搜索,而是直接基于长度搜。

值得一提的是,我问了队友后,他们表示这道题做法很多,还可以用IDA*算法或者启发式搜索,甚至不用搜索用AC自动机加矩阵也可以做。但这些做法都是基于字符去搜索的,也不能说谁好谁坏,只是我们的思维就不一样了,很多题目其实都不止一种解法,多想想,很有用的。至于其他做法我也就懒得做了(其实是不会23333)

原文地址:https://www.cnblogs.com/xenny/p/9388400.html

时间: 2024-11-09 02:14:26

DNA sequence(映射+BFS)的相关文章

【POJ 2778】DNA Sequence

Description It's well known that DNA Sequence is a sequence only contains A, C, T and G, and it's very useful to analyze a segment of DNA Sequence,For example, if a animal's DNA sequence contains segment ATC then it may mean that the animal may have

【POJ】2278 DNA Sequence

各种wa后,各种TLE.注意若AC非法,则ACT等一定非法.而且尽量少MOD. 1 #include <iostream> 2 #include <cstdio> 3 #include <cstring> 4 #include <queue> 5 using namespace std; 6 7 #define MAXN 105 8 #define NXTN 4 9 10 char str[15]; 11 12 typedef struct Matrix {

hdu 1560 DNA sequence(迭代加深搜索)

DNA sequence Time Limit : 15000/5000ms (Java/Other)   Memory Limit : 32768/32768K (Java/Other) Total Submission(s) : 15   Accepted Submission(s) : 7 Font: Times New Roman | Verdana | Georgia Font Size: ← → Problem Description The twenty-first century

DNA Sequence(POJ2778 AC自动机dp+矩阵加速)

传送门 DNA Sequence Time Limit: 1000MS   Memory Limit: 65536K       Description It's well known that DNA Sequence is a sequence only contains A, C, T and G, and it's very useful to analyze a segment of DNA Sequence,For example, if a animal's DNA sequenc

HDU 1560 DNA sequence(DNA序列)

p.MsoNormal { margin: 0pt; margin-bottom: .0001pt; text-align: justify; font-family: Calibri; font-size: 10.5000pt } h1 { margin-top: 5.0000pt; margin-bottom: 5.0000pt; text-align: center; font-family: 宋体; color: rgb(26,92,200); font-weight: bold; fo

POJ 2778 DNA Sequence

DNA Sequence Time Limit: 1000ms Memory Limit: 65536KB This problem will be judged on PKU. Original ID: 277864-bit integer IO format: %lld      Java class name: Main It's well known that DNA Sequence is a sequence only contains A, C, T and G, and it's

poj 2778 DNA Sequence(AC自动机+矩阵快速幂)

题目链接:poj 2778 DNA Sequence 题目大意:给定一些含有疾病的DNA序列,现在给定DNA长度,问有多少种不同的DNA序列是健康的. 解题思路:对DNA片段建立AC自动机,因为最多10个串,每个串最长为10,所以最多可能有100个节点,在长度为n时 以每个节点终止的健康字符串个数形成一个状态集,通过AC自动机形成的边可以推导出n+1的状态集,走到单词节点是 非法的,所以同样的我们可以先走到单词节点,但是从单词节点不向后转移.这样可以构造一个矩阵,剩下的就是矩阵 快速幂.注意的一

POJ 2778 DNA Sequence (AC自动机 + 矩阵快速幂)

题目链接:DNA Sequence 解析:AC自动机 + 矩阵加速(快速幂). 这个时候AC自动机 的一种状态转移图的思路就很透彻了,AC自动机就是可以确定状态的转移. AC代码: #include <iostream> #include <cstdio> #include <queue> #include <cstring> using namespace std; const int MOD = 100000; struct Matrix{ int ma

POJ POJ 2778 DNA Sequence AC自动机 + 矩阵快速幂

首先建立Trie和失败指针,然后你会发现对于每个节点 i 匹配AGCT时只有以下几种情况: i 节点有关于当前字符的儿子节点 j 且安全,则i 到 j找到一条长度为 1的路. i 节点有关于当前字符的儿子节点 j 且 不安全,则i 到 j没有路. i 节点没有关于当前字符的儿子节点 但是能通过失败指针找到一个安全的节点j,那么 i 到 j 找到一条长度为1的路. 关于节点安全的定义: 当前节点不是末节点且当前节点由失败指针指回跟节点的路径上不存在不安全节点,那么这个节点就是安全节点. 然后问题就

HDOJ 1560 DNA sequence 状压dp 或 IDA*

http://acm.hdu.edu.cn/showproblem.php?pid=1560 题意: 给不超过8个子串,每个子串最多5位,且都只包含ATCG,求最短的母串长度. 分析: 又是上个月写的,所以有点忘了..正解是IDA*.然后可以状压dp,记忆化搜索.dp[i],i用6进制表示,每位表示对应的子串匹配那么多长度所需要的最短母串长度.比如两个子串,13=2*6^1+1*6^0,dp[13]就表示第一个串匹配了第一位,第二个串匹配前两位所需要的最短母串长度. 状态讲完了,不过实际上程序里