Blue Jeans - poj 3080(后缀数组)

大致题意:

给出n个长度为60的DNA基因(A腺嘌呤 G鸟嘌呤 T胸腺嘧啶 C胞嘧啶)序列,求出他们的最长公共子序列

使用后缀数组解决

  1 #include<stdio.h>
  2 #include<string.h>
  3 char str[6200],res[6200];
  4 int num[6200],loc[6200];
  5 int sa[6200],rank[6200],height[6200];
  6 int wa[6200],wb[6200],wv[6200],wd[6200];
  7 int vis[6200];
  8 int seq_num;
  9 int cmp(int *r,int a,int b,int l){
 10     return r[a]==r[b]&&r[a+l]==r[b+l];
 11 }
 12 void DA(int *r,int n,int m){
 13     int i,j,p,*x=wa,*y=wb,*t;
 14     for(i=0;i<m;i++)wd[i]=0;
 15     for(i=0;i<n;i++)wd[x[i]=r[i]]++;
 16     for(i=1;i<m;i++)wd[i]+=wd[i-1];
 17     for(i=n-1;i>=0;i--) sa[--wd[x[i]]]=i;
 18     for(j=1,p=1;p<n;j*=2,m=p){
 19         for(p=0,i=n-j;i<n;i++) y[p++]=i;
 20         for(i=0;i<n;i++) if(sa[i]>=j) y[p++] = sa[i] -j;
 21         for(i=0;i<n;i++)wv[i]=x[y[i]];
 22         for(i=0;i<m;i++) wd[i]=0;
 23         for(i=0;i<n;i++)wd[wv[i]]++;
 24         for(i=1;i<m;i++)wd[i]+=wd[i-1];
 25         for(i=n-1;i>=0;i--) sa[--wd[wv[i]]]=y[i];
 26         for(t=x,x=y,y=t,p=1,x[sa[0]]=0,i=1;i<n;i++){
 27             x[sa[i]]=cmp(y,sa[i-1],sa[i],j)?p-1:p++;
 28         }
 29     }
 30 }
 31 void calHeight(int *r,int n){
 32     int i,j,k=0;
 33     for(i=1;i<=n;i++)rank[sa[i]]=i;
 34     for(i=0;i<n;height[rank[i++]]=k){
 35         for(k?k--:0,j=sa[rank[i]-1];r[i+k]==r[j+k];k++);
 36     }
 37 }
 38 int check(int mid,int len){
 39     int i,j,tot;
 40     tot=0;
 41     memset(vis,0,sizeof(vis));
 42     for(i=2;i<=len;i++){
 43         if(height[i]<mid){
 44             memset(vis,0,sizeof(vis));
 45             tot=0;
 46         }else{
 47             if(!vis[loc[sa[i-1]]]){
 48                 vis[loc[sa[i-1]]]=1;
 49                 tot++;
 50             }
 51             if(!vis[loc[sa[i]]]){
 52                 vis[loc[sa[i]]]=1;
 53                 tot++;
 54             }
 55             if(tot==seq_num){
 56                 for(j=0;j<mid;j++){
 57                     res[j]=num[sa[i]+j]+‘A‘-1;
 58                 }res[mid]=‘\0‘;
 59                 return 1;
 60             }
 61         }
 62     }
 63     return 0;
 64 }
 65 int main() {
 66     int case_num,n,sp,ans;
 67     scanf("%d",&case_num);
 68     for(int i=0;i<case_num;i++){
 69         scanf("%d",&seq_num);
 70         n=0;
 71         sp=29;
 72         ans=0;
 73         for(int j=0;j<seq_num;j++){
 74             scanf("%s",str);
 75             for(int k=0;k<60;k++){
 76                 loc[n]=j;
 77                 num[n++]=str[k]-‘A‘+1;
 78             }
 79             loc[n]=sp;
 80             num[n++]=sp++;
 81         }
 82         num[n]=0;
 83         DA(num,n+1,sp);
 84         calHeight(num,n);
 85         int left=0,right=60,mid;
 86
 87         while(right>=left){
 88             mid=(right+left)/2;
 89             int tt=check(mid,n);
 90             if(tt){
 91                 left=mid+1;
 92                 ans=mid;
 93             }else{
 94                 right=mid-1;
 95             }
 96         }
 97         if(ans>=3){
 98             printf("%s\n",res);
 99         }else{
100             printf("no significant commonalities\n");
101         }
102     }
103     return 0;
104 }

附:原题目

Time Limit: 1000MS   Memory Limit: 65536K
Total Submissions: 14020   Accepted: 6227

Description

The Genographic Project is a research partnership between IBM and The National Geographic Society that is analyzing DNA from hundreds of thousands of contributors to map how the Earth was populated.

As an IBM researcher, you have been tasked with writing a program that will find commonalities amongst given snippets of DNA that can be correlated with individual survey information to identify new genetic markers.

A DNA base sequence is noted by listing the nitrogen bases in the order in which they are found in the molecule. There are four bases: adenine (A), thymine (T), guanine (G), and cytosine (C). A 6-base DNA sequence could be represented as TAGACC.

Given a set of DNA base sequences, determine the longest series of bases that occurs in all of the sequences.

Input

Input to this problem will begin with a line containing a single integer n indicating the number of datasets. Each dataset consists of the following components:

  • A single positive integer m (2 <= m <= 10) indicating the number of base sequences in this dataset.
  • m lines each containing a single base sequence consisting of 60 bases.

Output

For each dataset in the input, output the longest base subsequence common to all of the given base sequences. If the longest common subsequence is less than three bases in length, display the string "no significant commonalities" instead. If multiple subsequences of the same longest length exist, output only the subsequence that comes first in alphabetical order.

Sample Input

3
2
GATACCAGATACCAGATACCAGATACCAGATACCAGATACCAGATACCAGATACCAGATA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
3
GATACCAGATACCAGATACCAGATACCAGATACCAGATACCAGATACCAGATACCAGATA
GATACTAGATACTAGATACTAGATACTAAAGGAAAGGGAAAAGGGGAAAAAGGGGGAAAA
GATACCAGATACCAGATACCAGATACCAAAGGAAAGGGAAAAGGGGAAAAAGGGGGAAAA
3
CATCATCATCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
ACATCATCATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AACATCATCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

Sample Output

no significant commonalities
AGATAC
CATCATCAT
时间: 2024-11-08 03:22:22

Blue Jeans - poj 3080(后缀数组)的相关文章

Match:Blue Jeans(POJ 3080)

DNA序列 题目大意:给你m串字符串,要你找最长的相同的连续字串 这题暴力kmp即可,注意要按字典序排序,同时,是len<3才输出no significant commonalities 1 #include <iostream> 2 #include <functional> 3 #include <algorithm> 4 #include <string.h> 5 #define MAX 60 6 7 using namespace std; 8

Blue Jeans POJ 3080 寻找多个串的最长相同子串

Description The Genographic Project is a research partnership between IBM and The National Geographic Society that is analyzing DNA from hundreds of thousands of contributors to map how the Earth was populated. As an IBM researcher, you have been tas

Blue Jeans - POJ 3080(多串的共同子串)

题目大意:有M个串,每个串的长度都是60,查找这M个串的最长公共子串(连续的),长度不能小于3,如果同等长度的有多个输出字典序最小的那个.   分析:因为串不多,而且比较短,所致直接暴力枚举的第一个串的所有子串,比较暴力的做法,如果串的长度大一些就没法玩了. 代码如下: ==================================================================================== #include<stdio.h> #include&l

POJ 2774 后缀数组:求最长公共子串

思路:其实很简单,就是两个字符串连接起来,中间用个特殊字符隔开,然后用后缀数组求最长公共前缀,然后不同在两个串中,并且最长的就是最长公共子串了. 注意的是:用第一个字符串来判断是不是在同一个字符中,刚开始用了第二个字符的长度来判断WA了2发才发现. #include<iostream> #include<cstdio> #include<cstring> #include<algorithm> #include<map> #include<

poj 3261 后缀数组 找重复出现k次的子串(子串可以重叠)

题目:http://poj.org/problem?id=3261 仍然是后缀数组的典型应用----后缀数组+lcp+二分 做的蛮顺的,1A 但是大部分时间是在调试代码,因为模板的全局变量用混了,而自己又忘了,,,等西安邀请赛还有四省赛结束之后,该冷静反思下尝试拜托模板了 错误   :1.k用错,题目的k和模板的k用混; 2.还是二分的C()函数,这个其实跟前一篇<poj 1226 hdu 1238 Substrings 求若干字符串正串及反串的最长公共子串 2002亚洲赛天津预选题>的C函数

poj 3415 后缀数组分组+排序+并查集

Source Code Problem: 3415   User: wangyucheng Memory: 16492K   Time: 704MS Language: C++   Result: Accepted Source Code #include<iostream> #include<cstdio> #include<algorithm> #include<cstring> using namespace std; #define N 510000

POJ 1226后缀数组:求出现或反转后出现在每个字符串中的最长子串

思路:这题是论文里的最后一道练习题了,不过最后一题竟然挺水的. 因为求的是未反转或者反转后,最长公共子串. 刚开始还真不知道怎么构建连接成一个字符串,因为需要有反转嘛! 但是其实挺简单的,把未反转的和反转后的字符串都连起来,中间用未出现过的字符隔开就行了!然后未反转的和反转的在同一组. 二分枚举最长的公共前缀长度,然后统计看看这个最长的长度在不在所有的组里,如果在就符合-- #include<iostream> #include<cstdio> #include<cstrin

POJ 3294 后缀数组:求不小于k个字符串中的最长子串

思路:先把所有的串连接成一个串,串写串之前用没出现过的字符隔开,然后求后缀:对height数组分组二分求得最长的公共前缀,公共前缀所在的串一定要是不同的,不然就不是所有串的公共前缀了,然后记下下标和长度即可. 刚开始理解错题意,然后不知道怎么写,然后看别人题解也不知道怎么意思,后面看了好久才知道题目意思理解错了. 时间四千多ms,别人才一百多ms,不知道别人怎么做的-- #include<iostream> #include<cstdio> #include<cstring&

POJ 1743 后缀数组:求最长不重叠子串

数据:这题弄了好久,WA了数十发,现在还有个例子没过,可却A了,POJ 的数组也太弱了. 10 1 1 1 1 1 1 1 1 1 1 这组数据如果没有那个n-1<10判断的话,输入的竟然是5,我靠-- 思路:这个题目关键的地方有两个:第一,重复的子串一定可以看作是某两个后缀的公共前缀,第二,把题目转化成去判定对于任意的一个长度k,是否存在长度至少为k的不重叠的重复的子串. 转化成判定问题之后,就可以二分去解答了.在验证判定是否正确时,我们可以把相邻的所有不小于k的height[]看成一组,然后