后缀数组(多个字符串的最长公共子串)—— POJ 3294

对应POJ 题目:点击打开链接

Life Forms

Time Limit:6666MS     Memory Limit:0KB     64bit IO Format:%lld
& %llu

Submit Status

Description

Problem C: Life Forms

You may have wondered why most extraterrestrial life forms resemble humans, differing by superficial traits such as height, colour, wrinkles, ears, eyebrows and the like. A few bear no human resemblance; these typically have geometric or amorphous shapes like
cubes, oil slicks or clouds of dust.

The answer is given in the 146th episode of Star Trek - The Next Generation, titled The Chase. It turns out that in the vast majority of the quadrant‘s life forms ended up with a large fragment
of common DNA.

Given the DNA sequences of several life forms represented as strings of letters, you are to find the longest substring that is shared by more than half of them.

Standard input contains several test cases. Each test case begins with 1 ≤ n ≤ 100, the number of life forms. n lines follow; each contains a string of lower case letters representing the DNA sequence
of a life form. Each DNA sequence contains at least one and not more than 1000 letters. A line containing 0 follows the last test case.

For each test case, output the longest string or strings shared by more than half of the life forms. If there are many, output all of them in alphabetical order. If there is no solution with at least one letter,
output "?". Leave an empty line between test cases.

Sample Input

3
abcdefg
bcdefgh
cdefghi
3
xxx
yyy
zzz
0

Output for Sample Input

bcdefg
cdefgh

?

Gordon V. Cormack

题意:给定一个数n,再给出n个字符串,求不少于n/2个字符串的最长公共子串。

思路:就是后缀数组求多字符串的最长公共子串,height数组分组+二分答案求上界。细节上,求得一组后前缀后,要判断是否含有分隔符。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MS(x, y) memset(x, y, sizeof(x))
const int MAXN = 100000+2000;
const int INF = 1<<30;

int wa[MAXN],wb[MAXN],wv[MAXN],ws[MAXN];
int rank[MAXN],r[MAXN],sa[MAXN],height[MAXN];
char str[1005];
int vis[1005], ID[1005];
int block[MAXN];

int cmp(int *r, int a, int b, int l)
{
	return r[a] == r[b] && r[a+l] == r[b+l];
}

void da(int *r, int *sa, int n, int m)
{
	int i, j, p, *x = wa, *y = wb, *t;

	for(i=0; i<m; i++) ws[i] = 0;
	for(i=0; i<n; i++) ws[x[i] = r[i]]++;
	for(i=1; i<m; i++) ws[i] += ws[i-1];
	for(i=n-1; i>=0; i--) sa[--ws[x[i]]] = i;

	for(j=1,p=1; p<n; j<<=1, m=p){

		for(p=0,i=n-j; i<n; i++) y[p++] = i;
		for(i=0; i<n; i++) if(sa[i] >= j) y[p++] = sa[i] - j;

		for(i=0; i<n; i++) wv[i] = x[y[i]];
		for(i=0; i<m; i++) ws[i] = 0;
		for(i=0; i<n; i++) ws[wv[i]]++;
		for(i=1; i<m; i++) ws[i] += ws[i-1];
		for(i=n-1; i>=0; i--) sa[--ws[wv[i]]] = y[i];

		for(t=x,x=y,y=t,p=1,x[sa[0]]=0,i=1; i<n; i++)
			x[sa[i]] = cmp(y, sa[i-1], sa[i], j) ? p-1 : p++;

	}
	return;
}

void calheight(int *r, int *sa, int n)
{
	int i, j, k = 0;
	for(i=1; i<n; i++) rank[sa[i]] = i;
	for(i=0; i<n-1; height[rank[i++]] = k)
		for(k ? k-- : 0,j=sa[rank[i]-1]; r[i+k] == r[j+k]; k++);
	return;
}

int main()
{
	//freopen("in.txt", "r", stdin);
	int n;
	scanf("%d", &n);
	while(n)
	{
		int i, j, k;
		MS(rank, 0);
		MS(sa, 0);
		MS(wa, 0);
		MS(wb, 0);
		MS(ws, 0);
		MS(wv, 0);
		MS(r, 0);
		MS(height, 0);
		MS(block, 0);
		MS(ID, 0);

		int len = 1, tmp_l, maxn = 0;
		int left = 1, right = INF;
		for(i=0; i<n; i++){//把所有字符串连成一个用分隔符分隔的字符串
			scanf("%s", str);
			tmp_l = strlen(str);
			if(tmp_l < right) right = tmp_l;//二分答案的右边界为最短字符串的长度
			int k;
			for(j=len, k=0; k<tmp_l; j++, k++){
				block[j] = i;//下标为j的字符所在的是第i个字符串
				r[j] = str[k] - 'a' + 1;
				if(r[j] > maxn) maxn = r[j];
			}
			len += tmp_l;
			r[len++] = 0;//末尾添加一个最小值
		}

		da(r, sa, len, maxn+1);
		calheight(r, sa, len);

		int beg = 0, end = 0, ok, u = 0, ul = 0, LEN = 0;
		while(left <= right)
		{
			ok = u = 0;
			int mid = left + (right - left)/2;//二分答案

			for(i=n+1; i<len; i++){
				if(height[i] >= mid){//确定某一组的起点终点
					for(k=sa[i]; k < sa[i] + mid; k++)
						if(0 == r[k]) break;//该公共前缀含有分隔符
					if(k == sa[i] + mid){
						if(!beg) beg = i;
						end = i;
					}
				}
				if((beg && end) && (i == len - 1 || height[i] < mid)){
					int count = 0;
					MS(vis, 0);
					for(j=beg-1; j<=end; j++){//一组里面有多少个后缀来自不同的字符串
						int num = block[sa[j]];
						if(!vis[num]) {
							vis[num] = 1;
							count++;
						}
					}
					if(count > n/2){//符合题意的解
						ID[u++] = sa[j-1];//保存下标
						LEN = mid;
						ok = 1;
					}
					beg = end = 0;
				}
			}

			if(ok) ul = u;//u值在每次二分都会置为0,故在每次找到合理的解后要赋给其它变量
			if(ok) left = mid + 1;//找到解,说明不是最长
			else right = mid - 1;
		}

		if(ul){
			for(i=0; i<ul; i++){
				for(j=ID[i]; j<ID[i] + LEN; j++)
					printf("%c", char(r[j] - 1 +'a'));
				printf("\n");
			}
		}
		else printf("?\n");
		scanf("%d", &n);
		if(n) printf("\n");
	}
}
时间: 2024-08-08 13:57:56

后缀数组(多个字符串的最长公共子串)—— POJ 3294的相关文章

poj2774 后缀数组2个字符串的最长公共子串

Long Long Message Time Limit: 4000MS   Memory Limit: 131072K Total Submissions: 26601   Accepted: 10816 Case Time Limit: 1000MS Description The little cat is majoring in physics in the capital of Byterland. A piece of sad news comes to him these days

luogu 2463 [SDOI2008]Sandy的卡片 kmp || 后缀数组 n个串的最长公共子串

题目链接 Description 给出\(n\)个序列.找出这\(n\)个序列的最长相同子串. 在这里,相同定义为:两个子串长度相同且一个串的全部元素加上一个数就会变成另一个串. 思路 参考:hzwer. 法一:kmp 在第一个串中枚举答案串的开头位置,与其余\(n-1\)个串做\(kmp\). 法二:后缀数组 将\(n\)个串拼接起来.二分答案\(len\),将\(height\)分组,\(check\)是否有一组个数\(\geq len\)且落在\(n\)个不同的串中. 注意:\(n\)个串

UVA 题目760 DNA Sequencing (后缀数组求两个串最长公共子串,字典序输出)

 DNA Sequencing  A DNA molecule consists of two strands that wrap around each other to resemble a twisted ladder whose sides, made of sugar and phosphate molecules, are connected by rungs of nitrogen-containing chemicals called bases. Each strand is

字符串hash + 二分答案 - 求最长公共子串 --- poj 2774

Long Long Message Problem's Link:http://poj.org/problem?id=2774 Mean: 求两个字符串的最长公共子串的长度. analyse: 前面在学习后缀数组的时候已经做过一遍了,但是现在主攻字符串hash,再用字符串hash写一遍. 这题的思路是这样的: 1)取较短的串的长度作为high,然后二分答案(每次判断长度为mid=(low+high)>>1是否存在,如果存在就增加下界:不存在就缩小上界): 2)主要是对答案的判断(judge函数

两个字符串的最长公共子串

import java.util.Scanner; /* 求两个字符串的最长公共子串*/ public class stringDemo {     public static void main(String[] args){      Scanner scanner = new Scanner(System.in);      System.out.println("请输入第一个字符串:");      String str1 = scanner.nextLine();     

求字符串的最长公共子串

找两个字符串的最长公共子串,这个子串要求在原字符串中是连续的.而最长公共子序列则并不要求连续. 代码如下: package string; import java.util.ArrayList; import java.util.List; public class Main { // 求最长公共子串长度 public int getMaxLen(String s1, String s2){ if(s1 == null || s2 == null){ return 0; } int m = s1

求两个字符串的最长公共子串——Java实现

要求:求两个字符串的最长公共子串,如"abcdefg"和"adefgwgeweg"的最长公共子串为"defg"(子串必须是连续的) public class Main03{ // 求解两个字符号的最长公共子串 public static String maxSubstring(String strOne, String strTwo){ // 参数检查 if(strOne==null || strTwo == null){ return null

自己写的一个后缀树算法查找一个字符串的最长重复子串

在上个星期面试一家公司的笔试题上面的最后一道题就是写程序查找一个字符串的最长重复子串.当时想了很长时间没想出什么好方法,就把一个算法复杂度比较高的算法写上去了.回来上机把那个算法写了一遍测试没问题,然后自己又到网上面查查还有什么方法,然后发现好像有种叫做后缀树的方法,然后看那个方法,当时没给出代码,看图看了老半天加之自己想了好几个小时终于知道后缀树是个什么东西.然后自己萌生了一个自己写一个后缀树算法解决那个重复子串的问题.然后写了一天终于写出来了.后续有做了一些测试,发现自己写的一个只有几十个字

[URAL-1517][求两个字符串的最长公共子串]

Freedom of Choice URAL - 1517 Background Before Albanian people could bear with the freedom of speech (this story is fully described in the problem "Freedom of speech"), another freedom - the freedom of choice - came down on them. In the near fu