Suffix Tree (from geeksforgeeks)

Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.

Preprocess Pattern or Preoprocess Text? 
We have discussed the following algorithms in the previous posts:

KMP Algorithm
Rabin Karp Algorithm
Finite Automata based Algorithm
Boyer Moore Algorithm

All of the above algorithms preprocess the pattern to make the pattern searching faster. The best time complexity that we could get by preprocessing pattern is O(n) where n is length of the text. In this post, we will discuss an approach that preprocesses the text. A suffix tree is built of the text. After preprocessing text (building suffix tree of text), we can search any pattern in O(m) time where m is length of the pattern.
Imagine you have stored complete work of William Shakespeare and preprocessed it. You can search any string in the complete work in time just proportional to length of the pattern. This is really a great improvement because length of pattern is generally much smaller than text.
Preprocessing of text may become costly if the text changes frequently. It is good for fixed text or less frequently changing text though.

A Suffix Tree for a given text is a compressed trie for all suffixes of the given text. We have discussed Standard Trie. Let us understand Compressed Trie with the following array of words.

{bear, bell, bid, bull, buy, sell, stock, stop}

Following is standard trie for the above input set of words.

Following is the compressed trie. Compress Trie is obtained from standard trie by joining chains of single nodes. The nodes of a compressed trie can be stored by storing index ranges at the nodes.

How to build a Suffix Tree for a given text?
As discussed above, Suffix Tree is compressed trie of all suffixes, so following are very abstract steps to build a suffix tree from given text.
1) Generate all suffixes of given text.
2) Consider all suffixes as individual words and build a compressed trie.

Let us consider an example text “banana\0″ where ‘\0′ is string termination character. Following are all suffixes of “banana\0″

banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0

If we consider all of the above suffixes as individual words and build a trie, we get following.

If we join chains of single nodes, we get the following compressed trie, which is the Suffix Tree for given text “banana\0″

Please note that above steps are just to manually create a Suffix Tree. We will be discussing actual algorithm and implementation in a separate post.

How to search a pattern in the built suffix tree?
We have discussed above how to build a Suffix Tree which is needed as a preprocessing step in pattern searching. Following are abstract steps to search a pattern in the built Suffix Tree.
1) Starting from the first character of the pattern and root of Suffix Tree, do following for every character.
…..a) For the current character of pattern, if there is an edge from the current node of suffix tree, follow the edge.
…..b) If there is no edge, print “pattern doesn’t exist in text” and return.
2) If all characters of pattern have been processed, i.e., there is a path from root for characters of the given pattern, then print “Pattern found”.

Let us consider the example pattern as “nan” to see the searching process. Following diagram shows the path followed for searching “nan” or “nana”.

How does this work?
Every pattern that is present in text (or we can say every substring of text) must be a prefix of one of all possible suffixes. The statement seems complicated, but it is a simple statement, we just need to take an example to check validity of it.

Applications of Suffix Tree
Suffix tree can be used for a wide range of problems. Following are some famous problems where Suffix Trees provide optimal time complexity solution.
1) Pattern Searching
2) Finding the longest repeated substring
3) Finding the longest common substring
4) Finding the longest palindrome in a string

There are many more applications. See thisfor more details.

Ukkonen’s Suffix Tree Construction is discussed in following articles:
Ukkonen’s Suffix Tree Construction – Part 1
Ukkonen’s Suffix Tree Construction – Part 2
Ukkonen’s Suffix Tree Construction – Part 3
Ukkonen’s Suffix Tree Construction – Part 4
Ukkonen’s Suffix Tree Construction – Part 5
Ukkonen’s Suffix Tree Construction – Part 6

References:
http://fbim.fh-regensburg.de/~saj39122/sal/skript/progr/pr45102/Tries.pdf
http://www.cs.ucf.edu/~shzhang/Combio12/lec3.pdf
http://www.allisons.org/ll/AlgDS/Tree/Suffix/

时间: 2024-10-29 00:39:46

Suffix Tree (from geeksforgeeks)的相关文章

[算法系列之二十四]后缀树(Suffix Tree)

之前有篇文章([算法系列之二十]字典树(Trie))我们详细的介绍了字典树.有了这些基础我们就能更好的理解后缀树了. 一 引言 模式匹配问题 给定一个文本text[0-n-1], 和一个模式串 pattern[0-m-1],写一个函数 search(char pattern[], char text[]), 打印出pattern在text中出现的所有位置(n > m). 这个问题已经有两个经典的算法:KMP算法 ,有限自动机,前者是对模式串pattern做预处理,后者是对待查证文本text做预处

编程算法 - 后缀树(Suffix Tree) 代码(C)

后缀树(Suffix Tree) 代码(C) 本文地址: http://blog.csdn.net/caroline_wendy 给你一个长字符串s与很多短字符串集合{T1,, T2, ...}, 设计一个方法在s中查询T1, T2, ..., 要求找出Ti在s中的位置. 代码: /* * main.cpp * * Created on: 2014.7.20 * Author: Spike */ /*eclipse cdt, gcc 4.8.1*/ #include <iostream> #i

Trie / Radix Tree / Suffix Tree

Trie (字典树) "A", "to", "tea", "ted", "ten", "i", "in", "inn" 这些单词组成的字典树. Radix Tree (基数树) 基数树与字典树的区别在于基数树将单词压缩了, 节点变得更少 Suffix Tree (后缀树) 单词 "BANANA" 的后缀树. 每个后缀以 $ 结尾

18.8 suffix tree

给定一个字符串s 和 一个包含较短字符串的数组t,设计一个方法,根据t中的每一个较短的字符换,对s进行搜索. 思路:基于s建立后缀树,然后在后缀树中进行查找,返回所有可能的索引值. import java.util.ArrayList; public class Question { public static void main(String[] args) { String testString = "bibs"; String[] stringList = {"b&qu

Suffix array

A suffix array is a sorted array of all suffixes of a given string. The definition is similar to Suffix Tree which is compressed trie of all suffixes of the given text. Any suffix tree based algorithm can be replaced with an algorithm that uses a suf

SPOJ 375. Query on a tree (树链剖分)

Query on a tree Time Limit: 5000ms Memory Limit: 262144KB This problem will be judged on SPOJ. Original ID: QTREE64-bit integer IO format: %lld      Java class name: Main Prev Submit Status Statistics Discuss Next Font Size: + - Type:   None Graph Th

Codeforces Round #256 (Div. 2) B. Suffix Structures

Bizon the Champion isn't just a bison. He also is a favorite of the "Bizons" team. At a competition the "Bizons" got the following problem: "You are given two distinct words (strings of English letters), s and t. You need to trans

BNU 28887——A Simple Tree Problem——————【将多子树转化成线段树+区间更新】

A Simple Tree Problem Time Limit: 3000ms Memory Limit: 65536KB This problem will be judged on ZJU. Original ID: 368664-bit integer IO format: %lld      Java class name: Main Prev Submit Status Statistics Discuss Next Type: None None Graph Theory 2-SA

Codeforces Round #256 (Div. 2) B. Suffix Structures(模拟)

题目链接:http://codeforces.com/contest/448/problem/B ---------------------------------------------------------------------------------------------------------------------------------------------------------- 欢迎光临天资小屋:http://user.qzone.qq.com/593830943/ma