Suffix trees are an application to particularly fast implement many important string operations like searching for a pattern or finding the longest common substring.
Introduction and definitions
I already introduced the Z-Algorithm which optimizes the searching for a pattern by preprocessing the pattern. It is very useful if you have to search for one single pattern in a large number of words. But often you’ll try to find many patterns in a single text. So the preprocessing of each pattern is ineffective. Suffix trees come up with a preprocessing of the text, to speed up the search for any pattern.
As expected a suffix tree of the word (length ) is represented in a data structure of a rooted tree. Every path from root to a leave represents a suffix of $ with . The union . Every inner node (except root) has at least two children. Every edge is labeled with a string of , labels of leaving edges at a single node start with different symbols and each leaf is indexed with . The concatenation of all edge labels on a path from root to a leaf with index represents the suffix .
Definition 1: For each node :
The path from root to is called
- The union of the labels at all edges on is
- is label of path ()
- is path label to ()
- instead of we can call this node
Definition 2: A pattern exists in suffix tree of (further called ) if and only if there is a , so that contains a node with ().
Definition 3: A substring ends in node or in an edge to with .
Definition 4: A edge to a leaf is called leaf-edge.
The tree contains all suffixes of the word extended with $. This tree is visualized in figure 1.
Building suffix trees: Write only top down
The write-only, top-down (WOTD) algorithm constructs the suffix tree in a top-down fashion.
Let be a node in , then denotes the concatenation of all edge labels on the path to (). Each node in the suffix tree represents the set of all suffixes that have the prefix . So the set of pathlabels to leafs below can be written as (all suffixes of the set of suffixes that start with ).
This set is splitted in equivalence classes for each symbol with is the -group of .
Case 1: For groups that contain only one suffix we create a leaf with the index and connect it to with an edge containing label . Case 2: In groups with a size of at least two we compute their longest common prefix that starts with and create a node . The connecting edge between and gets the label and we continue recursively with this algorithm in node with
Exact pattern matching
All paths from the root of the suffix tree are labeled with the prefixes of path labels. That is, they’re labeled with prefixes of suffixes of the string . Or, in other words, they’re labeled with substrings of . To search for a pattern in , just go through , following paths labeled by the characters of . At any node with is prefix of find the edge with label that starts with symbol . If such an edge doesn’t exists, isn’t a substring of . Otherwise try to match the pattern with to node . If is not a prefix of you’ll either get a mismatch denoting that isn’t a substring of , or you ran out of caracters of and found it in the tree. If is a prefix of continue searching at node . If you were able to find in , contains at any position denoted by the indexes of leafs below your point of discovery.
Minimal unique substrings
is a minimal unique substring if and only if contains exactly once and any prefix of can be found at least two times in .
To find such a minimal unique substring walk through the tree to nodes with a leaf-edge . A minimal unique substring is with is the first char of , because its prefix isn’t unique ( has at least two leaving edges) and every extended version has a prefix that is also unique.
A maximal pair is a tuple , so that , but and . A maximal repeat is the string represented by such tuple. If is a maximal repeat there is a node in . To find the maximal repeats do a DFS on the tree. Label each leaf with the left character of the suffix that it represents. For each internal node:
- If at least one child is labeled with c, then label it with c
- Else if its children’s labels are diverse, label with c.
- Else then all children have same label, copy it to current node.
Path labels to left-diverse nodes are maximal repeats.
Generalized suffix trees
An extension of suffix trees are generalized suffix trees. With it you can represent multiple words in one single tree. Of course you have to modify the tree, so that you know which leaf index corresponds to which word. Just a little bit more to store in the leafs ;) A generalized suffix tree is printed in figure 2 of page one.
There are a lot of other applications for a suffix tree structure. For example finding palindromes, search for regular expressions, faster computing of the Levenshtein distance, data compression and so on…
I’ve implemented a suffix tree in Java. The tree is constructed via WOTD and finds maximal repeats and minimal unique substrings. I also wanted pictures for this post, thus, I added a functionality that prints GraphViz code that represents the tree.
- bioinformatics (21) ,
- explained (37) ,
- java (22) ,
- pattern (3) ,
- programming (72) ,
- search (4) ,
- tree (3) ,
- university (39)