I’m actually learning some stuff related to algorithms on sequences. The naive search for a pattern in a long string is of course very slow and comes with a lot of unintelligent compares. The Z-Algorithm improves the searching by preprocessing the pattern.

## Naive searching

A simple search algorithm written in java may look like

This code reliably finds any existence of needle in haystack in , with length of needle and length of haystack. That screams for improvements ;)

## Definitions

The first algorithm that I want to present in this series is called Z-Algorithm. First of all we need some definitions.

**Definition 1**:
In the following we denote as the substring of beginning at position and ending at position . We can also leave one of the limits clear, so that is the substring and means .

**Definition 2**:
So is the length of the longest prefix of the suffix that is also prefix of itself. To abbreviate is further on mentioned as .

**Definition 3**:
The set for a is called *Z-Box* at position .

**Definition 4**:
is the set of limits of all Z-Box’es that start at the left-handed side of .
Consider .

**Definition 5**:
If and , defines the rightest Z-Box that starts before respectively at position .
Consider .

## Algorithm

In the following will denote the actual position we are looking for, and describe the current respectively last found of a Z-Box. First of all we set the values and to zero because we haven’t found any Z-Box yet. of our text is according to Definition 2 the length of the longest prefix of that is also prefix of itself. If we found a first Z-Box and update the limits to and .

Now we have to run through the word , so with defines the length of .

**Case 1:** Let’s assume position is outside of the last found Z-Box or we didn’t find any Z-Box yet (). We find by comparing the prefixes of and . If we’ve found a new Z-Box and need to update the limits to and .

**Case 2:** If the current position is inside of a current Z-Box () we try to find the equivalent position at the beginning of . The position we are searching for is steps off the beginning of (we are steps behind and has the same prefix as ).
**Case 2a:** If we don’t break out of the current Z-Box by creating another Z-Box with the length of the box at position (, so position is not behind position ), we can simply apply this Z-Box to the current position and .
**Case 2b:** Otherwise, if we would leave the actual Z-Box () we have to recheck the prefix conditions of and . We know that equals , so we only have to find the length of the longest prefix of that equals the prefix of . Now we can apply the new Z-Box such that and of course we update the Z-Box limits to and .

If we reached the end of all Z-Boxes are found in .

## Pseudo code

## Example

Let me demonstrate the algorithm with a small example. Let’s take the word . First we start with and at position 2. is the length of the shared prefix of () and (). Easy to see the prefix is with a length of 1. So , and . At the beginning of our for-loop the program’s status is:

$$T$$ | a | a | b | a | a | a | b |
---|---|---|---|---|---|---|---|

$$i$$ | 1 | 2 | |||||

$$Z_i$$ | 1 | ||||||

$$l$$ | 2 | ||||||

$$r$$ | 2 |

At the first round in the loop , so because . So we meet **case 1** and have to find the length of the prefix of () and (). Of course it’s zero, nothing to do.

$$T$$ | a | a | b | a | a | a | b |
---|---|---|---|---|---|---|---|

$$i$$ | 1 | 2 | 3 | ||||

$$Z_i$$ | 1 | 0 | |||||

$$l$$ | 2 | 2 | |||||

$$r$$ | 2 | 2 |

Next round, we’re at position 4 and again (**case 1**). So we have to compare and . The longest prefix of both words is with a length of 2. So we start a new Z-Box at 4 with a size of 2, so and .

$$T$$ | a | a | b | a | a | a | b |
---|---|---|---|---|---|---|---|

$$i$$ | 1 | 2 | 3 | 4 | |||

$$Z_i$$ | 1 | 0 | 2 | ||||

$$l$$ | 2 | 2 | 4 | ||||

$$r$$ | 2 | 2 | 5 |

With and we reach **case 2** for the first time. so our similar position at the beginning of is position 2. and so we are in **case 2b** and have to find the shared prefix of () and (). It’s , so and . and .

$$T$$ | a | a | b | a | a | a | b |
---|---|---|---|---|---|---|---|

$$i$$ | 1 | 2 | 3 | 4 | 5 | ||

$$Z_i$$ | 1 | 0 | 2 | 3 | |||

$$l$$ | 2 | 2 | 4 | 5 | |||

$$r$$ | 2 | 2 | 5 | 7 |

Next round brings us , therefor we’re in **case 2**. Equivalent position is again , but now and we’re in **case 2a** and can just set .

$$T$$ | a | a | b | a | a | a | b |
---|---|---|---|---|---|---|---|

$$i$$ | 1 | 2 | 3 | 4 | 5 | 6 | |

$$Z_i$$ | 1 | 0 | 2 | 3 | 1 | ||

$$l$$ | 2 | 2 | 4 | 5 | 5 | ||

$$r$$ | 2 | 2 | 5 | 7 | 7 |

The last round we have to process is , **case 2**. Equivalent position is and , so **case 2a** and .

$$T$$ | a | a | b | a | a | a | b |
---|---|---|---|---|---|---|---|

$$i$$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

$$Z_i$$ | 1 | 0 | 2 | 3 | 1 | 0 | |

$$l$$ | 2 | 2 | 4 | 5 | 5 | 5 | |

$$r$$ | 2 | 2 | 5 | 7 | 7 | 7 |

That’s it. The Z-Box’es we’ve found are visualized in the image.

## Searching

To search for a pattern in a text just calculate the Z-Boxes of with . These calculations are done in . For any : If means is prefix of , so is found at position in .

## Code

Of course I’m providing an implementation, see attachment.

- bioinformatics (21) ,
- explained (39) ,
- java (22) ,
- pattern (3) ,
- programming (73) ,
- search (4) ,
- university (41)

## 1 Comment

Suffix trees…Suffix trees are an application to particularly fast implement many important string operations like searching for a pattern or finding the longest common substring…….

## Post a comment

read more about submitting comments

`_comments`

directory and send me a pull request