I’m actually learning some stuff related to algorithms on sequences. The naive search for a pattern in a long string is of course very slow and comes with a lot of unintelligent compares. The Z-Algorithm improves the searching by preprocessing the pattern.

Naive searching

A simple search algorithm written in java may look like

public void search (String needle, String haystack)
{
	for (int off = 0; off < haystack.length () - needle.length () + 1; off++)
	{
		boolean found = true;
		for (int p = 0; p < needle.length (); p++)
			if (needle.charAt (p) != haystack.charAt (off + p))
			{
				found = false;
				break;
			}
		if (found) System.out.println ("Fount pattern at position " + off);
	}
}

This code reliably finds any existence of needle in haystack in , with length of needle and length of haystack. That screams for improvements ;)

Definitions

The first algorithm that I want to present in this series is called Z-Algorithm. First of all we need some definitions.

Definition 1: In the following we denote as the substring of beginning at position and ending at position . We can also leave one of the limits clear, so that is the substring and means .

Definition 2: So is the length of the longest prefix of the suffix that is also prefix of itself. To abbreviate is further on mentioned as .

Definition 3: The set for a is called Z-Box at position .

Definition 4: is the set of limits of all Z-Box’es that start at the left-handed side of . Consider .

Definition 5: If and , defines the rightest Z-Box that starts before respectively at position . Consider .

Algorithm

In the following will denote the actual position we are looking for, and describe the current respectively last found of a Z-Box. First of all we set the values and to zero because we haven’t found any Z-Box yet. of our text is according to Definition 2 the length of the longest prefix of that is also prefix of itself. If we found a first Z-Box and update the limits to and .

Now we have to run through the word , so with defines the length of .

Case 1: Let’s assume position is outside of the last found Z-Box or we didn’t find any Z-Box yet (). We find by comparing the prefixes of and . If we’ve found a new Z-Box and need to update the limits to and .

Case 2: If the current position is inside of a current Z-Box () we try to find the equivalent position at the beginning of . The position we are searching for is steps off the beginning of (we are steps behind and has the same prefix as ). Case 2a: If we don’t break out of the current Z-Box by creating another Z-Box with the length of the box at position (, so position is not behind position ), we can simply apply this Z-Box to the current position and . Case 2b: Otherwise, if we would leave the actual Z-Box () we have to recheck the prefix conditions of and . We know that equals , so we only have to find the length of the longest prefix of that equals the prefix of . Now we can apply the new Z-Box such that and of course we update the Z-Box limits to and .

If we reached the end of all Z-Boxes are found in .

Pseudo code

l = r = 0
Z[2] = prefix (S, S[2 ..]).length
if Z[2] > 0 then
	l = 2
	r = 2 + Z[2] - 1

for i = 3..|S| do
	if i > r then 										'(case 1)'
		Z[i] = prefix (S, S[i ..]).length
		if Z[i] > 0 then
			l = i
			r = i + Z[i] - 1

	else 												'(case 2)'
		k = i - l + 1
		if Z[k] < r - i + 1 then 						'(case 2a)'
			Z[i] = Z[k]

		else											'(case 2b)'
			p = prefix (S[r - i + 2 ..], S[r + 1 ..]).length
			Z[i] = r - i + 1 + p
			l = i
			r = i + Z[i] - 1

Example

Let me demonstrate the algorithm with a small example. Let’s take the word . First we start with and at position 2. is the length of the shared prefix of () and (). Easy to see the prefix is with a length of 1. So , and . At the beginning of our for-loop the program’s status is:

$$T$$aabaaab
$$i$$12
$$Z_i$$1
$$l$$2
$$r$$2

At the first round in the loop , so because . So we meet case 1 and have to find the length of the prefix of () and (). Of course it’s zero, nothing to do.

$$T$$aabaaab
$$i$$123
$$Z_i$$10
$$l$$22
$$r$$22

Next round, we’re at position 4 and again (case 1). So we have to compare and . The longest prefix of both words is with a length of 2. So we start a new Z-Box at 4 with a size of 2, so and .

$$T$$aabaaab
$$i$$1234
$$Z_i$$102
$$l$$224
$$r$$225

With and we reach case 2 for the first time. so our similar position at the beginning of is position 2. and so we are in case 2b and have to find the shared prefix of () and (). It’s , so and . and .

$$T$$aabaaab
$$i$$12345
$$Z_i$$1023
$$l$$2245
$$r$$2257

Next round brings us , therefor we’re in case 2. Equivalent position is again , but now and we’re in case 2a and can just set .

$$T$$aabaaab
$$i$$123456
$$Z_i$$10231
$$l$$22455
$$r$$22577

The last round we have to process is , case 2. Equivalent position is and , so case 2a and .

$$T$$aabaaab
$$i$$1234567
$$Z_i$$102310
$$l$$224555
$$r$$225777

That’s it. The Z-Box’es we’ve found are visualized in the image.

Searching

To search for a pattern in a text just calculate the Z-Boxes of with . These calculations are done in . For any : If means is prefix of , so is found at position in .

Code

Of course I’m providing an implementation, see attachment.

Download: Java: Zbox.java (Please take a look at the man-page. Browse bugs and feature requests.)

Martin Scharm

stuff. just for the records.


1 Comment

binfalse | Permalink | 2010-09-20 03:02:28

Suffix trees…

Suffix trees are an application to particularly fast implement many important string operations like searching for a pattern or finding the longest common substring…….

Post a comment

read more about submitting comments