Advanced searching via Z-Algorithm

I’m actually learning some stuff related to algorithms on sequences. The naive search for a pattern in a long string is of course very slow and comes with a lot of unintelligent compares. The Z-Algorithm improves the searching by preprocessing the pattern.

Naive searching

A simple search algorithm written in java may look like

public void search (String needle, String haystack)
{
	for (int off = 0; off < haystack.length () - needle.length () + 1; off++)
	{
		boolean found = true;
		for (int p = 0; p < needle.length (); p++)
			if (needle.charAt (p) != haystack.charAt (off + p))
			{
				found = false;
				break;
			}
		if (found) System.out.println ("Fount pattern at position " + off);
	}
}

This code reliably finds any existence of needle in haystack in \(O(m \cdot n)\), with \(m=\) length of needle and \(n=\) length of haystack. That screams for improvements ;)

Definitions

The first algorithm that I want to present in this series is called Z-Algorithm. First of all we need some definitions.

Definition 1: In the following we denote \(S[i\dots j]\) as the substring of \(S\) beginning at position \(i\) and ending at position \(j\). We can also leave one of the limits clear, so that \(S[i\dots]\) is the substring \(S[i\dots |S|]\) and \(S[\dots j]\) means \(S[1\dots j]\).

Definition 2: \(Z_i(S) := \max \{p | S[i \dots i+p-1] = S[1 \dots p]\}\) So \(Z_i(S)\) is the length of the longest prefix of the suffix \(S[i\dots]\) that is also prefix of \(S\) itself. To abbreviate \(Z_i(S)\) is further on mentioned as \(Z_i\).

Definition 3: The set \([i,i+Z_i-1]\) for a \(Z_i > 0\) is called Z-Box at position \(i\).

Definition 4: \(V_i := \{[a_j, b_j] | [a_j, b_j] \text{ is Z-Box at } a_j \wedge a_j < i\}\) \(V_i\) is the set of limits of all Z-Box’es that start at the left-handed side of \(i\). Consider \(i<j \Rightarrow V_i \subseteq V_j\).

Definition 5: \([l_i,r_i] := \begin{cases} \underset{b_j}{\arg\max} \ [a_j,b_j] \in V_i, & \text{if } V_i \ne \varnothing\\ [0,0] & \text{else}\end{cases}\) If \(l_i>0\) and \(r_i>0\), \([l_i,r_i]\) defines the rightest Z-Box that starts before respectively at position \(i\). Consider \(i<j \Rightarrow r_i\le r_j\).

Algorithm

In the following \(i\) will denote the actual position we are looking for, \(l\) and \(r\) describe the current respectively last found of a Z-Box. First of all we set the values \(l\) and \(r\) to zero because we haven’t found any Z-Box yet. \(Z_2\) of our text \(S\) is according to Definition 2 the length of the longest prefix of \(S[2\dots]\) that is also prefix of \(S\) itself. If \(Z_2>0\) we found a first Z-Box and update the limits to \(l=2\) and \(r=2+Z_2-1\).

Now we have to run through the word \(S\), so \(i=3\dots \|S\|\) with \(\|S\|\) defines the length of \(S\).

Case 1: Let’s assume position \(i\) is outside of the last found Z-Box or we didn’t find any Z-Box yet (\(i>r\)). We find \(Z_i\) by comparing the prefixes of \(S\) and \(S[i\dots]\). If \(Z_i>0\) we’ve found a new Z-Box and need to update the limits to \(l=i\) and \(r=i+Z_i-1\).

Case 2: If the current position \(i\) is inside of a current Z-Box (\(i\le r\)) we try to find the equivalent position at the beginning of \(S\). The position we are searching for is \(k=i-l+1\) steps off the beginning of \(S\) (we are \(i-l+1\) steps behind \(l\) and \(S[l\dots]\) has the same prefix as \(S\)). Case 2a: If we don’t break out of the current Z-Box by creating another Z-Box with the length of the box at position \(k\) (\(Z_k<r-i+1\), so position \(i+Z_k\) is not behind position \(r\)), we can simply apply this Z-Box to the current position and \(Z_i=Z_k\). Case 2b: Otherwise, if we would leave the actual Z-Box (\(i + Z_k>r\)) we have to recheck the prefix conditions of \(S[i\dots]\) and \(S\). We know that \(S[i\dots r]\) equals \(S[1\dots r-i+1]\), so we only have to find the length of the longest prefix \(p\) of \(S[r-i+2\dots]\) that equals the prefix of \(S[r+1\dots]\). Now we can apply the new Z-Box such that \(Z_i=r-i+1+p\) and of course we update the Z-Box limits to \(l=i\) and \(r=i+Z_i-1\).

If we reached the end of \(S\) all Z-Boxes are found in \(\Theta(\|S\|)\).

Pseudo code

l = r = 0
Z[2] = prefix (S, S[2 ..]).length
if Z[2] > 0 then
	l = 2
	r = 2 + Z[2] - 1

for i = 3..|S| do
	if i > r then 										'(case 1)'
		Z[i] = prefix (S, S[i ..]).length
		if Z[i] > 0 then
			l = i
			r = i + Z[i] - 1

	else 												'(case 2)'
		k = i - l + 1
		if Z[k] < r - i + 1 then 						'(case 2a)'
			Z[i] = Z[k]

		else											'(case 2b)'
			p = prefix (S[r - i + 2 ..], S[r + 1 ..]).length
			Z[i] = r - i + 1 + p
			l = i
			r = i + Z[i] - 1

Example

Let me demonstrate the algorithm with a small example. Let’s take the word \(S=aabaaab\). First we start with \(l=0\) and \(r=0\) at position 2. \(Z_2\) is the length of the shared prefix of \(S\) (\(aabaaab\)) and \(S[2\dots]\) (\(abaaab\)). Easy to see the prefix is \(a\) with a length of 1. So \(Z_2=1\), \(l=2\) and \(r=2\). At the beginning of our for-loop the program’s status is:

$$T$$aabaaab
$$i$$12
$$Z_i$$1
$$l$$2
$$r$$2

At the first round in the loop \(i=3\), so \(i>r\) because \(r=2\). So we meet case 1 and have to find the length of the prefix of \(S\) (\(aabaaab\)) and \(S[3\dots]\) (\(baaab\)). Of course it’s zero, nothing to do.

$$T$$aabaaab
$$i$$123
$$Z_i$$10
$$l$$22
$$r$$22

Next round, we’re at position 4 and again \(i>r\) (case 1). So we have to compare \(aabaaab\) and \(aaab\). The longest prefix of both words is \(aa\) with a length of 2. So we start a new Z-Box at 4 with a size of 2, so \(l=4\) and \(r=5\).

$$T$$aabaaab
$$i$$1234
$$Z_i$$102
$$l$$224
$$r$$225

With \(i=5\) and \(r=5\) we reach case 2 for the first time. \(k=i-l+1=2\) so our similar position at the beginning of \(S\) is position 2. \(Z_2=1\) and \(r-i+1=1\) so we are in case 2b and have to find the shared prefix of \(S[2 ..]\) (\(abaaab\)) and \(S[6 ..]\) (\(ab\)). It’s \(ab\), so \(p=2\) and \(Z_5=r-i+1+p=3\). \(l=5\) and \(r=7\).

$$T$$aabaaab
$$i$$12345
$$Z_i$$1023
$$l$$2245
$$r$$2257

Next round brings us \(i=6<r\), therefor we’re in case 2. Equivalent position is again \(k=i-l+1=2\), but now \(Z_2=1<r-i+1=2\) and we’re in case 2a and can just set \(Z_6=1\).

$$T$$aabaaab
$$i$$123456
$$Z_i$$10231
$$l$$22455
$$r$$22577

The last round we have to process is \(i=7<r\), case 2. Equivalent position is \(k=i-l+1=3\) and \(Z_3=0<r-i+1=1\), so case 2a and \(Z_7 = 0\).

$$T$$aabaaab
$$i$$1234567
$$Z_i$$102310
$$l$$224555
$$r$$225777

That’s it. The Z-Box’es we’ve found are visualized in the image.

Searching

To search for a pattern \(P \in A^*\) in a text \(T \in A^*\) just calculate the Z-Boxes of \(P\$T\) with \(\$\notin A\). These calculations are done in \(\Theta(|T|)\). For any \(i>|P|\): If \(Z_i=|P|\) means \(P\$T[i\dots i+|P|-1]\) is prefix of \(P\$T\), so \(P\) is found at position \(i-(|P|+1)\) in \(T\).

Code

Of course I’m providing an implementation, see attachment.

Download: Java: Zbox.java (Please take a look at the man-page. Browse bugs and feature requests.)

SSH escape sequences

Such as telnet the SSH protocol also has a control character, it’s the tilde (~).

If you for example want to kill a hanging SSH session just type ~. . With ~^Z you can suspend a running session and get back to your local machine. To reactivate it just type fg (yes, the SSH session is also just a job). All supported escape sequences will be listed with ~? :

me@remote ~ % ~?
Supported escape sequences:
  ~.  - terminate connection (and any multiplexed sessions)
  ~B  - send a BREAK to the remote system
  ~C  - open a command line
  ~R  - Request rekey (SSH protocol 2 only)
  ~^Z - suspend ssh
  ~#  - list forwarded connections
  ~&  - background ssh (when waiting for connections to terminate)
  ~?  - this message
  ~~  - send the escape character by typing it twice
(Note that escapes are only recognized immediately after newline.)

All sequences are of course only understood after a newline ;)

First HTML5 experiences

Although I have too much to do it’s in the nick of time to try some stuff with HTML5.

You should all have heard about HTML5, next generation of web ;) I still saw a lot of new features, some are still not supported in many browsers but all in all I’m looking forward.

Here I played a little bit with the canvas stuff and created a binary clock:

Wasn’t that difficult, just created an HTML element of type canvas with enough space in it to draw the clock:

<canvas id="clock" width="250" height="100"></canvas>

and via JavaScript I draw the clock in it:

/* JS binary clock by Martin Scharm <http://binfalse.de> */
function init()
{
	clock();
	setInterval(clock,1000);
}
function draw (ctx, x, y, stroke)
{
	ctx.beginPath(); 
	ctx.arc(x, y, 9, 0, Math.PI*2,true);
	if (stroke) ctx.stroke();
	else ctx.fill ();
}
function clock ()
{
	var canvas = document.getElementById("clock");  
	if (canvas.getContext)
	{  
		var offset = 60;
		var ctx = canvas.getContext("2d");
		ctx.save();
		ctx.clearRect(0,0,300,300); 
		var now = new Date();
		var sec = now.getSeconds();  
		var min = now.getMinutes(); 
		var hr  = now.getHours(); 
		for (var i = 0; i < 3; i++)
			for (var x = 0; x < 2; x++)
				for (var y = 0; y < 3; y++)
				{
					draw (ctx, i*offset + x*20 + 20, y*20 + 20, true);
				}
				for (var x = 1; x < 3; x++)
					for (var y = 2; y < 4; y++)
					{
						ctx.beginPath();
						ctx.arc(x * offset, y * 20, 4, 0, Math.PI*2,true);
						ctx.fill ();
					}
					for (var x = 0; x < 2; x++)
						for (var y = 0; y < 3; y++)
						{
							if (sec & Math.pow (2, (1 - x) * 3 + 2 - y)) draw (ctx, 2*offset + x*20 + 20, y*20 + 20, false);
							if (min & Math.pow (2, (1 - x) * 3 + 2 - y)) draw (ctx, 1*offset + x*20 + 20, y*20 + 20, false);
							if (hr & Math.pow (2, (1 - x) * 3 + 2 - y)) draw (ctx, x*20 + 20, y*20 + 20, false);
						}
						ctx.fillText(hr + ":" + min + ":" + sec, 70, 80);
					ctx.restore();
	}
}

After wards just called init (); , that calls clock(); once a second to draw the clock. Please tell me whether it works in your browser ;)

If anybody is interested, here is the code: html5_clock. If you also want to deal with it, Mozilla has a good tutorial.

I hope this new age of web will delete all the flash trash out there!

Download: Javascript: html5_clock.js (Please take a look at the man-page. Browse bugs and feature requests.)

Umlauts on English keyboards

Micha is just sitting next to me, writing a new blog post. He’s writing in German with an English keyboard, so he has to encode umlauts like ä with an &auml; . I can not watch any longer, here is the trick.

Still blogged about it, you can create such additional keys with Xmodmap. So choose a key, get its key code for example with xbindkeys -k and create a file $HOME/.Xmodmap with the following syntax:

keycode XXX = YYY

XXX ist the code of your key and YYY is that what should happen. For example:

keycode  137 = adiaeresis Adiaeresis
keycode  139 = udiaeresis Udiaeresis
keycode  141 = odiaeresis Odiaeresis 
keycode  143 = ssharp ssharp

That gives you an ä/Ä on the key with code 137 and so on. To let the file take effect just run xmodmap $HOME/.Xmodmap . Btw xmodmap -pke will give you the actual running keymap. So Micha, no need to type to much ;)

Twitter disabled Basic Authentication

Some of you may have recognized that twitter has disabled the so called Basic Authentication. So my previous twitter-tools don’t work anymore. But don’t bury your head in the sand, here are the newer versions.

Basic Authentication means you send your account information (username/password) unencrypted with your query (status update/timeline request/…) to twitter. Of course this method isn’t a nice way, so twitter disabled this method of authentication.

But the new methods of API calls are more complicated (called “OAuthcalypse”) and I really don’t like them. But whoever listens to me?

If you now want to interact with the twitter API, you have to register your tool as new twitter tool. Don’t ask me why, but you have to choose an unique name (all over the twitter world) for your application and get some random strings. For example for a Perl script you need the ones called Consumer key and Consumer secret.

If you want to interact with twitter, you have to do the following:

<li>send the combination of <em>Consumer key</em> and <em>Consumer secret</em> to the server
<li>receive an URL from the server where the user itself can find a pin code (when (s)he is logged into twitter)
<li>send this code to the server again and the user is verified
<li>receive some more authentication information from the server, store it for the next time, so the user don't have to authenticate again

Very annoying method, but there is no alternative method and at least your account is more save against hijacker.

By the way I found a Perl module called Net::Twitter that helps a lot.

Here is my snippet to solve this authentication stuff:

use Net::Twitter;

my $CRED_FILE = "somefile";

sub restore_cred {#read creds from $CRED_FILE}
sub save_cred {#write creds to $CRED_FILE}

my $nt = Net::Twitter->new(traits => ['API::REST', 'OAuth'], consumer_key => "KEY", consumer_secret => "SECRET",);
my ($access_token, $access_token_secret) = restore_cred();
if ($access_token && $access_token_secret)
{
	$nt->access_token($access_token);
	$nt->access_token_secret($access_token_secret);
}

unless ( $nt->authorized )
{
	print "Authorize this app at ", $nt->get_authorization_url, " and enter the PIN: ";
	chomp (my $pin = <stdin>);
	my($access_token, $access_token_secret, $user_id, $screen_name) = $nt->request_access_token(verifier => $pin);
	if (save_cred($access_token, $access_token_secret))
	{ print "successfull enabled this app! credentials are stored in: " . $CRED_FILE . "\\n" }
	else
	{ die "failed\\n"; }
}
if ($nt->update({ status => $status }))

Ok, you see it’s not impossible to solve this problem. And there is another advantage, with these two scripts I don’t have to provide my username/passwort any more.

Here is the script to tweet from command line and this script dumps the actual news to the console.

To use my tools just download them to your machine, rename them as you want and then just run it:

  • To tweet something call tweet-v2.pl with your status message as argument.
  • To get latest informations from the people you are following just call twitstat-v2.pl with an optional argument defining the maximal number of messages you want to see.

For the first time you’ll see a link where you’ll get your pin (open the link with your browser), after wards the tools will store your credentials in [toolname].credentials . Just try it, won’t (hopefully) break anything :P

Download: Perl: tweet-v2.pl (tweet from command line) Perl: twitstat-v2.pl (get latest news) (Please take a look at the man-page. Browse bugs and feature requests.)


Martin Scharm

stuff. just for the records.

Do you like this page?
You can actively support me!