web content anlayzer

Just developed a small crawler to check my online content at binfalse.de in terms of W3C validity and the availability of external links. Here is the code and some statistics…

The new year just started and I wanted to check what I produced the last year in my blog. Mainly I wanted to ensure more quality, my aim was to make sure all my blog content is W3C valid and all external resources I’m linking to are still available. First I thought about parsing the database-content, but at least I decided to check the real content as it is available to all of you. The easiest way to do something like this is doing it with Perl, at least for me. The following task were to do for each site of my blog:

  • Check if W3C likes the site
  • For each link to external resources: Check if they respond with 200 OK
  • For each internal link: Check this site too if not already checked

While I’m checking each site I also saved the number of leaving links to a file to get an overview. Here is the code:

You need to install LWP::UserAgent , XML::TreeBuilder and WebService::Validator::HTML::W3C . Sitting in front of a Debian based distribution just execute:

aptitude install libxml-treebuilder-perl libwww-perl libwebservice-validator-css-w3c-perl libxml-xpath-perl

The script checks all sites that it can find and that match to

m/^(http(s)?:\\/\\/)?[^\\/]*$domain/i

So adjust the $domain variable at the start of the script to fit your needs. It writes all W3C results to /tmp/check-links.val , the following line-types may be found within that file:

# SITE is valid
ok: SITE
# SITE contains invalid FAILURE at line number LINE
error: SITE -> FAILURE (LINE)
# failed to connect to W3C because of CAUSE
failed: CAUSE

So it should be easy to parse if you are searching for invalids. Each external link that doesn’t answer with 200 OK produces an entry to /tmp/check-links.fail with the form

SITE -> EXTERNAL (RESPONSE_CODE)

Additionally it writes for each website the number of internal links and the number of external links to /tmp/check-links.log .

If you want to try it on your site keep in mind to change the content of $domain and take care of the pattern in line 65:

$href =~ m/\\/$/

Because I don’t want to check internal links to files like .png or .tgz the URL has to end with / . All my sites containing parseable XML end with / , if your sites doesn’t, try to find a similar expression.

As I said I’ve looked to the results a bit. Here are some statistics (as at 2011/Jan/06):

Processed sites481
Sites containing W3C errors38
Number of errors63
Mean error per site0.1309771
Mean of internal/external links per site230.9833 / 15.39875
Median of internal/external links per site216 / 15
Dead external links82
Dead external links w/o Twitter5

Most of the errors are now repaired, the other ones are in progress. The high number of links that aren’t working anymore comes from the little twitter buttons at the end of each article. My crawler is of course not authorized to tweet, so twitter responds with 401 Unauthorized . One of the other five fails because of a cert problem, all administrators of the other dead links are informed.

I also analyzed the outgoing links per site. I’ve clustered them with K-Means, the result can be seen in figure 1. How did I produce this graphic? Here is some R code:

library(MASS)

x=read.table("check-links.log")
intern=x[,1]
extern=x[,2]
z <- kde2d(intern,extern, n=50)

# cluster
colnames(x) <- c("internal", "external")
(cl <- kmeans(x, 2))

# draw
png("check-links.png", width = 600, height = 600)

# save actual settings
op <- par()
layout( matrix( c(2,1,0,3), 2, 2, byrow=T ), c(1,6), c(4,1))

par(mar=c(1,1,5,2))
contour(z, col = "black", lwd = 2, drawlabels = FALSE)
points(x, col =  2*cl$cluster, pch = 2 * (cl$cluster - 1),lwd=2,cex=1.3)
points(cl$centers, col = c(2,4), pch = 13, cex=3,lwd=3)
rug(side=1, intern )
rug(side=2, extern )
abline(lm( external ~ internal, data = x))
title(main = "internal -vs- external links per site")

# boxplot left
par(mar=c(1,2,5,1))
boxplot(extern, axes=F)
title(ylab='external links', line=0)

# boxplot right
par(mar=c(5,1,1,2))
boxplot(intern, horizontal=T, axes=F)
title(xlab='internal links', line=1)
# restore settings
par(op)

dev.off()

You’re right, there is a lot stuff in the image that is not essential, but use it as example to show R beginners what is possible. Maybe you want to produce similar graphics!?

Download: Perl: check-links.pl R: check-links.R (Please take a look at the man-page. Browse bugs and feature requests.)

Pre- and Post-suspend

Today I got another complaint in a row of complaints of my Jabber contacts, arguing that they can’t send me messages although my account seems to be online in their buddy list. That happens when I put my notebook to sleep, this time I got informed by Micha.

Here are 3 steps to patch this problem, dealing with gajim-remote, PowerManagement-Utils and DBus.

This annoying events happens when I was online with my notebook and close the lid so the notebook goes sleeping. Unfortunately my Jabber client Gajim doesn’t notice that I’m going to disconnect and so the Jabber server isn’t informed about my absence. Due to connection instabilities the server waits some time of inactivity until it recognizes that there is really no more client before it tells all my friends I’m gone. During this time I appear online but messages are not able to reach my client, so they are lost in hell. That sucks, I know, and now I’ve reacted.

First of all I checked how to tell Gajim to disconnect via command line and found the tool gajim-remote , it comes with Gajim itself. Here are some examples of using it:

# disconnect
gajim-remote change_status offline
# reconnect setting status message to "meetunix is sexy"
gajim-remote change_status online "meetunix is sexy"
# learn what can be done!?
gajim-remote -h

Of course the manpage will give you more information.

Ok, so far, next task is to understand what is done when the lid is closed. The task to suspend or hilbernate is, at least in my case, done by pm-utils (PowerManagement-Utils). It comes with some tools like pm-suspend or pm-hibernate and so on. To tell these tools to do something before respectively after suspending there is a directory in /etc/pm/sleep.d . Here You can leave some script that look like those in /etc/init.d/* . Here is a smart example now located in /etc/pm/sleep.d/01users on my notebook, you can use it as skeleton:

#!/bin/bash
# check which users are logged in
USERS=$(/usr/bin/finger | /bin/grep ':0'| /bin/grep -o '^\\w*'| /usr/bin/uniq)
prepare_sleep ()
{
        for user in $USERS
        do
                uhome=$(/bin/su $user -c 'echo $HOME')
                [ -x $uhome/.suspend ] && /bin/su $user -c $uhome/.suspend
        done
}
revive ()
{
        for user in $USERS
        do
                uhome=$(/bin/su $user -c 'echo $HOME')
                [ -x $uhome/.awake ] && /bin/su $user -c $uhome/.awake
        done
}
case "$1" in
        hibernate|suspend|suspend_hybrid)
                prepare_sleep
                ;;
        thaw|resume)
                revive
                ;;
        *) exit 1
                ;;
esac
exit 0

Make it executable and give it a try. It checks for each logged-in user whether there is a .suspend or .awake in its $HOME to execute it before suspending respectively after resuming.

Next step is telling Gajim to change its status. Unfortunately the gajim-remote script is speaking to the running Gajim-instance via DBus. You may have heard about DBus, there are two main options of DBus buses: system- and session-bus. To speak to Gajim you use the session DBus and need the bus address. That is a problem, this address is acquired while your X-login, and you don’t know it from a remote session or if the system executes scripts while suspending. So if you just try to execute gajim-remote change_status offline in your .suspend you’ll get an error like D-Bus is not present on this machine or python module is missing or Failed to open connection to "session" message bus . Your DBus session address within an X-session is set in your environment in $DBUS_SESSION_BUS_ADDRESS ( echo $DBUS_SESSION_BUS_ADDRESS ). So what are your options to get this address for your .suspend script?

  • You can export your env to a file when you login (maybe automatically via .xinitrc ) to parse it
  • All addresses are saved in $HOME/.dbus/session-bus/ , so try to find the right one..
  • Get it from a process environment

The last possibility is of course the nicest one. So check if Gajim is running and extract the DBUS_SESSION_BUS_ADDRESS from /proc/GAJIM_PID/environ ! Here is how it can be done:

#!/bin/bash

gajim_logout ()
{
    # is gajim-remote present
    [ -x /usr/bin/gajim-remote ] || return 1

    # is gajim running
    gajim_pid=$(/bin/ps -f --user $(whoami) | /bin/grep [g]ajim.py | /usr/bin/awk '{print $2}')
    [ -f /proc/$gajim_pid/environ ] || return 1

    # get the dbus address
    DBUS_SESSION_BUS_ADDRESS=$(/bin/cat /proc/$gajim_pid/environ | /bin/grep -z DBUS_SESSION_BUS_ADDRESS | /bin/sed 's/^[^=]*=//')
    [ -n $DBUS_SESSION_BUS_ADDRESS ] || return 1
    export DBUS_SESSION_BUS_ADDRESS

    # now gajim is ready to go offline ;-)
    /usr/bin/gajim-remote change_status offline >> /dev/null
}

gajim_logout

exit 0

That’s it, great work! Save this file in $HOME/.suspend and give it the right for execution. You can also write a similar script for $HOME/.awake to reconnect to your Jabber server, but you eventually don’t want to reconnect each time you open the lid..

So the next time I close my laptops lid Gajim disconnects immediately! No annoyed friends anymore :P

ShortCut[image]: ScreenShooting a JPanel

This is about taking a screenshot of a JPanel to save the visible stuff as an image to disk.

Sometimes it may happen that you create a swing GUI with a panel to draw cool stuff. Here you can learn how to let the user take a screenshot of this graphic with a single click on a button.

First of all create such a JPanel and fill it with crazy graphics, then create a BufferedImage with the size of the panel and tell the panel to draw its content to this image instead of printing the content to the screen and, last but not least, save this image:

// create a panel
javax.swing.JPanel panel = new javax.swing.JPanel ();
// [..] tell the panel to look nice

// create an image
java.awt.image.BufferedImage bImage = new java.awt.image.BufferedImage (panel.getSize ().width, panel.getSize().height, java.awt.image.BufferedImage.TYPE_INT_RGB);
// tell the panel to draw its content to the new image
panel.paint (bImage.createGraphics ());
// save everything
try
{
  java.io.File imageFile = new java.io.File ("/tmp/screener.png")
  imageFile.createNewFile ();  
  javax.imageio.ImageIO.write (bImage, "png", imageFile);  
}
catch(Exception ex)
{
  ex.printStackTrace ();
}

You see it’s very simple. Of course it’s also possible to create other types of image with ImageIO, like JPEG or GIF, but for more information take a look in the documentation. In a project that will be published you should think about using a JFileChooser instead of hard coding the name of the new image ;-)

Adding a hyperlink to Java Swing GUI

Just dealt with an annoying topic: How to add a link to a Java swing frame. It’s not that hard to create some blue labels, but it’s a bit tricky to call a browser browsing a specific website…

As I mentioned the problem is to call the users web browser. Since Java SE 6 they’ve added a new class called Desktop. With it you may interact with the users specific desktop. The call for a browser is more than simple, just tell the desktop to browse to an URL:

java.awt.Desktop.getDesktop ().browse ("binfalse.de")

Unfortunately there isn’t support for every OS, before you could use it you should check if it is supported instead of falling into runtime errors..

if (java.awt.Desktop.isDesktopSupported ())
{
	java.awt.Desktop desktop = java.awt.Desktop.getDesktop ();
	if (desktop.isSupported (java.awt.Desktop.Action.BROWSE))
	{
		try
		{
			desktop.browse (new java.net.URI ("binfalse.de"));
		}
		catch (java.io.IOException e)
		{
			e.printStackTrace ();
		}
		catch (java.net.URISyntaxException e)
		{
			e.printStackTrace ();
		}
	}
}

So far.. But what if this technique isn’t supported!? Yeah, thats crappy ;) You have to check which OS is being used, and decide what’s to do! I searched a little bit through the Internet and developed the following solutions:

String url = "";
String osName = System.getProperty("os.name");
try
{
	if (osName.startsWith ("Windows"))
	{
		Runtime.getRuntime ().exec ("rundll32 url.dll,FileProtocolHandler " + url);
	}
	else if (osName.startsWith ("Mac OS"))
	{
		Class fileMgr = Class.forName ("com.apple.eio.FileManager");
		java.lang.reflect.Method openURL = fileMgr.getDeclaredMethod ("openURL", new Class[] {String.class});
		openURL.invoke (null, new Object[] {url});
	}
	else
	{
		//check for $BROWSER
		java.util.Map<String, String> env = System.getenv ();
		if (env.get ("BROWSER") != null)
		{
			Runtime.getRuntime ().exec (env.get ("BROWSER") + " " + url);
			return;
		}

		//check for common browsers
		String[] browsers = { "firefox", "iceweasel", "chrome", "opera", "konqueror", "epiphany", "mozilla", "netscape" };
		String browser = null;
		for (int count = 0; count < browsers.length && browser == null; count++)
			if (Runtime.getRuntime ().exec (new String[] {"which", browsers[count]}).waitFor () == 0)
			{
				browser = browsers[count];
				break;
			}
		if (browser == null)
			throw new RuntimeException ("couldn't find any browser...");
		else
			Runtime.getRuntime ().exec (new String[] {browser, url});
	}
}
catch (Exception e)
{
	javax.swing.JOptionPane.showMessageDialog (null, "couldn't find a webbrowser to use...\\nPlease browser for yourself:\\n" + url);
}

Combining these solutions, one could create a browse function. Extending the javax.swing.JLabel class, implementing java.awt.event.MouseListener and adding some more features (such as blue text, overloading some functions…) I developed a new class Link, see attachment.

Of course it is also attached, so feel free to use it on your own ;)

Unfortunately I’m one of these guys that don’t have a Mac, shame on me! So I just could test these technique for Win and Linux. If you are a proud owner of a different OS please test it and let me know whether it works or not. If you have improvements please tell me also.

Download: JAVA: Link (Please take a look at the man-page. Browse bugs and feature requests.)

Revived iso2l

Someone informed me about a serious bug, so I spent the last few days with rebuilding the iso2l.

This tool was a project in 2007, the beginning of my programming experiences. Not bad I think, but nowadays a pharmacist told me that there is a serious bug. It works very well on small compounds, but the result is wrong for bigger molecules.. Publishing software known to fail is not that nice and may cause serious problems, so I took some time to look into our old code…

Let me explain the problem on a little example. Let’s denote the atom Carbon \(\text{C}\) has two isotopes. The first one \(\text{C}^{12}\) with a mass of 12 and an abundance of 0.9 and another one \(\text{C}^{13}\) with a mass of 13 and an abundance of 0.1 (these numbers are only for demonstration). Let’s further assume a molecule \(\text{C}_{30}\) consisting of 30 Carbon atoms. Since we can choose from two isotopes for each position in this molecule and each position is independent from the others there are many many possibilities to create this molecule from these two isotopes, exactly \(2^{30}=1073741824\). Let’s for example say we are only using 15 isotopes of \(\text{C}^{12}\) and 15 isotopes of \(\text{C}^{13}\), there still exists \(\frac{30!}{15!\cdot 2} = 155117520\) combinations of these elements. Each combination has an abundance of \(0.9^{15} \cdot 0.1^{15} \approx 2.058911 \cdot 10^{-16}\).

Unfortunately in the first version we created a tree to calculate the isotopic distribution (here it would be a binary tree with a depth of 10). If a branch has an abundance smaller than a threshold it’s cut to decrease the number of calculations and the number of numeric instabilities. If this thresh is here \(10^{-10}\) none of these combinations would give us a peak, but if you add them all together (they all have the same mass of \(15 \cdot 12 + 15 \cdot 13 = 375\)) there would be a peak with this mass and an abundance of \(\frac{30!}{15!\cdot 2} \cdot 0.9^{15} \cdot 0.1^{15} \approx 3.193732 \cdot 10^{-8} > 10^{-10}\) above the threshold. Not only the threshold is killing peaks. This problem is only shifted if you decrease the thresh or remove it, because you’ll run in abundances that aren’t representable for your machine, especially in larger compounds (number of carbons > 100). So I had to improve this and rejected this tree-approach.

And while I’m touching the code, I translated the tool to English, increased the isotopic accuracy and added some more features. In figure 1 you can see the output of version 1 for \(\text{C}_{150}\). As I told this version loses many peaks. I contrast you can see the output of version 2 for the same compound in figure 2. Here the real isotopic cluster is visible.

Please try it out and tell me if you find further bugs or space for improvements or extensions!



Martin Scharm

stuff. just for the records.

Do you like this page?
You can actively support me!