## WordPress WordCloud with R

These days one can frequently read about wordclouds created with R, initiated by the release of the wordcloud package by Ian Fellows on July 23rd. So here I am to put in my two cents.

I thought about creating a wordcloud of a complete blog history, so I build a script that connects to a MySQL database and grabs all published posts and pages. All articles are combined in an huge text, that, when purged from tags and special chars, is visualized as a wordcloud:

[cc lang=”rsplus” lines=”-1” file=”pipapo/R/wordpress-wordcloud.R”][/cc]

Enough code, here is the result for my slight blog:

Smart image, isn’t it? Unfortunately it takes about 30 secs to generate it, otherwise it would be cool to create such a cloud live, for example using rApache.

## Installing an HP ProLiant

I just installed a new server from HP, a ProLiant DL180 G6. Here are some notes about the setup.

To check the hardware status you need to install the ProLiant Support Package. Running a Debian/Ubuntu you should import the HP PSP mirror in your sources.list . It can be found here, you might include something like:

After an aptitude update you’ll find some new packages. I recommend to install hpaclui to speak to your raid-controllers and hp-health to interact with your hardware.

With hpaclui you can ask the raid-controllers for some information:

So you get an idea of your storage.

The hp-health packages comes with a tool called hpasmcli . It’s used to query all the hardware states:

Both tools are very easy to use and give a great overview about the health. So I immediately developed a monitoring plugin that parses the output of those runs. I came to the point, that I wasn’t able to find some documentation about the hpasmcli tool. Most of its output was clear, but I don’t know what happens if a fan breaks. The output with working fans looks like:

So what if a fan is broken? Is it still Present and the Speed -string just changes to NONE or something like that? I send a support request to HP, but all they respond was a premium-rate number to call. Seems that my understanding of service differs from theirs. Since I don’t know how the output looks like in an error case (I don’t want to stick pencils into new machines) the plugin can’t decide whether the fans are OK. If you want to use my plugin you need to skip fan-checks until HP publishes a document with possible values. IMHO a public tool should be open source, so I can get those information on my own, or at least well documented!

## Now I'm R-Blogging

Today a lot of great mails arrived at my inbox. In one of them I was reading I’ve just added your feed to the site.

## Where did this mail come from?

The sender of the email was Tal Galili. He is a researcher in BioStatistics at the Tel Aviv University, very active around the internet. He also originated R-Bloggers and in this email he told me that I’m recruited ;-)

## What is R-Bloggers

R-Bloggers is an aggregation of more than 200 bloggers writing about GNU’s R and some statistics/math/hacks that can be done using R. If you didn’t heard about R-Bloggers I strongly recommend to take a look at their website. I’m following this project since a while, it’s a great fusion of brainiacs! So I’m proud to have my modest R-related articles listed between them.

Btw. if you like R-Bloggers and/or have some R experiences yet you should also take a look at the ‘R’ programming Wikibook. Contributing your knowledge is greatly appreciated!

Let’s see what the future brings, happy hacking!

## Rate My Po...

…sts (of course!). Yesterday I just installed a rating plugin, inspired by these stackexchange platforms.

Searching through the WordPress plugin directory didn’t make me happy. All existing plugins lack of desired features. After some tests I decided to modify UpDownUpDown of Dave Konopka. It’s a nice plugin, but still didn’t match my criteria. For example guests were not allowed to vote, there were some XHTML bugs and I didn’t like the style. So I created a patch (it’s attached..) and sent it to Dave (I don’t yet have a github account). He told me that he’ll take a look at it and might apply it to the official plugin, so if you also want to use this rating plugin with my additional features just keep the URL in mind and watch out for a new version.

The special version I’m using here right now has of course some more slight changes, to have it perfectly matched to my own blog. So you are now able to vote for articles, positive or negative, to give me a hint what my visitors like to read ;-)

I additionally installed a further page that lists my articles sorted by votes: top. So you can get a smart overview of best/worst content.

With this in mind: Happy voting! ;-)

Download: Patch: Patch for UpDownUpDown [03e9bb8017...] (Please take a look at the man-page. Browse bugs and feature requests.)

## Build your own Network packets

Of course it’s more than nonsense creating all packets on your own, but sometimes there might be a reason making you wish you could.. For ex. for my last article I searched for a possibility to modify some contents of a packet. First I thought about using iptables , but than I found a nice tool: scapy!

With scapy you can create your own packets, IP/TCP/UDP whatever! It is very heavy but comes with an user-friendly interface. Using Debian/Ubuntu you need to install python-scapy :

To open the interface just run scapy . You can easily create an IP packet by typing something like this:

So an IP packet is stored in the variable ippacket . This packet will be send to binfalse.de and has a ttl of 12 (if there are more than 12 network nodes between your machine and the target it will disappear and never arrive at the target). Let’s create some TCP stuff:

We stored some TCP information in tcpcrap . This packet will be send through your port 1337 and hopefully arrive at port 80 (in general a webserver is listening on port 80 ). That’s it for the networking part. Last but not least we will create some data to send:

Combining all parts we’ll get a very nice packet, sending it will trigger my webserver to send the main page of my website (Sending exactly this packet won’t ever result in any website from my webserver. Why? Just think about…):

Well done! Ok, that’s very much to do. But fortunately it’s just that much code for explanation, you can send the same packet in a single line:

Very smart, isn’t it? You can also sniff whooshing packets! But something like this I won’t explain, find out by yourself ;-)

## Connecting through a NAT - the non-trivial direction

I often here people saying something like

SSH to your home PC? Sitting behind a NAT? A snowball's chance in hell...

But is it really impossible?

## What is a NAT?

NAT (network address translation) is a technique to cover multiple clients behind one router. Kristian Köhntopp explained the technology very well in his article NAT ist kein Sicherheitsfeature (GER). But let me summarize some things. Here is a small image to visualize the topology of an example network:

You see, the NAT represents something like a bridge between it’s clients (in network 10.0.0.0/24 ) and the rest of the world. The connections of the clients are translated by this router. Assuming client 10.0.0.3 wants to speak to my webserver 87.118.88.39 , he sends a packet containing, among others, the following information:

So all machines on the way from 10.0.0.3 to 87.118.88.39 know whom to send the packet next. When this packet arrives at the NAT, the NAT will rewrite it. The NAT stores a table for all recent connections. Each entry consists of a client IP, client port and a local port on its public interface. For our example the table entry for this example might look like:

Source IPSource PortNAT IPNAT Port
10.0.0.33947888.66.88.661234

The resulting port on the NAT is arbitary, it’s just one free port.. Each packet arriving on port 1234 of the public interface of the NAT is forwarded to 10.0.0.3:39478 . Our rewritten packet 10.0.0.3->87.118.88.39 now contains the following informations:

and is send to the next node in the world wide web. Nobody out of 10.0.0.0/24 will ever know that there is a machine 10.0.0.3 requesting a website from 87.118.88.39 . The webserver on 87.118.88.39 will send it’s answer to the pretended source, 88.66.88.66:1234 , and the NAT will forward the traffic according to its table entry to 10.0.0.3 .

Why do NAT’s exists? The solely plausible reason seem to be the lack of IPv4 addresses. With a NAT an ISP just need to offer a single IP address for a huge bunch of clients. Hopefully this will change in times of IPv6!

## Why does it seem to be impossible?

Since the private network 10.0.0.0/24 is not known by the outer world (it is simply not route-able in the Internet, see wikipedia), you cannot connect from outside 10.0.0.0/24 straight to 10.0.0.3 . The WWW will only see 88.66.88.66 as source for all the clients. That means all clients in 10.0.0.0/24 have the same public IP for each machine that is not in 10.0.0.0/24 . So how to access 10.0.0.3 ? Speaking to 88.66.88.66 will result in crap, you don’t know which port will be forwarded to whom!? If it is forwarded at all…

## How is it nevertheless possible?

#### Method one...

…is not very nice, if you are looking for a real solution please skip this paragraph and continue with solutions two and three ;-) Since there is no entry in the NAT table that specifies an outside target, you can send packets from any location to 88.66.88.66:1234 and the NAT will forward them to 10.0.0.3:39478 (according to my example). So to create a path from outside to 10.0.0.3 ‘s SSH server you just need to send a packet from 10.0.0.3:22 to any server outside that informs you about the source IP and source port that was reported by the NAT (it’s the address that will be forwarded to the client). If you immediately connect to this address, and if a SSH server is listening on 10.0.0.3:22 , you should be able to establish a SSH session. Simple isn’t it ;-) To get this working you could try something like repeating the following commands frequently:

Of course you can also install some iptables rules to rewrite the TCP packets. So you can send the packets from some other ports than 22 , iptables will rewrite them so the target machine (and the NAT) thinks they came from :22 . With this setup you don’t have to stop SSH, because you don’t need the free port… But just hack it your way ;-)

#### Method two...

…is much more comfortable. You can set up a reverse SSH tunnel! Again you need another machine outside the NAT, that has a SSH server running and will act as your gateway. Just connect to it from your local machine behind the NAT:

That will open the port 1337 on your.server . All packets arriving at this port are transferred through the SSH tunnel to your home PC. Run something like screen or top on the server to always transfer packets (otherwise the connection will be closed after some time), with -o ServerAliveInterval=XXX you can adjust the threshold for closing the SSH connection. Surround it with a while loop and you’ll reestablish closed connections (network errors or something like that):

By default the opened port is just bound to 127.0.0.1 (the servers loopback interface), so you can only send packets from the server itself (or need some more network hacking). To have this hack listening to 0.0.0.0 (all interfaces) add the following to your /etc/ssh/sshd_config on your.server :

and restart the daemon.

#### Method three...

…might be the most elegant. Set up a VPN! But that’s too much for now, request some explanations from 3dfxatwork, he’s your OpenVPN guy!, and take a look at Dirty NAT tricks to get a VPN to work with clients also numbered in the private address space

So you see, no hasty prejudices ;-)

## 2 applications for 1 port

One of my PC’s is covered behind a firewall and just one port is opened. I want to serve SSH and HTTPS, but as you know it’s not easy to get both listening on the same port, so what should I do?

Of course one possibility is to decide for the more important application and forget about the other. But there is another solution! But first of all let’s have a look at both protocols.

If you connect to a SSH server he immediately welcomes you with the running SSH-version, for example:

Here it is SSH-2.0-OpenSSH_5.5p1 Debian-6 . So your client connects and just waits for an answer from the server. In contrast The HTTP protocol doesn’t greet:

The server is programmed to just answer request. So if we ask for anything it will give some feedback:

You see, the web server responds with code 200 , indicating everything is fine.

These differences in both protocols can be used to set up a proxy. If the client starts to send something it seems to speak HTTP, otherwise the client seems to wait for some SSH greetings. Depending on the client behavior the proxy should forward the packets to the relevant application. There is a nice Perl module to implement this easily: Net::Proxy .

First of all both applications need to be configured to not use the open port. Without loss of generality let’s assume port 443 is opened by the firewall, SSH listens on it’s default port 22 and your webserver is configured to listen on 8080 . The following piece of code will split the requests:

Some notes:

• To listen on ports < 1024 you need to be root!
• Debians need to install libnet-proxy-perl .
• Some protocols that wait for the client: HTTP, HTTPS
• Some protocols that greets the clients: SSH, POP3, IMAP, SMTP

## Hallo VG WORT - hier Blogger

Ja, es ist Deutsch! Warum? Ich habe mich heute bei der Verwertungsgesellschaft WORT registriert ;-)

## Was ist denn VG WORT?

Die VG WORT verwaltet die Tantiemen aus Zweitverwertungsrechten an Sprachwerken. Die Unternehmen, die Kopierer oder Drucker oder CD’s oder Ähnliches verkaufen, müssen einen gewissen Betrag an die VG WORT bezahlen. Die VG WORT sammelt das Geld und gibt es an Journalisten, Schriftsteller, Verleger etc. weiter. Publiziert also jemand etwas, das öffentlich zugänglich ist, kann er sich einen kleinen Beitrag von der VG WORT abholen. Das gilt natürlich nicht nur für große Arbeiten (Diplom, Dr. oder Journal-Beitrag), auch Blogger können mitmachen!

## Wie geht das?

Als Blogger kann man für Texte mit mind. 1.800 Anschlägen eine Vergütung bekommen. Es gibt zwei Arten der Vergütung:

• Teilnahme an der Sonderausschüttung
• Vergütung basierend auf den Leserzahlen

Für die Sonderausschüttung gibt es sehr viel weniger Geld, sobald man Zugriff auf den Quelltext des Textes im Netz hat kann man auch nicht daran teilnehmen. So bleibt für uns Blogger nur die Vergütung basierend auf den Leserzahlen. Dazu bekommt man von der VG WORT sogenannte Zählmarken. Das sind nichts anderes als 1x1 Pixel, die man in seinen Artikel integriert. Ließt ein Besucher dann diesen Artikel, lädt er sich auch diese “Bild” vom Server der VG WORT und die zählen dann für diesen Beitrag eins hoch. So funktioniert übrigens auch der Facebook Like-Button: Wenn ihr den irgendwo seht könnt ihr davon ausgehen, dass Facebook weiß, dass ihr diese Seite besucht habt. Mir ist bewusst, dass ich damit die Privatsphäre meiner Leser einschränke, daher auch gleich die Lösung: Wenn ihr nicht wollt, dass die VG WORT euch trackt solltet ihr eurem Browser das Laden von Bildern verbieten, oder diverse Plugins so konfigurieren, dass sie die Bilder mit der URL http://vg\\d{2}.met.vgwort.de/.* blocken. Im Zweifel wisst ihr schon wie das läuft.

## Wo gibt es das Geld?

Als Blogger muss man sich als erstes bei der VG WORT registrieren. Das kann man in der Oberfläche T.O.M. erledigen. Dann kommt ein wenig Bürokratie: Formulare ausfüllen, unterschreiben, abschicken… Ist dies erledigt, kann man sich diese Zählpixel bestellen, die kommen auch sofort per Mail. Gezählt wird immer vom 1. Januar bis zum 31. Dezember des selben Jahres. Im Folgejahr wird dann die Vergütung berechnet und man kann eine Auszahlung beantragen. Wenn das was ich so im Netz gelesen habe stimmt, gibt es für 1.500 Zugriffe 15 Euro, für 3.000 Besucher gibt es 20 Euro und sollten mehr als 10.000 Browser euren Artikel ausliefern gibt es 30 Euro. Die 10k werde ich wohl erst einmal nicht ins Auge fassen, 1500 Besucher sollte aber Möglich sein ;-)

Es werden im Übrigen nur Besucher aus Deutschland gezählt. Daher ist dieser Artikel auch auf Deutsch. Fragt mich nicht, wie die Leute festlegen ob ein Klick aus Deutschland kommt oder nicht, ich denke im Zweifel wird er nicht gezählt. Auch Pixel die in verschiedene Akregatoren wie den GoogleReader ausgeliefert werden zählen nicht. Daher wäre es nett wenn ihr auch hin und wieder einmal auf die Seite durch klickt.

## All in all

Ich betrachte das als Experiment (dieses Jahr werde ich wohl nicht mehr als 20 Pixel verbauen) und erwarte keine größeren Einnahmen, aber wer weiß. Das Geld schüttet die VG WORT sowieso aus, warum soll ich also nicht auch die Hände aufhalten!? Wer sich seine Vergütung nicht abholt ist selbst Schuld ;-)

## Result

Von den gelisteten Artikeln haben es 4 geschafft die Mindestbesucherzahl zu erreichen. Habe dafür von der VG WORT 40 € bekommen: Danke an alle Besucher :D Im Endeffekt hat sich das natürlich nicht wirklich rentiert, eher ein kleiner Bonus. Es war aber ein nettes Experiment, das mittlerweile auch schon beendet wurde. Alle Zählpixel wurden zugunsten der Privacy entfernt. Ihr dürft aber trotzdem fleißig weiter klicken!

## R progress indicators

Complicated calculations usually take a lot of time. So how to know the progress status to estimate how much time the program still needs to finish?

So far, I always printed some debugging stuff. So I knew how much is done and what is still to do, but that isn’t a nice solution if you plan to share your application with others (the guys in your dev team or the whole public in general).

The first solutions to indicate the status is just printing something like an iteration number:

Ok, works but sucks ;-) Some days ago I read about an Unicode trick to build a clock on the prompt. Using this the next possibility for status indication is:

It’s much less line consuming. Of course there is also a lot of space to prettify it, for example:

In order to write this article I searched for some more solutions and found one that, more or less, equals my last piece of code. txtProgressBar is part of the built-in R.utils package:

The last progress bar I want to present is a visual one and comes with the package tcltk :