Using MathJax to render math

Some time ago I’ve heard about MathJax and decided to integrate it to my blog. Short time later everything was forgotten, but a few days ago I read an article and remembered my plan. So here it is ;-)

Up to now a mathematical formula was converted to an image like this: a^2+b^2=c^2\\a^2=c^2\,

There are some disadvantages, for example you can’t align a number of lines by the equal sign. And also the integration into continuous text is terrible, as you can see in the following sum: \sum i = 5.
Different images have different baselines: i=\circ vs p=\circ. This will destroy any line spacings and it depends on the browser what you see if you zoom into the website.

Here is the same Text rendered with MathJax (you need to have JavaScript enabled so see a difference)

There are some disadvantages, for example you can’t align a number of lines by the equal sign. And also the integration into continuous text is terrible, as you can see in the following sum: . Different images have different baselines: vs . This will destroy any line spacings and it depends on the browser what you see if you zoom into the website.

So you see, MathJax remedies these issues. Simple latex code is rendered into web compatible math symbols. That is done via JavaScript, so your browser has some more to do, but I think in times of Web2.0 it’s negligible. And it’s also mark-able, so you can copy & paste! But what if a visitor is browsing w/o JS? I implemented my version with a fallback to these images. So if you disable JS you’ll see pure output of WP-LaTeX.

I’m actually very busy, so there is no time to create and maintain an official WP-Plugin, but I can offer a How to, so you can handiwork.

How to?

This instruction is for WP-LaTeX version 1.7. Add a comment if you want an update for a newer version.

First of all you have to download MathJax, you’ll get it here. I installed a copy into WP_PATH/wp-content/plugins/ . Now log into you admin panel and install the plugin WP-LaTeX.

If this is done, cd to your plugin directory. The only file you have to edit is wp-latex/wp-latex.php . Since we won’t destroy the original functionality, we will continue creating images. So no need to delete anything. But if JS is enabled, the images should be replaced by MathJax-code. How do we find out whether JS is available!? We take JS to replace the images. So if it’s not enabled, the images won’t be replaced ;-)

Since the MathJax library contains very much JS, we will only load the MathJax-stuff if we need it. Most of the article don’t require latex, it’s a waste of resources if we load the library nevertheless. We introduce a new variable loadMathJax , indicating whether we need MathJax. Have a look at the code and search for function wp_head() { . This function still contains some style stuff, we only need to append some JS code:

function wp_head() {
	if ( !$this->options['css'] )
		return;
	?>
	<style type="text/css">
	/* <![CDATA[ */
	<?php echo $this->options['css']; ?>

	/* ]]> */
	</style>
	// -> our code start
	<script type="text/javascript">
	var loadMathJax = false;
	</script>
	// -> our code end
	<?php

loadMathJax is false by default, we don’t always need MathJax libs. That was nothing exciting, here comes the intelligence. You’ll also find a function called shortcode . This function is responsible for image creation, here is the code that is send to your browser:

$latex_object = $this->latex( $latex, $atts['background'], $atts['color'], $atts['size'] );

$url = clean_url( $latex_object->url );
$alt = attribute_escape( is_wp_error($latex_object->error) ? $latex_object->error->get_error_message() . ": $latex_object->latex" : $latex_object->latex );

return "<img src='$url' alt='$alt' title='$alt' class='latex' />";

Nice, isn’t it!? We now need to add some piece of code to replace this image with MathJax source code. We change the code to append a small JS:

$latex_object = $this->latex( str_replace("&", "", $latex), $atts['background'], $atts['color'], $atts['size'] );

$url = clean_url( $latex_object->url );
$alt = attribute_escape( is_wp_error($latex_object->error) ? $latex_object->error->get_error_message() . ": $latex_object->latex" : $latex_object->latex );

$id = "latex".md5($url.microtime ());
$start = "\$";
$end = "\$";
if ($latex[strlen($latex)-1] == ",")
{
	$start = "\\\\begin{align}";
	$end = "\\\\end{align}";
}
$mathjaxcode = "<script type='text/javascript'>
if (document.createElement && document.getElementById){
	loadMathJax = true;
	var img = document.getElementById('" . $id . "');
	if (img){
		var tex = document.createTextNode(\"" . $start . str_replace("\\", "\\\\", $latex) . $end . "\");
		img.parentNode.replaceChild(tex, img)
	}
};
</script>";

return "<img src='$url' alt='$alt' title='$alt' class='latex' id='" . $id . "'/>".$mathjaxcode;

Ok, let me shortly explain this. First we have to replace all & in the latex code that is parsed to an image (line 1). There is a small issue with this WP-LaTeX plugin. You can’t align multiple lines, & isn’t allowed. To nevertheless create multiline MathJax formulas this workaround is my resort. In line 6 I create a random id, so we can call this specific element by just naming it’s id. I additionally defined a tailing , as indicator for multiple lines. You just have to add e.g. \, (a small space) to the end of the last line, and this code expects multiple lines. It will be centered in the line and all & are aligned. After wards the piece of JS follows. You don’t have to understand it, it just looks for an image with the specified id and replaces it with LaTeX-code. It additionally sets the variable loadMathJax to true . Once an image is replaced this variable gets true! If no image will be replaced it will always stay false.

Last but not least the browser has to load the libraries. Since we want to know whether there is LaTeX-code in this side we can’t load it early in the header section. We have to evaluate loadMathJax in the footer section. Add the following to the init () function:

add_action( 'wp_footer', array( &$this, 'wp_footer' ) );

And append a new function to the end of the class:

function wp_footer ()
{
	?>
	<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
		showProcessingMessages: false,
			messageStyle: "none",
			extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
			jax: ["input/TeX", "output/HTML-CSS"],
				tex2jax: {
					inlineMath: [ ['$','$'], ["\\(","\\)"] ],
					displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
					multiLine: true
				},
			"HTML-CSS": { availableFonts: ["TeX"] }
	});
	</script>
	<script type="text/javascript">
	if (loadMathJax)
	{
		var head= document.getElementsByTagName('head')[0];
		var script= document.createElement('script');
		script.type= 'text/javascript';
		script.src="/wp-content/plugins/MathJax/MathJax.js";
		head.appendChild(script);
	}
	</script>

	<?php
}

The first script section adds the MathJax configuration to the page. Take a look at the documentation to learn more. The second script appends a new DOM node to the head section via JS. If and only if loadMathJax is true and JS is available. If you installed MathJax to a directory different to WP_PATH/wp-content/plugins/ you have to edit the script.src line.

This should work, at least for me ;-) Right-click to these mathematical formulas and choose Settings -> Zoom Trigger -> Click, and each time you click on a formula you’ll see a zoomed version. Very smart I think!

Btw. even if it sounds like I’m arguing about this image variant, I’m not! It’s a very good method and the displayed formula is the same in every browser. Even Wikipedia uses this technique.

Here is a nice last example, based on the sample of WP-LaTeX:

Here is the code for the above formula:

\displaystyle P_\nu^{-\mu}(z)&=\frac{\left(z^2-1\right)^{\frac{\mu}{2}}}{2^\mu \sqrt{\pi}\Gamma\left(\mu+\frac{1}{2}\right)}\int_{-1}^1\frac{\left(1-t^2\right)^{\mu -\frac{1}{2}}}{\left(z+t\sqrt{z^2-1}\right)^{\mu-\nu}}dt\\&=a^2+\pi\cdot x_\infty\\&\approx42\,

Go out, produce smart looking, intelligent web pages! Looking forward to read some scientific articles at your websites!

Why false is sometimes true

We have some specialists in our admin staff, only able to administrate LDAP via phpLDAPadmin. But for several days the connection to the LDAP servers was read-only. It took some time to figure out why.

The configuration in /etc/phpldapadmin/config.php is very extensive, so I always ignored the failure in the following line:

$servers->setValue ('server', 'read_only', 'false');

Do you find the crap? If I comment it out the session isn’t read-only anymore. First we thought of a bug and I started to check the source code, but some more considerations let me have an idea. You might know PHP (like some other languages) interprets everything that is not empty or false explicitly as to be true. So 'hello' , 'true' and 'false' are all true ;-)

Don’t know who inserted this line, but sometimes (or generally?) you just work to correct the work of somebody else…

ShortCut[proprietary]: NVIDIA update

Again I installed a new kernel and again X isn’t able to start. Of course the last time I installed the proprietary NVIDIA driver I downloaded it to /tmp , and curiously it’s lost! The funny NVIDIA website is so damn incompatible, you need to have JavaScript or Flash or both to find your driver, no chance to get the driver with e.g. lynx from command line… So you need to have another running system to download the driver and secure copy it, or you need to reboot into the old kernel. (nonstop swearing at proprietary smut) Does this sound familiar? There is an alternative!

Once you have installed the driver, you’ve also installed a tool called nvidia-installer . This tool is able to find the newest driver at nvidia.com and to downloads it itself via FTP. Just type the following:

nvidia-installer --update

Even if you save your driver persistently, if you installed a new X version and your old driver is out of date you have to get a new driver! So this trick simplifies the world a lot ;-)

Get rid of version grml.02

I frequently get asked about the error:

dpkg-query: warning: parsing file '/var/lib/dpkg/status' near line 5038 package 'linux-image-2.6.33-grml':
 error in Version string 'grml.02': version number does not start with digit
dpkg-query: warning: parsing file '/var/lib/dpkg/status' near line 21699 package 'linux-headers-2.6.33-grml':
 error in Version string 'grml.02': version number does not start with digit
dpkg-query: warning: parsing file '/var/lib/dpkg/status' near line 61359 package 'linux-doc-2.6.33-grml':
 error in Version string 'grml.02': version number does not start with digit

So this article is to answer all questions in a time.

I don’t know why, but that grml kernel has a version number of grml.02 (other kernel versions are also affected). This version string doesn’t meet the criteria for version numbers because it doesn’t start with a digit. So dpkg is correctly warning. This warning is not critical, you might ignore it without any consequences, or you can install a newer one. Kernel with corrected version numbers can be found in grml-testing , so add the following to your /etc/apt/sources.list.d/grml.list :

deb     http://deb.grml.org/ grml-testing main

and you’ll find for example the new 2.6.38 kernel. Just for those lazy guys:

aptitude install linux-image-2.6.38-grml linux-headers-2.6.38-grml

So, have fun with your new kernel :-P

Humanizing atan2

I’m sure everyone of you got livid with the return value of the atan2 functions. Here is a fix.

public double arctan (double x, double y)
{
	double d = Math.atan2 (x, y) % (2 * Math.PI);
	if (d >= 0 && d <= Math.PI / 2)
		return Math.PI / 2 - d;
	else if (d < 0 && d >= -Math.PI)
		return Math.PI / 2 - d;
	else if (d > Math.PI / 2 && d <= Math.PI)
		return 2.5 * Math.PI - d;
	return d;
}

This is Java code, but easy to adapt for other languages. And since you are here, a little hint: Multiply the result with 180 / Math.PI to receive the angle in degrees.

Moved to Icinga

I just installed Icinga, it was the right decision!

First of all respect to the Icinga guys, the compatibility to Nagios is great! Moving from Nagios to Icinga is mainly copy and paste. Syntax is the same, management structure also equals, you can even use all your previous installed Nagios plugins and the nagios-checker add on. Well done! Except for the web interface (looks much more professional) every feels like Nagios. So I can’t see any reason to stay with Nagios.

Here are some things I had to do:

  • First of all I changed the credentials for the web interface:
  /etc/icinga % mv htpasswd.users htpasswd.users-org
  /etc/icinga % htpasswd -c -s htpasswd.users NewUser
  
  • This new user needs also authorizations, so you need to edit /etc/icinga/cgi.cfg and replace icingaadmin with NewUser .
  • The rights in /var/lib/icinga/rw/ were wrong, www-data wasn’t able to access the directory. So I wasn’t able to schedule manual checks via web. When I changed the permissions everything was fine:
  /etc/icinga % l /var/lib/icinga/rw/
  total 8.0K
  drwx------ 2 nagios www-data 4.0K Apr 18 01:02 .
  drwxr-xr-x 4 nagios nagios   4.0K Apr 18 01:02 ..
  prw-rw---- 1 nagios nagios      0 Apr 18 01:02 icinga.cmd
  /etc/icinga % chmod 750 /var/lib/icinga/rw/
  
  • I added the following into the DirectoryMatch directive of /etc/icinga/apache2.conf , to force me to use SSL encryption:
  SSLOptions +StrictRequire
  SSLRequireSSL
  
  • I shortened the mail subject of the notifications. By default the subject looks like:
  ** PROBLEM Service Alert: localhost/Aptitude-Updates is WARNING **
  

But I’m just interested in the important parts, so I changed the following in /etc/icinga/commands.cfg :

  [...] /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" [...]
  

to:

  [...] /usr/bin/mail -s "** $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" [...]
  

the the notifications now come with a subject like this:

  ** localhost/Aptitude-Updates is WARNING **
  

All in all I’m glad that I gave Icinga this chance and recommend to test it if you are still using Nagios. Maybe I’ll test some further Icinga features and maybe we’ll migrate at the university..

Inspecting Java startups

The developers around you might know that there are some mechanism hooked when creating an object. Lets have a look at the order of these processes.

Even beginners should know about constructors. They are called if you create an object of a class, but there are some things running before. Here is an example class:

public class Initializing
{
	// static initializer
	static
	{
		System.out.println ("class loaded");
	}
	
	// instance initializer
	{
		System.out.println ("new instance");
	}
	
	// constructor
	public Initializing ()
	{
		System.out.println ("constructor");
	}
	
	public static void main (String [] args)
	{
		System.out.println ("first object");
		new Initializing ();
		System.out.println ("\\n\\nsecond object");
		new Initializing ();
	}
}

The output is:

class loaded
first object
new instance
constructor


second object
new instance
constructor

As you can see, first of all the static initializer is called. It’s also called exactly once, when the class is loaded. It’s clear that the class has to get loaded before the main () inside can be executed. The main () then prints a string to indicate the start of that routine and afterwards it creates the first object of the type Initializing . This calls the instance initializer before the constructor is executed. Also the creation of the second object calls first the instance initializer and then the constructor. That’s the workaround. At the first time a certain class is used the static initializer is executed, and each time an object of that class is created first the instance initializer is called and then the constructor. Btw. all of these routines are able to access members that are private, but notice that the static initializer can only access static fields.

Stupid handycrafts

Today I had to install a new server for some biologists, they want to do some NGS. It took a whole day and all in all we’ll send it back…

These biologists ordered the server without asking us, I think the salesmen noticed that they don’t have expertise. The money is provided by the university, so no need to design for efficiency. And that is how it came that they send the hardware (2 Xeon-DP 5500, 72 GB mem) in a desktop case with a Blu-ray writer, a DVD-writer, a GPU with 2.72 TeraFLOPS (afaik they don’t want to OpenCL) and: NO CHASSIS FANS!! Ouch..

It’s as clear as daylight that this cannot work. Did they thought the hot air around the mem (18 slots, each 4GB) leaves the chassis by diffusion?? The processor cooling construction for my Athlon X2 is twice as big as all their fans together…

After setting up a Linux and installing lm-sensors the CPU’s are running at >75°C, fighting for air. Of course we immediately turned off the hardware! After half an hour we were able to start it again and took a look at the BIOS sensors for the memory. No time to get bored while the temperatures raised up to 80°C and more in less than 5 min, see figure 2… Of course time to turn it off again! Don’t want to hazard a nuclear meltdown..

They also enclosed a raid controller. I’ve googled that. Less than 100$.. wtf… The controller came with a CD to create a driver disk. But when you boot into the small Linux on the CD it hangs with the message “Searching for CD…”. And there is no driver for us, you are only able to use this controller when you are running Win 2003/XP/Vista or a RHEL or a Suse Enterprise. Other systems are not supported.. Proprietary crap..

What should I say, I’m pissed off. A whole business day is gone for nothing… Just because of some less-than-commodity handycrafts… I strongly recommend to ask Micha before plugging such bullshit!

Closer look to Triwave and MSE

After getting started to work with the new Synapt G2 HDMS from Waters a few questions about the working principle of this machine came up. Here I’ll try to explain where the drift time detector is located and how the software can distinguish between fragments produced in the trap and transfer cell, respectively.

As far as I know Waters is the first manufacturer who joined the IMS- and QTOF-technologies to combine all well known benefits from the QTOF instruments plus the advantages of separating ions by their shape and size.

But lets start at the beginning. As any other MS instrument the Synapt carries an ion source. Here are also some attractive innovations located, but nothing remarkable for now, so I won’t explain anything of it. The interesting magic part starts when the ions pass the quadrupole. To give you a visual feedback of my explanations I created an image of the hierarchical ion path:

Assume only one big black ion has entered the machine and found its way to the quadrupole. This ion will now follow the ion path and arrive at the Triwave cell, consisting of a trap cell, the new IMS cell and a transfer cell. Trap and transfer cell are able to fragment the ions, so you can produce fragments before and/or after separation by ion mobility. Producing fragments is nothing new, most of the MS instruments out there are able to do so.

So consider this big black ion is decayed in the trap cell into three smaller ions, a blue, a red and a green one:

As the figure indicates all fragments have different shape sizes and actually share the same velocity. They now enter the magic IMS cell and an IMS cycle starts. This cell is a combination of two chambers. One small chamber at the front of this cell is flooded with helium and operates as a gate. During an IMS cycle it is impossible that further ions can pass this gate. The main chamber of the IMS cell is filled with nitrogen. The pressures of helium and nitrogen are sensible tuned, such that nitrogen doesn’t form an counterflow for incoming ions. Here is a smart graphic of both chambers:

These nitrogen molecules represent a barrier, so the passing ions are slowed done while they have to find their way through this chamber. Here is the awesome trick located! The bigger the ion shape the bigger the braking force and the slower the ion. A heavy but compact ion might be faster than a lightweight but space consuming ion. So the drift time through the IMS cell is independent of m/z values, it separates the ions by their shape size. Back to the example the blue ion is much smaller so it is much faster, in contrast the fat green ion is inhibited a lot by this nitrogen gas:

Thence the blue fragment arrives first at the transfer cell. Here it is again fragmented in smaller components:

Each of the smaller blue ions then reaches the TOF analyzer. Here they have to fly a specific way, the required time is tracked at the TOF-detector. Heavier ions are slower, so the resulting m/z values are measured here:

The first acquired spectrum is recorded. All blue ions have the same drift time (they were all present when the pusher pushed), but are distinguished by their mass:

At the same time the red ion was able to reach the transfer cell and got also decayed again. Passing the transfer cell they reach the pusher and the next push will make them fly through the flight tube:

All red fragments also have the same drift time, but this time differs from the drift time of the blue ions. Nothing was said about their mass, the m/z isn’t determined before their flight through the TOF analyzer! So it might (and will) happen, that they are lighter than the blue ions. At this point the spectrum will look like this:

The same will happen with the green ions. Entering transfer cell, getting decayed, flying with the next push:

At the end the spectrum produced by the big black ion might look like this:

You see, the drift time is measured without any additional IMS detector! A common MS instrument is just able to record the left m/z spectrum, so if it produces the same seven fragments you are only able to identify five peaks. Since the dark blue and the light red ions have the same mass (they are called isobars) the produced peak is a merge of both ions. Same issue with the dark red and the light green ion. The new IMS technology now enables you to split this peak by the required drift time. Nice, isn’t it?

Ok, so far, back to reality! First of all I have to say the images are not true to scale, I’ve warped the elements for a better representation. The size of the IMS cell is not comparable with the size of the TOF! Fortunately the IMS cell was broken so I was able to look into the machine (figure 1). While the TOF is about 1m the IMS cell is a bit longer than half of a keyboard length, see figure 2. By the way the ions don’t fly a linear way through the TOF analyzer. The Synapt operates with reflectrons and knows two modes: V-TOF (ions are reflected once) and W-TOF (ions are reflected three times).

The energy beam transporting the ions through the IMS cell can be understood as a wave. You can define the wave height and the velocity to effectively separate your present ions. Don’t ask me why the call it height and velocity, and not amplitude and frequency, but what ever ;-) These two parameter are nevertheless very sensitive. So if they are not chosen very well, an ion might need longer then one IMS cycle to pass the cell, so it enters the transfer cell when the next cycle is still started. I don’t have any empirical knowledge yet, but it seems to be hard to find a well setting.

The complexity of this system of curse increases crucial if there isn’t only one big black ion in your machine! So analyzing is not that trivial as my images might induce. You are also able to separately enable or disable fragmentation in trap and transfer cell. So your awareness of this process is essential to understand the resulting data.

In reality one IMS cycle takes the time of 200 pushes, but the pusher isn’t synchronized with the IMS gate. What did he say? Time to get confused! If an IMS cycle takes exactly the time of 200 pushes, ions that arrive between two pushes (one push of course takes some time) should be lost every time, because they should arrive with the next IMS cycle, exactly +200 pushes, again between two pushes. This scenario would mean your sensitivity is crap. Theoretically correct, but fortunately we can’t count on our hardware. Even if you tell the pusher to push every 44 μs, the consumed time will fluctuate in the real world. So he’ll need 45 μs for one push and 43.4 μs for the following. Instead an IMS cycle will always take 44*200=8800 μs, independent of the real time the pusher needs for 200 pushes. So if an IMS cycle starts exactly with a push the next cycle will probably start within two pushes and ions that weren’t able to catch a push last time might now get pushed.

All in all you have to agree that this is an absolutely great invention. If you are interested in further information Waters provides some videos to visualize the IMS technology, and here you can find some smart pictures of the Triwave system in a Synapt.

Comparison of compression

I recently wrote an email with an attached LZMA archive. It was immediately answered with something like:

What are you doing? I had to boot linux to open the file!

First of all I don’t care whether user of proprietary systems are able to read open formats, but this answer made me curious to know about the differences between some compression mechanisms regarding compression ratio and time. So I had to test it!

This is nothing scientific! I just took standard parameters, you might optimize each method on its own to save more space or time. Just have a look at the parameter -1..-9 of zip. But all in all this might give you a feeling for the methods.

Candidates

I’ve chosen some usual compression methods, here is a short digest (more or less copy&paste from the man pages):

  • gzip: uses Lempel-Ziv coding (LZ77), cmd: tar czf $1.pack.tar.gz $1
  • bzip2: uses the Burrows-Wheeler block sorting text compression algorithm and Huffman coding, cmd: tar cjf $1.pack.tar.bz2 $1
  • zip: analogous to a combination of the Unix commands tar(1) and compress(1) and is compatible with PKZIP (Phil Katz’s ZIP for MSDOS systems), cmd: zip -r $1.pack.zip $1
  • rar: proprietary archive file format, cmd: rar a $1.pack.rar $1
  • lha: based on Lempel-Ziv-Storer-Szymanski-Algorithm (LZSS) and Huffman coding, cmd: lha a $1.pack.lha $1
  • lzma: Lempel-Ziv-Markov chain algorithm, cmd: tar --lzma -cf $1.pack.tar.lzma $1
  • lzop: imilar to gzip but favors speed over compression ratio, cmd: tar --lzop -cf $1.pack.tar.lzop $1

All times are user times, measured by the unix time command. To visualize the results I plotted them using R, compression efficiency at X vs. time at Y. The best results are of course located near to the origin.

Data

To test the different algorithms I collected different types of data, so one might choose a method depending on the file types.

Binaries

The first category is called binaries. A collection of files in human-not-readable format. I copied all files from /bin and /usr/bin , created a gpg encrypted file of a big document and added a copy of grml64-small_2010.12.iso. All in all 176.753.125 Bytes.

MethodCompressed Size% of originalTime in s
gzip161.999.80491.6510.18
bzip2161.634.68591.4571.76
zip179.273.428101.4313.51
rar175.085.41199.06156.46
lha180.357.628102.0435.82
lzma157.031.05288.84129.22
lzop165.533.60993.654.16

Media

This is a bunch of media files. Some audio data like the I have a dream-speech of Martin-Luther King and some music. Also video files like the The Free Software Song and Clinton’s I did not have sexual relations with that woman are integrated. I attached importance to different formats, so here are audio files of the type ogg, mp3 mid, ram, smil and wav, and video files like avi, ogv and mp4. Altogether 95.393.277 Bytes.

MethodCompressed Size% of originalTime in s
gzip88.454.00292.736.04
bzip287.855.90692.1037.82
zip88.453.92692.736.17
rar87.917.40692.1670.69
lha88.885.32593.1814.22
lzma87.564.03291.7974.76
lzop90.691.76495.072.28

Office

The next category is office. Here are some PDF from different journals and office files from LibreOffice and Microsoft’s Office (special thanks to @chschmelzer for providing MS files). The complete size of these files is 10.168.755 Bytes.

MethodCompressed Size% of originalTime in s
gzip8.091.87679.580.55
bzip28.175.62980.408.58
zip8.092.68279.580.54
rar7.880.71577.503.72
lha8.236.42281.003.29
lzma7.802.41676.735.62
lzop8.358.34382.200.21

Pictures

To test the compression of pictures I downloaded 10 files of each format bmp, eps, gif, jpg, png, svg and tif. That are the first ones I found with google’s image search engine. In total 29’417’414 Bytes.

MethodCompressed Size% of originalTime in s
gzip20.685.80970.321.65
bzip218.523.09162.9710.71
zip20.668.60270.261.72
rar18.052.68861.378.58
lha20.927.94971.145.97
lzma18.310.03262.2421.09
lzop23.489.61179.850.57

Plain

This is the main category. As you know, ASCII content is not saved really space efficient. Here the tools can riot! I downloaded some books from Project Gutenberg, for example Jules Verne’s Around the World in 80 Days and Homer’s The Odyssey, source code of moon-buggy and OpenLDAP, and copied all text files from /var/log . Altogether 40.040.854 Bytes.

MethodCompressed Size% of originalTime in s
gzip11.363.93128.381.88
bzip29.615.92924.0213.63
zip12.986.15332.431.6
rar11.942.20129.838.68
lha13.067.74632.648.86
lzma8.562.96821.3930.21
lzop15.384.62438.420.38

Rand

This category is just to test random generators. Compressing random content shouldn’t decrease the size of the files. Here I used two files from random.org and dumped some bytes from /dev/urandom. 4.198.400 Bytes.

MethodCompressed Size% of originalTime in s
gzip4.195.64699.930.23
bzip24.213.356100.361.83
zip4.195.75899.940.2
rar4.205.389100.171.65
lha4.194.56699.912.04
lzma4.197.25699.971.98
lzop4.197.13499.970.1

Everything

All files of the previous catergories compressed together. Since the categories aren’t of same size it is of course not really fair. Nevertheless it might be interesting. All files together require 355.971.825 Bytes.

MethodCompressed Size% of originalTime in s
gzip294.793.25582.8120.43
bzip2290.093.00781.49141.89
zip313.670.43988.1223.78
rar305.083.64885.70246.63
lha315.669.63188.6864.81
lzma283.475.56879.63258.05
lzop307.644.07686.427.89

Conclusion

As you can see, the violet lzma-dot is always located at the left side, meaning very good compression. But unfortunately it’s also always at the top, so it’s very slow. But if you want to compress files to send it via mail you won’t bother about longer compression times, the file size might be the crucial factor. At the other hand black, green and grey (gzip, zip and lzop) are often found at the bottom of the plots, so they are faster but don’t decrease the size that effectively.

All in all you have to choose the method on your own. Also think about compatibility, not everybody is able to unpack lzma or lzop.. My upshot is to use lzma if I want to transfer data through networks and for attachments to advanced people, and to use gzip for everything else like backups of configs or mails to windows user.