Mount multiple subvolumes of a LUKS encrypted BTRFS through pam_mount

Some days ago, @daftaupe@mamot.fr convinced me on Mastodon to give BTRFS a try. That’s actually been a feature on my list for some time already, and now that I need to switch PCs at work I’m going for it. However, this post wouldn’t exist if everything went straight forward.. ;-)

The Scenario

I have a 1TB SSD that I want to encrypt. It should automatically get decrypted and mounted to certain places when I log in. pam_mount can do that for you, and I’ve already been using that a lot in different scenarios. However, with BTRFS it’s a bit different. With any other file systems you would create a partition on the hard drive, which is then LUKS encrypted. This has the drawback, that you need to decide on the partition’s size beforehand!

With BTRFS you can just encrypt the whole drive and use so-called subvolumes on top of it. Thus, you’re a bit more flexible by creating and adjusting quotas as required at any point in time (if at all…), but (or and!) the subvolumes are not visible unless the device is decrypted.

Let’s have a look into that and create the scenario. I assume that the SSD is available as /dev/sdb. Then we can create an encrypted container using LUKS:

root@srv ~ # cryptsetup -y -v --cipher aes-xts-plain64 --key-size 256 --hash sha256 luksFormat /dev/sdb

WARNING!
========
This will overwrite data on /dev/sdb irrevocably.

Are you sure? (Type uppercase yes): YES
Enter passphrase for /dev/sdb: ****
Verify passphrase: ****
Key slot 0 created.
Command successful.

You’re not sure which cipher or key-size to choose? Just run cryptsetup benchmark to see which settings perform best for you. My CPU, for example, comes with hardware support for AES, thus the AES ciphers show a significantly higher throughput. If you’re still feeling uncompfortable with that step, I recommend reading the sophisticated article at the ArchLinux’ wiki on dm-crypt/Device encryption.

We can now open the encrypted device using

root@srv ~ # cryptsetup luksOpen /dev/sdb mydrive
Enter passphrase for /dev/sdb: ****

This will create a node in /dev/mapper/mydrive, which represents the decrypted device.

Next, we’ll create a BTRFS on that device:

root@srv ~ # mkfs.btrfs /dev/mapper/mydrive
btrfs-progs v4.17
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want to force metadata duplication.
Label:              home
UUID:               d1e1e1f9-7273-4b29-ae43-4b9ca411c2ba
Node size:          16384
Sector size:        4096
Filesystem size:    931.51GiB
Block group profiles:
Data:             single            8.00MiB
Metadata:         single            8.00MiB
System:           single            4.00MiB
SSD detected:       yes
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
ID        SIZE  PATH
1   931.51GiB  /dev/mapper/mydrive

That’s indeed super fast, isn’t it!? I also couldn’t believe it.. ;-)

We can now mount the device, for example to /mnt/mountain:

root@srv ~ # mount /dev/mapper/mydrive /mnt/mountain
root@srv ~ # cd /mnt/mountain

So far, the file system is completely empty. But as it’s a BTRFS, we can create some subvolumes. Let’s say, we want to create a volume for our $HOME, and as we’re developing this website, we also want to create a volume called www:

root@srv /mnt/mountain # btrfs subvolume create home
Create subvolume './home'

root@srv /mnt/mountain # btrfs subvolume create www
Create subvolume './www'

root@srv /mnt/mountain # btrfs subvolume list .
ID 258 gen 21 top level 5 path home
ID 259 gen 22 top level 5 path www

So we have two subvolumes in that file system: home (id 258) and www (id 259). We could now mount them with

root@srv ~ # mount -o subvol=/home /dev/mapper/mydrive  /home/user
root@srv ~ # mount -o subvol=/www  /dev/mapper/mydrive  /var/www

But we want the system to do it automatically for us, as we login.

So unmount everything and close the LUKS container:

root@srv ~ # umount /mnt/mountain /home/user /var/www
root@srv ~ # cryptsetup luksClose mydrive

PamMount can Decrypt and Mount Automatically

I’m using pam_mount already for ages! It is super convenient. To get your home automatically decrypted and mounted, you would just need to add the following lines to your /etc/security/pam_mount.conf.xml:

<volume path="/dev/disk/by-uuid/a1b20e2f-049c-4e5f-89be-2fc0fa3dd564" user="YOU"
        mountpoint="/home/user" options="defaults,noatime,compress,subvol=/home" />

<volume path="/dev/disk/by-uuid/a1b20e2f-049c-4e5f-89be-2fc0fa3dd564" user="YOU"
        mountpoint="/var/www" options="defaults,noatime,compress,subvol=/www" />

Given this, PAM tries to mount the respective subvolumes of the disk (identified by the UUID a1b20e2f-049c-...) to /home/user and /var/www as soon as YOU logs in.

Here, I am using UUIDs to identify the disks. You can still use /dev/sdb (or similar), but there is a chance, that the disks are recognised in a different sequence with the next boot (and /dev/sdb may become /dev/sdc or something…). Plus, the UUID is invariant to the system – you can put the disk in any other machine and it will have the same UUID.

To find the UUID of your disk you can use blkid:

root@srv ~ # blkid
[...]
/dev/sdb: UUID="a1b20e2f-049c-4e5f-89be-2fc0fa3dd564" TYPE="crypto_LUKS"
[...]

The Problem

As said above, with BTRFS you’ll have your partitions (called subvolumes) right in the filesystem – invisible unless decrypted. So, what is PAM doing? It discovers the first entry in the pam_mount.conf.xml configuration, which basically says

mount a1b20e2f-049c-... with some extra options to /home/user when YOU logs in

PAM is also smart enough to understand that a1b20e2f-049c-... is a LUKS encrypted device and it decrypts it using your login password. This will then create a node in /dev/mapper/_dev_sdb, representing the decrypted device. And eventually, PAM mounts /dev/mapper/_dev_sdb to /home/user. So far so perfect.

But as soon as PAM discovers the second entry, it tries to do the same! Again it detects a LUKS device and tries to decrypt that. But unfortunately, there is already /dev/mapper/_dev_sdb!? Thus, opening the LUKS drive fails and you’ll find something like that in your /var/log/auth.log:

(mount.c:72): Messages from underlying mount program:
(mount.c:76): crypt_activate_by_passphrase: File exists
(pam_mount.c:522): mount of /dev/disk/by-uuid/a1b20e2f-049c-... failed

First it seems annoying that it doesn’t work out of the box, but at least it sounds reasonable that PAM cannot do what you what it to do..

The Solution

… is quite easy, even though it took me a while to figure things out…

As soon as the first subvolume is mounted (and the device is decrypted and available through /dev/mapper/_dev_sdb), we have direct access to the file system! Thus, we do not neet to tell PAM to mount /dev/disk/by-uuid/a1b20e2f-049c-..., but we can use /dev/mapper/_dev_sdb. Or even better, we can use the file system’s UUID now, to become invariant to the sdb-variable. If you run blkid with the device being decrypted you’ll find an entry like this:

root@srv ~ # blkid
[...]
/dev/sdb: UUID="a1b20e2f-049c-..." TYPE="crypto_LUKS"
/dev/mapper/_dev_sdb: UUID="d1e1e1f9-7273-..." UUID_SUB="..." TYPE="btrfs"
[...]

You see, the new node /dev/mapper/_dev_sdb also carries a UUID, actually representing the BTRFS :)
This UUID was by the way also reported by the mkfs.btrfs call above.

What does that mean for our setup? When we first need a subvolume of an encrypted drive we need to use the UUID of the parent LUKS container. For every subsequent subvolume we can use the UUID of the internal FS.

Transferred to the above scenario, we’d create a /etc/security/pam_mount.conf.xml like that:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE pam_mount SYSTEM "pam_mount.conf.xml.dtd">
<pam_mount>

  <volume path="/dev/disk/by-uuid/a1b20e2f-049c-4e5f-89be-2fc0fa3dd564" user="YOU"
          mountpoint="/home/user" options="defaults,noatime,subvol=/home" />

  <volume path="/dev/disk/by-uuid/d1e1e1f9-7273-4b29-ae43-4b9ca411c2ba" user="YOU"
          mountpoint="/var/www" options="defaults,noatime,subvol=/www" />

  <mkmountpoint enable="1" remove="true" />

</pam_mount>

Note the different UUIDs? Even though both mounts origin from the same FS :)

Open Problems

Actually, I wanted to have my home in a raid of two devices, but I don’t know how to tell pam_mount to decrypt two devices to make BTRFS handle the raid..? The only option seems to use mdadm to create the raid, but then BTRFS just sees a single device and, therefore, cannot do its extra raid magic

If anyone has an idea on that issue you’ll have my ears :)

Thunderbird 60+ is missing calendars

Lightning is a calendar plugin for Thunderbird.
Lightning is a calendar plugin for Thunderbird.

I’m running Thunderbird to read emails on my desktops. And I’m using the Lightning plugin to manage calendars, evens, and tasks.

However, since I updated to Thunderbird 60 some weeks ago, Lightning strangely seems to be broken. The Add-ons manager still lists Lightning as properly installed, but there the “Events and Tasks” menu is missing, as well as the calendar/tasks tabs and the calendar settings in the preferences. As I’ve been pretty busy with many other things, I didn’t study the problem - hoping that the bug gets fixed in the meantime - but living without the calendar addon is cumbersome. And today it became annoying enough to make me investigate this…

There seems to be various issues with calendars in the new Thunderbird version: Mozilla provides an extensive support page dedicated to this topic. Sadly, none of these did help in my case..

I then made sure that the versions of Thunderbird and Lightning are compatible (both are 1:60.0-3~deb9u1 for me):

$ dpkg -l thunderbird
ii  thunderbird       1:60.0-3~deb9u1     amd64     mail/news client with RSS, chat [...]
$ dpkg -l lightning 
ii  lightning         1:60.0-3~deb9u1     all       Calendar Extension for Thunderbird

Eventually, I stumbled upon a thread in the German Debian forums: Thunderbird 60 - Lightning funktioniert nicht. And they figured out, that it may be caused by missing language packs for Lightning… Indeed, I do have language packs for Thunderbird installed (de and en-gb), that are not installed for Lightning:

$ dpkg -l| egrep "thunderbird|lightning"
ii  lightning                1:60.0-3~deb9u1
ii  thunderbird              1:60.0-3~deb9u1
ii  thunderbird-l10n-de      1:60.0-3~deb9u1
ii  thunderbird-l10n-en-gb   1:60.0-3~deb9u1

And it turns out, that this was a problem! Thunderbird apparently wouldn’t run Lightning unless it has all required language packs installed. After installing the missing language packs (aptitude install lightning-l10n-de lightning-l10n-en-gb), the extension is again fully working in Thunderbird! How unsatisfactory…

All that may be cause by a missing dependency..? Even though thunderbird recommends lightning, thunderbird-l10n-de (and similiar) do not recommend lightning-l10n-de. Not exactly sure how, but maybe the dependencies should be remodelled…?

Native SSH server on LinageOS

I finally trashed my shitty Shift5.2 and got a spare OnePlus One from a good colleague.

tldr: scroll down to Setup of SSH on LineageOS.

I strongly discourage everyone from buying a ShiftPhone. The Phone was/is on Android patch level from 2017-03-05 – which is one and a half year ago! Not to mention that it was running an Android 5.1.1 in 2018… With soo many bugs and security issues, in my opinion this phone is a danger to the community! And nobody at Shift seemed to really care…

However, I now have a OnePlus One, which is supported by LineageOS - the successor of CyanogenMod. So, first action was installing LineageOS. Immediately followed by installing SU to get root access.

Next, I’d like to have SSH access to the phone. I did love the native SSH server on my Galaxy S2, which used to run CyanogenMod for 5+ years. Using the SSH access I was able to integrate it in my backup infrastructure and it was much easier to quickly copy stuff from the phone w/o a cable :)

The original webpage including a how-to for installing SSH on CyanogenMod has unfortunately vanished. There is a copy available from the WayBackMachine (thanks a lot guys!!). I still thought dumping an up-to-date step-wise instruction here may be a good idea :)

Setup of SSH on LineageOS

The setup of the native SSH server on LineageOS seems to be pretty similiar to the CyanogenMod version. First you need a shell on the phone, e.g. through adb, and become root (su). Then just follow the following three steps:

Create SSH daemon configuration

You do not need to create a configuration file from scratch, you can use /system/etc/ssh/sshd_config as a template. Just copy the configuration file to /data/ssh/sshd_config;

cp /system/etc/ssh/sshd_config /data/ssh/sshd_config

Just make sure you set the following things:

  • PermitRootLogin without-password
  • PubkeyAuthentication yes
  • PermitEmptyPasswords no
  • ChallengeResponseAuthentication no
  • Subsystem sftp internal-sftp

Setup SSH keys

We’ll be using SSH-keys to authenticate to the phone. If you don’t know what SSH keys are, or how to create them, you may go to an article that I wrote in 2009 (!!) or use an online search engine.

First, we need to create /data/.ssh on the phone (note the .!) and give it to the shell user:

mkdir -p /data/.ssh
chmod 700 /data/.ssh
chown shell:shell /data/.ssh

Second, we need to store our public SSH key (probably stored in ~/.ssh/id_rsa.pub on your local machine) in /data/.ssh/authorized_keys on the phone. If that file exists, just append your public key into a new line. Afterwards, handover the authorized_keys file to the shell user:

chmod 600 /data/.ssh/authorized_keys
chown shell:shell /data/.ssh/authorized_keys

Create a start script

Last but not least, we need a script to start the SSH service. There is again a template available in /system/bin/start-ssh. Just copy the script to /data/local/userinit.d/:

mkdir /data/local/userinit.d/
cp /system/bin/start-ssh /data/local/userinit.d/99sshd
chmod 755 /data/local/userinit.d/99sshd

Finally, we just need to update the location of the sshd_config to /data/ssh/sshd_config in our newly created /data/local/userinit.d/99sshd script (in the template it points to /system/etc/ssh/sshd_config, there are 2 occurences: for running the daemon w/ and w/o debugging).

That’s it

You can now run /data/local/userinit.d/99sshd and the SSH server should be up and running :)

Earlier versions of Android/CyanogenMod auto-started the scripts stored in /data/local/userinit.d/ right after the boot, but this feature was removed with CM12.. Thus, at the moment it is not that easy to automatically start the SSH server with a reboot of your phone. But having the SSH daemon running all the time may also be a bad idea, in terms of security and battery…

Regain RSS feeds for the University of Rostock

RSS feeds for uni-rostock.de
RSS feeds for uni-rostock.de

I’m consuming quite some input from the internet everyday. A substantial amount of information arrives through podcasts, but much more essential are the 300+ RSS feeds that I’m subscribed to. I love RSS, it’s one of the best inventions in the world wide web!

However, there are alarming rumors and activities trying to get rid of RSS… We probably should all get our news filtered by Facebook or something..!? The importance of RSS, which allows users to keep track of updates on many different websites, seems to get continuously ignored.. And so does the new website of our University, where official RSS feeds aren’t provided anymore :(

Apparently, many people were already asking for RSS feeds of the University’s webpage. At least that’s what they told me, when I asked… But the company who built the pages won’t integrate RSS anymore - probably wasn’t listed in the requirements.. And the University wouldn’t touch the expensive website.

“Fortunatelly,” they stayed with Typo3 as the CMS, which we’ve been using as well - before we decided to switch. And this Typo3 platform can output the page’s content as RSS feed out of the box, you just need to know how! ;-)

And… I’ll tell you: Just append ?type=9818 to the URL. That’s it! Really. It’s so easy.

Here are a few examples:

Sure, it doesn’t work everywhere. If the editors maintain news as static HTML pages, Typo3 fails to export a proper RSS feed. It’s still better than nothing. And maybe it helps a few people…

The RSS icon was adapted from commons:Generic Feed-icon.svg.

Proper Search Engine for a Static Website powered by DuckDuckGo (and similar)

Static websites are great and popular, see for example Brunch, Hexo, Hugo, Jekyll, Octopress, Pelican, and …. They are easy to maintain and their performance is invincible. But… As they are static, they cannot dynamically handle user input, which is an obvious requirement for every search engine.

Outsource the task

Lucky us, there are already other guys doing the search stuff pretty convincingly. So it’s just plausible to not reinvent the wheel, but instead make use of their services. There are a number of search engines, e.g. Baidu, Bing, Dogpile, Ecosia, Google, StartPage, Yahoo, Yippy, and more (list sorted alphabetically, see also Wikipedia::List of search engines). They all have pros and cons, but typically it boils down to a trade between coverage, up-to-dateness, monopoly, and privacy. You probably also have your favourite. However, it doesn’t really matter. While this guide focusses on DuckDuckGo, the proposed solution is basically applicable to all search engines.

Theory

The idea is, that you add a search form to your website, but do not handle the request yourself and instead redirect to an endpoint of a public search engine. All the search engines have some way to provide the search phrase encoded in the URL. Typically, the search phrase is stored in the GET varialble q, for example example.org/?q=something would search for something at example.org. Thus, your form would redirect to example.org/?q=.... However, that would of course start a search for the given phrase on the whole internet! Instead, you probably want to restrict the search results to pages from your domain.

Fortunatelly, the search engines typically also provide means to limit search results to a domain, or similar. In case of DuckDuckGo it is for example the site: operator, see also DuckDuckGo’s syntax. That is, for my blog I’d prefix the search phrase with site:binfalse.de.

Technical realisation

Implementing the workaround is no magic, even though you need to touch your webserver’s configuration.

First thing you need to do is adding a search form to your website. That form may look like this:

<form action="/search" method="get">
     <input name="q" type="text" />
     <button type="submit">Search</button>
</form>

As you see, the form just consists of a text field and a submit-button. The data will be submitted to /search on your website.

Sure, /search doesn’t exist on your website (if it exists you need to use a different endpoint), but we’ll configure your web server to do the remaining work. The web server needs to do two things: (1) it needs to prefix the phrase with site:your.domain and (2) it needs to redirect the user to the search engine of your choice. Depending on the web server you’re using the configuration of course differs. My Nginx configuration, for example, looks like this:

location ~ ^/search {
    return 302 https://duckduckgo.com/?q=site%3Abinfalse.de+$arg_q;
}

So it sends the user to duckduckgo.com, with the query string site:binfalse.de concatenated to the submitted search phrase ($arg_q = the q variable of the original GET request). If you’re running an Apache web server, you probably know how to achieve the same over there. Otherwise it’s a good opportunity to look again into the manual ;-)

Furthermore, the results pages of DuckDuckGo can be customised to look more closely like your site. You just need to send a few more URL parameters with the query, such as kj for the header color or k7 for the background color. The full list of available configuration options are available from DuckDuckGo settings via URL parameters.

In conclusion, if you use my search form to search for docker, you’ll be guided to https://binfalse.de/search?q=docker. The Nginx delivering my website will then redirect you to https://duckduckgo.com/?q=site%3Abinfalse.de+docker, try it yourself: search for docker!

This of course also works for dynamic websites with WordPress, Contao or similar…

Run Baïkal through Docker

Baïkal is a quite popular Calendar+Contacts server. It supports CalDAV as well as CardDAV.

I’ve been using it for my calendars and adressbooks already for more than 4 years now. However, I initially installed it as plain PHP application with a MySQL database. The developers also announced quite early, that they are working on a Docker image, but there is nothing useful as of mid 2018. So far they just provide a quite inconvenient how-to and a list of issues that apparently prevent them from providing a proper Docker image. Thus, I just dockerised the application myself :)

The Docker image

Actually, creating a Docker image for Baïkal was super easy. In the end, it is “only” a PHP application ;-) The corresponding Dockerfile can be found in the root directory of Baïkal’s git repository (at least in my fork). The latest version at the time of writing is:

FROM php:apache
MAINTAINER martin scharm <https://binfalse.de/contact>

# we're working from /var/www, not /var/www/html
# the html directory will come with baikal
WORKDIR /var/www

# install tools necessary for the setup
RUN apt-get update \
 && apt-get install -y -q --no-install-recommends \
    unzip \
    git \
    libjpeg62-turbo \
    libjpeg62-turbo-dev \
    libpng-dev \
    libfreetype6-dev \
    ssmtp \
 && apt-get clean \
 && rm -r /var/lib/apt/lists/* \
 && a2enmod expires headers

# for mail configuration see https://binfalse.de/2016/11/25/mail-support-for-docker-s-php-fpm/


# install php db extensions
RUN docker-php-source extract \
 && docker-php-ext-configure gd --with-freetype-dir=/usr/include/ --with-jpeg-dir=/usr/include/ \
 && docker-php-ext-install -j$(nproc) pdo pdo_mysql \
 && docker-php-source delete

# install composer
RUN php -r "copy('https://getcomposer.org/installer', 'composer-setup.php');" \
 && php -r "if (hash_file('SHA384', 'composer-setup.php') === '544e09ee996cdf60ece3804abc52599c22b1f40f4323403c44d44fdfdd586475ca9813a858088ffbc1f233e9b180f061') { echo 'Installer verified'; } else { echo 'Installer corrupt'; unlink('composer-setup.php'); } echo PHP_EOL;" \
 && mkdir -p composer/packages \
 && php composer-setup.php --install-dir=composer \
 && php -r "unlink('composer-setup.php');" \
 && chown -R www-data: composer


# prepare destination
RUN rm -rf /var/www/html && chown www-data /var/www/
ADD composer.json /var/www/
ADD Core html /var/www/Core/
ADD html /var/www/html/

# install dependencies etc
USER www-data
RUN composer/composer.phar install


USER root

# the Specific dir is supposed to come from some persistent storage
VOLUME /var/www/Specific

So, it basically

  • installs some dependencies through apt-get,
  • installs the PDO-MySQL extension,
  • installs composer,
  • adds the Baikal sources into the image,
  • and finally installs remaining Baikal dependencies through composer.

I distribute the image as binfalse/baikal.

Using the Docker image

Using the image is fairly simple. Basically, you only need to mount some persistent space to /var/www/Specific

docker run -it --rm -p 80:80 -v /path/to/persistent:/var/www/Specific binfalse/baikal

To start with, you can use the original Specific directory from the Baïkal repository. Then head to your Baikal instance (which will probably redirect to BASEURL/admin/install), and setup your server. Every configuration will be stored in the mounted volume at /path/to/persistent.

SSL

To support encrypted connections you would need to mount the certificates as well as a modified Apache configuration into the container. However, I recommend to run it behind a reverse proxy, such as binfalse/nginx-proxy, and let the proxy handle all SSL connections (as for all other containers). This way, you just need one proper SSL configuration.

MySQL

The default SQLite database is perfect for a first test, but is slow and just allows for a limited amount of SQL variables. If you for example have more than 999 contacts, the first sync of a clean WebDAV device will result in an exception such as:

PDOException: SQLSTATE[HY000]: General error: 1 too many SQL variables

Thus, for production you may want to switch to a proper database, such as MariaDB. Lucky you, the Docker image supports MySQL! ;-)

To reproducibly assemble both containers, I recommend Docker-Compose. Here is a sample config with two containers baikal and baikal-db:

version: '2'
services:
    baikal:
        restart: always
        image: binfalse/baikal
        container_name: baikal
        volumes:
            - /srv/baikal/config:/var/www/Specific
        links:
            - baikal-db
    baikal-db:
        restart: always
        image: mariadb
        container_name: baikal-db
        volumes:
            - /srv/baikal/database:/var/lib/mysql
        environment:
            MYSQL_ROOT_PASSWORD: roots-difficult-password
            MYSQL_DATABASE: baikal
            MYSQL_USER: baikal
            MYSQL_PASSWORD: baikals-difficult-password

This assumes, that your Baikal configuration can be found in /srv/baikal/config. The database will be stored in /srv/baikal/database. Also note the database credentials for configuring Baikal. If you’re not running a reverse proxy in front of the application, you also need to add some port forwarding for the baikal container:

version: '2'
services:
    baikal:
        restart: always
        image: binfalse/baikal
        [...]
        ports:
            - "80:80"
            - "443:443"
        [...]

Mail support

I’m not sure why, but Baikal’s list of issues included support for mail. However, adding mail support should also be fairly easy if needed. I already wrote a How-To for PHP-mail in Docker.

Logging with Docker

In a typical Docker environment you’ll have plenty of containers (probably in multiple networks?) on the same machine. Let’s assume, you need to debug some problems of a container, eg. because it doesn’t send mails anymore.. What would you do? Correct, you’d go and check the logs.

By default, Docker logs the messages of every container into a json file. On a Debian-based system you’ll probably find the file at /var/lib/docker/containers/CONTAINERID/CONTAINERID-json.log. However, to properly look into the logs you would use Docker’s logs tool. This will print the logs, just as you would expect cat to dump the logs in /var/log. docker-logs can also filter for time spans using --since and --until, and it is able to emulate a tail -f with --follow.

However, the logs are only available for exsiting containers. That means, if you recreate the application (i.e. you recreate the container), you’ll typically loose the log history… If your workflow includes the --rm, you will immediately trash the log of a container when it’s stopped. Fortunatelly, Docker provides other logging drivers, to e.g. log to AWS, fluentd, GPC, and to good old syslog! :)

Here I’ll show how to use the host’s syslog to manage the logs of your containers.

Log to Syslog

Telling Docker to log to the host’s syslog is really easy. You just need to use the built-in syslog driver:

docker run --log-driver syslog [other options etc]

Voilà, the container will log to the syslog and you’ll probably find the messages in /var/log/syslog. Here is an example of an Nginx, that I just started to serve my blog on my laptop:

Feb 21 16:06:32 freibeuter af6dcace59a9[5606]: 172.17.0.1 - - [21/Feb/2018:15:06:32 +0000] "GET /2018/02/21/logging-with-docker/ HTTP/1.1" 304 13333 "http://localhost:81/" "Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0" "-"

By default, the syslog driver uses the container’s ID as the syslog tag (here it is af6dcace59a9), but you can further configure the logging driver and, for example, set a proper syslog tag:

docker run --log-driver syslog --log-opt tag=binfalse-blog [other options etc]

This way, it is easier to distinguish between messages from different containers and to track the logs of an application even if the container gets recreated:

Feb 21 16:11:16 freibeuter binfalse-blog[5606]: 172.17.0.1 - - [21/Feb/2018:15:11:16 +0000] "GET /2018/02/21/logging-with-docker/ HTTP/1.1" 200 13333 "http://localhost:81/" "Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0" "-"

If you’re using Docker Compose, you can use the logging keyword to configure logging:

version: '2'
  services:
    website:
      restart: unless-stopped
      image: nginx
      container_name: website
      volumes:
        - /srv/web/default/:/usr/share/nginx/html
      logging:
        driver: syslog
        options:
          tag: docker/website

Here, I configured an nxinx that just serves the contents from /srv/web/default. The interesting part is, however, that the container uses the syslog driver and the syslog tag docker/website. I always prefix the tag with docker/, to distinguish between log entries of the host machine and entries from Docker containers..

Store Docker logs seperately

The workaround so far will probably substantially spam your /var/log/syslog, which may become very annoying… ;-)

Therefore, I recommend to write Docker’s logs to a seperate file. If you’re for example using Rsyslog, you may want to add the following configuration:

if $syslogtag contains 'docker/' then /var/log/docker
& ~

Just dump the snippet to a new file /etc/rsyslog.d/docker.conf and restart Rsyslog. This rule tells Rsyslog to write messages that are tagged with docker/* to /var/log/docker, and not to the default syslog file anymore. Thus, your /var/log/syslog stays clean and it’s easier do monitor the Docker containers.

Disentangle the Container logs

Since version 8.25, Rsyslog can also be used to split the docker logs into individual files based on the tag. So you can create separate log files, one per container, which is even cleaner! The idea is to use the tag name of containers to implement the desired directory structure. That means, I would tag the webserver of a website with docker/website/webserver and the database with docker/website/database. We can then tell Rsyslog to allow slashes in program names (see the programname section at www.rsyslog.com/doc/master/configuration/properties.html) and create a template target path for Docker log messages, which is based on the programname:

global(parser.PermitSlashInProgramname="on")

$template DOCKER_TEMPLATE,"/var/log/%programname%.log"

if $syslogtag contains 'docker/' then ?DOCKER_TEMPLATE
&~

Using that configuration, our website will log to /var/log/docker/website/webserver.log and /var/log/docker/website/database.log. Neat, isn’t it? :)

Inform Logrotate

Even though all the individual logfiles will be smaller than a combined one, they will still grow in size. So we should tell logrotate of their existence!

Fortunatelly, this is easy as well. Just create a new file /etc/logrotate.d/docker containing something like the following:

/var/log/docker/*.log
/var/log/docker/*/*.log
/var/log/docker/*/*/*.log
{
        rotate 7
        daily
        missingok
        notifempty
        delaycompress
        compress
        postrotate
                invoke-rc.d rsyslog rotate > /dev/null
        endscript
}

This will rotate the files ending in *.log in /var/log/docker/ and its subdirectories everyday and keep compressed logs for 7 days. Here I’m using a maximum depth of 3 subdirectories – if you need to create a deeper hierarchy of directories just add another /var/log/docker/*/*/*/*.log etc to the beginning of the file.

Dockerising a Contao website II

In a previous post I explained how to run a Contao website in a Docker infrastructure. That was a good opening. However, after running that setup for some time I discovered a few issues…

A central idea of Docker is to install the application in an image and mount persistent files into a running container. Thus, you can just throw away an instance of the app and start a new one very quickly (e.g. with an updated version of the app). Unfortunately, using Contao it’s not that straight-forward – at least when using the image decribed earlier.

Here I’m describing how I fought the issues:

Issues with Cron

The first issue was Contao’s Poor-Man-Cron. This cron works as follows:

  • The browser requests a file cron.txt, which is supposed to contain the timestamp of the last cron run.
  • If the timestamp is “too” old, the browser will also request a cron.php, which then runs overdue jobs.
  • If a job was run, the timestamp in cron.txt will be updated, so cron.php won’t be run every time.

Good, but that means the cron.txt will only be written, if a cron job gets executed. But let’s assume the next job will only be run next week end!? The last cron-run-time is stored in the database, but the cron.txt won’t exist by default. That means, even if the cron.php is run, it will know that there is no cron job to execute and, therefore, exit without creating/updating the cron.txt. Especially when using Docker you will hit such a scenario every time when starting a new container.. Thus, every user creates a 404 error (as there is no cron.txt), which is of course ugly and spams the logs..

I fixed the issue by extending the Contao source code. The patch is already merged into the official release of Contao 3.5.33. In addition, I’m initialising the cron.txt in my Docker image with a time stamp of 0, see the Dockerfile.

Issues with Proxies

A typical Docker infrastructure (at least for me) consists of bunch containers orchestrated in various networks etc.. Usually, you’ll have at least one (reverse) proxy, which distributes HTTP request to the container in charge. However, I experienced a few issues with my proxy setup:

HTTPS vs HTTP

While the connection between client (user, web browser) and reverse proxy is SSL-encrypted, the proxy and the webserver talk plain HTTP. As it’s the same machine, there is no big need to waste time on encryption. But Contao has a problem with that setup. Even though, the reverse proxy properly sends the HTTP_X_FORWARDED_PROTO, Contao only sees incomming HTTP traffic and uses http://-URLs in all documents… Even if you ignore the mixed-content issue and/or implement a rewrite of HTTP to HTTPS at the web-server-layer, this will produce twice as much connections as necessary!

The solution is however not that difficult. Contao does not understand HTTP_X_FORWARDED_PROTO, but it recognises the $_SERVER['HTTPS'] variable. Thus, to fix that issue you just need to add the following to your system/config/initconfig.php (see also Issue 7542):

<?php
if (isset ($_SERVER['HTTP_X_FORWARDED_PROTO']) && 'https' === $_SERVER['HTTP_X_FORWARDED_PROTO'])
{
	$_SERVER['HTTPS'] = 1;
}

In addition, this will generate URLs including the port number (e.g. https://example.com:443/etc), but they are perfectly valid. (Not like https://example.com:80/etc or something that I saw during my tests… ;-)

URL encodings in the Sitemap

The previous fix brought up just another issue: The URL encoding in the sitemap breaks when using the port component (:443).. Conato uses rawurlencode to encode all URLs before writing them to the sitemap. However, rawurlencode encodes quite a lot! Among others, it converts :s to %3A. Thus, all URLs in my sitemap looked like this: https://example.com%3A443/etc - which is obviously invalid.

I proposed using htmlspecialchars instead to encode the URLs, but it was finally fixed by splitting the URLs and should be working in release 3.5.34.

Issues with Cache and Assets etc

A more delicate issue are cache and assets and sitemaps etc. Contao’s backend comes with convenient buttons to clear/regenerate these files and to create the search index. Yet, you don’t always want to login to the backend when recreating the Docker container.. Sometime you simply can’t - for example, if the container needs to be recreated over night.

Basically, that is not a big issue. Assets and cache will be regenerate once they are needed. But the sitemaps, for instance, will only be generated when interacting with the backend.

Thus, we need a solution to create these files as soon as possible, preferably in the background after a container is created. Most of the stuff can be done using the Automator tool, but I also have some personal scripts developed by a company, that require other mechanisms and are unfortunately not properly integrated into Contao’s hooks landscape. And if we need to touch code anyways, we can also generate all assets and rebuild the search index manually (precreating necessary assets will later on speed up things for users…). To generate all assets (images and scripts etc), we just need to access every single page at the frontend. This will then trigger Contao to create the assets and cache, and subsequent requests from real-life users will be much faster!

The best hack that I came up with so far looks like the following script, that I uploaded to /files/initialiser.php to Contao instance:

<?php
define ('TL_MODE', 'FE');
require __DIR__ . '/../system/initialize.php';

$THISDIR = realpath (dirname (__FILE__));

$auto = new \Automator ();
// purge stuff
$auto->purgeSearchTables ();
$auto->purgeImageCache ();
$auto->purgeScriptCache();
$auto->purgePageCache();
$auto->purgeSearchCache();
$auto->purgeInternalCache();
$auto->purgeTempFolder();
$auto->purgeXmlFiles ();

// regenerate stuff
$auto->generateXmlFiles ();
$auto->generateInternalCache();
$auto->generateConfigCache();
$auto->generateDcaCache();
$auto->generateLanguageCache();
$auto->generateDcaExtracts();


// get all fe pages
$pages = \Backend::findSearchablePages();

if (isset($GLOBALS['TL_HOOKS']['getSearchablePages']) && is_array($GLOBALS['TL_HOOKS']['getSearchablePages'])) {
	foreach ($GLOBALS['TL_HOOKS']['getSearchablePages'] as $callback) {
		$classname = $callback[0];
		if (!is_subclass_of ($classname, 'Backend'))
			$pages =  (new $classname ())->{$callback[1]} ($pages);
	}
}

// request every fe page to generate assets and cache and search index
$ch=curl_init();
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, 'conato-cleaner');
# maybe useful to speed up:
#curl_setopt($ch, CURLOPT_MAXCONNECTS, 50);
#curl_setopt($ch, CURLOPT_NOBODY, TRUE);
#curl_setopt($ch, CURLOPT_TIMEOUT_MS, 150);
#curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS, 150);

foreach ($pages as $page) {
	curl_setopt($ch, CURLOPT_URL, $page);
	curl_exec($ch);
}

The first 3 lines initialise the Contao environment. Here I assume that ../system/initialize.php exists (i.e. the script is saved in the files directory). The next few lines purge existing cache using the Automator tool and subsequently regenerate the cache – just to be clean ;-)

Finally, the script (i) collects all “searchable pages” using the Backend::findSearchablePages() functionality, (ii) enriches this set of pages with additional pages that may be hooked-in by plugins etc through $GLOBALS['TL_HOOKS']['getSearchablePages'], and then (iii) uses cURL to iteratively request each page.

But…

The first part should be reasonably fast, so clients may be willing to wait until the cache stuff is recreated. Accessing every frontend page, however, may require a significant amount of time! Especially for larger web pages.. Thus, I embedded everything in the following skeleton, which advises the browser to close the connection before we start the time-consuming tasks:

<?php
/**
* start capturing output
*/
ob_end_clean ();
ignore_user_abort ();
ob_start() ;


/**
* run the tasks that you want your users to wait for
*/

// e.g. purge and regenerate cache/sitemaps/assets
$auto = new \Automator ();
$auto->purgeSearchTables ();
// ..

/**
* flush the output and tell the browser to close the connection as soon as it received the content
*/
$size = ob_get_length ();
header ("Connection: close");
header ("Content-Length: $size");
ob_end_flush ();
flush ();


/**
* from here you have some free computational time
*/

// e.g. collect pages and request the web sites
// users will already be gone and the output will (probably) never show up in a browser.. (but don't rely on that! it's still sent to the client, it's just outside of content-length)
$pages = \Backend::findSearchablePages();
// ...

Here, the browser is told to close the connection after a certain content size arrived. I buffer the content that I want to transfer using ob_start and ob_end_flush, so I know how big it is (using ob_get_length). Everything after ob_get_length can safely be ignored by the client, and the connection can be closed.
(You cannot be sure that the browser really closes the connection. I saw curl doing it, but also some versions of Firefox still waiting for the script to finish… Nevertheless, the important content will be transferred quick enough).

In addition, I created some RewriteRules for mod_rewrite to automatically regenerate missing files. For example, for the sitemaps I added the following to the vhost config (or htaccess):

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^/share/(.*)\.xml.*$ https://example.com/files/initialiser.php?target=sitemap&sitemap=$1 [R=302,L]

That means, if for example /share/sitemap.xml not yet exists, the user gets automagically redirected to our initialiser.php script! In addition, I added some request parameters (?target=sitemap&sitemap=$1), so that the initialiser.php knows which file was requested. It can then regenerate everything and immediately output the new content! :)

For example, my snippet to regenerate and serve the sitemap looks similar to this:

<?php
// ...

$auto = new \Automator ();
// ...
$auto->generateXmlFiles ();

if ($_GET['target'] == 'sitemap') {
	$sitemaps = $auto->purgeXmlFiles (true);
	$found = false;
	foreach ($sitemaps as $sitemap) {
		if ((!isset ($_GET['sitemap']) || empty ($_GET['sitemap'])) || $_GET['sitemap'] == $sitemap) {
			$xmlfile = $THISDIR . "/../share/" . $sitemap . ".xml";
			
			// if it still does not exists -> we failed...
			if (!file_exists( $xmlfile )) {
				// error handling
			}
			// otherwise, we'll dump the sitemap
			else {
				header ("Content-Type: application/xml");
				readfile ($xmlfile);
			}
			$found = true;
			break;
		}
	}
	if (!$found) {
		// error handling
	}
}

Thus, the request to /share/somesitemap.xml will never fail. If the file does not exist, the client will be redirected to /files/initialiser.php?target=sitemap&sitemap=somesitemap, the file /share/somesitemap.xml will be regenerated, and the new contents will immediately be served. So the client will eventually get the desired content :)

Please be aware, that this script is easily DOS-able! Attackers may produce a lot of load by accessing the file. Thus, I added some simple DOS protection to the beginning of the script, which makes sure the whole script is not run more than once per hour (3600 seconds):

<?php
$dryrun = false;
$runcheck = "/tmp/.conato-cleaner-timestamp";

if (file_exists ($runcheck) && filemtime ($runcheck) > time () - 3600) {
	$dryrun = true;
    if (!isset ($_GET['target']) || empty ($_GET['target']))
        die ();
}
else
	touch ($runcheck);

If $dryrun is true, it won’t regenerate cache etc, but still serve the sitemap and other files if requested.. However, if there is also no $_GET['target'] defined, we don’t know what to serve anyway and can die immediately…

You could include the script at the footer of your webpage, e.g. using

<script src="/files/initialiser.php"></script>
</body></html>

(you may want to make sure that the generated output, if any, is valid JavaScript. E.g. embed everything in /*...*/ or something…)

This way you would make sure, that every request produces a fully initialised system. However, this will probably also create unnecessary load every hour… You could increase the time span in the DOS-protection-hack, but I guess it should be sufficient to run the script only if a missing file is requested. Earlier requests then need to wait for pending assets etc, but to be honest, that should not be too long (or you have a different problem anyway…).

And if your website provides an RSS feed, you could subscribe to it using your default reader, which will regularly make sure that the RSS feed is generated if missing.. (and thus trigger all the other stuff in our initialiser.php) – A feed reader as the poorest-man-cron ;-)

Share

As I said earlier, my version of the script contains plenty of personalised stuff. That’s why I cannot easily share it with you.. :(

However, if you have trouble implementing it yourself just let me know :)

Dockerising a Contao website

I’m a fan of containerisation! It feels much cleaner and systems don’t age that quickly.

Latest project that I am supposed to maintain is a new Contao website. The company who built the website of course just delivered files and a database. The files contain the Contao installation next to Contao extensions next to configuration and customised themes.. All merged into a blob… Thus, in the files it is hard to distinguish between Contao-based files and user generated content. So I needed to study Contao’s documentation and reinstall the website to learn what files should go into the Docker image and which files to store outside.

However, I finally came up with a solution that is based on two Contao images :)

A general Contao image

The general Contao image is supposed to contain a plain Conato installation. That is, the recipe just installs dependencies (such as curl, zip, and ssmtp) and downloads and extracts Contao’s sources. The Dockerfile looks like this:

FROM php:apache
MAINTAINER martin scharm <https://binfalse.de/contact/>

# for mail configuration see https://binfalse.de/2016/11/25/mail-support-for-docker-s-php-fpm/

RUN apt-get update \
 && apt-get install -y -q --no-install-recommends \
    wget \
    curl \
    unzip \
    zlib1g-dev \
    libpng-dev \
    libjpeg62-turbo \
    libjpeg62-turbo-dev \
    libcurl4-openssl-dev \
    libfreetype6-dev \
    libmcrypt-dev \
    libxml2-dev \
    ssmtp \
 && apt-get clean \
 && rm -r /var/lib/apt/lists/*

RUN wget https://download.contao.org/3.5/zip -O /tmp/contao.zip \
 && unzip /tmp/contao.zip -d /var/www/ \
 && rm -rf /var/www/html /tmp/contao.zip \
 && ln -s /var/www/contao* /var/www/html \
 && echo 0 > /var/www/html/system/cron/cron.txt \
 && chown -R www-data: /var/www/contao* \
 && a2enmod rewrite

RUN docker-php-source extract \
 && docker-php-ext-configure gd --with-freetype-dir=/usr/include/ --with-jpeg-dir=/usr/include/ \
 && docker-php-ext-install -j$(nproc) zip gd curl mysqli soap \
 && docker-php-source delete

RUN php -r "copy('https://getcomposer.org/installer', 'composer-setup.php');" \
 && php -r "if (hash_file('SHA384', 'composer-setup.php') === '544e09ee996cdf60ece3804abc52599c22b1f40f4323403c44d44fdfdd586475ca9813a858088ffbc1f233e9b180f061') { echo 'Installer verified'; } else { echo 'Installer corrupt'; unlink('composer-setup.php'); } echo PHP_EOL;" \
 && mkdir -p composer/packages \
 && php composer-setup.php --install-dir=composer \
 && php -r "unlink('composer-setup.php');" \
 && chown -R www-data: composer

The first block apt-get installs necessary stuff from the Debian repositories. The second block downloads a Contao 3.5 from https://download.contao.org/3.5/zip, extracts it to /var/www/, and links /var/www/html to it. It also creates the cron.txt (see github.com/contao/core/pull/8838). The third block installs a few required and/or useful PHP extensions. And finally the fourth block retrieves and installs Composer to /var/www/html/composer, where the Contao-composer-plugin expects it.

That’s already it! We have a recipe to create a general Docker image for Contao. Quickly setup an automatic build and .. thada .. available as binfalse/contao.

A personalised Contao image

Besides the plain Contao installation, a Contao website typically also contains a number of extensions. Those are installed through composer, and they can always be reinstalled. As we do not want to install a load of plugins everytime a new container is started we create a personalised Contao image. All you need is the composer.json that contains the information on which extensions and which versions to install. This json should be copied to /var/www/html/composer/composer.json, before composer can be run to install the stuff. Here is an example of such a Dockerfile:

FROM binfalse/contao
MAINTAINER martin scharm <https://binfalse.de/contact/>

COPY composer.json composer/composer.json

USER www-data

# we need to run it this twice... you probably know the error:
# 'Warning: Contao core 3.5.31 was about to get installed but 3.5.31 has been found in project root, to recover from this problem please restart the operation'
# not sure why it doesn't run the necessary things itself? seems idiot to me, but... yes.. we run it twice if it fails...

RUN php composer/composer.phar --working-dir=composer update || php composer/composer.phar --working-dir=composer update

USER root

This image can then be build using:

docker build -t contao-personalised .

The resulting image tagged contao-personalised will contain all extensions required for your website. Thus, it is highly project specific and shouldn’t be shared..

How to use the personalised Contao image

The usage is basically very simple. You just need to mount a few things inside the container:

  • /var/www/html/files/ should contain files that you uploaded etc.
  • /var/www/html/templates/ may contain your customised layout.
  • /var/www/html/system/config/FILE.php should contain some configuration files. This may include the localconfig.php or a pathconfig.php.

Optionally you can link a MariaDB for the database.

Tying it all together using Docker-Compose

Probably the best way to orchestrate the containers is using Docker-Compose. Here is an example docker-compose.yml:

version: '2'
services:

    contao:
      build: /path/to/personalised/Dockerfile
      restart: unless-stopped
      container_name: contao
      links:
        - contao_db
      ports:
        - "8080:80"
      volumes:
        - $PATH/files:/var/www/html/files
        - $PATH/templates:/var/www/html/templates:ro
        - $PATH/system/config/localconfig.php:/var/www/html/system/config/localconfig.php

    contao_db:
      image: mariadb
      restart: always
      container_name: contao_db
      environment:
        MYSQL_DATABASE: contao_database
        MYSQL_USER: contao_user
        MYSQL_PASSWORD: contao_password
        MYSQL_ROOT_PASSWORD: very_secret
      volumes:
        - $PATH/database:/var/lib/mysql

This assumes that your personalised Dockerfile is located in path/to/personalised/Dockerfile and your website files are stored in $PATH/files, $PATH/templates, and $PATH/system/config/localconfig.php. Docker-Compose will then build the personalised image (if necessary) and create 2 containers:

  • contao based on this image, all user-based files are mounted into the proper locations
  • contao_db a MariaDB to provide a MySQL server

To make Contao speak to the MariaDB server you need to configure the database connection in $PATH/system/config/localconfig.php just like:

$GLOBALS['TL_CONFIG']['dbDriver'] = 'MySQLi';
$GLOBALS['TL_CONFIG']['dbHost'] = 'contao_db';
$GLOBALS['TL_CONFIG']['dbUser'] = 'contao_user';
$GLOBALS['TL_CONFIG']['dbPass'] = 'contao_password';
$GLOBALS['TL_CONFIG']['dbDatabase'] = 'contao_database';
$GLOBALS['TL_CONFIG']['dbPconnect'] = false;
$GLOBALS['TL_CONFIG']['dbCharset'] = 'UTF8';
$GLOBALS['TL_CONFIG']['dbPort'] = 3306;
$GLOBALS['TL_CONFIG']['dbSocket'] = '';

Here, the database should be accessible at contao_db:3306, as it is setup in the compose file above.

If you’re running contao with “Rewrite URLs” using an .htaccess you also need to update Apache’s configuration to allow for rewrites. Thus, you may for example mount the follwoing file to /etc/apache2/sites-available/000-default.conf:

<VirtualHost *:80>
    ServerAdmin webmaster@localhost
    DocumentRoot /var/www/html
    <Directory /var/www/>
        AllowOverride All
        Options FollowSymLinks
    </Directory>
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>

This tells Apache to allow everything in any .htaccess file in /var/www.

When everything is up running the Conato install will be available at port 8080 (see ports definition in the compose file) of the machine hosting the Docker containers.

Mail support

The image above comes with sSMTP installed. If you need support for email with your Contao installation, you just need to mount two more files into the container:

Tell PHP to mail through sSMTP

The following file tells PHP to use the ssmtp binary for mailing. Just mount the file to /usr/local/etc/php/conf.d/mail.ini:

[mail function]
sendmail_path = "/usr/sbin/ssmtp -t"

Configure sSMTP

The sSMTP configuration is very easy. The following few lines may already be sufficient, when mounted to /etc/ssmtp/ssmtp.conf:

FromLineOverride=YES
mailhub=mail.server.tld
hostname=php-fpm.yourdomain.tld

For more information read Mail support for Docker’s php:fpm and the Arch Linux wiki on sSMTP or the Debian wiki on sSMTP.

Archiving a (Wordpress) Website

I needed to migrate a lot of tools and projects that we’ve been working on in the SEMS group at the University of Rostock. Among others, the Wordpress website needed to be serialised to get rid of PHP and all the potential insecure and expensive Wordpress maintenance. I decided to mirror the page using HTTrack and some subsequent fine tuning. This is just a small report, maybe interesting if you also need to archive a dynamic web page.

Prepare the page

Some stuff in your (Wordpress) installation are properly useless after serialisation (or have never been working either) - get rid of them. For example:

  • Remove the search box - it’s useless without PHP. You may add a link to a search engine instead…?
  • Remove unnecessary trackers like Google analytics and Piwik. You probably don’t need it anymore and users may be unnecessarily annoyed by tracking and/or 404s.
  • Disable unnecessary plugins.
  • Check that manual links (e.g. in widgets) are still up-to-date, also after archiving..
  • Check for unpublished drafts in posts/pages. Those will be lost as soon as you close the CMS.
  • Recreate sitemap and rss feeds (if not created automatically)

I also recommend to setup some monitoring, e.g. using check_link, to make sure all resources are afterwards accessible as expected!

Mirror the website

I decided to mirror the web content using HTTrack. That’s basically quite simple. At the target location you only need to call:

httrack --mirror https://sems.uni-rostock.de/

This will create a directory sems.uni-rostock.de containing the mirrored contend. In addition you’ll find logs in hts-log.txt and the cached content in hts-cache/.

However, I tweaked the call a bit and actually executed HTTrack like this:

httrack --mirror '-*trac/*' '-*comments/feed*' '-*page_id=*' -%k --disable-security-limits -%c160 -c20  https://sems.uni-rostock.de/

This ignores all links that match *trac/* (there was a Trac running, but that moved to GitHub and an Nginx will permanently redirect the traffic), in addition it will keep connections alive (-%k). As I’m the admin of the original site (which I know won’t die too soon, and in worst case I can just restart it) I increased the speed to a max of 160 connections per second (-%c160) and max 20 simultaneous connections (-c20). For that I also needed to disable HTTrack’s security limits (--disable-security-limits).

That went quite well and I quickly had a copy of the website. However, there were a few issues…

Problems with redirects.

Turns out that HTTrack has problems with redirects. At some point we installed proper SSL certificates and since then we were redirecting traffic at port 80 (HTTP) to port 443 (HTTPS). However, some people manually created links that point to the HTTP resources, such as http://sems.uni-rostock.de/home/. If HTTrack stumbles upon such a redirect it will try to remodel that redirect. However, in case of redirects from http://sems.uni-rostock.de/home/ to https://sems.uni-rostock.de/home/, the target is the same as the source (from HTTrack’s point of view) and it will redirect to … itself.. -.-

The created HTML page sems.uni-rostock.de/home/index.html looks like that:

<HTML>
<!-- Created by HTTrack Website Copier/3.49-2 [XR&CO'2014] -->

<!-- Mirrored from sems.uni-rostock.de/home/ by HTTrack Website Copier/3.x [XR&CO'2014], Wed, 24 Jan 2018 07:16:38 GMT -->
<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=iso-8859-1" /><!-- /Added by HTTrack -->
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=UTF-8"><META HTTP-EQUIV="Refresh" CONTENT="0; URL=index.html"><TITLE>Page has moved</TITLE>
</HEAD>
<BODY>
<A HREF="index.html"><h3>Click here...</h3></A>
</BODY>
<!-- Created by HTTrack Website Copier/3.49-2 [XR&CO'2014] -->

<!-- Mirrored from sems.uni-rostock.de/home/ by HTTrack Website Copier/3.x [XR&CO'2014], Wed, 24 Jan 2018 07:16:38 GMT -->
</HTML>

As you can see, both the link and the meta refresh will redirect to the very same index.html, effectively producing a reload-loop… And as sems.uni-rostock.de/home/index.html already exists it won’t store the content behind https://sems.uni-rostock.de/home/, which will be lost…

I have no idea for an easy fix. I’ve been playing around with the url-hacks flag, but I did not find a working solution.. (see also forum.httrack.com/readmsg/10334/10251/index.html)

What I ended up with was to grep for this page and to find pages that link to it:

grep "Click here" -rn sems.uni-rostock.de | grep 'HREF="index.html"'

(Remember: some of the Click here pages are legit: They implement proper redirects! Only self-links to HREF="index.html" are the enemies.)

At SEMS we for example also had a wrong setting in the calendar plugin, which was still configured for a the HTTP version of the website and, thus, generating many of these problematic URLs.

The back-end search helped a lot to find the HTTP links. When searching for http://sems in posts and pages I found plenty of pages that hard-coded the wrong link target.. Also remember that links may also appear in post-excerpts!

If nothing helps, you can still temporarily disable the HTTPS redirect for the time of mirroring.. ;-)

Finalising the archive

To complete the mirror I also rsync‘ed the files in wp-content/uploads/, as not all files are linked in through the web site. Sometimes we just uploaded files and shared them through e-mails or on other websites.

I also manually grabbed the sitemap(s), as HTTrack apparently didn’t see them:

wget --quiet https://sems.uni-rostock.de/sitemap.xml -O sems.uni-rostock.de/sitemap.xml
wget --quiet https://sems.uni-rostock.de/sitemap.xml -O - | egrep -o "https?://[^<]+" | wget --directory-prefix=sems.uni-rostock.de -i -