binfalse
Dockerising a Contao website II
February 20th, 2018This article is based on Contao 3. There is a new version, see Dockerising Contao 4
In a previous post I explained how to run a Contao website in a Docker infrastructure. That was a good opening. However, after running that setup for some time I discovered a few issues…
A central idea of Docker is to install the application in an image and mount persistent files into a running container. Thus, you can just throw away an instance of the app and start a new one very quickly (e.g. with an updated version of the app). Unfortunately, using Contao it’s not that straight-forward – at least when using the image decribed earlier.
Here I’m describing how I fought the issues:
Issues with Cron
The first issue was Contao’s Poor-Man-Cron. This cron works as follows:
- The browser requests a file
cron.txt
, which is supposed to contain the timestamp of the last cron run. - If the timestamp is “too” old, the browser will also request a
cron.php
, which then runs overdue jobs. - If a job was run, the timestamp in
cron.txt
will be updated, socron.php
won’t be run every time.
Good, but that means the cron.txt
will only be written, if a cron job gets executed.
But let’s assume the next job will only be run next week end!?
The last cron-run-time is stored in the database, but the cron.txt
won’t exist by default.
That means, even if the cron.php
is run, it will know that there is no cron job to execute and, therefore, exit without creating/updating the cron.txt
.
Especially when using Docker you will hit such a scenario every time when starting a new container..
Thus, every user creates a 404 error (as there is no cron.txt
), which is of course ugly and spams the logs..
I fixed the issue by extending the Contao source code.
The patch is already merged into the official release of Contao 3.5.33.
In addition, I’m initialising the cron.txt
in my Docker image with a time stamp of 0
, see the Dockerfile.
Issues with Proxies
A typical Docker infrastructure (at least for me) consists of bunch containers orchestrated in various networks etc.. Usually, you’ll have at least one (reverse) proxy, which distributes HTTP request to the container in charge. However, I experienced a few issues with my proxy setup:
HTTPS vs HTTP
While the connection between client (user, web browser) and reverse proxy is SSL-encrypted, the proxy and the webserver talk plain HTTP.
As it’s the same machine, there is no big need to waste time on encryption.
But Contao has a problem with that setup.
Even though, the reverse proxy properly sends the HTTP_X_FORWARDED_PROTO
, Contao only sees incomming HTTP traffic and uses http://
-URLs in all documents…
Even if you ignore the mixed-content issue and/or implement a rewrite of HTTP to HTTPS at the web-server-layer, this will produce twice as much connections as necessary!
The solution is however not that difficult.
Contao does not understand HTTP_X_FORWARDED_PROTO
, but it recognises the $_SERVER['HTTPS']
variable.
Thus, to fix that issue you just need to add the following to your system/config/initconfig.php
(see also Issue 7542):
<?php
if (isset ($_SERVER['HTTP_X_FORWARDED_PROTO']) && 'https' === $_SERVER['HTTP_X_FORWARDED_PROTO'])
{
$_SERVER['HTTPS'] = 1;
}
In addition, this will generate URLs including the port number (e.g. https://example.com:443/etc
), but they are perfectly valid. (Not like https://example.com:80/etc
or something that I saw during my tests… ;-)
This workaround doesn’t work for Contao 4 anymore! To fix it see Dockerising Contao 4
URL encodings in the Sitemap
The previous fix brought up just another issue: The URL encoding in the sitemap breaks when using the port component (:443
)..
Conato uses rawurlencode
to encode all URLs before writing them to the sitemap.
However, rawurlencode
encodes quite a lot!
Among others, it converts :
s to %3A
.
Thus, all URLs in my sitemap looked like this: https://example.com%3A443/etc
- which is obviously invalid.
I proposed using htmlspecialchars instead to encode the URLs, but it was finally fixed by splitting the URLs and should be working in release 3.5.34.
Issues with Cache and Assets etc
A more delicate issue are cache and assets and sitemaps etc. Contao’s backend comes with convenient buttons to clear/regenerate these files and to create the search index. Yet, you don’t always want to login to the backend when recreating the Docker container.. Sometime you simply can’t - for example, if the container needs to be recreated over night.
Basically, that is not a big issue. Assets and cache will be regenerate once they are needed. But the sitemaps, for instance, will only be generated when interacting with the backend.
Thus, we need a solution to create these files as soon as possible, preferably in the background after a container is created.
Most of the stuff can be done using the Automator
tool, but I also have some personal scripts developed by a company, that require other mechanisms and are unfortunately not properly integrated into Contao’s hooks landscape.
And if we need to touch code anyways, we can also generate all assets and rebuild the search index manually (precreating necessary assets will later on speed up things for users…).
To generate all assets (images and scripts etc), we just need to access every single page at the frontend.
This will then trigger Contao to create the assets and cache, and subsequent requests from real-life users will be much faster!
The best hack that I came up with so far looks like the following script, that I uploaded to /files/initialiser.php
to Contao instance:
<?php
define ('TL_MODE', 'FE');
require __DIR__ . '/../system/initialize.php';
$THISDIR = realpath (dirname (__FILE__));
$auto = new \Automator ();
// purge stuff
$auto->purgeSearchTables ();
$auto->purgeImageCache ();
$auto->purgeScriptCache();
$auto->purgePageCache();
$auto->purgeSearchCache();
$auto->purgeInternalCache();
$auto->purgeTempFolder();
$auto->purgeXmlFiles ();
// regenerate stuff
$auto->generateXmlFiles ();
$auto->generateInternalCache();
$auto->generateConfigCache();
$auto->generateDcaCache();
$auto->generateLanguageCache();
$auto->generateDcaExtracts();
// get all fe pages
$pages = \Backend::findSearchablePages();
if (isset($GLOBALS['TL_HOOKS']['getSearchablePages']) && is_array($GLOBALS['TL_HOOKS']['getSearchablePages'])) {
foreach ($GLOBALS['TL_HOOKS']['getSearchablePages'] as $callback) {
$classname = $callback[0];
if (!is_subclass_of ($classname, 'Backend'))
$pages = (new $classname ())->{$callback[1]} ($pages);
}
}
// request every fe page to generate assets and cache and search index
$ch=curl_init();
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, 'conato-cleaner');
# maybe useful to speed up:
#curl_setopt($ch, CURLOPT_MAXCONNECTS, 50);
#curl_setopt($ch, CURLOPT_NOBODY, TRUE);
#curl_setopt($ch, CURLOPT_TIMEOUT_MS, 150);
#curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS, 150);
foreach ($pages as $page) {
curl_setopt($ch, CURLOPT_URL, $page);
curl_exec($ch);
}
The first 3 lines initialise the Contao environment.
Here I assume that ../system/initialize.php
exists (i.e. the script is saved in the files
directory).
The next few lines purge existing cache using the Automator tool and subsequently regenerate the cache – just to be clean ;-)
Finally, the script
(i) collects all “searchable pages” using the Backend::findSearchablePages()
functionality,
(ii) enriches this set of pages with additional pages that may be hooked-in by plugins etc through $GLOBALS['TL_HOOKS']['getSearchablePages']
,
and then (iii) uses cURL to iteratively request each page.
But…
The first part should be reasonably fast, so clients may be willing to wait until the cache stuff is recreated. Accessing every frontend page, however, may require a significant amount of time! Especially for larger web pages.. Thus, I embedded everything in the following skeleton, which advises the browser to close the connection before we start the time-consuming tasks:
<?php
/**
* start capturing output
*/
ob_end_clean ();
ignore_user_abort ();
ob_start() ;
/**
* run the tasks that you want your users to wait for
*/
// e.g. purge and regenerate cache/sitemaps/assets
$auto = new \Automator ();
$auto->purgeSearchTables ();
// ..
/**
* flush the output and tell the browser to close the connection as soon as it received the content
*/
$size = ob_get_length ();
header ("Connection: close");
header ("Content-Length: $size");
ob_end_flush ();
flush ();
/**
* from here you have some free computational time
*/
// e.g. collect pages and request the web sites
// users will already be gone and the output will (probably) never show up in a browser.. (but don't rely on that! it's still sent to the client, it's just outside of content-length)
$pages = \Backend::findSearchablePages();
// ...
Here, the browser is told to close the connection after a certain content size arrived.
I buffer the content that I want to transfer using ob_start
and ob_end_flush
, so I know how big it is (using ob_get_length
).
Everything after ob_get_length
can safely be ignored by the client, and the connection can be closed.
(You cannot be sure that the browser really closes the connection. I saw curl
doing it, but also some versions of Firefox still waiting for the script to finish… Nevertheless, the important content will be transferred quick enough).
In addition, I created some RewriteRules
for mod_rewrite
to automatically regenerate missing files.
For example, for the sitemaps I added the following to the vhost config (or htaccess
):
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^/share/(.*)\.xml.*$ https://example.com/files/initialiser.php?target=sitemap&sitemap=$1 [R=302,L]
That means, if for example /share/sitemap.xml
not yet exists, the user gets automagically redirected to our initialiser.php
script!
In addition, I added some request parameters (?target=sitemap&sitemap=$1
), so that the initialiser.php
knows which file was requested.
It can then regenerate everything and immediately output the new content! :)
For example, my snippet to regenerate and serve the sitemap looks similar to this:
<?php
// ...
$auto = new \Automator ();
// ...
$auto->generateXmlFiles ();
if ($_GET['target'] == 'sitemap') {
$sitemaps = $auto->purgeXmlFiles (true);
$found = false;
foreach ($sitemaps as $sitemap) {
if ((!isset ($_GET['sitemap']) || empty ($_GET['sitemap'])) || $_GET['sitemap'] == $sitemap) {
$xmlfile = $THISDIR . "/../share/" . $sitemap . ".xml";
// if it still does not exists -> we failed...
if (!file_exists( $xmlfile )) {
// error handling
}
// otherwise, we'll dump the sitemap
else {
header ("Content-Type: application/xml");
readfile ($xmlfile);
}
$found = true;
break;
}
}
if (!$found) {
// error handling
}
}
Thus, the request to /share/somesitemap.xml
will never fail.
If the file does not exist, the client will be redirected to /files/initialiser.php?target=sitemap&sitemap=somesitemap
, the file /share/somesitemap.xml
will be regenerated, and the new contents will immediately be served.
So the client will eventually get the desired content :)
Please be aware, that this script is easily DOS-able! Attackers may produce a lot of load by accessing the file. Thus, I added some simple DOS protection to the beginning of the script, which makes sure the whole script is not run more than once per hour (3600 seconds):
<?php
$dryrun = false;
$runcheck = "/tmp/.conato-cleaner-timestamp";
if (file_exists ($runcheck) && filemtime ($runcheck) > time () - 3600) {
$dryrun = true;
if (!isset ($_GET['target']) || empty ($_GET['target']))
die ();
}
else
touch ($runcheck);
If $dryrun
is true
, it won’t regenerate cache etc, but still serve the sitemap and other files if requested..
However, if there is also no $_GET['target']
defined, we don’t know what to serve anyway and can die
immediately…
You could include the script at the footer of your webpage, e.g. using
<script src="/files/initialiser.php"></script>
</body></html>
(you may want to make sure that the generated output, if any, is valid JavaScript. E.g. embed everything in /*...*/
or something…)
This way you would make sure, that every request produces a fully initialised system. However, this will probably also create unnecessary load every hour… You could increase the time span in the DOS-protection-hack, but I guess it should be sufficient to run the script only if a missing file is requested. Earlier requests then need to wait for pending assets etc, but to be honest, that should not be too long (or you have a different problem anyway…).
And if your website provides an RSS feed, you could subscribe to it using your default reader, which will regularly make sure that the RSS feed is generated if missing.. (and thus trigger all the other stuff in our initialiser.php
)
– A feed reader as the poorest-man-cron ;-)
Share
As I said earlier, my version of the script contains plenty of personalised stuff. That’s why I cannot easily share it with you.. :(
However, if you have trouble implementing it yourself just let me know :)
Dockerising a Contao website
January 24th, 2018This article is based on Contao 3. There is a new version, see Dockerising Contao 4
I’m a fan of containerisation! It feels much cleaner and systems don’t age that quickly.
Latest project that I am supposed to maintain is a new Contao website. The company who built the website of course just delivered files and a database. The files contain the Contao installation next to Contao extensions next to configuration and customised themes.. All merged into a blob… Thus, in the files it is hard to distinguish between Contao-based files and user generated content. So I needed to study Contao’s documentation and reinstall the website to learn what files should go into the Docker image and which files to store outside.
However, I finally came up with a solution that is based on two Contao images :)
A general Contao image
PLEASE NOTE: sSMTP is not maintained anymore! Please swith to
msmtp
, for example, as I explained in Migrating from sSMTP to msmtp.
The general Contao image is supposed to contain a plain Conato installation. That is, the recipe just installs dependencies (such as curl, zip, and ssmtp) and downloads and extracts Contao’s sources. The Dockerfile looks like this:
FROM php:apache
MAINTAINER martin scharm <https://binfalse.de/contact/>
# for mail configuration see https://binfalse.de/2016/11/25/mail-support-for-docker-s-php-fpm/
RUN apt-get update \
&& apt-get install -y -q --no-install-recommends \
wget \
curl \
unzip \
zlib1g-dev \
libpng-dev \
libjpeg62-turbo \
libjpeg62-turbo-dev \
libcurl4-openssl-dev \
libfreetype6-dev \
libmcrypt-dev \
libxml2-dev \
ssmtp \
&& apt-get clean \
&& rm -r /var/lib/apt/lists/*
RUN wget https://download.contao.org/3.5/zip -O /tmp/contao.zip \
&& unzip /tmp/contao.zip -d /var/www/ \
&& rm -rf /var/www/html /tmp/contao.zip \
&& ln -s /var/www/contao* /var/www/html \
&& echo 0 > /var/www/html/system/cron/cron.txt \
&& chown -R www-data: /var/www/contao* \
&& a2enmod rewrite
RUN docker-php-source extract \
&& docker-php-ext-configure gd --with-freetype-dir=/usr/include/ --with-jpeg-dir=/usr/include/ \
&& docker-php-ext-install -j$(nproc) zip gd curl mysqli soap \
&& docker-php-source delete
RUN php -r "copy('https://getcomposer.org/installer', 'composer-setup.php');" \
&& php -r "if (hash_file('SHA384', 'composer-setup.php') === '544e09ee996cdf60ece3804abc52599c22b1f40f4323403c44d44fdfdd586475ca9813a858088ffbc1f233e9b180f061') { echo 'Installer verified'; } else { echo 'Installer corrupt'; unlink('composer-setup.php'); } echo PHP_EOL;" \
&& mkdir -p composer/packages \
&& php composer-setup.php --install-dir=composer \
&& php -r "unlink('composer-setup.php');" \
&& chown -R www-data: composer
The first block apt-get install
s necessary stuff from the Debian repositories.
The second block downloads a Contao 3.5 from https://download.contao.org/3.5/zip
, extracts it to /var/www/
, and links /var/www/html
to it.
It also creates the cron.txt
(see github.com/contao/core/pull/8838).
The third block installs a few required and/or useful PHP extensions.
And finally the fourth block retrieves and installs Composer to /var/www/html/composer
, where the Contao-composer-plugin expects it.
That’s already it! We have a recipe to create a general Docker image for Contao. Quickly setup an automatic build and .. thada .. available as binfalse/contao
.
A personalised Contao image
Besides the plain Contao installation, a Contao website typically also contains a number of extensions.
Those are installed through composer, and they can always be reinstalled.
As we do not want to install a load of plugins everytime a new container is started we create a personalised Contao image.
All you need is the composer.json
that contains the information on which extensions and which versions to install.
This json should be copied to /var/www/html/composer/composer.json
, before composer can be run to install the stuff.
Here is an example of such a Dockerfile:
FROM binfalse/contao
MAINTAINER martin scharm <https://binfalse.de/contact/>
COPY composer.json composer/composer.json
USER www-data
# we need to run it this twice... you probably know the error:
# 'Warning: Contao core 3.5.31 was about to get installed but 3.5.31 has been found in project root, to recover from this problem please restart the operation'
# not sure why it doesn't run the necessary things itself? seems idiot to me, but... yes.. we run it twice if it fails...
RUN php composer/composer.phar --working-dir=composer update || php composer/composer.phar --working-dir=composer update
USER root
This image can then be build using:
docker build -t contao-personalised .
The resulting image tagged contao-personalised
will contain all extensions required for your website.
Thus, it is highly project specific and shouldn’t be shared..
How to use the personalised Contao image
The usage is basically very simple. You just need to mount a few things inside the container:
/var/www/html/files/
should contain files that you uploaded etc./var/www/html/templates/
may contain your customised layout./var/www/html/system/config/FILE.php
should contain some configuration files. This may include thelocalconfig.php
or apathconfig.php
.
Optionally you can link a MariaDB for the database.
Tying it all together using Docker-Compose
Probably the best way to orchestrate the containers is using Docker-Compose.
Here is an example docker-compose.yml
:
version: '2'
services:
contao:
build: /path/to/personalised/Dockerfile
restart: unless-stopped
container_name: contao
links:
- contao_db
ports:
- "8080:80"
volumes:
- $PATH/files:/var/www/html/files
- $PATH/templates:/var/www/html/templates:ro
- $PATH/system/config/localconfig.php:/var/www/html/system/config/localconfig.php
contao_db:
image: mariadb
restart: always
container_name: contao_db
environment:
MYSQL_DATABASE: contao_database
MYSQL_USER: contao_user
MYSQL_PASSWORD: contao_password
MYSQL_ROOT_PASSWORD: very_secret
volumes:
- $PATH/database:/var/lib/mysql
This assumes that your personalised Dockerfile is located in path/to/personalised/Dockerfile
and your website files are stored in $PATH/files
, $PATH/templates
, and $PATH/system/config/localconfig.php
.
Docker-Compose will then build the personalised image (if necessary) and create 2 containers:
contao
based on this image, all user-based files are mounted into the proper locationscontao_db
a MariaDB to provide a MySQL server
To make Contao speak to the MariaDB server you need to configure the database connection in $PATH/system/config/localconfig.php
just like:
$GLOBALS['TL_CONFIG']['dbDriver'] = 'MySQLi';
$GLOBALS['TL_CONFIG']['dbHost'] = 'contao_db';
$GLOBALS['TL_CONFIG']['dbUser'] = 'contao_user';
$GLOBALS['TL_CONFIG']['dbPass'] = 'contao_password';
$GLOBALS['TL_CONFIG']['dbDatabase'] = 'contao_database';
$GLOBALS['TL_CONFIG']['dbPconnect'] = false;
$GLOBALS['TL_CONFIG']['dbCharset'] = 'UTF8';
$GLOBALS['TL_CONFIG']['dbPort'] = 3306;
$GLOBALS['TL_CONFIG']['dbSocket'] = '';
Here, the database should be accessible at contao_db:3306
, as it is setup in the compose file above.
If you’re running contao with “Rewrite URLs” using an .htaccess you also need to update Apache’s configuration to allow for rewrites.
Thus, you may for example mount the follwoing file to /etc/apache2/sites-available/000-default.conf
:
<VirtualHost *:80>
ServerAdmin webmaster@localhost
DocumentRoot /var/www/html
<Directory /var/www/>
AllowOverride All
Options FollowSymLinks
</Directory>
ErrorLog ${APACHE_LOG_DIR}/error.log
CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>
This tells Apache to allow everything in any .htaccess file in /var/www.
When everything is up running the Conato install will be available at port 8080
(see ports
definition in the compose file) of the machine hosting the Docker containers.
Mail support
PLEASE NOTE: sSMTP is not maintained anymore! Please swith to
msmtp
, for example, as I explained in Migrating from sSMTP to msmtp.
The image above comes with sSMTP installed. If you need support for email with your Contao installation, you just need to mount two more files into the container:
Tell PHP to mail through sSMTP
The following file tells PHP to use the ssmtp
binary for mailing. Just mount the file to /usr/local/etc/php/conf.d/mail.ini
:
[mail function]
sendmail_path = "/usr/sbin/ssmtp -t"
Configure sSMTP
PLEASE NOTE: sSMTP is not maintained anymore! Please swith to
msmtp
, for example, as I explained in Migrating from sSMTP to msmtp.
The sSMTP configuration is very easy. The following few lines may already be sufficient, when mounted to /etc/ssmtp/ssmtp.conf
:
FromLineOverride=YES
mailhub=mail.server.tld
hostname=php-fpm.yourdomain.tld
For more information read Mail support for Docker’s php:fpm and the Arch Linux wiki on sSMTP or the Debian wiki on sSMTP.
Archiving a (Wordpress) Website
January 24th, 2018I needed to migrate a lot of tools and projects that we’ve been working on in the SEMS group at the University of Rostock. Among others, the Wordpress website needed to be serialised to get rid of PHP and all the potential insecure and expensive Wordpress maintenance. I decided to mirror the page using HTTrack and some subsequent fine tuning. This is just a small report, maybe interesting if you also need to archive a dynamic web page.
Prepare the page
Some stuff in your (Wordpress) installation are properly useless after serialisation (or have never been working either) - get rid of them. For example:
- Remove the search box - it’s useless without PHP. You may add a link to a search engine instead…?
- Remove unnecessary trackers like Google analytics and Piwik. You probably don’t need it anymore and users may be unnecessarily annoyed by tracking and/or 404s.
- Disable unnecessary plugins.
- Check that manual links (e.g. in widgets) are still up-to-date, also after archiving..
- Check for unpublished drafts in posts/pages. Those will be lost as soon as you close the CMS.
- Recreate sitemap and rss feeds (if not created automatically)
I also recommend to setup some monitoring, e.g. using check_link, to make sure all resources are afterwards accessible as expected!
Mirror the website
I decided to mirror the web content using HTTrack. That’s basically quite simple. At the target location you only need to call:
httrack --mirror https://sems.uni-rostock.de/
This will create a directory sems.uni-rostock.de
containing the mirrored contend.
In addition you’ll find logs in hts-log.txt
and the cached content in hts-cache/
.
However, I tweaked the call a bit and actually executed HTTrack like this:
httrack --mirror '-*trac/*' '-*comments/feed*' '-*page_id=*' -%k --disable-security-limits -%c160 -c20 https://sems.uni-rostock.de/
This ignores all links that match *trac/*
(there was a Trac running, but that moved to GitHub and an Nginx will permanently redirect the traffic), in addition it will keep connections alive (-%k
).
As I’m the admin of the original site (which I know won’t die too soon, and in worst case I can just restart it) I increased the speed to a max of 160 connections per second (-%c160
) and max 20 simultaneous connections (-c20
).
For that I also needed to disable HTTrack’s security limits (--disable-security-limits
).
That went quite well and I quickly had a copy of the website. However, there were a few issues…
Problems with redirects.
Turns out that HTTrack has problems with redirects.
At some point we installed proper SSL certificates and since then we were redirecting traffic at port 80 (HTTP) to port 443 (HTTPS).
However, some people manually created links that point to the HTTP resources, such as http://sems.uni-rostock.de/home/
.
If HTTrack stumbles upon such a redirect it will try to remodel that redirect.
However, in case of redirects from http://sems.uni-rostock.de/home/
to https://sems.uni-rostock.de/home/
, the target is the same as the source (from HTTrack’s point of view) and it will redirect to … itself.. -.-
The created HTML page sems.uni-rostock.de/home/index.html
looks like that:
<HTML>
<!-- Created by HTTrack Website Copier/3.49-2 [XR&CO'2014] -->
<!-- Mirrored from sems.uni-rostock.de/home/ by HTTrack Website Copier/3.x [XR&CO'2014], Wed, 24 Jan 2018 07:16:38 GMT -->
<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=iso-8859-1" /><!-- /Added by HTTrack -->
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=UTF-8"><META HTTP-EQUIV="Refresh" CONTENT="0; URL=index.html"><TITLE>Page has moved</TITLE>
</HEAD>
<BODY>
<A HREF="index.html"><h3>Click here...</h3></A>
</BODY>
<!-- Created by HTTrack Website Copier/3.49-2 [XR&CO'2014] -->
<!-- Mirrored from sems.uni-rostock.de/home/ by HTTrack Website Copier/3.x [XR&CO'2014], Wed, 24 Jan 2018 07:16:38 GMT -->
</HTML>
As you can see, both the link and the meta refresh will redirect to the very same index.html
, effectively producing a reload-loop…
And as sems.uni-rostock.de/home/index.html
already exists it won’t store the content behind https://sems.uni-rostock.de/home/
, which will be lost…
I have no idea for an easy fix. I’ve been playing around with the url-hacks flag, but I did not find a working solution.. (see also forum.httrack.com/readmsg/10334/10251/index.html)
What I ended up with was to grep for this page and to find pages that link to it:
grep "Click here" -rn sems.uni-rostock.de | grep 'HREF="index.html"'
(Remember: some of the Click here
pages are legit: They implement proper redirects! Only self-links to HREF="index.html"
are the enemies.)
At SEMS we for example also had a wrong setting in the calendar plugin, which was still configured for a the HTTP version of the website and, thus, generating many of these problematic URLs.
The back-end search helped a lot to find the HTTP links. When searching for http://sems
in posts and pages I found plenty of pages that hard-coded the wrong link target..
Also remember that links may also appear in post-excerpts!
If nothing helps, you can still temporarily disable the HTTPS redirect for the time of mirroring.. ;-)
Finalising the archive
To complete the mirror I also rsync
‘ed the files in wp-content/uploads/
, as not all files are linked in through the web site.
Sometimes we just uploaded files and shared them through e-mails or on other websites.
I also manually grabbed the sitemap(s), as HTTrack apparently didn’t see them:
wget --quiet https://sems.uni-rostock.de/sitemap.xml -O sems.uni-rostock.de/sitemap.xml
wget --quiet https://sems.uni-rostock.de/sitemap.xml -O - | egrep -o "https?://[^<]+" | wget --directory-prefix=sems.uni-rostock.de -i -
iptables: log and drop
July 17th, 2017Linux has a sohpisticated firewall built right into the kernel: It’s called iptables
!
I’m pretty sure you heard about it.
You can do realy crazy things with iptables.
But here I just want to log how to log+drop a packet in a single rule.
Usually, you would probably do something like that:
iptables -A INPUT -j LOG --log-level warning --log-prefix "INPUT-DROP:"
iptables -A INPUT -j DROP
Works perfectly, but dramatically messes your rules table up.. Especially, if you want to log+drop packets that match a complicated filter. You’ll end up with twice as many table entries as desired..
The trick is to instead create a new rule chain that will log+drop in sequence:
iptables -N LOG_DROP
So here I created a new chain called LOG_DROP
.
We can now append (-A
) two new rules to that chain, which do the actual drop+log:
iptables -A LOG_DROP -j LOG --log-level warning --log-prefix "INPUT-DROP:"
iptables -A LOG_DROP -j DROP
(similar like the first code above, just not for the INPUT
chain but for the LOG_DROP
chain)
That’s basically it!
If you now need to log+drop a packet you can append a new rule to e.g. the INPUT
chain that routes the packet to the LOG_DROP
chain:
iptables -A INPUT [...filter specification...] -j LOG_DROP
You should consider to limit the number of redundant log entries per time to prevent flooding of your logs..
For more documentation you should consult the manual of iptables(8)
.
Common Name vs Subject Alternative Names
May 19th, 2017You probably heard about the conflict between the fields Common Name (CN
) and Subject Alt Names (subjectAltName
) in SSL certificates.
It seems best practice for clients to compare the CN
value with the server’s name.
However, RFC 2818 already advised against using the Common Name and google now takes the gloves off.
Since Chrome version 58 they do not support the CN anymore, but throw an error:
Subject Alternative Name Missing
Good potential for some administrative work ;-)
Check for a Subject Alternative Names
You can use OpenSSL to obtain a certificate, for example for binfalse.de
:
openssl s_client -showcerts -connect binfalse.de:443 </dev/null 2>/dev/null
Here, openssl
will connect to the server behind binfalse.de
at port 443
(default port for HTTPS) to request the SSL certificate and dump it to your terminal.
openssl
can also print the details about a certificate. You just need to pipe the certificate into:
openssl x509 -text -noout
Thus, the whole command including the output may look like this:
openssl s_client -showcerts -connect binfalse.de:443 </dev/null | openssl x509 -text -noout
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
03:a1:4e:c1:b9:6c:60:61:34:a2:e1:9f:ad:15:2b:f9:fd:f0
Signature Algorithm: sha256WithRSAEncryption
Issuer: C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
Validity
Not Before: May 12 07:11:00 2017 GMT
Not After : Aug 10 07:11:00 2017 GMT
Subject: CN = binfalse.de
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (4096 bit)
Modulus:
00:ae:8d:6a:74:0b:10:4e:8e:07:1e:c8:3e:b8:83:
11:4f:b0:af:2b:eb:49:61:82:4f:6f:73:30:0c:d6:
3e:0a:47:bc:72:55:df:84:8c:56:1a:4a:87:ec:d4:
72:8d:8c:3d:c4:b3:6c:7a:42:e2:f4:6e:c0:5e:50:
e4:c0:9c:63:6c:0b:e0:12:15:0c:28:2d:4f:67:ad:
69:9a:b4:ee:dc:12:b1:02:83:00:b7:22:22:60:13:
a6:7d:e3:8a:e5:0c:f3:15:17:69:5e:fe:de:af:ea:
1e:71:b4:90:df:97:fe:d2:1b:ef:58:d5:43:35:8b:
81:e1:62:d6:6b:eb:18:e5:5b:a8:5c:da:f8:39:be:
8b:9a:34:c1:54:d2:5c:bc:22:85:6b:2e:30:8c:d8:
fa:dd:2c:9d:ae:5e:c9:21:43:86:d5:f8:dc:aa:d6:
d4:2c:a8:0b:ca:d8:16:cb:98:d3:c9:c8:c0:a3:6c:
1e:2f:9d:6f:5b:d3:09:1f:4e:1b:a7:48:99:25:84:
ef:5f:5a:db:c1:19:82:fd:8c:9e:b2:68:da:1b:98:
b8:60:49:62:82:8e:75:ea:03:be:0d:df:e1:8c:40:
8a:10:48:f4:c0:f8:89:02:29:9b:94:3f:6d:68:72:
42:e8:2e:ad:e6:81:cd:22:bf:cd:ff:ce:40:89:73:
2e:1e:b7:94:3f:f1:9e:36:89:37:4a:04:81:80:70:
8f:39:fe:b2:90:b5:5e:cb:93:7e:71:e3:e1:2a:bc:
21:9a:ef:a6:e2:2b:1c:8c:da:53:bf:79:37:7d:6e:
0e:eb:de:c3:aa:9f:64:f6:c9:58:35:d2:32:ab:4f:
f7:8d:6e:a1:7f:7a:de:d4:48:cd:0d:18:b7:20:84:
b5:8c:d8:f5:b1:ac:e3:b4:66:9f:9f:ab:01:22:c8:
f2:f8:09:36:f1:c5:90:ff:d3:a4:80:8e:f4:c4:05:
c5:4f:7f:ca:f3:fd:42:ec:25:b7:38:42:af:fd:37:
da:5e:2f:a8:c4:23:fe:24:d2:72:16:1e:96:50:45:
05:cb:39:6c:95:69:a0:39:48:73:72:a4:d5:c0:a0:
b3:9a:cb:27:fe:7c:87:b8:53:3b:52:50:b6:5d:11:
ea:b5:42:1a:80:07:4d:4c:b4:79:59:7c:b9:4b:2f:
0b:b4:2e:57:a6:6c:5f:45:c6:4d:20:54:9d:e3:1b:
82:0c:16:65:a0:fa:e9:cb:98:6d:59:3c:a5:41:22:
22:e8:38:38:b6:fe:05:d5:e5:34:7f:9e:52:ba:34:
4c:ab:9b:8d:e0:32:ce:fa:cd:2b:a3:57:7a:2c:fc:
2c:e7:31:00:77:d7:d1:cd:b5:d2:6a:65:0f:97:63:
b0:36:39
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Server Authentication, TLS Web Client Authentication
X509v3 Basic Constraints: critical
CA:FALSE
X509v3 Subject Key Identifier:
3B:F7:85:9A:2B:1E:1E:95:20:1B:21:D9:2C:AF:F4:26:E8:95:29:BA
X509v3 Authority Key Identifier:
keyid:A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1
Authority Information Access:
OCSP - URI:http://ocsp.int-x3.letsencrypt.org/
CA Issuers - URI:http://cert.int-x3.letsencrypt.org/
X509v3 Subject Alternative Name:
DNS:binfalse.de
X509v3 Certificate Policies:
Policy: 2.23.140.1.2.1
Policy: 1.3.6.1.4.1.44947.1.1.1
CPS: http://cps.letsencrypt.org
User Notice:
Explicit Text: This Certificate may only be relied upon by Relying Parties and only in accordance with the Certificate Policy found at https://letsencrypt.org/repository/
Signature Algorithm: sha256WithRSAEncryption
1b:82:51:b3:1c:0d:ae:8c:9f:25:4e:87:1a:4b:e9:b4:77:98:
74:22:f1:27:c5:c1:83:45:7c:89:34:43:fe:76:d8:90:56:c5:
b1:a7:74:78:f1:e4:4c:69:2c:9f:55:d1:a3:c9:ce:f1:b6:4a:
40:e4:18:ae:80:03:76:bd:d5:25:ff:4b:4b:68:cd:98:09:48:
e4:42:07:bc:4a:ad:a3:f7:46:8a:fe:46:c2:6a:b2:28:01:4d:
89:09:2a:31:15:26:c5:aa:14:93:5e:8c:a6:cb:30:af:08:7f:
6f:d8:ef:a2:d7:de:33:3e:f2:c3:17:c6:08:4a:3b:c6:67:05:
07:c0:b8:52:13:e1:c8:13:d4:0e:19:11:0f:54:4e:ea:d0:2b:
c2:3d:93:51:8a:15:da:f7:4b:78:08:cd:c1:d0:f2:f7:e0:98:
f7:0a:bc:13:ca:d0:9b:be:2d:2b:d5:e9:03:29:12:aa:97:ec:
1a:d1:2c:51:7d:21:d1:38:39:aa:1d:9e:a5:98:1d:94:e2:66:
ea:31:c4:18:b6:13:6c:6f:8e:2f:27:77:7b:af:37:e0:0b:86:
4b:b5:cc:7b:96:31:0c:30:c6:9e:12:a2:15:07:29:9f:78:3e:
5e:2a:3f:cf:f8:27:82:30:72:6b:63:64:5a:d1:2d:ed:08:ed:
71:13:a9:0b
As you can see in the X.509
extension this server’s SSL certificate does have a Subject Alternative Name:
X509v3 Subject Alternative Name:
DNS:binfalse.de
To quick-check one of your websites you may want to use the following grep
filter:
openssl s_client -showcerts -connect binfalse.de:443 </dev/null | openssl x509 -text -noout | grep -A 1 "Subject Alternative Name"
If that doesn’t print a proper Subject Alternative Name you should go and create a new SSL certificate for that server!