Archiving a (Wordpress) Website
I needed to migrate a lot of tools and projects that we’ve been working on in the SEMS group at the University of Rostock. Among others, the Wordpress website needed to be serialised to get rid of PHP and all the potential insecure and expensive Wordpress maintenance. I decided to mirror the page using HTTrack and some subsequent fine tuning. This is just a small report, maybe interesting if you also need to archive a dynamic web page.
Prepare the page
Some stuff in your (Wordpress) installation are properly useless after serialisation (or have never been working either) - get rid of them. For example:
- Remove the search box - it’s useless without PHP. You may add a link to a search engine instead…?
- Remove unnecessary trackers like Google analytics and Piwik. You probably don’t need it anymore and users may be unnecessarily annoyed by tracking and/or 404s.
- Disable unnecessary plugins.
- Check that manual links (e.g. in widgets) are still up-to-date, also after archiving..
- Check for unpublished drafts in posts/pages. Those will be lost as soon as you close the CMS.
- Recreate sitemap and rss feeds (if not created automatically)
I also recommend to setup some monitoring, e.g. using check_link, to make sure all resources are afterwards accessible as expected!
Mirror the website
I decided to mirror the web content using HTTrack. That’s basically quite simple. At the target location you only need to call:
This will create a directory
sems.uni-rostock.de containing the mirrored contend.
In addition you’ll find logs in
hts-log.txt and the cached content in
However, I tweaked the call a bit and actually executed HTTrack like this:
This ignores all links that match
*trac/* (there was a Trac running, but that moved to GitHub and an Nginx will permanently redirect the traffic), in addition it will keep connections alive (
As I’m the admin of the original site (which I know won’t die too soon, and in worst case I can just restart it) I increased the speed to a max of 160 connections per second (
-%c160) and max 20 simultaneous connections (
For that I also needed to disable HTTrack’s security limits (
That went quite well and I quickly had a copy of the website. However, there were a few issues…
Problems with redirects.
Turns out that HTTrack has problems with redirects.
At some point we installed proper SSL certificates and since then we were redirecting traffic at port 80 (HTTP) to port 443 (HTTPS).
However, some people manually created links that point to the HTTP resources, such as
If HTTrack stumbles upon such a redirect it will try to remodel that redirect.
However, in case of redirects from
https://sems.uni-rostock.de/home/, the target is the same as the source (from HTTrack’s point of view) and it will redirect to … itself.. -.-
The created HTML page
sems.uni-rostock.de/home/index.html looks like that:
As you can see, both the link and the meta refresh will redirect to the very same
index.html, effectively producing a reload-loop…
sems.uni-rostock.de/home/index.html already exists it won’t store the content behind
https://sems.uni-rostock.de/home/, which will be lost…
I have no idea for an easy fix. I’ve been playing around with the url-hacks flag, but I did not find a working solution.. (see also forum.httrack.com/readmsg/10334/10251/index.html)
What I ended up with was to grep for this page and to find pages that link to it:
(Remember: some of the
Click here pages are legit: They implement proper redirects! Only self-links to
HREF="index.html" are the enemies.)
At SEMS we for example also had a wrong setting in the calendar plugin, which was still configured for a the HTTP version of the website and, thus, generating many of these problematic URLs.
The back-end search helped a lot to find the HTTP links. When searching for
http://sems in posts and pages I found plenty of pages that hard-coded the wrong link target..
Also remember that links may also appear in post-excerpts!
If nothing helps, you can still temporarily disable the HTTPS redirect for the time of mirroring.. ;-)
Finalising the archive
To complete the mirror I also
rsync‘ed the files in
wp-content/uploads/, as not all files are linked in through the web site.
Sometimes we just uploaded files and shared them through e-mails or on other websites.
I also manually grabbed the sitemap(s), as HTTrack apparently didn’t see them:
- network (68) ,
- software (159) ,
- university (46) ,
- website (22) ,
- wordpress (12) ,
- administration (42) ,
- php (16) ,
- html (8) ,
- howto (27) ,
- monitoring (3)
- backup (5) ,
- config (21) ,
- httrack (1) ,
- monitoring (4) ,
- network (81) ,
- php (8) ,
- ssl (10) ,
- sync (4) ,
- university (42) ,
- wordpress (15)