I talk about restoring backups often recently. This is because the disk on my trusty bare-metal server died. This gave me the opportunity to reassess my hosting choices, and do the ground work to move from where it was to where I want it to be.
One of those changes is moving static website hosting away from a Apache HTTPd, running on an OS I administrate (read: “frequently broke”), to a more focused and hands-off system in the cloud, AWS S3 with a CloudFront CDN (more on this in a later post).
Unfortunately, decades of running Apache have left me with a number of static sites using some on-the-fly templating by relying on Server-side Includes (SSI). Headers, footers, geeky IPv6 and last-modified tags, … none of those work with a truly static host. I needed a solution to render those snippets into full pages.
At first, I thought I’d just write a simple parser in Python. I quickly gave up on the idea, however, when I realised I used included templates with parameters. Pretty nifty stuff, but also not trivial to write a parser for.
Then I realised I already had the perfect parser: Apache. All I needed was to let it render all the pages one last time, and publish those instead! This was packed quickly with a relatively simple Docker container, and the trusty
wget. The busy person can find a Gist of the Dockerfile here.
The key is a simple
- builds off an
httpdbase (alpine, to keep the container small)
- installs GNU wget (to get all the recursive capabilities)
- enables and configure
- creates a small
entrypoint.shwhich starts the server, and follows with a recursive
wgetto a known output directory.
# usage: # # docker build -t ssi-extractor - < Dockerfile # docker run -v ./www:/usr/local/apache2/htdocs/ -v ./out:/out ssi-extractor FROM httpd:alpine RUN apk update \ & apk add wget RUN sed -i \ -e 's/#LoadModule include_module/LoadModule include_module/' \ -e 's/#LoadModule negotiation_module/LoadModule negotiation_module/' \ -e 's/Options Indexes FollowSymLinks/& Includes MultiViews/' \ -e 's/#\(Add.*shtml\)/\1/' \ -e 's/DirectoryIndex index.html/DirectoryIndex index.shtml/' \ /usr/local/apache2/conf/httpd.conf \ & rm /usr/local/apache2/htdocs/index.html RUN mkdir /out \ & echo '#!/bin/sh' > /cmd.sh \ & echo 'httpd-foreground &' >> /cmd.sh \ & echo 'sleep 3; cd /out; wget -rl 0 -nH -E --accept-regex "/[^.]*(.html)?$" http://localhost/' >> /cmd.sh \ chmod a+x /cmd.sh CMD ["/cmd.sh"]
wget incantation, which recursively fetches everything (
-r), ad vitam eternam (
-l 0), skips using the hostname when creating a directory structure (
-nH), fixes the extensions according to served MIME type (
-E), and only retains HTML files (
wget -rl 0 -nH -E --accept-regex "/[^.]*(.html)?$" http://localhost/'
usage comment says at the top, the image build is pretty classic. Newer versions of
docker would complain about not using
buildx, like a caveman.
docker build -t ssi-extractor - < Dockerfile
The extraction can then be run by mounting the source directory, containing
shtml files as a volume to
/usr/local/apache2/htdocs, and mount another, presumably empty, output directory in
/out. The container will do the rest.
docker run -v ./www:/usr/local/apache2/htdocs/ -v ./out:/out ssi-extractor
out directory will now contain a bunch of HTML files, which hopefully are in the desired form. A few caveat are worth noting:
- Any dynamic server variable will have a pretty arbitrary value: times, dates, and last-modified tags will obviously go stale very quickly, and configuration data such as
SERVER_ADMINwill be a default placeholder unless the
httpd.confis further modified to have an adequate value.
- No attempt is made to fetch error pages, though it could be easily added by, e.g., creating known entry points throwing those errors, or just hitting the
I was using those SSI mainly for templating and ease of maintenance on sites that are no longer updated and maintained for archival purposes, so losing this flexibility in favour of truly-static pages is not an issue. If the pages were still actively edited, this would probably not have been a practical approach. That said, the Docker container-based approach is sufficiently self-contained to be weaponised into a build step. All the same, I’m glad I don’t have to do that!