I talk about restoring backups often recently. This is because the disk on my trusty bare-metal server died. This gave me the opportunity to reassess my hosting choices, and do the ground work to move from where it was to where I want it to be.
One of those changes is moving static website hosting away from a Apache HTTPd, running on an OS I administrate (read: “frequently broke”), to a more focused and hands-off system in the cloud, AWS S3 with a CloudFront CDN (more on this in a later post).
Unfortunately, decades of running Apache have left me with a number of static sites using some on-the-fly templating by relying on Server-side Includes (SSI). Headers, footers, geeky IPv6 and last-modified tags, … none of those work with a truly static host. I needed a solution to render those snippets into full pages.
At first, I thought I’d just write a simple parser in Python. I quickly gave up on the idea, however, when I realised I used included templates with parameters. Pretty nifty stuff, but also not trivial to write a parser for.
Then I realised I already had the perfect parser: Apache. All I needed was to let it render all the pages one last time, and publish those instead! This was packed quickly with a relatively simple Docker container, and the trusty wget. The busy person can find a Gist of the Dockerfile here.
If you have a Python string that contains the representation of escaped characters (say, a backslash \, an x, and two hexadecimal digits), and you want to decode those escapes back to the actual character they represent, you can use codecs.escape_decode.
GitHub now allows to expand/collapse all files in a PR diff at once (pressing Alt while clicking one of the toggles). Unfortunately, there is no similar feature to mark all files as viewed. This is handy after having reviewed meaningful changes to file, and automatically modified/generated files can be ignored.
In this series, I look at how we load test our platform to ensure platform stability during periods of heavy user traffic. For Learnosity, that’s typically during the back-to-school period. The year was different though, as COVID caused a dramatic global pivot to online assessment in education. Here is what the result of that looked like in terms of traffic.
We expect major growth every year but that kind of hockey stick curve is not something you can easily predict. But, because scalability is one of the cornerstones of our product offering, we were well-equipped to handle it.
This article series reveals how we prepared for that.
In part one (which was, incidentally, pre-COVID), I detailed how we actually created the load by writing a script using Locust. In this post, I’ll go through the process of running the load test. I’ll also look at some system bottlenecks it helped us find.
Let’s kick things off by looking at some important things a good load-testing method should do. Namely, it should
Apply a realistic load, starting from known-supported levels.
Determine whether the behaviour under load matches the requirements.
If the behaviour is not as desired, you need to identify errors and fix them. These could be in
the load-test code (not realistic enough)
the load-test environment (unable to generate enough load)
the system parameters
the application code
If the behaviour is as desired, then ramp up the load exponentially.
We used two separate tools for steps 1 above (as described in the first part of this series) and tracked the outcomes of step 2 in a spreadsheet.
Every now and then, some spurious peaks show up on munin graphs. The peaks are order of magnitude higher than the expected range of the data. This particularly happens with DERIVE plugins, that are notably used for network interfaces.
The dramatic increase in Learnosity users during the back-to-school period each year challenges our engineering teams to find new approaches to ensuring rock-solid reliability at all times.
Stability is a core part of Learnosity’s offering. Prior to back-to-school (known as “BTS” internally) we load-test our system to handle a 5x to 10x increase on current usage. That might sound excessive, but it accounts for the surge of first-time users that new customers bring to the fold as well as the additional users that existing customers bring.
Since the BTS traffic spike occurs from mid-August to mid-October, we start preparing in March. We test our infrastructure and apps to find and remove any bottlenecks.
Last year, a larger client ramped up their testing. This created a 3x usage increase of our Events API. In the process, several of our monitoring thresholds were breached and the message delivery latency increased to an unacceptable level.
As a result, we poured resources into testing and ensuring our system was stable even under exceptional stress. To detail the process, I’ve broken the post into two parts:
Creating the load with Locust (this piece)
Running the load test (in part two, coming soon).
Here’s a snapshot of what I cover in this post:
Our target metrics.
How we wrote a Locust script to generate load for a Publish/Subscribe system.
Our observations that:
The load test must reflect real user behaviours and interactions
Load testing alone doesn’t validate system behaviour against target metrics. It’s better to measure this separately while the system is under load.
I finally mastered the shell (beit bash or zsh, but really, this is readline)’s history with command replacement. It took me 19 years and my entire family fortune to gather enough wits to read that part of the manual with enough attention and will as to learn to use it.
Essentially, you can recall previous commands from the history with !number. You can then change some content of the previous command programmatically before running it by adding :s/PATTERN/REPLACEMENT/ or :gs/PATTERN/REPLACEMENT/ (the first one will replace the first occurrence, the second one will replace them all).
I’ve been using Kodi (then XBMC) for more than a decade now (yup, “XB” did stand for X Box alright, but now LibreELEC on a WeTek Core). I’ve also had the library in MySQL for more than half of it. Across migrations, it had developed some quirky content, such as duplicate albums, and some rarities, such as this version of 21, by Adèle, where the description reminds us that her previous album, Ixnay on the Hombre, was only moderately successful on launch; go figure…
As suggested, pretty much everywhere, as the solution for duplicate content in Kodi, I first tried cleaning the library, repeatedly, to no avail. The duplicate albums were still there. One of their noticeable characteristics, though, was that there was always some copy of the album (and in Adèle’s case, the one following Ixnay), that did not have any associated tracks. This felt like it could be a good angle to help me clear those up. Enter some SQL.