The dramatic increase in Learnosity users during the back-to-school period each year challenges our engineering teams to find new approaches to ensuring rock-solid reliability at all times.
Stability is a core part of Learnosity’s offering. Prior to back-to-school (known as “BTS” internally) we load-test our system to handle a 5x to 10x increase on current usage. That might sound excessive, but it accounts for the surge of first-time users that new customers bring to the fold as well as the additional users that existing customers bring.
Since the BTS traffic spike occurs from mid-August to mid-October, we start preparing in March. We test our infrastructure and apps to find and remove any bottlenecks.
Last year, a larger client ramped up their testing. This created a 3x usage increase of our Events API. In the process, several of our monitoring thresholds were breached and the message delivery latency increased to an unacceptable level.
As a result, we poured resources into testing and ensuring our system was stable even under exceptional stress. To detail the process, I’ve broken the post into two parts:
Creating the load with Locust (this piece)
Running the load test (in part two, coming soon).
Here’s a snapshot of what I cover in this post:
Our target metrics.
How we wrote a Locust script to generate load for a Publish/Subscribe system.
Our observations that:
The load test must reflect real user behaviours and interactions
Load testing alone doesn’t validate system behaviour against target metrics. It’s better to measure this separately while the system is under load.
To ease the task of viewing the data, each machine runs munin-node, but only a couple of masters do the data collection with munin-update. This works reasonably well, except that machines monitored by more than one server need to work extra time to provide the same data to both.
I finally mastered the shell (beit bash or zsh, but really, this is readline)’s history with command replacement. It took me 19 years and my entire family fortune to gather enough wits to read that part of the manual with enough attention and will as to learn to use it.
Essentially, you can recall previous commands from the history with !number. You can then change some content of the previous command programmatically before running it by adding :s/PATTERN/REPLACEMENT/ or :gs/PATTERN/REPLACEMENT/ (the first one will replace the first occurrence, the second one will replace them all).
I’ve been using Kodi (then XBMC) for more than a decade now (yup, “XB” did stand for X Box alright, but now LibreELEC on a WeTek Core). I’ve also had the library in MySQL for more than half of it. Across migrations, it had developed some quirky content, such as duplicate albums, and some rarities, such as this version of 21, by Adèle, where the description reminds us that her previous album, Ixnay on the Hombre, was only moderately successful on launch; go figure…
As suggested, pretty much everywhere, as the solution for duplicate content in Kodi, I first tried cleaning the library, repeatedly, to no avail. The duplicate albums were still there. One of their noticeable characteristics, though, was that there was always some copy of the album (and in Adèle’s case, the one following Ixnay), that did not have any associated tracks. This felt like it could be a good angle to help me clear those up. Enter some SQL.
What sounded like a fairly straightforward question quickly snowballed into one of those long chat threads that left us none the wiser. A follow-up face-to-face discussion helped us get down to the root of the problem: code and functional reviews on feature branches may leave us exposed to integration issues after non-fast-forward merges, and can only be caught too late for comfort.
We had to consider our Git workflow alongside the lifecycle of our tickets to come up with an improvement. Merge commits remain, but we rebase (and fix conflicts) on the latest main branch before any review in order to make sure we look at the final code.
The rest of this article describes our Git workflow, our ticket lifecycle, their interactions, and how we made them better.
Rebase onto develop before code review (original developer)
Rebase onto develop before functional review (functional reviewer)
Deploy to staging as soon as possible (i.e., all codebases merged for the feature; original developer)