I wrote this article for the Learnosity blog, where it originally appeared. I repost it here, with permission, for archival. With thanks to Micheál Heffernan for countless editing passes.
The dramatic increase in Learnosity users during the back-to-school period each year challenges our engineering teams to find new approaches to ensuring rock-solid reliability at all times.
Stability is a core part of Learnosity’s offering. Prior to back-to-school (known as “BTS” internally) we load-test our system to handle a 5x to 10x increase on current usage. That might sound excessive, but it accounts for the surge of first-time users that new customers bring to the fold as well as the additional users that existing customers bring.
Since the BTS traffic spike occurs from mid-August to mid-October, we start preparing in March. We test our infrastructure and apps to find and remove any bottlenecks.
Last year, a larger client ramped up their testing. This created a 3x usage increase of our Events API. In the process, several of our monitoring thresholds were breached and the message delivery latency increased to an unacceptable level.
As a result, we poured resources into testing and ensuring our system was stable even under exceptional stress. To detail the process, I’ve broken the post into two parts:
- Creating the load with Locust (this piece)
- Running the load test (in part two, coming soon).
Here’s a snapshot of what I cover in this post:
- Our target metrics.
- How we wrote a Locust script to generate load for a Publish/Subscribe system.
- Our observations that:
- The load test must reflect real user behaviours and interactions
- Load testing alone doesn’t validate system behaviour against target metrics. It’s better to measure this separately while the system is under load.