I wrote this article for the Learnosity blog, where it originally appeared. I repost it here, with permission, for archival. With thanks, again, to Micheál Heffernan for countless editing passes.
In this series, I look at how we load test our platform to ensure platform stability during periods of heavy user traffic. For Learnosity, that’s typically during the back-to-school period. The year was different though, as COVID caused a dramatic global pivot to online assessment in education. Here is what the result of that looked like in terms of traffic.
We expect major growth every year but that kind of hockey stick curve is not something you can easily predict. But, because scalability is one of the cornerstones of our product offering, we were well-equipped to handle it.
This article series reveals how we prepared for that.
In part one (which was, incidentally, pre-COVID), I detailed how we actually created the load by writing a script using Locust. In this post, I’ll go through the process of running the load test. I’ll also look at some system bottlenecks it helped us find.
Let’s kick things off by looking at some important things a good load-testing method should do. Namely, it should
- Apply a realistic load, starting from known-supported levels.
- Determine whether the behaviour under load matches the requirements.
- If the behaviour is not as desired, you need to identify errors and fix them. These could be in
- the load-test code (not realistic enough)
- the load-test environment (unable to generate enough load)
- the system parameters
- the application code
- If the behaviour is as desired, then ramp up the load exponentially.
We used two separate tools for steps 1 above (as described in the first part of this series) and tracked the outcomes of step 2 in a spreadsheet.
TL;DR
- We used Locust to create the load, and a custom application to verify correct behaviour.
- We found a number of configuration-level issues, mainly around limits on file descriptors and open connections.
- Stuff we learned along the way:
- Record all parameters and their values, change one at a time;
- Be conscious of system limits, particularly on the allowed number of open files and sockets.
Continue reading →