What to test, How to test, and How to Interpret the Results
This section is a lot more technical than the previous bit, which covered a lot of the why you should performance test your software before you go live and how you should communicate that you stakeholders. Here we go into how to build up your tests, what tooling could help and how to interpret the results you’re seeing.
What to test
This section covers the technical details of the test scenarios sketched out below.
- Soak test your application for three weeks.
- Restart all components a fair few times.
- Put half a million records in your DB and do a regression test under moderate load.
Let’s begin in the middle.
Restarting all components a few times
This is an easy one. You can do the first round manually. Regardless of the OS, the platform, whether on not it is virtualised, containerised, dockerised, caramelised or pulverised, whether it runs as a systemd service, whether or not there is automation to restart it if it goes down… Just stop and start each one in turn.
What to test
Restart all components in a row. All meaning all, including the DB, the webserver, the third party that does your authentication and the one legacy thing that nobody understands.
I know that many of the infra/env/ops folks will say that Docker will take care of that, or systemd, or something custom. Be that as it may, as a professional empirical skeptic, you just want to see it for yourself.
How to test
Once you have done it manually a few times with different stop and start orders, do some scripting and put a sleep in between the stop and the start of the processes. Make it in a range like: 1 second, 10s, 30s 60, 120s, 600s 1200s and 2400s. Yeah, I know for some of there you might need to leave then running overnight or over the weekend.
Also, space out the restarts between the components similarly, do a few rounds of flash restarts where all components are restarted in a row, without delay, and then do stops and starts with a short delay in between, from 1 sec to 200 secs, with similar spaces in between the components.
Play around with the order in which you restart them. DB first, DB last are one nice pair. You can also play around with stopping and starting multiple components with some overlap, like stop your message queue, stop the DB, stop the data collections, start message queue, start data collection, start BD. You catch my drift.
Grep your logs for errors and warnings and remember, your basic assumption is that your services are always up and running, and that basic functionality works. Make sure all your services are up after the restarts, and check functionality with a couple of easy, low-hanging tests. You don’t need to run a full smoke, just handful of easily automated checks that detect signs of life in the system.
One thing that will throw a curveball to the infrastructure folks is restarting, or stopping and starting the DB(s). The usual thing they say is, oh well, you can’t do that. Well, you can and you should. Life does not really care what you are supposed to do with your DB. Life will just stop your DB on really busy days when your customer is making big bucks. So in fact, you want to know what actually happens on those days. Tell your people that it’s okay and that you are just curious and it’s still no risk as we are not in prod yet.
How to interpret the results
You might be really lucky and all your services always come back and everything works and is robust. I’ve run this test on a fair few systems and am yet to see such an MVP, but I am hopeful. One day, I will.
If you are not so lucky, make a list of the scenarios where the failures happened. Put them into a little report, and ask around for explanations why people this might have happened. Don’t say this is good or bad, just state the fact and ask people for explanations. This should generate a good conversations around your results that would solidify the operational aspects of your product or service.
Let’s move on to the next test.
Soak test your application for three weeks
Begin the stakeholder management conversations really early. Six months before the planned launch, start talking to people about needing an environment that is like production where you can test the software end to end. As it will cost money, put together a portfolio why it is needed, what are the business benefits if you do it, and what the risks are if you don’t do it.
Prepare a perfect replica of what you want your customers to run your software on, or if you are running in the cloud, the exact setup you are planning to sell for money. Either way, have some monitoring on it, not just software stats but hardware metrics as well. A boring Prometheus + Grafana combo is good for a start. Watch those charts like a hawk. (Or for really low-tech, just gauge your process metrics with top every minute or so and pipe the results into a log file.)
Why three weeks?
Because most of your competitors have done a day or a weekend long test at best. So most of their resource leaks never surfaced. Most of their operational issues never manifested before they went into prod. Then all the teething problems came out with their very first customer, which cost them an arm and a leg and too many concessions at best, and their very first customer at worst.
Three weeks is long enough to identify most resource issues, be it memory, disk, or RAM and the number of cores required at startup. Yet, it is still reasonably short to be budgeted for in the project plan, particularly if you asked for it 6 months out.
Can you get away with just two weeks? Sure. But instead of just stopping at two weeks, let the product launch but also keep the experiment running. This way, you will have two-weeks’ notice should anything nasty fall out. (Can I get away with three days? Sure, if that is all your project management could get you, take it like your life depended on it. Because it does. Also, do update your CV because they are not in the game for the long run.)
Why the traffic?
Because your customers will not be using your software just by looking at it from a distance. They will be putting some data in, and they expect to see some results out. Most of your race condition quirks will only show if there is time and traffic to trigger the unforeseen pathways.
If you can, put this on top of the half a mill data records DB. That, by all means would be a proper stability and robustness test, from all angles.
What if you can’t synthesise traffic? You can get away without any automated traffic, but the experiment will be a lot more limited in terms of what you can discover. If traffic is too hard to generate, you can get away without it, but then at least do all your regular testing on that system, and organise one or two mobtest session where all the people with at least one working finger are there testing the product for you.
What kind of traffic should we run on it?
You want to see the kind of traffic on it that you think your first customers will have. It should be with occasional mistaken entries, some errors, missing fields, out of range values, params of the wrong kind. The point is to have some data flowing through the system for an extended period.
Any traffic is better than none. If your app is stateless and you want to run a B2C service for the public, get your JMeter out and shape some bellcurves on a daily basis. Then do a simple hump (traffic has one summit at 10am or 3pm), double hump (8-9 am and 6-7 pm) and tripple hump variations (8-9 am, 12-1 and 6-7 pm), with an occasional spike or two (Merkating sent out an email campaign).
If your product is a stateful piece, map out your most DB-heavy operations. Then put them on an eternal while loop on a separate machine, as a low-end solution. If you want something really sophisticated, get a runner VM, and use JMeter CLI as a generator to shape your mixed traffic with a 1-30% faulty traffic into a pretty bell curve.
How much traffic should we run on it?
Remember, you are not stress testing at this stage, so there’s no need to put in millions of transactions. Just imitate regular Joe with regular-joe traffic on a busy day, non stop, for three weeks. You can take some try to figure out your saturation level traffic for the given pattern and run 50-80% of that amount. If you don’t have the capacity to play around with it, ask you manager how much is the max amount of traffic that you should be expecting. As they usually have an estimation error of around 20 times less than real life, use their estimate as your peak traffic on your bell curves.
How to interpret the results?
If you have good monitoring in place, do check the charts on a daily basis. Any troughs or inexplicable peaks should be investigated. One funny suspect is the DB response time is really bad at 2:30 in the morning. Your DB is running maintenance scripts, so it is not an issue. Unless, it’s 2:30 GMT, which is 7:30 PM PST. In which case, you should retweak the automated cleanup scripts. If you don’t, plough through the logs, grep for errors and just watch the logs to see if there is anything unusual.
Your assumption is that the whole solution will keep running, uninterrupted and unassisted. If you need to intervene (log rotation, process restarts, docker space, db connection pool, dog chewed the network cable), make sure those tasks are automated and scheduled in cron.
Basic rule of thumb is, every time your software crashes, your timer restarts. Fix all bugs until your software can run, with moderate load, for three weeks. If people are unhappy about this, one argument that is hard to debate is that every bug you find is four to five times more expensive to fix in production. So if you just spend 2 days on the investigation and the fix, say calculating with $/£/Eur 2K engineer day rate, you just saved them eight grand. So surely, we can wait a little more to make sure we don’t spend our annual budget on maintenance next year.
Finally the db.
Putting half a million records in your DB and doing a regression test under moderate load
How much traffic should we run on it?
By the time you get to this point you have a pretty good infrastructure in place. In fact, if you play it right, you can use the previous 21 days to get those 500K subs, product items, randomised sensor readings or fake room bookings into your DB. Each day, 25K new data items. That makes it roughly 1K an hour, a new item about every three seconds. This should be doable in any production system without a glitch.
The important part is that you should have rich randomised data in your db, not a sequential UUID from 1 to 525001 with ‘asd’ as first name and ‘qwerty’ as the surname. Those profiles should have proper random names, addresses, postcodes on different continents, purchase histories and even returned items. Most tools will happily generate random data of any sort, so make use of that capability. Also, do delete some of the stuff that you create, we’re talking random deletion of a few percent of your data, so there are gaps as well in any sorted or ordered list views that get created.
What kind of traffic should we run on it?
Start some varied traffic on your instance, CRUD your usual data records in an automated fashion, and manually run through the whole regression suite that you have. No need to rush, take all the time you need because you might want to rerun the same tests with a few variations.
If you are working in a browser, keep a browser terminal open. Record the network traffic as well. If you are working from a shell, have a separate one open in GNU screen and keep tailing the logs as you test.
Some of the things you find will not be reproducible, or at least not easily, so there are two things you can do to help this: first, make sure logging is set to debug level, second buy a professional version of snagit, and record your sessions, from the moment you start setting up the regression tests. Record everything, from as early as the setup. Don’t just say to yourself, let me set this up and I’ll start recording when I start testing proper. Nope. Start recording and then you can start the setup. You can thank me l8ter for this.
How to interpret the results
If you can, get some db monitoring tool in place. If not, you’ll need to mine the logs.
You are keeping an eye out for slow responses, unusual delays, anything that feels different from your previous testing sessions. Many of the things may just feel different, so you want to ask questions from all the people involved if they think this is okay – and you send them the logs, the timestamps and the video recording as well.
Anything that feels slower than before, or just generically slow, try to reproduce it a few times. Is it always slow? Measure the time how slow it is, stick a big timer on the recorded screen so it shows up in the face. Discuss the results with your product people whether or not this is considered regression, and discuss the thing with your dev team whether there is something they need to look into in the abyss of the database.
Monitor logs for errors because slow responses may not show up in the UI, but they may cause issues in the background. Are any of your calls timing out? Is that acceptable? Simple timeouts may escalate into cascading failures given the right amount of pressure, so feel free to press the system harder if you notice something unexpected. Crank up the automated traffic to double load and do that test again.
So here we are, four thousand words later, we are now ready to take a look at the second scenario, when you already have a few customer and you know you need to look at performance testing the software.
Comments on “Pre-Launch Performance Testing – Technical Guide – Part 2 of Functional and Nonfunctional Testing – Stakeholder Management”