Benchmarking and Metrics in Performance Testing Series Part 2

Volume aka Throughput aka Load Testing

One thing that is rather hard to communicate to dev teams is that their initial assumption regarding how much volume it needs to handle when they originally designed the system may no longer hold. If your product was designed for a domestic market in Europe, you are probably counting on customers in the range from tens of thousands to possibly a couple of millions (Difference between say Hungary and France, Czechia and the UK or Germany). Now if your sales suddenly starts courting the US market, say a national player, you have a very different game on your hands. Your range jumped from millions to tens of millions, possibly a hundred million. Then comes say India where you can quite happily sign up 10 million news subs every month for years.

Not only that, usage patters are also different. While European customers may go radiosilent over the weekend for certain services, Americans will not. While the Greeks spend around 5 hours online, folks in the US spend over 7, and Colombians over 8.

User expectations with regards to response times will also be different. How long does it take for your new users to click on another search result because your service is not loading fast enough? How do your load times compare against your that of your competitors?

Most people in a dev function will significantly underestimate how much hardware may be required for particular customers. There was this largish setup at one of the companies where I used to work. The development team estimated around a hundred DL380, so they doubled the number just to be on the safe side for our launch partner, a large US company. Oh boy… they needed over 1200 blades eventually, and that was before they merged with a similarly sized company with about 80% of their subscriber numbers.

The moral of the story is that your volume measurements may have far-reaching repercussion for the development, sales and management teams planning and decisions. So talk to them like lovers do. What may look awful to you, may be just good enough for them. What may be a medium-term objective for you, may turn out to be the most burning issue on their agenda right now.

All in all, your load and volume testing experiments will probably serve the following purposes, possibly more:
– Establishing how much load, both in terms of numbers and volumes, a given hardware setup can take.
– Understanding if your existing setup is able to handle the expected load.
– Figuring out how your system can scale, if at all. (No, this is not a given.)
– Establishing a saturation level for a particular setup
and
– Understanding how your system, its components and services degrade.
– You may also be able to supply your merkating and sales teams with a few reasonably reliable figures they can use in their negotiations. Just make sure you caveat all statements with the full technical specs of the env you used.

Saturation Level traffic

Saturation level here means the borderline between a stable high traffic that is error free and a stable slightly higher load that throws occasional errors as the system struggles to handle the volume.

When your system is saturated, some or all of its resources are fully utilised and at times, it will not have enough to handle requests. It will start with a few hick-ups and then may collapse into cascading failure even without any increase in the volume of traffic it needs to handle. In your experiments, you don’t want that, you only want to know with some precision where the saturation level is.

For stateless applications, it may just be a particular number: 310 readers hitting on your news-site every second. Two fifty are good, two eighty still good, but around 300/sec, you start throwing errors. They are getting slow responses peppered with a couple of 4xx failures, and at 350 half of your traffic gets 4xx responses. (If you are getting 5xx errors, you have a security vulnerability that you might want to fix first.)

Load testing tools have this step-wise traffic shaping functionality where you can define the step size and how long it should hold your particular traffic there. This is the simplest experiment that you can run to establish the level of traffic that saturates your setup, both software and hardware. This volume and mix will be important for a lot of later experiments, so it is worth establishing this early.

For stateful applications… well, you need to figure out a realistic mix of traffic and play a few variations of that. Say your traffic may be new users signing up, most users consuming various kinds of content on demand and a few content creators uploading humongous amounts of data to be processed and consumed by others. You have free tier users and paid users and your quality metrics are primarily focused on paid user experience.

For starters, you probably want to have an aggregated mix of your regular days / weeks / months. Then figure out if you recreate that mix synthetically, of just up the volume, but reuse the anonymized production traffic. Hint, synthetic traffic gives more freedom to play with the variables and less hassle with the anonymization and the legal aspect. What is does not give you is the dirt. You want to have some grime in your traffic to ensure that your system can handle unexpected input as well.

So figure out the max peaks you are getting in your production servers, and assuming that your hardware has been able to handle them (If it has not, you would have resized it already), make sure you overshoot by 30% to 100%, in a neat stepwise fashion. Five, six step with five-minute holds are enough for an initial run, so you get the figures in a 15-20% range in half an hour. Make sure you set your experiment to fail on error. You just want to know the level at which it failed first. Run this same experiment another couple of times. If the results are all similar, you’re good to move on to the next stage. If the results were different during the lunchtime or at night or at the weekend, repeat it a few more times in a row.

Once you have the rough saturation threshold, refine the setup: shorter range, smaller steps and longer plateaus. Limit your range to the +/-20% of your suspected saturation level and use quarter hour intervals to hold the traffic at that particular level. Set your experiment up so it does not fail on the first error. Now you actually want to see service degradation as well. You will notice that the error charts are never linear. They are a bad-kind of hockey stick. They go real gentle at first but then the failure cascades through the system: 0.1% – 0.7% – 1.9% – 5% – 30% – 99%. You might also want to note down the level at which the stick turns upwards sharply, that is the actual total capacity of your particular software – hardware setup. You will need that later for capacity planning.

Run this experiment again, with the exact same setup another dozen times:
– at different times of the day,
– on different days of the week,
– at weekends, and
– straight after a physical reboot as well as
– several times in a row without any restarts or cleanups.

See if the results are still consistent around the same levels. If they are, good; if not, figure out what changes in your setup. Are your weekend numbers looking much better? Your system shares resources with someone else. Your night runs much worse? Maybe a team in a different timezone is hitting your DB unbeknownst to you. Your results are completely inconsistent? Another test system is feeding off your message queue. (All these examples are from real life).

For stateful applications, this is just the beginning of a beautiful French chip. Because your initial set of experiments was based on your regular traffic at best, or your assumption of what regular traffic should look like at worst. Now you should also have a few runs with various traffic compositions.

If you took the synthetic traffic route, you should have a fairly easy job configuring these scenarios. For the sake of fun as well as educational purposes, you can try the pure traffic scenarios: reg only, account modification only, data upload only… in any case, do not spend too much time on this unless you notice something unexpected. Is one particular kind of traffic slower than expected? Is the saturation level significantly loser than for your original mix? These kind of scenarios warrant further experiments.

However, you mainly want to focus on mixed traffic with a heavy X or heavy Y component. Focus on transactions that are not just DB read but DB write as well. You want to make sure that none of your transactions cause locks that would slow down the rest of the flow. Draw on real life to figure out what these heavily used areas may be. Is merkating doing a big email campaign, blasting millions of people to visit the site, signing up thousands of users every day? Have you just signed up a heavy B2B user who has not quite figured out how to use your API and is regularly sending erroneous input? Is sales trying to convert free-tier users to paid? Talk to the business folks about their short and medium term goals and translate those into traffic patterns.

The second type of traffic you want to focus on is transactions that generate very generic searches in the DB. Some whitebox testing might be necessary, look into the code or look into the slow-query logs of your DB and figure out what part of the code ran that `SELECT * FROM` monster that took 15 minutes to return (Yes, this actually happened in real life). First, your dev team may want to fix that if possible, if not, you need to know how it impacts the rest of your traffic responses.

Make sure your experiment setups and results are documented along with the traffic mix descriptions and explanations why you chose those traffic compositions. You future self will be eternally grateful to you for the why, what and how and the metrics, and the actual results, all of them, not just the aggregates.

One thing about performance testing results is that your most important result are not what fits on the bell curve, but in fact all the outliers. Thing on the lower end usually signal a problem with the experiment setup. Thing on the tail usually signal a systemic weakness.

In your report, seek to explain why the different profiles have different saturation levels and what it may mean for developers and for business stakeholders.

Error tolerance

Why on earth would you want to performance test nonsense traffic? Good question. Most applications were designed and built by genuinely good people. But some people just want to see the world burn. Others have the power of ignorance or negligence. In any case, the real world is full of dirty traffic and you want to make sure that error handling does not slow down your responses to otherwise perfectly legitimate, healthy traffic. Error handling logic is almost never tested as thoroughly as features, so you have no or very little experience how the business logic behaves under errors, how slow logging may become, or how much more bloated your logfiles can become.

So start injecting increasing numbers of faulty data that generate errors into your traffic. The easiest way is to use two independent traffic runners, say two pods that can run JMeter from the CLI. One would have your regular stepwise plain vanilla traffic, just at two thirds the volume, stepping up to your saturation level.

The other runner would have a generous mix of:
– Faulty API calls, syntactically correct but sent to the wrong endpoint
– Faulty data sent to the API endpoint; missing fields, wrong token, string for an int, additional non-existent fields
– total gibberish sent to whatever input mechanism there is; jpg file for an int, XML for JSON, binary for XML.

This other runner should start at say 1% of the startup traffic and then keep increasing volume to roughly one third of the total.

Alternatively, You can also just use OWASP ZAP and run a full-auto injection scan at every endpoint that there is. Then you get a security scan as well as a truckload of nonsense thrown at your app. Make sure you set your delays to something that your app can handle because ZAP can easily overwhelm unsuspecting apps. You also need to do this with no cloud-WAP in front of your application because your Cloudflare will happily block the 1.5 million calls as suspicious traffic, which kinda defeats the purpose.

What to look out for
In addition to your existing metrics of response times, throughput volumes and error ratios, you also want to monitor your system resources more actively. CPU and memory, disk consumption and increase in the number of file generated.

If your logfiles are rotated on size, but they are not cleaned up, a constantly increasing number of files in /var/log (or any folder for that matter) will slow down your system eventually. Thirty thousand files make it painfully slow, 160K via a narrow bandwidth ssh shell make it inaccessible, 2 million files practically kill your OS and your application with it.

Overall, you expect no performance degradation just because your traffic is not as pristine as fresh snow. In any case, compare your saturation levels to the plain traffic and see if there is anything like >5% decrease. Up to five percent, you can sigh and just include it in your report as a would be nice to fix. Over five percent, for saturation, would be an issue of concern.

Also keep an eye on end-user metrics like response times. The reason why you are using two runners is that you still want to be able to see your 9Xth percentile responses at and below saturation level. If they are worse than before, you might want to talk to the product and the dev people how it should be fixed. (Dev might say it takes 6 months till they get there, PM might say let’s fix it by throwing more hardware at it; they might also say that they can live with it for now. Just remember, more hardware is no fix and it increases your TCO and OPEX as well.)

Conclusions or something

Your existing performance is the most reliable source of market information. You need to figure out what is worth measuring. Performance (testing) always implies speed if you are selling it, but resilience when you are buying it. Keep your metrics simple and meaningful. You are setting yourself up for failure if your metrics require half a page of explanation and a degree in Quantum Mechanics. A few heuristics to guide you along the way.

Use the initial values as absolute benchmarks: If you had a reliable 600 ms response time to all your API queries, set that as the absolute threshold for all future numbers. Otherwise, you might fall into the sliding window trap. Do not use sliding windows for comparisons, like “oh it’s just 1.4% slower than last time.” Because 10 releases later, you have degraded almost 15% without noticing it. Use absolute benchmarks for long periods. This is your rock-bottom. Always compare current performance to those values.

Use last measured values for another benchmark. This is your aspirational threshold. It is always a pleasure to see that we can do better than last time. It will motivate your developers to refactor and improve. Don’t let them obsess about it, but let it be a point of reflection and a motivational factor for them.

What is reliable?

As in what is a reliable measurement. At some point you want to create a dashboard, or rather your management wants you to show a live dashboard (that is always green) to the whole floor. You can only do that if you know your experiment results are reliable. You can also only do proper reporting if your numbers are reliable, in terms of meaning and measurement.

Reliable in meaning: The underlying assumption of your business people will be that your metric is meaningful in their language and that it always means the same, and that because it is quantifiable, bigger is always better.

If your metric says it is throughput, it is probably not a very clear one. Of what? What kind of traffic? What is the unit of measurement? Users / eons or Gigabits per interface / sec?

Your definition needs to be very specific and it needs align with your stakeholders expectations: it is Registration Throughput per minute. It refers to the number of users who successfully registered a new account in a minute, as an average value taken from a 60 minute sample. They will also assume that your experiments used realistic data. So in this sense reliable means that you did not use a single john doe as the only required first name, surname for reg, but that your registration data is randomly generated that contains all the likely fields, including address, social security, shoe size, whathaveyou, and all these fields contain valid, realistic data.

Let’s try to tackle the reliable in measurement part now.
Reliable in measurement: most nontesting folks assume that your experiments always yield the same result. Most also assume that once you have measured a metric, it can be stated as a fact and can be waved around freely and without risk or consequence. Now, anyone who has provided an early result to one such person can testify how hard it is to go back to the same people and explain how the initial results does not hold and there is a new set of numbers and they look X% worse. (Funny enough, explaining better results is not that half as bad.)

The numerical differences can stem from any number of sources. Mine, at the beginning of a new project, usually stem from a lack of Understanding or misunderstanding how things work, are connected or what they actually indicate. My second biggest source of unreliable data is shared resources. My third is the actual race conditions that make performance testing a lot of fun.

Lack of knowledge: testing is a way of learning about a system. You discover how it actually works versus how you thought it was supposed to work. There is an inevitable readjustment of your mental maps and the search for why your mental map was faulty. This is a natural process, and it always takes place when you are new to a system. So try to refrain from announcing your results very early. You can say they look okay for now, but try to resist the temptation of sharing them with nontechnical people and announcing them as the naked truth. Take your time, refine your experiments, use the scientific method to invalidate your assumption, not to try to prove them. Try various, radically differing approaches, something like in simulated annealing, to see if your results are similar and show significant correlation. If they do, you can start taking measurements of a handful of metrics. Take ten, or a hundred if you can, and while you may accept the averages, your interest should focus on the outliers. Do not discard them unless you have solid evidence that the difference is attributable to external factors and not to the nature of the software you are testing.

Shared resources have been the most entertaining sources of headaches I have ever had in my testing career. First, there is the whack them on the head with an Ooopsie: “You installin’ somethin’ on cluster 9? Yeah, buddy. Why you askin’? I was testing on it. Oopsie!” Make sure, if you can, that you have dedicated and ringfenced hardware to do your performance testing. Make sure it says in big red letters that you are working on it and everyone should stay well clear of the fence. The next variation is when you do as you are told. I was told to use the testing accounts on a dedicated VM (I know, you work with what you’ve got, and tell people of the limitations and the risks… sigh…), anyhow, so I start at night, all things look good. Then I carry on in the morning, when we hear screams from the direction of the C-suite: “Did someone forget to turn something off, or what the hell is going on?”, just in a mildly nicer tone. Turns out that I created an import queue of 900K items. Not because I did anything wrong or what I was not supposed to, simply because of shared resources. Then there is the why is it different now nightmare. This was simply because I was not aware that someone else was also using a system that was feeding mine. It took us three days to figure out why our results went from pretty good to awful over a lunchbreak. When we turned off that other consumer, our results were back to wonderful in no time. Finally, there was the flip-flop: Ops forgot to disconnect the old system in a blue green deployment after the upgrade, so both systems were taking items from the same message queue. So sometimes everything worked fine, sometimes nothing did. We were aging very rapidly until someone asked the dumb question: did you guys switch off the old system?

Finally, some of your results will be weird because your systems runs into a race condition. This is why you are interested in the outliers. Those results can help you prevent a lot of pain in production. Investigate them with the dev team and make sure those issues get onto a performance improvement roadmap.

So run a lot of the same tests, and variations on what you think ought to be measuring the same metrics. And only them, do you accept your numbers as reliable. Here’s a way to go about it. Put the same experiment on a looper and keep running it over a long weekend. If it runs without fail, each time, and the result figures are in the same ballpark, you are probably okay to share them. If it failed twice and 10% of the results are on ranges you have never seen before… Well, Mr Watson, let’s see what’s in those logfiles.

Final Caveat – as much as we love our metrics and numbers, even performance testing results are more than what mere numbers can express – read this 4-minute piece https://www.linkedin.com/pulse/little-tester-tobias-fornaroli-borella/ for a really good point on why quantitative results need to be coupled with qualitative ones.

Volume aka Throughput aka Load Testing

Saturation Level traffic

Error tolerance

Conclusions or something

What is reliable?

Related Posts

Leave a Reply