Benchmarking and Metrics in Performance Testing Series

Warning: you may find that a lot of the stuff that I write about is in the operations, site reliability and devops domain, and you may legitimately say: dude, I am a performance tester, I only test the software. Cool. This is not your blog. No need to read on. Thanks for the visit.

This is a blog for performance testing with an extreme ownership mindset. While it may not be your fault, everything is your responsibility. So performance testing here means making sure the whole solution works, come rain or shine.

How fast is fast?

How do you know how good your product is? Or if it is any good at all? I mean your competitors will not just send you their complete performance testing reports. Nor is there a Big Book of Performance Metrics for Your Industry. And none of your business users are likely to be able to say something like, dude, with a constant 5 user registrations per second, we need to get the 99th percentile within 200 ms. That somehow is rather unlikely to happen.

What is more likely is that people will turn to you to ask how good is good and how fast is very fast. So this is little personal journey into what you can do to get reference numbers for yourself.

Benchmark against Yourself

If you have an existing product with a few customers, your existing performance is a good guideline. It may not even need to be acceptable, but it is something that you can measure and improve upon.

Benchmarks against Explicit or Implicit Requirements

Contracts or sales portfolios may contain explicit requirements or implicit promises with regards to how much, how fast and how long. Ask around, from the C-suite to sales, product and operations to see if anyone has, has seen, has heard of someone who might have a few tentative numbers.

Benchmark against Your Competitors

This may not be possible or feasible in your line of business or industry. However, in places, such as public trading of various kinds, where there are public information sources that log all events and make them available to all participants, you can look into how long transactions took for someone who is not you.

Benchmark against adjacent industries

Your industry may not have any public information or regulation with regards to nonfunctional requirements, but a similar industry, or an adjacent one may have explicitly stated requirements. You can use those as tentative guidelines.

Let’s look at each in more detail.

Using Your Existing Performance Metrics

The title begs the following questions:
– What do we define as performance?
– What are the metrics that best capture our software performance based on the definition above.

What is performance testing in your context?

You run a search for articles and definitions of performance testing and you get the same bookish answer. Do everything, measure everything. You speak to your stakeholders and your C-suite, top of the foodchain folks are going to dream about a perfect fully automated CICD that has a dozen performance experiments and a gigantic green-red dashboard with metrics, percentages and regression flags. Your development team leads will dream about an easy to deploy performance testing harness that profiles all their code and tells them about all the bottlenecks and O(n^2) time complexity issues before the code is written.

But the hard facts of life say otherwise. What you ought to to with those ideas is turn them into formalised requirements and put them in the Road Map section of your Performance Strategy – but more on that in another post later.

For now, assess the capacity, the infrastructure and the skills that you have a do as little as you have time and capacity for but make the most of it at the same time.

Look at your customers, talk to your support team, read up the performance related tickets if you are B2B, search around on social media if you are B2C. Chances are some of those people already know about your most painful performance issues. Their problem is that they have fables and lack the precision and repeatability of scientific discovery. What they do have, though, is what really hurts. Nobody will bother opening a ticket about a nonissue. If they took the time to write it up, it means they care. If they care, so should you.

Just run a search on “online service was extremely slow”. All those rants should be indicators that those companies are badly needing decent performance testing. Or look up any of the categories in Downdetector. The Finance section is full of banks that have serious issues with their online availability. Do customers care why they can’t log in online, nope. Should you? Hell yeah. You want to know every intimate detail: was it the webserver, the network, was it localised or global, was it a firewall issue, an issue with the third party authentication service or was it just the DB cleanup script causing the glitch.

What should my metrics be?

You can look at it from your business’ or your stakeholders’ perspectives. Reach out to them, compile a list and add it to your roadmap. Then reach out to support, and figure out what captures best what your customers care about. What makes them tick? Very often, they will actually use the metric to describe the phenomenon.

Their complaints tend to centre around a few areas.

Speed aka responsiveness: The results take forever. I click and it is very slow. When the US office on the East Coast wakes up at 2, the database just loads for hours.

Robustness, reliability, availability: The app keeps crashing. The service was not available yesterday at 4. My colleagues have problems connecting to your service all the time. They have to keep trying several times until they succeed.

Volume aka throughput: I tried to upload a 4K image and I got an error. We had a busy day and it took 9 hours to import all our transactions, and the system was just really slow for all our users. I don’t understand why my search results take quarter an hour to load.

Honorary mention – Error tolerance: It took the system 20 minutes to provide a totally useless error code that I had an error in my xml, so I gave up debugging after the second try. It was well formed so I have no way of figuring out what to change.

What metrics make sense?

Speed and responsiveness

We are talking response and roundtrip times, pings, database response times, UI and API response times. Define which one exactly you mean, from where to where. There is a difference between how long it take from receiving a trigger to firing a response in your lab or your customer in Frankfurt clicking a submit button to upload a 3 meg XML file to your app running in Oregon on AWS, and waiting for a reasonable fast response with a single euro value. The first is significantly simpler to measure and will be misleading. the second one is a lot more complex to measure and to solve, but in the end what your customers experience is the only thing that matters. You need to clear that with the powers that be at the start.

Define the load: 100 – 1000 – 10K transaction per hour, minute or second. If your machine stateless or stateful (excellent lil talk on this from the guy who created gatling: ). Is your product B2C or B2B? Does the traffic consist of mainly get requests reading from the DB or content store or there is all kings of GETs, POSTs, PUTs and DELs exercising the DB vigorously? Is the data flow only in one direction, or heavy both up and downstream? Start with something simple but realistic.

Define the pattern: Do your customers flock to you in batches creating bellcurves of sorts? Are there special customers who make 5K transactions in a few minutes creating spikes? Does your marketing team send out hundreds of thousands of emails in batches of 10 Ks, creating hedgehogs? Do your first customers wake up at 7 in the Eurozone then a second wave at lunchtime followed by the rise and shine on the East coast, then the West coast, then the lunches again? Shape that traffic for each scenario, also look at real life stats if you can get access to it.

Define the expectations for percentiles: Say, you want your API to respond northbound within 150 ms for the 90th percentile, 200 ms for the 95th percentile, 300 ms for the 99th, with Max values always under 400 ms. You can either pick some arbitrary numbers, or make a couple of measurements, and use the results from those as your guides.

In any case make sure you document these in a Performance Benchmarking Guide. This doc should have a reference section where these are metrics are defined, methodology is described in terms of how to and explained in terms why certain choices and decisions were made, and finally, you also need a reference section in there where you have the benchmark values are recorded. This doc is there for posterity so people, including your future self, have easy access to performance reference data for comparison.

Robustness, reliability, availability:

Bit of a detour around availability, particularly for Cloud services

Sooner or later, you will be pulled into these conversations. Whether from a legal perspective or from a support perspective or a sales perspective, it does not matter. You’ll be asked, so what is our uptime? What should we put in our SLAs? What can we commit to when we make promises? And because everyone else has a tendency to overpromise and underdeliver, and because you are not at a stage where you actually have years’ worth of reliable metrics to back up your statements, you’d better be real careful. Measure a lot, run multiple scenarios, run each several times on various environments and keep good records. For legal, give estimates based on the worst possible scenarios and the most pessimistic numbers, for dev one with the most likely downtime forecasts all compounded, and give a realistic estimate to sales.

Unless you make flint tools the old fashioned way, chances are that your availability will be limited by your third party providers availability. Your hosting provider, your compute provider, your authentication service, your storage and whatever other service you may rely on will have their own SLA, so think long and hard how much availability you put in your legally binding agreements. There’s a really good post from google on composite cloud availability calculations, it is well worth the read.

A heuristics on availability
People with manager and chief in their titles are likely to come to you saying / asking if you/we could guarantee, promise, deliver 99.9% availability. The answer in most cases is a straightforward no, but you need to be able to explain it that even if your software and deployment are perfect, even 99% is a big stretch.

– If provider one offers you 99% availability a month, provider two 98% and provider three 95%, then your availability cannot be better that 92%. Not 97%, not 92.1%. Your downtime as a result of their downtime is not the product but the total sum of their downtimes.

“But you’re wrong, that’s not what the big fat textbooks say!”

Sure. If you feel more comfortable with the book approach, knock yourself out. I am a Fat Tony type of guy. Cascading failures, that fact that network will impact your access to storage are irrelevant and the fact that two of those services are out at the same time will definitely do not increase their overall uptime.

Every provider will use their downtime with no regard for others. These are all independent from your perspective. If your storage is down for half an hour each month, and your compute is down for half and hour and so is your network and your webserver, then you are down for two hours. As simple as that. So if all your providers are 99.9%, it means you cannot be.

If your software is not perfect, here’s a rundown of what you might want to factor in your calculations.

What kinds of things you should consider in your availability calculations

Remember, at this stage, you have no historical data to rely on. Half a year’s or a whole year’s worth of data is awesome as a reference number, but my palms would be sweaty, knees weak, arms are heavy when I’d nod okay to sign a binding contract just based on those numbers. If you do, check your graphs if they show a tendency, for better or worse, and use as large a timewindow as possible. Six months’ worth of operational data will not expose all the potential downtime scenarios that your legal needs to cover. So to keep the spaghetti off the shirt, here’s my rundown of things to factor in.

I also know that no company is going to give you the time and money to test everything from the list below. It is more like a menu to go through and pick and chose the ones that represent the biggest, most catastrophic risks in your world.

Startup Times

Your systems or components are not immediately available the moment the code starts. So add time for startups.

What to test and what to measure

Test start times from power up to the completion of the first successful transactions. Timestamps you are interested in, on bare metal, is from power switch to OS is ready; platform is ready, container is ready, databases and reference data are loaded, connections are established, first heartbeats start coming in, last component checked in, first transaction is complete. In virtual environments, you ought to measure how long it takes for the hypervisor to come online, how long docker takes to be up and running, how long your containers and pods take to be up, and the rest of the software checkpoints should be the same.

First transactions are often ignored, but they are the true sign that you are not just up, but also running. One such test that I used to use on customer production systems was a negative test: I pinged the customers’ API with an invalid key and expected a 401 we used to mask the server side error.

OS Updates and Upgrades

You may need to schedule some time off for OS updates (OS here is defined as some kind of nix flavour.) which normally do not require too much of a downtime, unless a kernel update is necessary, in which case you need to restart the darn thing. If you are forced to migrate from a sunset version, it’s a different kettle of fish, where you could do the switchover with practically no downtime. In any case, factor in restart times, across the entire cluster. This may significantly differ from a simple test-setup start time.

What to test and what to measure

Ideally, you run in a blue-green setup, so you can use the secondary to run these experiments and take the times for all the 30 servers, not just the three in the lab. In a less ideal world, you have probably already done this, so you need to go back to your logs and /var/log/messages and last are your friends with a generous amount of grepping. In either case, from the issue of the shutdown to the completion of all startup processes.

If you have a prod-like lab, keep running the same restart everything test from various initial states. Restart when there is no load, restart when the DB is empty, restart when you already have half a million records in your DB, restart when you have multiple traffic runners putting generous amounts of load on the system. You are not interested in averages. You are interested in the worst results. (Once you have the benchmark numbers, you should go back to those result and figure out the whats and the whys: what system resource is my bottleneck? What was the longest component to come up? Why was it so slow? Why was the DB not responding? … The answers are vital for the ops and dev teams.)

Software Upgrades

If you have zero-downtime deployment, good for you. For the rest of us, schedule double the time normally required for upgrades. If you finish early, nobody will pat you on the shoulder, if you don’t, have enough buffer so you don’t miss your SLAs.

What to test and what to measure
Installability is one of those performance testing areas that is only listed in the textbooks but otherwise, it is never actually an issue. Oh, wait… I used to work on this project where each upgrade was a gruelling three-day manual process and the QA project cycle was also three days.
– Test installation in a production like environment.
– Test upgrades in product-like environments.
Put an actual timer on how long it takes from start to finish. Not an estimate, and actual timer.

Heuristics for red flags that your numbers will not hold
– XZY is a special customer. We have a special deal with them
– MNO is such a custom built project. It has a lot of custom code that is not in your main repo
– ABC has some custom config. This usually comes with that config being a mess so your config migration script won’t work.
– QWE is an on-premise deployment where the customer also installed some of their special scripts. You are dealing with a dirty machine install/upgrade, you have no idea how much of the actual resources are available to your application.

All these mean that your default measurements will give you estimates that are off by 100% to 1000%.

Database Upgrades and Migrations, Schema Changes

For DB migrations and schema changes, calculate with 4 times as much as you it took you when you tested it. I remember this one instance when there was a major DB upgrade and the guys asked for an 8 hour window. They got 4 hours. Eventually it took 48. I was watching 150 salespeople prowling the floor because they could not make a single call for two days. I also remember the DBAs looking about 60 years older 48 hours later.

What to test and what to measure

See if you can get an anonymized copy of the prod database. If you can, use that for all your DB testing. Make sure all steps are automated and that your complete automated regression test set runs on it without a glitch.

If you have a green and blue, also test how long it takes to roll back in case the upgrade fails. If your setup is different, test how long a backup and restore takes.

If you can’t do a restore, book 36 hours over the weekend, order pizza and fizzy drink and may fortune be ever in your favour. And make sure legal puts a clause into the contracts that DB upgrades are outside the scope of binding SLAs.

With longer maintenance windows, you can be creative. First, your contracts should explicitly state that you may need to have them from time to time. Second, because 8 hours may not fit within your monthly SLA metrics, but a quarterly or annual window of that size should be manageable. The 8-hour slot is a 0.3% downtime in a quarter, and 0.09% pa.

Hardware failures

Factor in hardware faults and replacement times. We had this one major upgrade that was to be run on a cluster of 936 Dell 380 blades. And we knew that disk controllers and disks are vulnerable to restarts, as in some of them may not come up. And if that happens during the upgrade, we’re doomed. As those node had been running for over two years, and because of a kernel update, we had to boot them during the upgrade. So as a preventive measure, before the upgrade started, we rebooted the whole lot in batches and eventually only had to replace one disk controller, and the upgrade went fine. I guess my main message is hardware fails, count on it. Your infra guys will probably know how long before new pieces start to crumble, so give yourself time for that as well. They will also know if it can be hot-swapped. Also, know your raid config as it will help you figure out how much time you may need for a replacement.

What to test and what to measure
Talk to ops first. Ask if you have enough redundancy so any one particular hardware failure will not down your whole system. You expect to hear a proper N+1, fully redundant system for bare iron. But even for virtualised setups, you’d really like to hear that all components are decoupled, run in their own independent containers and there’s a hot-hot secondary on a separate physical server.

But for those of us who are not that lucky… I guess we’ll just have to work harder on the following:

Disaster Recovery-type of Thingies

Though these would not be covered by your SLA, these are good to know.

– How long does a complete new setup take?
Remember that if read so far, you are the guy who has no database backup and no fully redundant deployment. So talk to ops to see how long it takes for them to provision the hardware to the right specs. Document the hardware specs for reference. How long the OS and the platform takes to install and configure. Find out if there is a default config that runs out of the box or you have to manually fidget in some arcane cfg for hours. Ask specifically about firewalls, loadbalancer and networks. (There is hardly anything more fun than having to set up a system in two subnets with a firewall in between, while the customer is drumming on the server room window and your C-suite and PM want separate half hourly written reports. Just kidding. It could be worse.) At this stage, there is hardly anything you can do, other than documenting how long it would take to get the env up and ready for the install, and, or rather AND communicating this to everyone you meet on the corridor. This is a major risk, so you want people to know about this. Alternatively, you can also just raise this in a meeting or an email, responses can be rather harsh or terse, or both. YMMV.

To comfort you, at least you have on paper how long a complete setup takes from bare iron to the first successful transaction – this will be super useful in your Business Continuity Plans for Operational Disaster Recovery.

– How long does it take to replace parts that fail regularly?
Talk to ops about the disks, what is the raid in prod and what is the raid in the lab. Ask them how long it takes to replace a disk. Double that time for when it can’t be hot-swapped.

– Hardware Dependencies for Virtualised Envs (ie. Docker)
I know all the docker people have been going like: yay, luckily I don’t need to worry ’bout hardware, do I. Well, you do. We had this Python application that was bleeding memory though the nose and the containers were growing like crazy. We had one test setup that was growing by 2 gigs an hour. As the software fix required a platform fix, which was not going to happen in this century, systems would be regularly restarted.

Also, in order to not have to do daily restarts, the available diskspace was tripled so the containers had room to grow. The bitter pill came when we realised that the containers were not downed when the hardware diskspace was increased under the virtualisation layer so the mapping between the hardware layer and the virtual layer was done in. The mapping on the live system messed up the references and the systems got stuck between the two worlds. We could not down them because the references were wrong, but they were still running out of space. It required quite some dark magic to restore those instances into this world. So anything that involves resizing your virtualization may require some downtime if you don’t have enough physical space provisioned for you.

To make matters worse, if your app is designed in a way that it can only run in a single instance, what happens when you have an event like Sandy. When Sandy, the hurricane, hit NYC we had a hosting provider whose server room was in a basement… Boy, the fish were swimming among the blades. We had servers standing in six feet of water.

Factor in how long it takes for you to recover. This is one of those topics that people prefer to not even mention, never mind discussing it openly and honestly. Most companies consider this as a taboo in commercial and legal conversations and even dev and ops will tend to wave it away, with a don’t-happen-much-and-it’s-real-quick-anyway while they stare at their shoes. So you need to be honest and tactful. You need to quantify the not much and the real quick. Take several samples of how long it takes your complete system to start up as well as how long it takes to reboot.

Now do the same in a resource constrained environment: cut back on memory, disk, cores, coffee, whathaveyou. Why? Because when your system crashes you want it to be back up asap, so you will not start looking for things to fix. You want it up, and then you do the fix. So you need to know at what levels of resource consumption your system will no longer start up. They you need to know how much longer it takes to start up when things are just barely acceptable.

Sidenote for docker people

on hardware pinning, and automatic failover to different hardware.

In the cloud, and virtualised only means you leased your hardware problems to someone else. Fair enough, it is a special skillset, hard to find good people, so it does make sense. But it does not mean that your system is now completely hardware independent. If someone says so, they need serious retraining. Send them to your operations team, make sure they work for ops for six months.

As a consequence, you normally would not run performance tests, particularly not load, volume and stress tests on virtualised environments. The best you can do is comparative experiments but event those might be totally unreliable, because you have no guarantees as to the hardware layer underneath your virtual setup. Sometimes there may be an exception. What you can request is that particular hardware be pinned to your VMs. Then you can try. You will still be impacted by how busy other systems may be on the same hardware, over which you may have or may not have some control.

Similarly, if you have backup systems, ideally, you’d prefer them to be virtualised on a different piece of hardware, far far away from the primary instance.

Next up is Volume testing proper. The whys, whats and hows.

Benchmarking and Metrics in Performance Testing Series

How fast is fast?

Benchmark against Yourself

Benchmarks against Explicit or Implicit Requirements

Benchmark against Your Competitors

Benchmark against adjacent industries

Using Your Existing Performance Metrics

What is performance testing in your context?

What should my metrics be?

Speed and responsiveness

Robustness, reliability, availability:

Startup Times

OS Updates and Upgrades

Software Upgrades

Database Upgrades and Migrations, Schema Changes

Hardware failures

Disaster Recovery-type of Thingies

Sidenote for docker people

Comment (1) on “Benchmarking and Metrics in Performance Testing Series”

Leave a Reply

How fast is fast?

Benchmark against Yourself

Benchmarks against Explicit or Implicit Requirements

Benchmark against Your Competitors

Benchmark against adjacent industries

Using Your Existing Performance Metrics

What is performance testing in your context?

What should my metrics be?

Speed and responsiveness

Robustness, reliability, availability:

Startup Times

OS Updates and Upgrades

Software Upgrades

Database Upgrades and Migrations, Schema Changes

Hardware failures

Disaster Recovery-type of Thingies

Sidenote for docker people

Related Posts

Comment (1) on “Benchmarking and Metrics in Performance Testing Series”

Leave a Reply