Performance vs Diagnostic Metrics
The value of any tracked metric is in the actions it inspires. Numbers on a wall or a slide deck don’t mean anything on their own. They’re signposts and signals to help guide behaviour. Making the metrics go up or down can be very rewarding. Some metrics however, aren’t there to be as big or small as possible.
These metrics I’ll call Diagnostic Metrics. We choose Diagnostic Metrics to be flags, signalling that something is broken. They often have a threshold (or two for a range) that if surpassed, you act. If the metric is in its healthy zone, the best thing you can do is leave it alone.
Why it matters: There’s a temptation to believe that every metric can be optimised forever. Sales can keep going up, cost of acquisition we try to lower as much as possible. For performance metrics, that makes sense. Trying to optimise diagnostic metrics however, is a waste of effort. If we’re happy for the CPU load on our server to average below 60%, trying to drive it down to 0% doesn’t really mean anything.
- Diagnostic metrics exist at a personal level as well. Not to mistake the term diagnostic here but many blood markers have a similar healthy range. If you’re in the healthy band for iron levels then there’s no change needed.
- Diagnostic Metrics won’t give you an immediate fix. They are there to provide a diagnosis, an indication that something is wrong. They’re not there to give you the cure.
- Take the average CPU load as an example. If the server spends a day at 95% load there’s no load level we can pull to drop it back down (ignoring spending more on CPU). The alert sparks an investigation, once we find the cause, we make the change and the alarm disappears.
How to classify a metric: You may be thinking, is my metric a Diagnostic Metric or a Performance Metric? The answer, perhaps frustratingly, is that it depends. Any metric or measure can be Diagnostic or Performance depending on how we use it.
- Let’s say this quarter we’re focused on software performance. We use Response Time as a guide. We’ve not measured it before for this endpoint but have a feeling based on customer feedback that its too slow. For the next quarter, Response Time is a performance metric. We work on it to see how low it can go. We make some progress, complaints stop and we’re happy.
- Next quarter rolls around and we already have our Response Time metric on the dashboard. Given complaints are low enough we don’t want to dedicate more time to it but we certainly don’t want it to regress. Now, response time has become a Diagnostic Metric. We pick a threshold and as long as we don’t drift above that level, we’re happy focusing effort elsewhere.
- A good time to switch from Performance to Diagnostic is when we see diminishing returns on new work. We can drive up our click through rate for a quarter or two but eventually we’ll hit a ceiling. We then decide if a much smaller shift is worth as much effort, or if resources should be better spent elsewhere.