5 Things to Consider When Dealing with Availability of Software Systems
Are you involved in monitoring and application performance management topics? Are you maybe even responsible for some monitoring topics in your company? Then you might be familiar with the following situation, that we often see in our customer projects:
Your manager or the manager of your manager rushes into your office with the following request:
„I need a report about the availability of our systems. I need this done by Friday next week!“
You are terrified. But quickly you realize that you are a lucky guy. Remember the great application performance management (APM) solution you set up some months ago – such as the likes of AppDynamics, Dynatrace, Instana, NewRelic or maybe even an open source solution? You would think that your APM system should be able to generate this report on the press of a button, especially after all the customizing you did. But are you sure these numbers are the availability statistics you are looking for? Well, think again.
Let’s think about the meaning of availability …
First, let’s consider the definition of availability from Wikipedia:
In other words, the availability is the potion of time your system is up in relation to a considered time window. Easy, right?
Now, let’s look into typical software systems and consider some simple examples in which the term availability becomes less trivial than it’s definition:
Example 1: Amazon’s Recommendation Feature
Take the Amazon online shop as an example. An Amazon user is browsing to a product details page. You might know Amazon’s product recommendation feature, that shows you „customers, who ordered this item, also viewed this items …“. Let’s assume the product details page is working fine except that the recommended items are not loaded properly.
So, what would you say, is the page available? — Yes? — No? — Maybe?
Example 2: Buying Process
For the sake of simplicity, let’s stay with Amazon for the second example as well. Consider a simple buying process at an online shopping platform like Amazon. Customers usually spend most of the time searching and browsing through different products and their details before they place an order. For simplicity let’s assume it’s a proportion of 99% browsing to 1% placing an order. Now, consider the case that everything works fine except for the interaction step where the order is placed by customers. Following the definition of availability, the system has an availability of 99%, right? But actually, the usefulness of the system is zero for both, the customers and the online shop, as the ordering process cannot be completed.
Example 3: Seasonal Usage Patterns
Let’s consider an application with a seasonal usage pattern. This is for example often the case with internal business applications that are intensively used during business hours but are not used at all in the night. Let’s assume there is a periodic synthetic test running that checks the availability of such an application. Quiz question: what exactly is the meaning of 90% availability in such a case? Is availability measured all the time or just during the business hours? Is it really an issue, if the application is not available during the night, when no user is affected by the downtime? Not really, right?
Things to consider, when dealing with availability of software systems
Though availability has a clear, general definition, in the examples above we have seen that the semantics of this term are highly dependent on the concrete context, scope and especially the questions to be answered by the availability metrics. You should consider the following aspects, when measuring, deriving and analysing availability statistics:
- Align goals
The examples above show that availability statistics (and statistics in general) can be interpreted (or even manipulated) in different ways.
Make sure that there is no conflict between the actual goal of the statistics and the personal interests of the ones who collect the statistics.
Example: You can imagine how your availability numbers would look like if your salary is bound to these numbers and at the same time you are in charge of collecting those statistics ;-).
- Be clear about underlying measurement methods
Availability is a metric that is derived from some measurements. The semantics of the resulting availability numbers highly depend on the method of measuring availability. For instance, availability of a digital service from an end-user perspective can be measured either through synthetic monitoring or by passive monitoring of the real users. While both methods provide availability metrics, they can be significantly different. For instance, if synthetic tests yield 100% availability, it does not necessarily mean that none of the users encounter availability issues.
Thus be clear about the measurement method and the origin of the availability metrics.
- Availability is relative
The examples above clearly show that there is no generic, out-of-the-box meaning of availability in the context of software systems. Availability is always relative to a target system or aspect, such as an individual machine, software component, service, end-user interaction or a whole process like in Example 2. When you are asked to provide availability statistics, make sure there is a common understanding between all stakeholders about what the questions are that should be answered by those statistics.
- End-User first
When thinking about availability, start thinking from the end-user and process perspective, because that’s what counts in the end. From the business perspective, it’s quite unimportant if a machine or service fails as long as no end-users or processes are affected by this (assuming that the failure is recovered before users are actually affected). Of course, technical availability of services and underlying infrastructure may be interesting for failure forecasting, capacity planing, etc.
- Relate to business relevance
It’s always a good idea to set availability into relation with the importance of the corresponding business transactions and processes. If business critical services and transactions are unavailable, the resolution should be treated with a higher urgency and priority than services that are less critical.
What is your experience with measuring and analysing availability statistics? Have you experienced similar challenges? Leave a comment here and let’s further discuss what the challenges are when dealing with availability statistics.