When ‘do nothing’ is a legitimate outage response

When an outage or degradation occurs, users demand action.

They lodge support tickets, and anticipate operations or product teams spinning into action: investigating and determining the root cause, and either remediating it themselves or logging the fault with an upstream provider to pursue resolution.

The counterintuitive response is for operations and product teams to look at all this and to ‘do nothing’. Yet, if they understand their end-to-end service delivery chain in detail, it may be the quickest and most effective way of dealing with the problem.

The world today operates as a series of complex delivery chains. Consider a retail environment: warehouse management and logistics systems ensure that distribution centres carry appropriate amounts of stock and that it is allocated to trucks appropriately.

More systems ensure that the stock makes its way onto shelves, that it can be scanned and paid for at a point-of-sale, and that sales information is fed back to the logistics side of the business for demand planning and restocking purposes. Still more systems and APIs are needed to ensure all these systems interconnect and can operate together seamlessly. Everything needs to work, or the service delivery chain will start to falter.

Different failures will have different impacts. The failure of a mapping service, for example, could break route optimisation for truck transport between warehouses and stores, or between stores and residences for home delivery. A failure of the POS terminal, or of the communications network it routes transactions over, may leave customers stranded at the register and/or break automated restocking processes because sales data cannot be transmitted to the ordering system.

In all of these situations, the retailer has “options”. They might switch route optimisation systems to a backup maps service; have a wireless broadband failover that POS can operate on in case the primary fixed-line connection to the store fails; or be able to manually download data from the POS and load it into the ordering system as a stopgap measure to ensure shelves remain stocked the following day.

The “options” available depend on the possible scenarios the retailer has planned for. It’s not practical to have a ready backup for every possible outage condition they might encounter. There might be a (hot, cold or warm) backup option for the most obvious outage types. This can be engaged in outage scenarios where the response is a simple binary choice: ‘if the primary goes down, then switch to the secondary or backup’.

But an incident rarely benefits from such a set-in-stone response. Complex end-to-end service delivery chains demand a more nuanced consideration of the outage conditions.

In the mapping service example, a quick call to the service provider might confirm the fault has been isolated to a bad software update that’s in the process of being rolled back, and that normal service will resume within minutes. Knowing that, it would be pointless to switch to a backup map service and recalculate truck routes, because the primary service is likely to recover before any of that can occur anyway. The quickest solution would be to do nothing.

It’s a situation that’s increasingly encountered by operations and product teams everywhere: where the most effective path to resolution is determined by understanding how all the components in the delivery chain function and having visibility into each one right across the chain, such that an informed decision can be made and cool can heads prevail, at a time when level-headed thinking is most needed.

Making the right call

We’ve established that teams benefit from having more information and context available during an outage or degradation scenario, because it allows for more nuanced decision-making, including opening the possibility of limiting reactivity, if that would bring about the best outcome.

But not all information is created equal. For example, a payment gateway may appear to be available and reachable, while still being unable to process payments. And so specific information is needed to determine the root cause of an issue.

Having the ability to test the functionality of a problem component to determine where exactly the fault lies will show how complex it is to fix. This, in turn, influences the continuity or recovery path chosen: whether that response is to do something, or do nothing and wait it out.

In addition, what we’ve discussed so far is about determining the best course of manual intervention, whereas what operations teams are really working towards is automated or semi-automated responses.

Skilled and knowledgeable teams, supported by context-aware visibility into the end-to-end service delivery chain, are best-placed to determine which parts of incident response can be automated, and then under what circumstances an automated response should be allowed to trigger.

Whether a fault condition is significant enough to trigger an automated or manual intervention, or a wait-and-see “do nothing” approach, depends on the severity and nature of the fault detected.

What’s clear is this: teams that have the visibility to see into and map out a problem, are better-placed to weigh up courses of action and make the right decision in more circumstances.

They also have a documented evidence trail after-the-fact, where they can ‘show their workings’ to demonstrate why a particular call was made, and whether it was successful.

This, in turn, can improve future decision-making, and potentially train an automated decision-making model to handle the response if history repeats.

Mike Hicks is principal solutions analyst at Cisco ThousandEyes.