Datacentres go down. It doesn’t matter who runs them, nor how much redundancy you build into them, eventually something happens that takes them down. 

It could be an earthquake. It could be a flood. It could be (most often is) human error. You may be able to mitigate such things, but you can’t prevent them entirely. The Cloud doesn’t change that.

Puff. One of Amazon’s datacentres goes down and it’s as if the entire Cloud has vanished. Listen to the outcry. “See, we told you so: you can’t trust the Cloud,” they all say. “If Amazon can’t run a reliable service, then how can you trust anyone else?”

It’s not that simple. Datacentres go down. The Cloud doesn’t change that.

Here are some of the other things that the Cloud doesn’t change:

SLAs based on individual components are meaningless.  

Most components are useless on their own: they deliver value by interacting with other components. That’s pretty well the definition of a system. Yet most vendors only give Service Level Agreements for individual components – the storage or the network or the compute engines. 

To deliver an acceptable end-to-end service, you need to build additional redundancy between and across these components. That’s the client’s responsibility and always has been, whether in the Cloud or not.

It’s extremely hard to eliminate every failure mode.

Cloud marketing majors on the redundancy that the vendors have built into their services. Too many people assume that this means they’ve eliminated all single points of failure. 

But there’s still plenty of scope for failure at the integration points and in the service interactions. And, as the Amazon outage shows, there may still be weaknesses within the service provider’s systems. 

Identifying and eliminating every possible failure mode requires a lot of expertise. And it’s hard work, doubly so on dynamically reconfigurable systems such as the Cloud – you have to check the bounds of all the possible reconfigurations, and the transitions between them. 

Resilient design is complex. 

It takes a lot of time, thought and skill to design highly resilient systems, even for relatively simple configurations. Adding redundancy always adds complexity, so you need to check that this hasn’t introduced new failure points. 

And the more dynamic the system is, the greater the number of options you need to consider. And technology isn’t the only factor – the operational processes, escalation paths, testing modes, etc, all need to be designed for resilience too. The Cloud doesn’t change any of this.

Resilient systems are expensive. 

You need to pay for redundant components. You need to pay for all the additional storage, network transfers, etc, that will be required to keep sites in synch with each other. 

You’ll need to do a lot more testing. You’ll need to train people for the potential operational scenarios. Above all, you’ll need to pay for expertise, both to design the systems and to provide support cover for them.

The reality is that these costs simply can’t be justified for many organisations – it may make perfect business sense to accept the occasional outage. That’s fine, so long as it’s a conscious and considered decision. 

You get what you pay for. 

Cloud has changed the economic model. It’s introduced some attractive new price points, but it hasn’t changed this fundamental truth: some vendors offer much higher service levels, so they cost more. 

The Cloud doesn’t change any of these fundamentals of resilient systems design. Some of them, it makes harder. For example, Cloud service providers rarely give much transparency into the configuration of the technology underpinning their services. (If they did, it would restrict their ability to dynamically reallocate resources as demand changes). 

Without transparency, it’s a lot harder to identify all the possible failure modes. All you can do is design on the basis of worst-case assumptions, i.e. assume that the vendor’s service will fail completely at some point, and design your systems to deal with it.

The Cloud also makes some things easier. Cheap, usage-based pricing makes it a lot easier to afford redundancy, for a start. That’s what’s really happening here: the Cloud has shifted the price point for redundancy. 

So organisations which weren’t able to justify the cost of resilient systems are starting to ask if they can afford them now. Some of them can, but it’s not a free option.

Some Amazon customers continued to deliver service throughout the recent outage. Look at the case studies and you’ll see that they’d thought about resilience and designed their systems to deal with failures in the underlying components.

They were able to exploit the benefits of the Cloud, without simply assuming that it would deliver resilience for free.