When it comes to building and operating the infrastructure to power some of the most popular services on the internet, it’s no surprise that many companies have decided that Amazon can do a better, more cost-effective job than they can.
Amazon’s AWS infrastructure-as-a-service offerings, of course, aren’t perfect. Outages, some of them quite large and ugly, do happen.
In outages past, it was easy to give Amazon a pass. After all, Amazon has always been clear that AWS customers should plan for failures and architect their applications so that they use multiple regions, something that many companies don’t do. But following the most recent outage, which degraded and knocked out popular services including Netflix, Instagram and Pinterest, it would appear that Amazon may not be able to wash its hands so easily this time around.
Why? Some of the biggest companies affected apparently did everything Amazon suggested, and still found themselves facing hours of downtime.
As detailed by Wired’s Robert McMillan, tweets from Netflix and Instagram employees seem to indicate that Amazon’s Elastic Load Balancing (ELB) service, which was supposed to ensure that requests were distributed across Netflix’s AWS resources in multiple regions, wasn’t working properly. Amazon tells customers that they:
…can build fault tolerant applications by placing your Amazon EC2 instances in multiple Availability Zones. To achieve even more fault tolerance with less manual intervention, you can use Elastic Load Balancing. You get improved fault tolerance by placing your compute instances behind an Elastic Load Balancer, as it can automatically balance traffic across multiple instances and multiple Availability Zones and ensure that only healthy Amazon EC2 instances receive traffic.
Had ELB worked as expected, when the Ashburn, Virginia availability zone went down for the count, traffic for the affected Netflix and Instagram applications should have been routed to the other availability zones the companies had set those applications up in. In short, Netflix and Instagram were apparently doing everything right by Amazon’s standards, but Amazon still failed them.
Obviously, the apparent failure of Amazon’s ELB service is quite disconcerting, but there are growing questions about Amazon AWS in general. The Ashburn, Virginia data center that failed failed during a severe weather event, but as McMillan notes, “A storm shouldn’t have taken out Amazon’s backup generators.” That may or may not be a fair assessment. We won’t know until Amazon explains what happened, but in any case, one thing is increasingly clear: what we often expect cloud providers to be resistant to and what they are actually resistant to are two very different things.
At the end of the day, there can be little doubt that companies looking to host their applications reliably in the cloud will need to build redundancy through multiple vendors, not multiple data centers with the same vendor.