On Thursday, April 21st, Amazon experienced a large outage that took down hundreds of websites, including the popular Foursquare, Reddit, Springpad, Hootsuite, BigDoor, and Quora. The service was fully resumed only 3 days later.
Amazon released a full description of what happened. In a nutshell, Amazon shifted traffic in one of its zones from one network to another in order to upgrade a network. The traffic was mistakenly shifted to a lower capacity network, which was unable to handle the traffic. This caused Amazon EBS volumes (Elastic Block Store, a persistent storage unit for database and file system) in one US East Region zone to become unable to perform read/write operations.
Besides the many websites taken down during the outage, it turned out that 0.07% of the data stored in the EBS volumes in one zone have also been lost. Chartbeat reports losing 11 hours of historical data to their customers saying it is ‘irrecoverable’.
Since then a lot has been written about the dangers of the cloud. It looks like Amazon’s outage brought weight to those opposing security concerns against EDA’s cloud aspirations. What to think of Synopsys’ recent announcement that they will provide SaaS in Amazon Web Services?
But many companies that are entirely relying on Amazon’s cloud services were not significantly affected by the outage. Twilio did not shut down. And more notably, Netflix, which runs its massive infrastructure entirely in Amazon’s cloud, did not shut down.
Why were some websites taken down while others were unaffected? Because some systems are designed to be resilient to all sort of failures. Taken from Netflix’ experience when moving its infrastructure to AWS:
“One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.”
In short, being in the cloud does not mean you are inherently safer or more exposed. It means that you have to design your system so that it can recover from defects –any defect. From G. Reese in O’Reilly:
“If your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider.”
And I’ll stand by this too.