Lightning Storms in the Cloud (yea, not funny if you're in Ops)
The technology community and cloud enthusiasts were reminded over the weekend—once again—that you need to architect your technology for fail if you’re going to use a cloud hosting provider like Amazon. While there are days when AWS feels like magic to me, every system has its bad moments, and Sunday was AWS EU's turn.
Lightning struck in the cloud on Sunday. Writing that, I can't help but snicker, as Amazon's EU data center was actually struck by lightning, causing a major power outage in one of their availability zones. The strike was powerful enough to take down even their backup power generators.
Here's what Amazon told customers:
“We understand at this point that a lighting strike hit a transformer from a utility provider to one of our Availability Zones in Dublin, sparking an explosion and fire. Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization. We’ve now restored power to the Availability Zone and are bringing EC2 instances up. We’ll be carefully reviewing the isolation that exists between the control system and other components. The event began at 10:41 AM PDT with instances beginning to recover at 1:47 PM PDT.”
If you happened to home your applications in the region that was affected, or if you have a critical single point of failure there, you were down. Furthermore, if you didn't have capacity outside of Amazon EU, you might have suffered as those affected tried to spin up in the region’s other zones. If you needed more capacity, it may have been hard.
At Mashery, we've been an AWS customer for a long time. While we are still big fans of it, we also need to run a highly available service. Thus, our product and service is designed for failure of AWS. While we did lose some servers in the affected availability zone, our network automatically routed traffic to alternate locations. Our service, and more importantly, our customers' APIs were continually available.
But it did remind us that lightning does strike, and you need to be prepared for a rainy day in Ops. (Sorry, couldn't help myself.)
Rule #1 in Ops is never talk about an outage because you will somehow cause one.
After I wrote this yesterday, but before it was posted, Amazon EC2 US East failed for 31 minutes, by our record, due to a major network connectivity outage. Once again, our network managed itself by rerouting traffic and we retained our high availability posture. Unfortunately some of our customers that single home host their APIs on EC2 East went down and many high profile websites (http://tcrn.ch/oqOcrQ) were also effected. Not a fun week in ops shops all over the web and we commiserate with those still recovering.