The cold reality of the Kinesis Incident

It was a systemic failure, not a random event. AWS must do better.

Nov 29, 2020

This holiday season, give your non-technical friends and family the gift of finally understanding what you do for a living. The Read Aloud Cloud is suitable for all ages and is 20% off at Amazon right now!

One of my pet hobbies is collecting examples of complex systems that people spend lots of time making resilient to random error, but which are instead brought down by systematic error that nobody saw coming.

Systematic error - as opposed to a random, isolated fault - is error that infects and skews every aspect of the system. In a laboratory, it could be a calibration mistake in the equipment itself, making all observations useless.

Outside the laboratory, systematic error can mean the difference between life and death, sometimes literally. The hull of the Titanic comprised sixteen “watertight” compartments that were supposed to seal off a breach, preventing individual failures from spreading. That’s why the shipbuilders bragged that the Titanic was unsinkable. Instead, systemic design error let water spill from one compartment to the next. The moment an iceberg compromised one section of the hull, the whole ship was doomed.

Or take election forecasting. In 2016, pretty much every poll predicted a landslide win for Hillary Clinton — a chorus of consensus that seemed safely beyond any mistake in an individual poll’s methodology. But it turned out that pollsters systemically underestimated enthusiasm for Donald Trump in key states — maybe because of shy Trump voters or industry bias, nobody really knows. If they had known, then maybe every poll wouldn’t have made the same mistakes. But they did. Four years later, nobody trusts the election modeling industry anymore.

That brings us to Wednesday. Several AWS services in us-east-1 took a daylong Thanksgiving break due to what shall be henceforth known around my house as The Kinesis Incident — which sounds like a novel in an airport bookstore, if Clive Cussler wrote thrillers about ulimit.

Please read AWS’s excellent blow-by-blow explanation for the full postmortem, but to sum up quickly: on Wednesday afternoon, AWS rolled out some new capacity to the Kinesis Data Streams control plane, which breached an operating system thread limit; because that part of KDS was not sufficiently architected for high availability, it went down hard and took a long time to come back up; and in the meantime several other AWS services that depend on Kinesis took baths of varying temperature and duration.

Hot takes vs cold reality

Plenty of hot takes have been swirling around AWS Twitter over the long weekend, just as they did after the great S3 outage of 2017. Depending on who you listen to, The Kinesis Incident is …

A morality tale about OS configs! (I mean, sure, but that’s not the interesting part of this.)

Yet another argument for multi-cloud! (No it isn’t, multi-cloud at the workload level remains expensive nonsense; please see the previous Cloud Irregular for an explanation of when multi-cloud makes sense at the organization level.)

An argument for building multi-region applications! (This sounds superficially more reasonable, but probably isn’t. Multi-region architectures — and I’ve built a few! — are expensive, have lots of moving parts, and limit your service options almost as much as multi-cloud. Multi-region is multi-cloud’s creepy little brother. Don’t babysit it unless you have to.)

An argument for *AWS’s internal service architectures* being multi-region! (I have no idea how this would work compliance-wise. And I think it would just make everything worse, weirder, and more confusing for everyone.)

Forget the hot takes. Here’s the cold reality: The Kinesis Incident is not a story of independent, random error. It’s not a one-off event that we can put behind us with a config update or an architectural choice.

It’s a story of systemic failure.

The cascade of doom

Reading between the lines of the AWS postmortem, Scott Piper has attempted to map out the internal dependency tree of last week’s affected services:

Scott Piper @0xdabbad00

AWS has posted a post-mortem of Nov 25 incident. From the info there, we learn some of the internal service dependencies that caused the cascade of failures across services. aws.amazon.com/message/11201/

The graphic in Scott’s tweet actually understates the problem — for example, no Kinesis also means no AWS IoT, which in turn meant a bad night for Ben Kehoe and his army of serverless Roombas, not to mention malfunctioning doorbells and ovens and who knows what else.

Now, IoT teams understand that their workloads are deeply intertwined with Kinesis streams. But who would have expected a Kinesis malfunction to wipe out AWS Cognito, a critical but seemingly unrelated service? The Cognito-Kinesis integration happens under the hood; the Cognito team apparently uses KDS to analyze API usage patterns. There’s no reason a customer would ever need to know that … until someone has to explain why Kinesis took down Cognito.

But it gets worse. According to the postmortem, the Cognito team actually had some caching in place to guard against Kinesis disappearing; it just didn’t work quite right in practice. So these individual service teams are rolling their own fault-tolerance systems to mitigate unexpected behavior from upstream dependencies that they may not fully understand. What do you want to bet Cognito isn’t the only service whose failsafes aren’t quite perfect?

(This is not a story of random error, this is a story of systemic failure.)

The more, the scarier

The edges in AWS’s internal service graph are increasing at a geometric rate as new higher-level services appear, often directly consuming core services from Kinesis, DynamoDB, and so on. Some bricks in this Jenga tower of dependencies will be legible to customers, like IoT’s white-labeling of Kinesis; others will use internal connectors and middleware that nobody sees until the next outage.

Cognito depends on Kinesis. AppSync integrates with Cognito. Future high-level services will no doubt use AppSync under the hood. Fixing one config file, hardening one failure mode, doesn’t shore up the entire tower.

The only conclusion is that we should expect future Kinesis Incidents, and we should expect them to be progressively bigger in scope and harder to resolve.

What’s the systematic failure here? Two-pizza teams. “Two is better than zero.” A “worse is better” product strategy that prioritizes shipping new features over cross-functional collaboration. These are the principles that helped AWS eat the cloud. They create services highly resilient to independent failures. But it’s not clear that they are a recipe for systemic resilience across all of AWS. And over time, while errors in core services become less likely, the probability builds that a single error in a core service will have snowballing, Jenga-collapsing implications.

Really, the astonishing thing is that these cascading outages don’t happen twice a week, and that’s a testament to the outstanding engineering discipline at AWS as a whole.

But still, as the explosion of new, higher-level AWS services continues (‘tis the season — we’re about to meet a few dozen more at re:Invent!) and that dependency graph becomes more complex, more fragile, we should only expect cascading failures to increase. It’s inherent in the system.

Unless?

AWS’s own postmortem, when it’s not promising more vigilance around OS thread hygiene, does allude to ongoing efforts to “cellularize” critical services to limit blast radius. I don’t fully understand how that protects against bad assumptions made by dependent services, and I’d be willing to bet that plenty of AWS PMs don’t either. But it’s time to build some trust with customers about exactly what to expect.

I’ve called for AWS to release a full, public audit of their internal dependencies on their own services, as well as their plan to isolate customers from failures of services the customer is not using. Maybe everything’s fine. Maybe the Kinesis Incident was an anomaly, and AWS won’t suffer another outage of this magnitude for years. But right now I don’t see a reason to believe that, and I’m sure I’m not alone.

Links and events

On that note, re:Invent starts on Monday! It’s three weeks long! It’s free, it’s virtual, it’s just. so. much. I’m going to try to send out a short executive summary of each day via email at A Cloud Guru. Make sure you follow the blog over there for lots of analysis and new feature deep dives from me, other AWS Heroes, and even some special AWS service team guests. I’ll probably pop up in a few other places as well.

Irish Tech News has a nice review out of The Read Aloud Cloud. “For the most part, the rhyming works”, they concede. I’ll take it!

Just for fun

Jay

Nov 30, 2020

Article shows some confusion between systemic failure (failure at a system level) and systematic failure (occurring in a deterministic way when given conditions are met)

Expand full comment

Adi Chiru

I would like to understand better from you how "2 pizza teams" is the problem; while may not be ideal, I have seen it working very well and I can't figure out what to replace it with for an improvement...

Good Tech Things

Discussion about this post