Discover more from Good Tech Things
The cloud billing risk that scares me most as a developer
Because there's no good way to protect against it
There’s still time to join the August Cloud Resume Challenge sprint - it’s free but spots are filling up fast!
Wanna hear a cloud billing story that will stand your hair on end? Ask the nearest cloud developer to tell you about a time when they accidentally let a serverless function run recursively.
Thanks for reading Cloud Irregular! Subscribe for free to receive new posts and support my work.
That’s a Google Cloud horror story I’ve linked (I’m trying to be fair since I work there), but serverless functions that spiral out of control and rack up huge bills are distressingly common across all the clouds.
AWS calls it the recursive runaway problem. I call it the Hall of Infinite Functions - imagine a roomful of mirrors reflecting an endless row of Lambda invocations. It’s pretty much the only cloud billing scenario that gives me nightmares as a developer, for two reasons:
It can happen so fast. It’s the flash flood of cloud disasters. This is not like forgetting about a GPU instance and incurring a few dollars per hour in linearly increasing cost. You can go to bed with a $5 monthly bill and wake up with a $50,000 bill - all before your budget alerts have a chance to fire.
There’s no good way to protect against it. None of the cloud providers has built mechanisms to fully insulate developers from this risk yet.
That second point hit me hard over the past few weeks as I was writing the billing safety tips for the new Cloud Resume Challenge Guidebooks. These books are designed for new serverless developers, and so I wanted to make sure they recommend safeguards to keep people from making dangerous billing mistakes. It’s straightforward to protect yourself from spending too much on continuously-priced services like VMs: create budget alerts, automate test project cleanup, etc. But there’s just no good way to guarantee you won’t get clobbered by a surprise bill from functions or other usage-billed services.
The existing mitigations (as far as they go)
I talked to half a dozen senior cloud developers about this problem to see what they are doing to protect themselves. All agreed it’s an unsolved problem so far. Their best ideas for reducing the risk:
Turn down your concurrency limits on functions. This will at least protect you from a recursive “fork-bomb”-style scenario where functions scale out indefinitely. But you can do plenty of damage with even a single function gone rogue. Imagine a function that triggers on an upload to a cloud storage bucket, updates the file in the bucket - and thus creates another upload event which triggers the function again, and again, and again, uploading and downloading expensive data each time. That could be a five-figure overnight mistake without ever exceeding a concurrency of 1.
Monitor your functions for unusual invocation spikes. Again, you can do this, and it might help - but there are two reasons it’s not enough. One, you could still rack up quite a sizable bill before you’re able to respond to the alert, determine that the usage increase isn’t the good kind of usage increase, isolate the problem, and fix it. Two, the obvious-in-hindsight nature of these infinite loop bugs means that they tend to show up disproportionately in development and test accounts, where you’re less likely to have comprehensive monitoring set up.
Just write better code. A good aspiration, but not a reasonable defense strategy. Four of the six experienced engineers I talked to have had to beg their cloud provider for billing forgiveness after some sort of runaway resource loop, because nobody writes perfect code all the time. If there was a programming language that physically melted your motherboard any time your code entered an infinite loop, you wouldn’t just “write better code” - you’d never use that language at all. And yet a broken laptop would be inexpensive compared to the downside risk of creating a hall of mirrors between usage-billed cloud services.
What could help even more?
Realistically, what ends up happening in a lot of these scenarios is that the cloud providers forgive the bills. That is nice of them. But nicer would be:
Near real-time billing
Again, budget alerts are great, but they’re not useful today to catch a sudden bill explosion because all the cloud providers send them on a delay of 24 hours or more. I do appreciate there are all sorts of technical and business hurdles to making billing information available in near real-time, so a nice stopgap would be -
Hard caps on cloud spend
This doesn’t exactly require real-time billing. It would be amazing if any of the cloud providers let you click a button that says “I don’t ever want to exceed my monthly budget in this test account, so please just shut down all my resources if I do.”1 If billing isn’t real-time, the provider might have to eat some cost until they can detect and shut down over-budget resources - which, in turn, seems like a good incentive to make billing more real-time.
Better automated anomaly detection and remediation for recursive workloads
Rubbing some ML on this problem would make everything better, right? Seriously though, there have to be some characteristic burst patterns of serverless workloads undergoing a recursive runaway, not so different from the signals used by all the automated security anomaly detection tools out there. It wouldn’t be perfect, but even catching a subset of runaway functions in development accounts would be a great step forward - as long as the cure wasn’t more expensive than the disease.
More esoterically, I sometimes wonder if it would be feasible to implement a configurable recursive depth limit feature for cloud functions. You can theoretically track and stop recursive invocations by hand using instrumentation tools like Yan Cui’s, but if we’re going to treat the cloud like an OS, we should expect the cloud providers to give us built-in protections on the “call stack” of our resource abstractions, right?
What I do know is this: the Hall of Infinite Functions may not be that terrifying for large corporate cloud customers, but it’s a spooky place for new cloud learners. All the cloud providers need better handrails as we invite the next generation of builders inside.
Microsoft Azure actually does let you set a spending limit for certain free and promotional account types, but once you graduate to a for-real pay-as-you-go plan you don’t have that option anymore.
Google Cloud also lets you cap daily usage of billable APIs; there’s some wiggle room at the top of the cap but the quotas seem to fire in fairly real time. I find this feature interesting, and I’m not sure I totally have my head around it, but I think if you tried to use it to protect against runaway functions it would have mostly the same limitations as relying on concurrency limits.