The cloud billing risk that scares me most as a developer

Because there's no good way to protect against it

Jul 19, 2022

There’s still time to join the August Cloud Resume Challenge sprint - it’s free but spots are filling up fast!

Wanna hear a cloud billing story that will stand your hair on end? Ask the nearest cloud developer to tell you about a time when they accidentally let a serverless function run recursively.

That’s a Google Cloud horror story I’ve linked (I’m trying to be fair since I work there), but serverless functions that spiral out of control and rack up huge bills are distressingly common across all the clouds.

AWS calls it the recursive runaway problem. I call it the Hall of Infinite Functions - imagine a roomful of mirrors reflecting an endless row of Lambda invocations. It’s pretty much the only cloud billing scenario that gives me nightmares as a developer, for two reasons:

It can happen so fast. It’s the flash flood of cloud disasters. This is not like forgetting about a GPU instance and incurring a few dollars per hour in linearly increasing cost. You can go to bed with a $5 monthly bill and wake up with a $50,000 bill - all before your budget alerts have a chance to fire.
There’s no good way to protect against it. None of the cloud providers has built mechanisms to fully insulate developers from this risk yet.

That second point hit me hard over the past few weeks as I was writing the billing safety tips for the new Cloud Resume Challenge Guidebooks. These books are designed for new serverless developers, and so I wanted to make sure they recommend safeguards to keep people from making dangerous billing mistakes. It’s straightforward to protect yourself from spending too much on continuously-priced services like VMs: create budget alerts, automate test project cleanup, etc. But there’s just no good way to guarantee you won’t get clobbered by a surprise bill from functions or other usage-billed services.

The existing mitigations (as far as they go)

I talked to half a dozen senior cloud developers about this problem to see what they are doing to protect themselves. All agreed it’s an unsolved problem so far. Their best ideas for reducing the risk:

Turn down your concurrency limits on functions. This will at least protect you from a recursive “fork-bomb”-style scenario where functions scale out indefinitely. But you can do plenty of damage with even a single function gone rogue. Imagine a function that triggers on an upload to a cloud storage bucket, updates the file in the bucket - and thus creates another upload event which triggers the function again, and again, and again, uploading and downloading expensive data each time. That could be a five-figure overnight mistake without ever exceeding a concurrency of 1.
Monitor your functions for unusual invocation spikes. Again, you can do this, and it might help - but there are two reasons it’s not enough. One, you could still rack up quite a sizable bill before you’re able to respond to the alert, determine that the usage increase isn’t the good kind of usage increase, isolate the problem, and fix it. Two, the obvious-in-hindsight nature of these infinite loop bugs means that they tend to show up disproportionately in development and test accounts, where you’re less likely to have comprehensive monitoring set up.
Just write better code. A good aspiration, but not a reasonable defense strategy. Four of the six experienced engineers I talked to have had to beg their cloud provider for billing forgiveness after some sort of runaway resource loop, because nobody writes perfect code all the time. If there was a programming language that physically melted your motherboard any time your code entered an infinite loop, you wouldn’t just “write better code” - you’d never use that language at all. And yet a broken laptop would be inexpensive compared to the downside risk of creating a hall of mirrors between usage-billed cloud services.

What could help even more?

Realistically, what ends up happening in a lot of these scenarios is that the cloud providers forgive the bills. That is nice of them. But nicer would be:

Near real-time billing

Again, budget alerts are great, but they’re not useful today to catch a sudden bill explosion because all the cloud providers send them on a delay of 24 hours or more. I do appreciate there are all sorts of technical and business hurdles to making billing information available in near real-time, so a nice stopgap would be -

Hard caps on cloud spend

This doesn’t exactly require real-time billing. It would be amazing if any of the cloud providers let you click a button that says “I don’t ever want to exceed my monthly budget in this test account, so please just shut down all my resources if I do.”1 If billing isn’t real-time, the provider might have to eat some cost until they can detect and shut down over-budget resources - which, in turn, seems like a good incentive to make billing more real-time.

Google Cloud does better than some here by letting you disable billing on a project in response to budget alerts, but again, the alert might not fire as soon as you’d want.2

Better automated anomaly detection and remediation for recursive workloads

Rubbing some ML on this problem would make everything better, right? Seriously though, there have to be some characteristic burst patterns of serverless workloads undergoing a recursive runaway, not so different from the signals used by all the automated security anomaly detection tools out there. It wouldn’t be perfect, but even catching a subset of runaway functions in development accounts would be a great step forward - as long as the cure wasn’t more expensive than the disease.

More esoterically, I sometimes wonder if it would be feasible to implement a configurable recursive depth limit feature for cloud functions. You can theoretically track and stop recursive invocations by hand using instrumentation tools like Yan Cui’s, but if we’re going to treat the cloud like an OS, we should expect the cloud providers to give us built-in protections on the “call stack” of our resource abstractions, right?

What I do know is this: the Hall of Infinite Functions may not be that terrifying for large corporate cloud customers, but it’s a spooky place for new cloud learners. All the cloud providers need better handrails as we invite the next generation of builders inside.

Microsoft Azure actually does let you set a spending limit for certain free and promotional account types, but once you graduate to a for-real pay-as-you-go plan you don’t have that option anymore.

Google Cloud also lets you cap daily usage of billable APIs; there’s some wiggle room at the top of the cap but the quotas seem to fire in fairly real time. I find this feature interesting, and I’m not sure I totally have my head around it, but I think if you tried to use it to protect against runaway functions it would have mostly the same limitations as relying on concurrency limits.

Arne Babenhauserheide

I’ve been asked in my lecture on distributed systems “you keep telling us that distributed systems are harder, so why go for a peer-to-peer approach if we can just rent a server?”.

This year I let the students calculate the cost of different solutions. The result: If you don’t have an average payoff per function invocation, cloud services are a really bad deal. Why did many big changes originate in peer-to-peer? Because that’s where you can actually scale up without having to monetize to the brink. And if your function runs amok, your users just restart your program and some of them might even report a bug.

Expand full comment

Petr

Sorry Forrest, but I would call this a cowboy approach.

As engineering director in several startups in the past I would say that the risk of putting our company to the brink of bankrupcy is too high. This is just irresponsible.

If there is no way to limit spending then this tehcnology is too immature.

On Azure were was a Dedicated (App Service) Plan, so I could have a hard limit of what I could spend.

The "infinite" scalability is rerelly needed. I need a decent perfomance and I'm ok with service degratation on slashdot effect.

Business values predictability much more than opportunity that rarelly can be used.

Good Tech Things

Discussion about this post