OK so the big news today is that I accidentally started a company. “Influencers-as-a-service” is the short version. My cofounder Emily and I have spent a lot of time on both sides of this market as underpaid tech creators and under-supported technical marketers. If you fall into either of those camps, you should definitely chat with us.
On to today’s Good (and not-so-good) Tech Things…
Today we’re looking at two concurrent issues that affected customers on AWS and Google Cloud. The way the providers responded reflects deep differences in their DNA .. and in their trustworthiness.
A tale of two clouds
AWS isn’t what they used to be. “It’s feeling a lot like Day Two over here,” friends at AWS often tell me, shaking their heads sadly. A sweaty pursuit of generative AI has clouded their historic focus on IaaS building blocks. A myopic insistence on RTO continues to force out top talent. They’re even deprecating services now. (My 168 AWS Services song from 2020 references at least 4 services, not counting SimpleDB, that have since been or soon will be entirely scrubbed from the docs and SDKs: Workdocs, Honeycode, Snowmobile, and Alexa for Business.)
But you know what AWS still does better than anybody else? They actually listen to their customers and bend over backwards to help them out.
Case in point: on April 29th, a software engineer named Maciej Pocwierz discovered an unusual behavior in Amazon S3 that left him on the hook for a $1,300 bill.
Cliff’s Notes explanation of the behavior:
S3 buckets all exist in a single global namespace.
Anybody can try to access any S3 bucket, with or without an AWS account, if they happen to guess the bucket name.
Assuming the bucket is not publicly accessible, the unauthorized user will get a generic HTTP 4xx error code.
In this case, the unauthorized requests (over 100 million of them!) were coming from the makers of a widely-used open-source tool, who had accidentally shipped a default configuration that collided with Maciej’s bucket name. They quickly rolled out a fix, and the problem stopped.
So far, so reasonable. What surprised Maciej was that each of those 100 million unauthorized requests gets billed to the bucket owner. In other words, it’s possible to perform a “denial-of-wallet” attack simply by spamming bad requests to your enemy’s bucket. And even though AWS Support was kind enough to refund this particular bill, there was nothing Maciej could do to protect against this attack in future. It was just a built-in risk of using S3.
Now, there were several ways AWS could have responded once Maciej’s blog about the incident gained wide attention.
They could have said “This is expected behavior; just use randomized bucket names that are harder for attackers to guess.”
They could have pointed out that it’s quite resource-intensive to spin up enough of these requests to make a meaningful dent in enterprise wallets, so this probably isn’t going to be a big risk for big customers.
They could have just stayed quiet about the whole thing and waited for everyone to forget about it.
But they didn’t do any of that. Instead, AWS’s official Voice of the Customer, Jeff Barr, acknowledged within 24 hours that people shouldn’t have to pay for unauthorized S3 requests they didn’t initiate. Within a week, the S3 team was working on a fix. And as of May 13th, customers will no longer incur request or bandwidth charges for HTTP 403 requests initiated from outside their own environment.
I just want to pause for a second on how remarkable this is. S3 is an enormous, mature product with all sorts of complex billing dimensions. Although they employ a lot of engineers, I’m quite sure everyone was busy with a full roadmap of other priorities for the quarter. And yet … they found time to roll out a nontrivial behavior change based on the complaint of one person at a small Polish consulting company who had incurred a total bill of $1300 US dollars. Elapsed turnaround time from complaint to rollout was 2 weeks.
That, my friends, is what customer obsession looks like. As long as AWS can still do that, they are the cloud to beat.
Meanwhile, in the other cloud…
At the exact same time all this was unfolding, Google Cloud was having troubles of its own.
On May 2nd, the CEO of a $135 billion Australian pension fund called Unisuper acknowledged a service outage affecting its 647,000 customers—many of them retirees whose life savings are tied up in the company. The whole thread of outage updates is worth a read. It starts out fairly generic and, as the days go by with no resolution in sight, becomes increasingly bewildering. Unisuper hastens to assure its members that ransomware is not involved, that nobody’s data is at risk … but that getting all this resolved is going to take quite a bit of time, and apparently Google Cloud is right in the middle of it all.
On the 8th of May, a full week into the outage, Unisuper issued a bizarre statement on its website purporting to be from both their CEO and from Google Cloud’s CEO Thomas Kurian. The statement placed full responsibility on Google Cloud for “deleting Unisuper’s private cloud subscription” and said that the only chance Unisuper had to get things back online was that they had squirreled away some backups on an entirely different cloud.
This statement was so weird that I, among others, immediately went on record with doubts. For one thing, Google themselves had been completely silent on this. For another, there’s no such logical boundary in Google Cloud as a “subscription”; that’s Azure terminology. And in the ~3 years I was at Google I never once heard Thomas Kurian make a personal public statement taking the blame for a customer outage. The whole thing sounded suspiciously like Unisuper was making things up to deflect blame for their $135 billion fumble.
But then—incredibly—Gergely Orosz was able to get confirmation from Google Cloud PR that the Unisuper statement was legit. Google Cloud really did delete one of the the biggest pension funds in Australia, and left their customer to handle all the fallout.
Questions arise.
In a statement presumably lawyered to death by both companies, why did Google use conspicuously inaccurate language to describe its own services? The only thing I can come up with is that Google Cloud preferred not to reveal exactly which of their services ate Unisuper. Given the oblique reference to a private cloud, we can speculate it might have been VMware Engine. But other customers of that service have no way to know for sure.
How, exactly, does a public cloud accidentally delete a major customer so irrevocably that it takes them eleven days and off-cloud backups to get back online? I do not know. I do not know how this is possible. It raises all sorts of unpleasant questions about what other safeguards Google Cloud’s ops team is missing behind the scenes. This is the sort of problem that cries out for a full, detailed, public root-cause analysis. Don’t delegate to your customer to post a vague assurance that “this was a one-off mistake and we really super promise it won’t happen again.” This is a bad, bad, bad problem. A confidence-shaking, existential problem. You have to make a direct statement demonstrating that you know what went wrong and why it won’t happen again.
But Google Cloud didn’t. They usually don’t. And customers yet again are left wondering, waiting, and considering if they should make off-cloud backups of their own.
That is the opposite of customer obsession. It is customer obfuscation. If Google Cloud wants to be taken seriously as a competitor to AWS, they should learn the lesson of the S3 denial-of-wallet attack. Customer trust is not earned through shiny AI demos and inexplicable musical performances. It is earned the hard way, one support case at a time.
Cartoon of the day
Today’s cartoon is brought to you by Tidelift. They’d love to see you at Upstream, their conference digging into the health and security of open source on June 5.
Absolutely amateur act from Google Cloud. Agree, that need to do a full-blown root cause analysis.
GC has released details/incident RCA
https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident