Digging into that Pokémon GO architecture

Why did they build it on Google Cloud?

Nov 01, 2021

The other day, Google Cloud’s consistently excellent Priyanka Vergadia released this interesting interview with an engineering manager at Niantic Labs, the people behind Pokémon GO.

My favorite thing about that interview is that … well okay, my favorite thing is that it’s been transcribed for text, which I find much easier to consume than video. But my second favorite thing is that it gives you a peek into how you can run something in the cloud that requires simply stupid levels of global scale, while also keeping interactive latency down to a level approximating “real life”. There’s a reason that a lot of the historical innovations in computing have come out of gaming. This is a next-level problem and I recommend reading/watching the interview multiple times to pull out the nuances of what Niantic’s James Prompanya is explaining.

The (very high-level) view of Pokémon GO’s cloud infrastructure shared in the interview. We’ll break this down in a sec.

Hours after the interview came out, a reader of this newsletter reached out to me with the following question:

“I'd love to see a tweet thread from you decrypting [the above design], mapping the various parts to equivalent AWS services. [To be honest,] I've never really taken Google Cloud very seriously because I've never seen the architecture of a major app like this running on GCP - they always seem to be leaning on AWS more. It would be instructive, if you have the time.”

They do say adults learn primarily by analogy, and I’m right there with ya, reader - my background is all AWS, and when I see an architecture like this I immediately try to relate it to what I already know.

So why did Niantic choose to build this app on Google Cloud? Could you do it using equivalent services on AWS? What tradeoffs are they making here?

It’s hard to give too serious of an analysis based on the high level of the information available, but here are four things that jump out to me. Let’s see if we can use them to shed some new light on both clouds.

Spanner

Let’s start with Google Cloud Spanner, the boldest choice made in this design. Is there a comparable service to it on AWS?

We know that global low latency, ACID transactions, and relational semantics are important to Niantic. They’re pumping a brain-bleeding amount of traffic through this thing as well, so it has to stay fast at giant scale. What AWS database could do this? Prompanya says they evolved past Google Datastore, so DynamoDB is probably not going to work for them either.1

The closest AWS-native service to GO’s requirements would probably be Aurora, but Aurora is not going to give you the same global write-anywhere/replicate-everywhere abilities as Spanner. (You can have multi-master clusters in Aurora now, but all the instances have to be in the same region, and you can’t enable cross-region replication on them.) Think of Aurora as Postgres or MySQL on cloud steroids, but still fundamentally bound by the traditional database rules of clustering, failover, and replication. Whereas Spanner is this shapeshifting Paxos-based voodoo creature that globally distributes your data at the row level.

The tradeoff with Spanner, one that Niantic apparently was okay with, is that you’re not going to get a familiar engine to build on - it speaks SQL, but it’s a bespoke system with its own libraries and drivers. It’s worth noting that Spanner finally got Postgres compatibility in preview about two weeks ago, though.

Bottom line: I don’t think you could reproduce this piece of the app one-to-one on AWS services. If you really want a Spanner-like experience on AWS, your best bet might be to pull in a third-party service: specifically CockroachDB, which approximates Spanner’s “TrueTime” design in software to get around Google’s reliance on atomic clocks.

GKE

The Kubernetes cluster in that diagram is a bit of a black box as presented; we’re given tantalizing hints about caching and event-driven optimizations, but we don’t really know how all these services are talking to each other.

The part of my brain that used to be a Serverless Hero says “see, this is what’s so great about serverless! If you’d just built your thing out of functions, all the complexity would be forced onto the cloud infrastructure diagram, and we wouldn’t have to guess about it!”

But, of course, I’m not saying this architecture could or should have been built on functions. The near-real-time latency requirements alone might rule that out. Instead, per the interview, we get “thousands” of containers - and if you’re going to run that much Kubernetes, you probably want to let a cloud provider manage it for you.

My bias here is to say that, given the choice, GKE is a better all-around option than AWS EKS. It’s simply the most fully-featured and the most automated managed K8s service of the 3 major cloud providers, as you can see in this bake-off I helped put together for ACG a few months ago2. I’d be curious to know what Niantic’s compute bill looks like, though.

Cloud load balancing

For fairness’ sake, I should call out something here that’s been an unpleasant surprise for me when picking up Google Cloud: that even a simple static website requires you to put an always-on Cloud Load Balancer in front of the storage bucket just to get HTTPS. I don’t want my static assets to charge me 24/7 for compute costs!

Obviously what’s going on here is more complex. But still, it appears the Cloud Load Balancer / CDN / Storage part of the architecture could be replaced by Cloudfront and S3 without much conceptual tweaking.

Data and monitoring

The one place where we actually get to see a bit of message passing happening in this architecture is the Pub/Sub notification (Google Cloud’s version of AWS SNS) that triggers the decoupled data warehousing / analytics stuff. An AWS equivalent of the Bigtable → Dataflow → BigQuery pipeline in the diagram would be a streaming ETL job, something like DynamoDB → Kinesis → Glue → RedShift. Your opinions of RedShift may vary, but I suspect Niantic is quite happy using BigQuery for their data lake needs.

One last thing. Did you notice that Prompanya keeps mentioning how happy they are with using Google Cloud Monitoring as their primary pane of glass to keep an eye on all this stuff? Coming from the AWS world, that really jumped out at me. When’s the last time you heard of someone monitoring a massive-scale app in AWS using nothing but CloudWatch - not to mention evidently being quite proud of that fact?

Now, to be fair, Prompanya never says that Google Cloud Monitoring is their only window into how the app behaves, and there could be some editorial bowdlerizing going on here. But still I’m getting the spidey-sense that, like with IAM and environments, there may be some fundamental quality-of-life difference between monitoring on AWS and on Google Cloud.

Maybe in the next installment of this newsletter, I’ll have to do a deep dive into Google Cloud Monitoring and see what all the fuss is about.

Disclaimer, as always, that I’m new to Google Cloud and could be totally off on something here. If so, let me know and I’ll fix it.

Could DynamoDB theoretically fulfill the latency and data model requirements of Pokémon GO? I suspect Rick Houlihan would say yes. He also might be the only person who could pull it off.

Note: that article was released before the announcement of GKE Autopilot, which abstracts away even more of the K8s management responsibilities.

Tim

Nov 1, 2021

On the load balancing comparison - I think Cloudfront would only replace the static asset CDN? You still need to load balance incoming requests to GKE. Paying for an always-on rule isn't great, but you can't think of it as compute, because it's not an instance (like an AWS ALB), the GCP LB is serverless. Plus it's global, you would need several AWS regional ALBs to do this - and you would have a pre-warming/scaling nightmare on your hands with AWS.

Expand full comment

Good Tech Things

Discussion about this post