Cringey, But True: How Uber Tests Payments In Production

Well-run payment systems are developed by engineers who understand what is the best use of their time: to catch unknown unknowns, and to do it fast.

Aug 07, 2024

You are wasting most of the time you spend testing.

And I get it. You’d rather test things in staging, because it gives you a sense of control. Most engineers even cringe at the idea of testing in production. But I think that's because they think it’s either/or. It’s not.

You can test before deployment as much as you’d like. And then, you can test in production as much as you can. The trick is when to switch.

You aren’t testing in production early enough.

That’s why what Uber does is so intriguing.

But first, why is testing in production even necessary? Can’t we just try to write code as correctly as we can inside a no-stakes environment, rather than risking real money and real impact on real users?

You could. But if you’re doing your job right, you’ll quickly run out of bugs to find in such an environment. The easy ones. If you haven’t already.

Look, the systems you maintain, that most of us maintain, have been around for a few years. Some, even decades. That is because software is not like other machines. Most machines, in time, rot and decay. But software is just information: if it’s correct, it stays that way. Hardware does need replacement, but the correct software that runs on it keeps running.

Software, if you’re doing your job right, gets better over time.

What an amazing machine.

You probably call those systems legacy with disdain, because software gets more difficult to maintain over time. But there’s a reason legacy software is scary to change. We want it to keep doing what it's currently doing.

Hate it all you want, but legacy software works, even when it’s a mess.

There’s pretty much one way to produce high quality software. Use it, and fix all the bugs you can find in it. In time, the easy bugs are gone.

Unlike the human body, old software is so healthy. If that’s the only way to produce high quality software, the only illnesses you’re going to find are the exotic ones.

Payment systems are an extreme version of this. Moving money has always been the most obvious business use case for computers. And so, money software has been around for a long time.

That’s the kind of software that Uber, or any other merchant, has to deal with. What I find fascinating is that Uber is doing it in a way that makes many engineers cringe.

Uber tests its payment systems in production. And in this article, I’m going to tell you how they do it, and why it’s a great idea.

I’m Alvaro Duran, and this is The Payments Engineer Playbook. Scroll for five minutes on Youtube and you’ll find tons of tutorials that show you payment system designs that’ll help you pass interviews. But there’s not much that teaches you how to build this critical software for real users and real money.

The reason I know this is because I’ve built and maintained systems that handle close to 100,000 payments a day. And I’ve been able to see all types of interesting conversations about what works and what doesn't for payment systems behind closed doors.

These conversations are what inspired this newsletter.

In The Payments Engineer Playbook, we investigate the technology that transfers money. And we do that by cutting off one sliver of it and extract tactics from it.

A Tour of Uber's New San Francisco Office - Officelovin — Uber offices in San Francisco

If you don’t want to test payments in production, your only choice is to use a staging environment.

And what do you have to do to set up a good one? Two things.

First, you have to copy all production data. It’s expensive, and a reckless breach in privacy and security, but it’s doable. And second, you must emulate all user activity. Staging must become a believable version of your production systems.

That reminds me of a Potemkin village.

Staging environments are not as useful as you think they are. It is unrealistic to try to erect sophisticated replicas of the real world. Your tests will only be as good as your ability to do the job completely.

I really like how Charity Majors put it: “staging is just a glorified laptop”. Only production is production.

In fact, payment providers only do so much in their sandbox environments. As soon as you start digging deeper, you’ll notice a big gap between how sandbox behaves and all the surprises that production has for you.

But the lesson isn’t to demand better sandbox applications from your provider. They’re not going to comply, because it just doesn’t make a business difference to them.

Instead, the lesson should be this: to test your payment systems in sandbox for an amount of time that’s reasonable. And not a second more.

That’s how Uber does it.

Uber has outgrown the idea that defects can be completely solved at staging.

Rather than stressing out over a perfect release, Uber has put in place tools to detect production failures as early as possible, and to roll back to a known safe state quickly and easily.

These tools correspond to three key concepts: To roll out against business metrics, to carefully select a first rollout region, and to make these rollouts progressively.

Rolling Out Against Business Metrics

Before he started The Pragmatic Engineer, Gergely Orosz worked as an engineer manager at Uber Money. In 2018, he gave a talk on how Uber rolled out new payment methods.

For Orosz, building and rolling out a new payment method are two sides of the same coin.

Sandbox testing is something you do early on, because it speeds up development. But as soon as you can, you move into real payments with real cards.

And for that, the right debugging tools are critical.

Orosz mentions Uber’s internal tools, Cerberus and Deputy, which are responsible for two important tasks when testing in production

Making requests to real systems in a transparent way
Channeling the responses into your own laptop

But to me, the most important point of his talk is this:

For Uber, every deployment is an experiment

What Orosz means is that Uber recognizes that nobody really knows how any deployment is really going to turn out. Every time out there is a guess, and your job is to make it an educated one.

Therefore, every deployment is a hypothesis on how certain business metrics will look like from the moment it goes live.

Which metrics to measure, and which monitors to put in place to do that varies from company to company. But a payment method that doesn’t help your company earn more or spend less is a wasted effort.

Carefully Select a First Rollout Region

I’m not going to name names but some large successful unicorns in this city still deploy all their Java WAR files to production at once. At once. And they have a reputation for going down a bunch. I have no idea why.
— Charity Majors, Engineering Large Systems When You're Not Google Or Facebook

Brazilians always get the new stuff from Facebook first.

This is by design. One of the corollaries of “every deployment is an experiment” is that you should mitigate any potential problems by exposing it to the smallest, but significant, subset of users possible. Only when you don’t see anything wrong, you expose it to more users.

The first experiment region is how you do that in the beginning. A way to contain the impact of a potential screw up.

When Uber rolled out GooglePay, they decided to focus their monitoring on Portugal.

It was a country

Small, but not tiny: Rolling out incrementally to 100% of the country’s traffic would be a significant number already
In a close time region to where the team was based (Amsterdam): This made live monitoring so much convenient for them.
With representative users: Most Portuguese pay on Uber through the Authorize flow, just like pretty much everyone globally.
Where the provider’s dependencies are minimized: In Portugal, the old Android Pay had close to no penetration.

Selecting a first experiment region can do wonders if you accept payments globally.

Canary Deployments

Done well, canary deployments make rollbacks more frequent, not less.

Like canaries in a coalmine, deploying to a subset of users is meant to be done frequently. You assume that the moment something doesn’t look good, you will pull back fast.

No deployment strategy is going to make each individual deployment safer. What canary deployments give you is the opportunity to trade SEV-1 outages for a few SEV-3 and SEV-4.

And guess what? That’s exactly what happened to Uber when they were rolling out GooglePay. Numbers didn’t add up in the beginning!

Rolling out cautiously in Portugal was a smart decision. It surfaced bugs in Google Pay.

Our uncollected rate was huge. And we first just said “all right, are we stupid? Are we missing something here?” But no, everything seemed fine. Seemed like no mistakes.
So we searched it all with Google. First, we rolled back and we talked with Google. And, you know, It turns out there were some issues on their end, and there were some issues on our end, and we certainly fixed it. But it took quite a while.
— Gergely Orosz, Payments Integration at Uber: A Case Study

How on Earth are you going to find bugs in GooglePay from a staging environment?

You can’t. You just can’t.

And that’s what’s fascinating to me. On the one hand, you’ve got engineers who take every precaution possible before rolling out because payments are something you should never break.

And on the other hand, you’ve got companies like Uber, who take every precaution possible after rolling out because they understand that the game is resiliency, not never failing at anything.

That’s the lesson that I’m taking away from how Uber does things. Testing before production is fine, but returns diminish sharply.

At some point, you’re better off checking your work against real users, and real money.

This reminds me of algorithmic trading. You can develop a strategy that performs great with backtest data, in a no stakes environment. But the real test is the real thing with real stakes. Nothing else compares to it.

You should think of deployments as experiments. Only production is production. Anything else is a prelude.

All right, that’s it for this article of The Payments Engineer Playbook. See you next week.

PS: Before you go, I have to be completely honest with you: this article and the one on Stripe took A LOT of work. I’m not sure if I’m going to keep making these kind of articles anymore.

I need to know from you.

Are these articles useful? Then, you can do two things.

First, I want you to leave a comment, no matter if it’s negative or positive. Either way, I want you to let me know: do you want me to keep making these articles?

It’s a lot of work, but I’ll keep doing it if I know it’s making an impact on you. Otherwise, these ideas will stay private inside of my team.

The second thing I need from you is to tell a colleague. If you’re reading this, you probably work with someone who builds payments for a living. And you’ve been reading this newsletter long enough to tell if it’s going to be useful for them too.

And if you got this article from a colleague, do me a favor and subscribe. It’s a flex to be a reader of a well-known publication before it was cool.

I bet that’s how it feels to be a VC who led a series A on a startup at IPO day.

Make a bet on The Payments Engineer Playbook. I’ll see you around.

Viktor Miroshnikov

Aug 7, 2024

Loved it. Especially the emphasis on resilience vs attempting to "never fail".

Expand full comment

Stephen Humphrey

I’m not in payments and yet, after reading this brilliant article, I subscribed to your channel. You touch on bedrock engineering principles which transcend your vital niche, methodically demonstrating how those principles apply to payments, but also hinting at how they apply throughout well-engineered systems. Well done.

1 reply by Alvaro Duran

34 more comments...

Cringey, But True: How Uber Tests Payments In Production

Well-run payment systems are developed by engineers who understand what is the best use of their time: to catch unknown unknowns, and to do it fast.

Rolling Out Against Business Metrics

Carefully Select a First Rollout Region

Canary Deployments

Discussion about this post