36 Comments

Loved it. Especially the emphasis on resilience vs attempting to "never fail".

Expand full comment

I’m not in payments and yet, after reading this brilliant article, I subscribed to your channel. You touch on bedrock engineering principles which transcend your vital niche, methodically demonstrating how those principles apply to payments, but also hinting at how they apply throughout well-engineered systems. Well done.

Expand full comment

Stephen, this made my day. Thank you so much for your kind words, I'll do my best to live up to the expectations in future articles :D

Expand full comment

This is one of the better eng articles I’ve read lately. Keep it up!

Expand full comment

Hope you stay tuned for more!

Expand full comment

It is an amazing article, Thanks

Expand full comment

A colleague shared this article with me and I loved reading it. Your work on writing it was much appreciated. Thanks for that!

Expand full comment

Great article! Super insightful! Keep it going!

Expand full comment

Great article, super informative!

Expand full comment

We have a similar situation at my work. Our system handles card processing. We can gather real-time data and use it in a staging environment but nothing, I mean nothing, is the same as real-time. We found a bug and have stressed over how best to test the fix. I keep saying we just need to release it and watch the results, roll back quickly if it fails. Your article was very timely. Thank you!

Expand full comment

Good luck and Have Fun with your roll out!

Expand full comment

super interesting

Expand full comment

Love your work. I think Charity Majors is a legend but the impact of you, from your position within Uber making these points is huge. Even more so given you're speaking about payments, which is where s**t get's real. Never underestimate that impact.

I was pivoting to public cloud back when Netflix was scaling. People like Adrian Cockroft talking about their very real problems and unique solutions has been instrumental to the way I think about resilient, scalable distributed systems. You continue that fine tradition & it's awesome. Thanks :)

Expand full comment

Hey Andrew thanks for this! Watch out for the next week's post on Airbnb, I'm sure you'll like it :)

Expand full comment

This was a great read, subscribed!

Expand full comment

Awesome article. It is helpful and it increases my curiosity around getting that playbook.

Expand full comment

This is total vindication for all the times I argued against spending vast engineering effort on the 'perfect' staging system. This approach makes more sense, especially when dealing with 3rd party APIs which often have garbage dev/staging environments.

Expand full comment

I would add some nuance to what I said about third party sandboxes. They're not going to improve, but that's because they're built so that you can do API integration testing.

Which is a great first step, and much needed. But there's more to it than just making sure that schemas conform to the API.

Expand full comment

I've worked on financial transaction systems that are way lower traffic than Uber, and it was still essential to test in production. It almost seems self-evident - the more critical the software is, the more important it is to test the final, live product properly.

I think the message of how your operations supports this is the real takeaway here.

Thanks for writing!

Expand full comment

I couldn't agree more Dave. And yet, testing in prod makes most people cringe. What do you think is the reason?

Expand full comment

I've worked in payments (not in this depth, but similarly, for an online game) about 15 years ago. Now I am working on developing a college course about the difference between prototyping and production. Thank you so much for providing such a useful reference on this topic as it pertains to payments!

Expand full comment

Let me shamelessly plug my recent talk on the topic of prototype vs production: https://news.alvaroduran.com/p/enterprise-python

Hope it's useful!

Expand full comment

Found this article through hacker news and followed instantly since this talks about a field I’ve always been fearful to engage in. Handling payments sounds terrifying to gave to deal with both legacy software and behemoth institutions when money is so visibly on the line. This article helped taper my fears though. If we design with failure as an expectation, we can fail more gracefully. There is no need to fear because there is no risk. The stakes are only high if a failure has no backup plan.

Expand full comment

Hi Nate, I appreciate your comment, especially the last sentence. It captures the zeitgeist of what I'm going for with this post beautifully!

I believe this post resonated with a lot of people precisely because of its emotional component. We're just afraid to test stuff in prod. It's trite, but the solution is precisely leaning into that fear.

Expand full comment