Code-First Reliability in Payment Systems

Building Payments to fail as soon as possible

Jun 25, 2025

This is a chapter from Code-First Reliability.

What would you do differently if you didn’t have a sandbox environment?

My biggest takeaway from How Uber Tests Payments in Production hitting the front page of Hacker News was that people don’t really get how reliability works for payments. Engineers have been building money software for a long time. Doing the books was one of the first applications of software.

Most of that software is still around.

Cringey, But True: How Uber Tests Payments In Production

Alvaro Duran

August 7, 2024

Read full story

Old software works. The world would have noticed otherwise. But old software is unwieldy, because it’s been around since before the time we got modern testing practices. Like, for example, sandbox environments.

Which means that payment systems gain very little from sandbox testing.

Why? Because, over time, we’ve managed to debug those old systems with heavy and consistent use. The easy bugs are all gone. The hard bugs too. What remains is a fuzzy cloud of unreproducible, often-works-but-sometimes-doesn’t kind of problems that no amount of “testing in preprod” will help you find.

People call these zebra bugs. And they all live in production.

Engineers who are used to sandbox know that you can push the limits when you’re not using systems live, where errors have real consequences and mistakes involving data often require fixing rows in the database.

So how do you catch these bugs? That’s what this series is going to be about.

This is The Payments Engineer Playbook. Since its very inception, one of the question that gets asked the most by readers is “how do you test payments?”. This is because, if you’ve spent a minute working on software that deals money, you’ve noticed how crucial it is to get things right all the time:

Introducing errors in money software is as stressful as in other kinds of systems, but it also leads to lower auth rates, increased fraud, and unnecessary errors shown to customers, scared by the suspicion that your site is scammy, impacting actual and potential revenue.
Customer won’t recommend a system that makes mistakes adding and substracting money. Tiny bugs in money software impact growth.
Manual fixes don’t scale, and money software is designed to be scalable from the start. Mistakes, when found, have to be fixed immediately, slowing development of new features.

In this series, we’re going to explore a few techniques to make your payment system more reliable.

I’m expanding from a 15 minute talk I gave a few weeks ago at Kiwi.com’s Engineering Open House in Barcelona, titled Unconventional Approaches to Payment System Reliability.

What I like most about [Fintech] is the responsiblity. Money is on the line. When things go well, multiple years of your salary are going to be [earned] by a very good release. And when things go wrong, multiple years of your salary are going down the drain because of a bad release.
— Me, at Kiwi.com’s Engineering Open House

This series will expand on the 4 concepts that I described in that talk, namely:

Production Testing (this article)
Redundancy
Tokenization
Being on call

Enough intro, let’s dive in.

Everything is Failing All The Time

Payments belongs to that subset of the software industry that is thoroughly regulated.

It’s not that you couldn’t plug your systems straight to Visa or Mastercard. It’s that they won’t allow you to, unless you painstackingly go through an endless compliance process, part designed to make sure that you have your act together, part scar tissue grown over decades of bureaucratic complexity, gaming government’s regulation, and the occasional scandal.

In practice, you can’t process payments alone. The good news is that there’s a cottage industry of payment providers out there which, under different business models and orchestration approaches, allow you to accept payments from your customers.

The Regulatory Face of Payments as a Service

Alvaro Duran

May 14

Read full story

The bad news is that payment providers fail all the time.

You on Sandbox vs You on Production

If you sell access to services via API, sandbox is marketing.

Engineers treat it like the real thing, and to some extent, it’s OK. The endpoints look the same, the happy path looks the same. And if everything goes well, both sandbox and production behave in the same manner.

But when it doesn’t go well, then sandbox is a very bad place to start debugging.

Payment providers use sandbox to deceive you in the same way that Potemkin was trying to deceive Empress Catherine II:

In 1787, as a new war was about to break out between Russia and the Ottoman Empire, Catherine II, with her court and several ambassadors, made an unprecedented six-month trip to New Russia. One purpose of this trip was to impress Russia's allies prior to the war. To help accomplish this, Potemkin was said to have set up "mobile villages" on the banks of the Dnieper River.^[3] As soon as the barge carrying the Empress and ambassadors arrived, Potemkin's men, dressed as peasants, would populate the village. Once the barge left, the village was disassembled, then rebuilt downstream overnight.
— Wikipedia entry on Potemkin village

When you send a request to a sandbox environment, what receives it isn’t an exact copy of the provider’s production system.

Instead, what’s often the case is that a separate, minimal, independent service acknowledges that request, and based on a predefined set of parameters (usually the credit card number or the cardholder name), a prearranged response is sent back.

No production-like database was used to service that request. Perhaps some logging, and the request payload was validated as if it were a production one, but that’s about it.

The reason, of course, is that it is pointless to have a production-like system up and running to help you integrate with the API. It is also very expensive.

A production-like system in sandbox doesn’t make a business difference.

If payments demand third party providers, and these are hard to test for, then a payment system’s biggest bottleneck is the reliability of their providers.

Which means that payment systems reliability is not an infrastructure concern. Fine tune Kubernetes all you like, but in the end, what’s going to decide whether you can accept payments or not happens beyond your own services.

You can get away with a fairly pedestrian Kubernetes config if you engage in the right set of strategies.

Because the right strategies to improve payment systems reliability are Code-first.

If you can’t rely on sandbox, the game can’t be thorough testing.

The diminishing returns of testing your system against a Potemkin provider means that every test scenario, good or bad, is validated only when you go through the provider’s production system. With real money, and real users1.

The game is resilience, not never fail at anything.

Which is why observability is the ultimate tool for testing payment systems, and money software more broadly. You have to know what happens under any scenario when a real user goes through it.

Merge And See What Happens

When I talk about production testing, I sometimes say, tongue in cheek, that we “merge and see what happens”.

Kinda is? But not exactly.

Because observability, metrics and the like are only the first step when it comes to production testing. Knowing that it is likely that a release may go wrong, one of the obvious goals is to minimize the impact of bad releases.

So let me list a few ways in which you can do that, including a few resources to read if you want to dive a little bit deeper.

Canary Deployments

Rather than routing all traffic straight away to a new release, you can route it through a specific subset of your system, either instance only or instance + database partition.

This is called partial rollout, or canary deployments. Facebook has always been one of the biggest proponents of this strategy (not on payments, but in general), and my favorite resource on this is Christian Legnitto’s Facebooks’s Mobile Release Process (back when they were The Facebook), where they describe a multicanary process in which different levels of traffic are progressively rolled out.

Feature Flags

A similar way in which you can make sure that bad releases only impact a small subset of your users is to decouple the process of release with the deployment of a feature. You do that with feature flags, a piece of configuration meant to programmatically allow or block a code chunk to be run.

Feature flags are great, but also dangerous when forgotten. Knight Capital famously went bankrupt when a bad release combined with a forgotten feature flag made the automatic trading system acquired massive long and short positions, totalling a $7.65 billion loss.

Never set up your own feature flag system. A battle-tested, third-party feature flag is your friend here.

Small, Easy to Revert Changes

I said that reliability is code-first, but I didn’t say that DevOps practices were ignored.

In fact, many of the tenets of the book Accelerate still apply:

Frequent deployments: indicating smaller, more comprehensible changes
Fast commit-to-release times: indicating a mature CI/CD pipeline
Low Mean Time To Restore (MTTR): indicating fast detection and easy to revert process
Low Change Failure Rate: indicating that most releases have no adverse effects or degradations (this one is the most difficult to measure, as failures may be detected days or weeks after the fact).

Many payments engineers have little exposure to DevOps practices, and Accelerate is too research-based to start, so I would recommend giving it a go to one of its author’s fiction book The Phoenix Project. If you’ve already read The Goal, you’ll notice that it’s a shameless copy applied to an IT project. Still, a good read.

Not To Get It Right First Time, But To Contain Damage

The biggest shift is still one of mindset: how to stop thinking in terms of correctness and into resilience.

The easy bugs are going to get caught in sandbox, yes. I’m not advocating for recklessly go to prod without reading some docs first.

But sandbox is like the Tutorial land in most videogames. If you want to make further progress, you’ll have to get out of the confort zone, and accept that the real magic (and the real bugs) live in Production land.

This is it for this week’s article of The Payments Engineer Playbook. I’ll see you next week.

PS: Before you go, I have to be completely honest with you: this series has taken a lot of work to prepare.

But it’s taking me time away from other topics that you may be more interested in reading about.

I need to know from you.

Are these articles useful? Then, you can do two things.

First, I want you to leave a comment, no matter if it’s negative or positive. Either way, I want you to let me know: do you want me to keep making these articles?

It’s a lot of work, but I’ll keep doing it if I know it’s making an impact on you. Otherwise, I’ll move on to something else.

The second thing I need from you is to tell a colleague. If you’re reading this, you probably work with someone who builds payments for a living. And you’ve been reading this newsletter long enough to tell if it’s going to be useful for them too.

And if you got this article from a colleague, do me a favor and subscribe. Yes, it is a paid publication, but I publish a lot of free content that is just as valuable as the articles that are paywalled.

The quality is the same. What paid subscribers get is more insights, and more frequently.

Make a bet on The Payments Engineer Playbook. I’ll see you around.

You may be thinking “so why don’t we ask engineers to pay through the platform as a manner of testing?”. This is a grey area that I’m not qualified to speak about (I’m not a lawyer), but it may be illegal. Make sure you check things out with HR, Legal or someone with expertise in this area.

Fernando Almenara

Jun 26

Es brutal. Es un montón de conocimiento de una de las piezas del engranaje de Internet que menos luz tiene. Agradezco mucho que estés haciendo el esfuerzo de volcar tu conocimiento y tu experiencia en estos emails. Tuve que aprender (a nivel básico) como PO para implementar el Second factor Authentication, y aluciné. Hace años perdí un contrato por no saber que había credit card acquirers (como Datatrans) o la frustración de tratar de implementar un sistema de cobros online recurrente en España (o EU) como el que tenía Basecamp a partir de la documentación de La Caixa para certificarnos (cosa que nunca hice porque era para otro tipo de empresa). Ha sido darme de golpes con ésto hasta llegar Stripe (que tardó en llegar a Europa, y empecé a usar el clon de los de RocketInternet). Por eso agradezco mucho todo lo que cuentas, y lo bien escrito además que está. Un saludo, y gracias,

Expand full comment

1 reply by Alvaro Duran

1 more comment...