How to build 99.999% uptime payment systems
In order to keep pace with state-of-the-art, money software must work for all possible databases
In May next year, I’ll be in WebExpo, in Prague, to give a talk on how to build payment systems, and I’d love for you to be there. You can find all the info here.
It takes only a few years for customers to demand what today seems impossible.
At the beginning of 2024, Stripe announced that its systems had achieved 99.999 percent uptime in 2023. This means that they had only 5 minutes of downtime throughout the whole year.
This is the state of the art. Now it’s your turn to live up to it.
Unbelievably high levels of availability can only be achieved with modern storage techniques and cleverly designed distributed systems. For example, many of us are following with tremendous interest the development of TigerBeetle, the database built specifically for financial data.
They claim that it is 1000 times faster than the mainstream ledgers. What’s mainstream, what you use today, is now very slow.
However, TigerBeetle’s main problem is adoption. Or, as one savage reader of The Playbook puts it, “talent is too thinly spread in fintechs these days to do a roll-your-own at acceptable medium term risk levels”.
Companies building money software would be way better off using TigerBeetle. But they often build payment systems that are too rigid at the database level, and therefore risky to migrate.
In order to keep pace, payments engineers need to be able to switch databases faster.
They need persistence ignorance.
I’m Alvaro Duran, and this is The Payments Engineer Playbook. If you scroll for 5 minutes on Youtube, you’ll find many tutorials showing you how to pass software interviews where the interviewer asks you to design a payment system. But if you want to build this critical piece of software for real users, and real money, you’re pretty much on your own.
I know this because I’ve built and maintained payment systems for almost ten years. I’ve been able to see all kinds of interesting conversations about what works, and what doesn’t.
But it was behind closed doors. Lately, I’ve decided to make these conversations public. This is how The Payments Engineer Playbook was born.
One reader said that The Playbook is about “deep dives on the stack behind the magic”. We investigate the technology to transfer money, so that you become a smarter, more skillful and more successful payments engineer.
And we do that by cutting off one sliver of it, and extract tactics from it. Today, we’re looking at isolation levels, and how to remove the restrictions in the way of achieving 5 minutes per year of downtime.
Let’s dive in.
Nowadays, relational databases are so deeply embedded that most engineers assume them right out of the gate.
Even when they know there are non-relational databases out there.
The biggest reason is that relational databases are the most sensible approach. We’re familiar with them—we know how they work, and especially how they stop working, and why.
But the fact is that there are a few trade-offs that are implicit in choosing a relational database that could become an obstacle to scalability.
This week, I’m going to focus on one particular trade-off: isolation levels. Next week, we’ll be looking at consistency models.
Isolation is the I in ACID compliance. It means that relational databases are designed to preserve a particular kind of fiction: that you’re the only one interacting with it.
With relational databases, you’re very rarely aware that other users are making requests to the same machine.
But this condition is what makes them less powerful, less scalable and, compared to specialized databases like TigerBeetle, a thousand times slower.
Here’s the kicker: payment systems could benefit a lot from more relaxed levels of isolation. They could very well achieve 5 minutes a year of downtime that way.
In order to preserve the isolation fiction, relational databases are serializable. It means that when two transactions start on the same database at roughly the same time, the outcome we’ll see after that would be as if they were executed sequentially.
Even if they happen at once, the end result would be as if first, we did one, then we did the other.
This is a great guarantee to have when we hit the like button on a Substack newsletter like this one. If two readers hit that button at the same time, what we want to see is two likes, not one.
When two users send money to the same account in parallel, we want everything to add up.
But payments is a special kind of money software. When you acquire payments from your customers, most of the interactions are already interleaved.
It is incredibly rare for a customer to pay twice in a matter of milliseconds. In fact, when it happens, it is almost always a double charge.
Why does that matter? Because, if payments are almost always interleaved with each other, we can trade a little bit of isolation in exchange for better scalability.
And, being the optimization nerd that I am, the question is: can we go all the way? Can we remove all restrictions and be safe, without burdening the server with too much complexity?
I believe we can.
Weakening commitments
Dirty reads, non-repeatable reads, and phantom reads. Those are the problems that serializable databases protect you from.
Dirty reads happen when other clients read data that we haven’t yet committed. Non-repeatable reads happen when data from before and after an update are combined, producing a paradoxical result. And phantom reads lead other clients to make update decisions based on outdated data.
If we can design payment systems that avoid these problems, then we may have a chance at emulating Stripe’s 5-minutes-of-downtime-per-year services.
The good news is that we can. With event locking.
Event Locking
To use an append-only, independent table as a server-driven mutual exclusion mechanism
Last week, when I published How Modern Treasury Invented Event Locking, you didn’t think I would use event locking only for ledgers, did you?
A mutual exclusion mechanism that’s append-only and server-driven relieves the database from the responsibility of making sure that isolation problems don’t happen.
That responsibility can be placed onto the server. Event locking can be used in payments to achieve persistence ignorance.
Stripe gets 99.999% uptime on top of a MongoDB-like database. I believe that, if we want to be able to do that too, we need to build servers that talk to the database as if it were MongoDB (even if it’s not).
Just like isolation levels preserve a particular kind of fiction in the database, we can do away with them, and preserve a similar fiction on the server.
To do exactly that, we can use events, and event locking.
Notice that dirty reads and repeatable reads are problems that are specific to database updates. That is, we can go around these problems by building a locking mechanism that is append-only.
Creating events, one by one, is the trick. Whoever creates the first event wins—the rest, if any, have to wait. This prevents dirty reads and repeatable reads without depending on any lock on the database itself.
Append-only events make it pretty easy for the server to ensure that conflicting payment operations, those that refer to the same user and the same payment, are blocked, while interleaving those that belong to different users.
But what about phantom reads? What if I check the database for events, find that I’m good to go, and then immediately after another client creates the first event?
Again, you’re good. Append-only events are like git branches. If you try to merge a pull request that isn’t rebased to the latest, you’re going to have conflicts.
Preserving the Fiction
Interleaved payments make strict isolation levels an obstacle, not a guardrail.
Strict isolation is one of the trade-offs that prevent you from keeping up with Stripe’s 5 minutes of downtime.
But the worst part isn’t that you can’t beat Stripe. The worst is that what Stripe is doing now is what your customers are going to expect from you in a few years.
Of course, they’re not going to spit it out that clearly. “Hey, your SLAs are falling behind” is not a common customer complaint.
Instead, they’re going to say “it hangs”.
Instead, they’re going to think “why is it taking this long to pay?”.
I know because that’s what I often wonder when I pay online now. I believe you do, too.
We’re engineers, after all, and we’re ahead of the curve when it comes to understanding availability and latency. As the quote goes, the future is already here—it’s just not very evenly distributed.
Preserving customers is a certain kind of race, not necessarily to be the first, but to at least avoid being the last. 5 minutes of downtime sounds unfeasible, until you realize, just like Shopify did years ago, that today's Black Friday is tomorrow’s Base Load.
At the database level, the task is to always drive safe. But as fast as you can.
That’s it for this article of The Payments Engineer Playbook. See you next week.
PS: I have a quote for you by Hillel Wayne:
All "best practices" and thought leader advice is geared towards the baseline generic system in that field.
The baseline generic distributed system, the baseline enterprise app, the baseline agile team, the baseline test suite.
No project is "baseline generic". All of them have complications, whether technical, historical, or domain-related, that make them special.
— Hillel Wayne, LinkedIn post
Generic advice yields generic outcomes. This is fine for mature industries, where processes evolve slowly and steadily, and optimal approaches can be deduced.
Payments isn’t one of those industries.
With The Payments Engineer Playbook, I set out to write what I needed back when I started in payments. What would’ve made me a better engineer in this specific domain, and not in any domain generically.
What would’ve protected me from lots of costly mistakes along the way.
There are a few newsletters many of us know about. But they are focused on one of two things: system design job interviews, and soft skills.
And I get it! These are probably the only topics that are broadly applicable for software engineers. System design interviews are mostly the same, regardless if you’re interviewing for one company or another, and soft skills are valuable everywhere.
These are areas where optimal approaches have already been deduced.
System design interviews will get you the job, and soft skills will help you keep it. But what’s going to put you ahead is doing your damned job, and doing it right.
If you build software that moves money around, being a paid subscriber of The Playbook pays for itself.
You may know that this newsletter is going to become a paid publication in 2025. Before then, you can pledge a subscription for $15 a month, or $149 a year. As soon as the year is over, the price will increase.
If you know somebody who can benefit from this newsletter, can I ask you to forward this email to them?
Even if they can’t afford it themselves, most companies have a training budget for their employees (ask your manager). These things move slowly though, so if you need approval from your company, better hurry up before the window closes.
And if someone you respect shared this article with you, do me a favor and subscribe. Every week I feel I’m getting better at this. That means that my best articles on how to build payment systems are probably yet to be written.
You can only find out if you subscribe to The Payments Engineer Playbook. I’ll see you around.
Your discussion of isolation level is about performance; what do they have to do with uptime?