How Modern Treasury Invented Event Locking

Race conditions are the nastiest problem in money software. They've forced engineers to make wrong design choices. Not anymore.

Nov 20, 2024

Sometimes, good discoveries are like buses. You wait a long time for them to show, and then suddenly, two of them come along at once.

This happened to calculus, to natural selection and to computability. Often, it’s because open collaboration and debate make seeds of an idea spread. Usually, it’s because it’s about damn time.

A few weeks ago, I read about an elegant idea, buried inside an old Modern Treasury post.

It was one that I had thought was my own.

I’m Alvaro Duran, and this is The Payments Engineer Playbook. If you scroll for 5 minutes on Youtube, you’ll find many tutorials showing you how to pass software interviews where the interviewer asks you to design a payment system. But if you want to build this critical piece of software for real users, and real money, you’re pretty much on your own.

I know this because I’ve built and maintained payment systems for almost ten years. I’ve been able to see all kinds of interesting conversations about what works, and what doesn’t.

But it was behind closed doors. Lately, I’ve decided to make these conversations public. This is how The Payments Engineer Playbook was born.

One reader said that The Playbook is about “deep dives on the stack behind the magic”. We investigate the technology to transfer money, so that you become a smarter, more skilfull and more successful payments engineer.

And we do that by cutting off one sliver of it, and extract tactics from it. Today, we’re looking at an elegant variation of optimistic locking that lets you engineer payment systems without the need for enums.

Let’s dive in.

How Tom Cruise did plane stunt in new 'Mission: Impossible - Rogue ... — Ethan Hunt optimistically trying to get into a locked plane.

I have a grudge for enums.

They are the programmers’ laundry lists. But often, they’re used as a class status, or type, or kind. Every engineer I know has used them—I have, too.

But grudgingly.

I dislike enums as class attributes because they attract problems like honey attracts flies.

Nothing is as critical to an object’s behavior as status, kind or type. But using enums forces you to validate that the attribute equals the one you wish you had every time you implement a behavior.

if status == 'pending':
  \\ do this

if status == 'canceled':
  \\ do that

This design creates change amplification. If you need to refactor some function that validates these attributes, how can you be so sure that it doesn’t create problems elsewhere? And if you add a new item in the enum set, how can you be so sure that every function that validates that attribute are refactored appropriately?

You test, of course. But testing becomes overwhelming, because enums, like concurrency, are prone to state space explosion:

So while concurrency might be difficult to reason about, I don't think it's because of a fault in our brains.
In my opinion, a better basis is state space explosion. Concurrency is hard because concurrent systems can be in a lot of different possible states, and the number of states grows much faster than anyone is prepared for.
— Hillel Wayne, What makes concurrency so hard?

Enums have similar dynamics: on every change, the number of things you have to check grows even more, to the point where you’re overwhelmed.

Past that point, it’s all bugs and delays.

That’s why I’ve been doing a small experiment on my own. Instead of having a single class with a type attribute, I’ve been prototyping a payment system where I have multiple classes that are similar, even equivalent, except for a different prefix in their naming.

So, instead of a single Payment class, I’ve built a system where I have type classes such as AuthenticatedPayment, AuthorizedPayment, CapturedPayment.

What I’ve discovered while building this prototype is that removing enum attributes has a subtle database trade-off.

And that’s the solution that Modern Treasury engineers had figured out years ago.

The Callback Event Condition

You might know that payment providers often implement notifications, where they let clients specify a callback URL that is meant to receive event messages.

We’ve already discussed that sending these kinds of messages only once is impossible:

Crypto enthusiasts will tell you all about The Byzantine Generals Problem, but this is what it boils down to: from your end, you can’t say if the other party has received your message. What you can do, though, is send your message multiple times, and hope that one of them reaches its destination.
— Exactly-Once Payments at Airbnb

To go around this limitation, payment providers keep sending callback requests until you’ve sent some acknowledgement response.

This is good! Sometimes, your application has crashed, or you’re having downtime during a deployment, and your system’s availability is taking a hit. Without retrying, you would have never received that message. Providers have designed their system to prevent this scenario from happening.

The problem is that you often get notified more than once.

This, combined with your Kubernetes cluster of stateless services talking to the same database leads to a very nasty problem: race conditions.

Your system, which stores data for one payment expecting an authorization event, suddenly receives two of them. Two processes, spawned from two independent servers, retrieve that payment data, neither of them aware of the other process’s existence.

Maybe both of them subtract the amount from the ledger. Maybe they mark two installments paid, instead of one. Maybe they send twice as many emails to the customer.

Just like buses.

Optimistic Ledgers

To prevent this problem, you use a locking mechanism.

That’s the subtle trade-off. While enums as class attributes are stored in the database, there’s nothing about the type classes there.

You can flag a specific row as locked while I’m updating the enum column. But if the table you’re locking isn’t the same as the one you’re changing, flagging a specific row won’t work. You’d have to lock the entire table.

This is the problem that I had. And it’s the same problem that Modern Treasury engineers had with their ledger.

We’ve discussed ledgers in the past.

Ledgers are a blend of accounting and engineering. Accounting, because they rely on an elaborate system that predates computers by a few thousand years. Engineering, because they sit at the center of all your company’s finances.

They are hard to get right though:

I used to work for a startup that, on every transaction, simply lost track of a couple of cents. As if they fell from our pockets every time we pulled out our wallets.
At this startup, a stock trading platform, the engineering team had followed the mantra of “make it work, make it right, make it fast”; we refused to build a double-entry accounting system. [...]
We could’ve taken the time to build it right. We could’ve done things better. But we didn’t.
— Engineers Do Not Get To Make Startup Mistakes When They Build Ledgers

Double-entry accounting involves having an Account table, where each row is associated with many rows in the Journal Entries table.

How do ledgers prevent race conditions? With either a pessimistic or an optimistic lock.

A pessimistic lock assumes that conflicting transactions happen frequently, and locks the specific row for reads or writes, halting any attempt at checking the information contained in it by any concurrent process.
An optimistic lock, on the other hand, assumes that conflicting transactions happen rarely, and so it doesn’t prevent reads by processes running in parallel. The lock instead happens when a commit happens: the fastest process will succeed, while the others will be rejected.

Optimistic locks work because a version number is incremented every time a row in the table changes, so that attempts at updating any row using an outdated version will be flagged as stale. It is often the preferred option, because it doesn’t prevent other processes from reading data from the database while the system adds new entries.

But it’s not as straightforward to implement.

Event Locking

Modern Treasury engineers soon realized that none of the out-of-the-box approaches work if the row you’re locking belongs to the Accounts table, and you want to prevent the creation of new rows into the Journal Entries table.

And that’s when I realized that the problem they had, and the one I was having with my prototype, was equivalent.

This is how they solved it: by separating the versioning from the ledger.

The engineers at Modern Treasury realized that the version number didn’t need to live inside the Accounts table. It could have its own, separate Account Version table, related one-to-one to the Accounts, with a column named “lock_version”.

In this approach, new Journal Entries aren’t created in isolation. They’re wrapped inside a transaction with the operation that increments the value inside lock_version. This is like an optimistic lock, except it is now the server, not the database, that decides if the lock_version is correct, or if the transaction needs to be rolled back entirely.

The trade-off is that the server has more responsibility for ensuring the integrity of the data. But in my case, I had already moved that responsibility to the server with my types!

Now, I can have a Payment Version table, one-to-one to a Payments table. And I can do the same trick every time the provider sends a new event.

Modern Treasury didn’t come up with a good name for this approach. So let me call it event locking.

Event Locking
To use an append-only, independent table as a server-driven mutual exclusion mechanism

One Step Closer To Eliminating The Enum

I’m surprised that such an approach isn’t a well-known standard.

But, in science, independent discoveries happen so frequently, I shouldn’t be too surprised. It’s not for the lack of resources out there.

I believe it’s the cacophony.

Without specific advice, we’re drowning in pointless content. Tantalizing, but void of value, like a Fortune 500 CEO speech or a LinkedIn made-up story.

That creates opportunity.

Why Payments Engineers Should Avoid State Machines

Alvaro Duran

September 25, 2024

Read full story

When I published Why Payments Engineers Should Avoid State Machines, some people on Hacker News went crazy about it. “This is of course wrong” was the sentiment of the majority of the comments.

And yet, save some sentences where I could’ve made my point clearer, I believe I was right, and they were wrong.

Approaches like Modern Treasury’s tell me that avoiding enum-based state attributes is possible. This opens a world of possibilities for payment systems.

I can use the type system more effectively, and write less buggy code. More correct code.

I can go faster, once I spend less time fixing mistakes.

I can focus on building, and not as much on correcting.

I can build at pace, with consistency.

This has been The Payments Engineer Playbook. I’ll see you next week.

PS: I want to ask you a question.

You’re in the middle of a multi-year long software project. You’ve nailed some things, but you also made some mistakes along the way. You certainly have a clearer picture of what the project needed in the beginning, and what was superfluous.

This is my question: How much would you paid, back in the early stages of the project, to have access to someone with that insight?

I believe that the biggest reason software projects fail is not a lack of technical knowledge, but a lack of domain knowledge.

But because most companies hire for technical skills, those engineers who have a deep understanding of the domain their in (read: they’ve made the same mistakes before) are seen as better, indispensable. They’re promoted faster. They are paid attention more often. They have influence and leverage.

Building payment systems, I’ve already nailed some things and made some mistakes. Thanks to it, I have a clearer picture of what payment systems need, and what is superfluous.

And I’m offering you access.

Every Wednesday, you’ll get an article, just like this one, full of insight, lessons learned and common mistakes, about a particular aspect of payment systems. All tailored to engineers who build money software. Payments engineers.

For $15 a month, or $149 a year, you can read them all.

If being able to avoid the mistakes your competitors are making when building payment systems is worth to you, I suggest you pledge a subscription for 2025.

Because at the beginning of next year, the price is going up. And as an early subscriber, I don’t want you to miss the chance.

And if someone you respect shared this article with you, do me a favor and subscribe. Every week I feel I’m getting better at this. That means that my best articles on how to build payment systems are probably yet to be written.

You can only find out if you subscribe to The Payments Engineer Playbook. I’ll see you around.

Ryan Johnson

Nov 21

Why is it that using enums requires validation?

Expand full comment

1 reply by Alvaro Duran

Olek Gornostal

Nov 23

| But if the table you’re locking isn’t the same as the one you’re changing, flagging a specific row won’t work. You’d have to lock the entire table.

What do you think about the example below where it locks the row FOR UPDATE from one table and updates other tables? It should give you that transactional guarantees, lock just a single row, and allow concurrent reads.

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; -- to allow concurrent reads

SELECT * FROM accounts WHERE account_id = X FOR UPDATE;

INSERT INTO transaction_ledger (account_id, transaction_date, debit, credit, description)

VALUES (X, NOW(), 100, 0, 'Purchase of item Y');

UPDATE inventory SET count = count - 1 WHERE item_id = Y; -- or do any other business-related updates

UPDATE accounts SET total_balance = total_balance - 100 WHERE account_id = X;

COMMIT;

3 replies by Alvaro Duran and others

4 more comments...