Why Payments Engineers Should Avoid State Machines
Event-driven money software is replayable and pull-based. Why are state machines still prevalent in payments?
There are two ways to represent movement.
Say you’re chasing someone in the dark, Marco Polo style. To find where they are, you shout “Marco”, and they scream “Polo” back. Like a radar, you move to where the voices are. And when you tag somebody, you stop.
Notice, though, that Google Maps doesn’t work that way.
Asking for directions is the other way to move. The path is ahead, and you just keep straight until a new direction is given to you. Like a car driving at night, you can see only what’s right in front of you. And when you aren’t given directions anymore, you stop.
Both approaches work. But if, once you were at your destination, I asked you which path you took, which of the two approaches would be more useful?
The first approach is a state machine; the second is an event-driven system.
A state machine cannot reconstruct the past. It can only move forward.
Payments Engineers must avoid state machines.
I’m Alvaro Duran, and this is The Payments Engineer Playbook. Scroll for five minutes on Youtube and you’ll find tons of tutorials that show you how to pass software design interviews that use payment systems. But there’s not much that teaches you how to build this critical software for real users and real money.
The reason I know this is because I’ve built and maintained payment systems for almost ten years. I’ve been able to see all types of interesting conversations about what works and what doesn't behind closed doors.
And I thought, “you know what? It’s time we have these conversations in public”.
In The Payments Engineer Playbook, we investigate the technology that transfers money. All to help you become a smarter, more skillful and more successful payments engineer. And we do that by cutting off one sliver of it and extract tactics from it.
It makes sense to think in terms of state machines.
It is undeniably easy to design if you draw boxes with names and arrows that make all possible transitions explicit. They force you to think about all of them.
But to code that way? In payments? That’s probably a mistake.
The first reason is replayability.
Replayability means being able to reconstruct what happened using previously recorded data.
Replayability is not only useful when debugging payment systems, it is also required when a customer disputes a payment. Merchants win those disputes only when they can prove that all the obligations toward the payer were met.
If finality doesn’t exist in payments, replayability is key.
The second reason is that state machines are a bottleneck for scalability.
In case you haven’t noticed, payment systems are used pretty heavily at most companies. Most of the problems of scaling money software come from the fact that they have to be strongly consistent (everything has to be accounted for) and highly available (every second of downtime is a second when the company is not selling).
State machines are an obstacle to that because they scale with the number of clients.
State machines are a push-based system for clients.
This is when clients request from it—the work is “pushed” to the server that handles the state machine. That’s what REST APIs are all about. Push-based systems are common, and they’re force-fed to engineers when they’re undergraduates.
But there’s another kind of system, the antithesis to push-based.
This is when the server leaves a breadcrumb on an intermediary, ready to be reconstructed, much like we reconstruct our way to where we want to go on Google Maps by driving.
The app has already cached all the instructions on our phone. Which is why you can use Google Maps in airplane mode.
The server is then responsible for pushing all breadcrumbs to the intermediary, usually a durable queue. And the client is responsible for pulling all that data, and reconstructing the current state from it.
Pull-based systems are replayable. Payment systems are requested data frequently, and by many services, internal and external.
In order to have payment systems that are pull-based and replayable, we must describe changes in state in terms of events.
You keep using that word…
The truth is, I don’t like the word event. This is the best definition I could find:
An event is a statement that something interesting has occurred
— Randy Shoup, Large-Scale Architecture: The Unreasonable Effectiveness of Simplicity
Here’s the thing: this definition is at the core of what makes events tricky.
First, the fact that something “has occurred” means that there is an inherent synchronization problem when the client handles events.
Remember the Marco Polo game? For all its inconveniences, requesting the current state from the server is an idempotent operation by design. Repeatedly shouting "Marco" won't make the person responding with "Polo" feel like they've moved.
Compare that with Google Maps telling you to turn left on the first corner and then right, only to find out that it should have been the other way around.
With state-machines, the server is responsible for the reconstruction of the current state (duh!). But event-driven servers push that responsibility to the client.
You’re getting scalability by forcing the client to accept more responsibility.
Second, the fact that “something interesting” has occurred when an event is created means that there’s a degree of domain knowledge that is imposed on the client.
In other words: reconstructing state from a stream of events needs a modicum of domain knowledge.
That’s what state-machines abstract clients from! If you have a payment system that only exposes the state of a payment, the client only needs to make sure that whenever the state changes to “finalized” or “paid” or “success”, it does what the client is meant to do.
State machine payment systems hide as much information as possible from the client.
However, I don’t think they should.
I think what’s missing from state machine payment systems is a consistent definition of what it means that a payment is in a certain state. What steps were made to get it to where it is, so to say.
Reconstructing a payment’s state is the same thing as defining what it means to be in that state.
Rather than hiding that information from the client, payments engineers should build common libraries that make the process of state reconstruction consistent across the client-server divide.
Copy pasting code works, yes. Until it doesn’t.
The Nick of Time
“‘Where did you go to, if I may ask?' said Thorin to Gandalf as they rode along.
‘To look ahead,' said he.
‘And what brought you back in the nick of time?'
‘Looking behind,' said he.”
― J.R.R. Tolkien, The Hobbit
Integrating with payments API is difficult and error prone. And I believe that’s because most of them are state-machine based.
It doesn’t matter if Stripe’s goal is to “abstract away the complexity of payments”. In the end, you either have 7 lines of code that are obvious, but too simple, or a PaymentIntent API that’s adequate, but no longer friendly.
What I find most useful about events is that they are individually obvious, and collectively powerful.
Events are linked to a specific action by one of the services involved: when something specific happens, a specific event gets created.
But they also stack into a story, one that can be debugged, understood, and reported.
Plus, state can be reconstructed from events, but not the other way around.
You can have events on a pull-based system, but you can also have a push-based API that exposes the state, reconstructed from all the collected events. And, if that reconstruction is lengthy and resource intensive, you can cache it, right until you collect a new event.
Not only caching works, it is straightforward when the cache is no longer valid.
Events force clients to be smarter because they have to reconstruct the state of the payment. But payments engineers can accept that responsibility with a common library!
I don’t think there are many good reasons to keep using state machines in payments. Events are scalable and replayable, and their problems can be mitigated.
It’s a matter of giving clients what they need, rather than making sure they stay inside your walled gardens.
But that’s not a software problem anymore.
PS: Can I ask you a massive favor?
These articles, they take a lot of work to write. Well, not the writing itself, that’s just banging on the keyboard for a few hours, and I’m done.
It’s all the research behind that eats most of my time and energy.
For example, this post on ledgers took me a full month of reading articles on different ways to explain accounting principles to engineers.
So, if you really want to read more of this, you can do two things.
First, you can leave a comment on this very article. Let me know what you liked most about it. Let me know if you want more like this. And if you do, what do you want them to be about?
And of course, if you hated this one, you hate The Payments Engineer Playbook, you hate me, let me know that too.
And second, tell a colleague. A bunch of you kept my article on how Uber tests in production on the front page of Hacker News for 12 hours. Others told me that they’ve shared it with everyone at work, and others went to their social media and posted the link with some insightful comments.
Somebody even inserted that article in their company’s Confluence page.
If you’re reading this, I’m sure you already have a pretty good idea who might find this article useful. Go share it with them!
And if you got this article from a colleague, do me a favor and subscribe. It’s a flex to be a reader of a well-known publication before it was cool.
Make a bet on The Payments Engineer Playbook. I’ll see you around.
It appears that there are multiple levels of abstraction addressed in this article. At the foundational level, a financial system necessitates an append-only database, ledger, or register to ensure data integrity and immutability. Building upon this foundation, we handle events, transactions, and behaviors that represent the dynamic aspects of the system. Beyond this layer, we enter a domain where certain situations may permit the updating of existing data or records, corresponding to your state machine abstraction. Finally, either above or parallel to these layers, communication interfaces such as REST APIs can be utilized to translate these financial situations into client-server interactions.
I believe that all these levels are essential in designing a robust financial system. I agree that people often assume simple solutions will suffice, but in reality, they do not account for the complexity and nuances inherent in financial systems. In the article you mentioned, you articulated one of the most significant financial arguments: "It is impossibly difficult to retroactively 'make them right'." This underscores the importance of designing systems that prevent errors upfront rather than attempting to correct them after the fact. I think this could be the central point of our fascination with financial, governmental, and similar systems, where the cost of retroactive corrections can be prohibitively high.