State Machines Are Probably a Bad Idea
Engineers are better off collecting Events and figuring out State afterwards
One year and take ago, Queen Elisabeth II passed away. Over the following 10 days, a carefully coreographed plan codenamed London Bridge was set in motion, which included a lying in state, the widespread tradition of placing the deceased head of state in a major building for everyone to pay their respects. It’s not something idiosyncratic to Brits. Countries that follow this tradition include Canada, North Korea, Russia, United States and, of course, Vatican City.
Why we do that became evident during the cholera epidemics of the nineteenth century. The general fear of premature burial led to the invention of many safety devices which could be incorporated into coffins.
Lying in state is a very old mechanism to make sure that whoever was pronounced dead was, indeed, dead. Waiting a few days served as proof that the doctor hadn’t make a mistake, and the social gathering was the perfect shroud with which to wrap the ceremony.
Human beings don’t die; they’re pronounced dead.
Seeing Like a State Machine
Most web applications contain several examples of state, a property of an instance that is subjected to some form of lifecycle. Invoices can be issued, and then paid; blog posts can be drafted, published and archived; Payments can be authorized, captured and refunded.
The most obvious way of dealing with this functionality is to implement a state machine, which boils down to having a field—a column in the database table, an attribute on the data class—that contains the current status, subject to change under certain conditions.
Most programming languages implement packages to implement state machines. Python, in particular, has automata.
Sometime ago, Shopify went as far as publishing a post on their blog claiming that developers weren’t using state machines often enough, and that they should be “force-fed” to them. This should’ve given all pause to think about state machines as a whole. If developers are wary of using state machines, it is likely that there’s something about them that makes them a bad choice.
Intuitively, engineers use state machines for very simple cases: CMS implement blog data models tend to include variables such as created_at, published_at and archived_at to reflect all the possible combinations of a blog state, without the need of a explicit state machine.
I’ve never used the automata library in my entire career. It’s not that it doesn’t do the job right—I presume it does. It’s that state machines lead to nasty and tech-debt heavy implementations of the state of entities in the domain.
It all comes back to the idea of being dead vs being pronounced dead. In order to transition to a new state, there is almost always an algorithm involved. Someone checks the pulse, the pupils, the breath. Then CPR. Only then, someone might have passed away.
State is almost always the result of an algorithm; it is therefore not an inherent property, because the conditions on which we considered the entity having a certain state will change in response to changes to that algorithm.
Modeling Payments
Say you have a payment system, where you keep track of a Payment object, which can be authorized, captured and refunded. Engineers at Shopify would certainly say that a state machine could be useful to model the behavior of the payment.
The problem, though, is that most payment systems are initially developed by engineers with a weak understanding of the domain they’re building software for. This problem is pervasive, and is probably the result of how easy it is for coders to learn about the domain compared to domain experts learning about writing code.
State machines are very difficult to change, because adding a new state involves ensuring backwards compatibility with all the states that were in place before. In the Payment example, I cheekily forgot to tell you that what we mean by refunds in the real world are, in practice, two different things, depending on whether the Payment was captured (refund) or simply authorized (void).
Now go change all Payments that were authorized, but not captured, and then eventually refunded, and change them to voided. Do not make mistakes, though: voiding a payment is cheaper than refunding one, and accounting for fees is going to be a nightmare.
And, months into the project, someone is going to realize that any of these states require some intermediary state, because most payment providers take some time to authorize or capture a payment—there’s fraud and other kinds of risks involved, and want to be sure about it.
Using state machines for modelling, and then keeping track of transitions with an adjacent entity is a corollary bad idea. In this example, and in many others, you’ll be better off if you keep track of what has happened rather than what is the situation after.
In other words, rather than focusing on the transition, engineers should focus on events.
A woman stopped breathing; someone checked, and she didn’t have a pulse; someone performed CPR, then checked pulse again.
The country paraded in front of her coffin for two days. We can pronounce her dead.