This is a chapter from Code-First Reliability.
There’s a certain kind of joy that you can experience when you’re on call.
It’s the feeling of palms are sweaty, knees weak, arms heavy but with no vomit on your sweater already. The joy that comes from proof that you’re competent and have poise under pressure. And I’m fully aware of the rapacious practices of some companies when it comes to unpaid overtime. But still.
Being on call can be a joyful experience.
No matter how good your payment systems are, they depend on external providers. Being “always-on” is not just writing defensive code, or elastic infrastructure. Accepting payments 24/7 is often the result of code that’s always running, but also people that are always vigilant.
As a result, and unlike other domains, money software requires you to be on call not only because your code can fail, but because it depends on systems that are beyond your control. And the decisions on what to do can’t be easily delegated to a machine, because whether a provider is “down” isn’t exactly rules based.
When nuance is required, the decision maker has to be human.
I’m Alvaro Duran, and this is The Payments Engineer Playbook. Over the last month, I’ve expanded on the engineering practices that make payment systems reliable. For many, it is a shock to realize that these practices have little to do with infrastructure. To have reliable money software, you have to engage in practices that are code-first.
That’s because payment systems can’t get away from depending on payment providers. These are the Stripes, Adyens and Paypals of the world, companies authorized by the payment networks to process payments “for the rest of us”.
We’ve covered why payment systems get tested in production, why retrying on a different provider often works, and how tokenization is the key to have seamless retries and agentic commerce.
But none of this matters when providers are down. And that’s when the engineers on call have to jump in. Because, when it comes to payments, every second counts.
This article will cover:
What a reasonable on call arrangement looks like
Why automated incident response seldom works
Why being good at on call is the mark of a great payments engineer
How to write code to make being on call easy (and what’s paradoxical about that)
What being on call really is all about (hint: not heroics)
And some mental tricks to cope with on call induced burnout
Enough intro, let’s dive in.
(Over)Time is Money
An oncall rotation is like a market for bonds.
There is a primary market, where the company has a constant need for engineers to be ready to open their laptops when there’s an alarm. And there’s a secondary market, where you trade your spot in the rotation with another engineer who may be able to take it.
This means that there is money involved in an on call rotation by design.
I’ve never heard of engineers buying and selling on call shifts from each other. But there are no functional on call rotations when the employer refuses to pay the engineers their time on call.
A good on call rotation has the following properties:
It’s voluntary, and paid
Engineers can pick their teammates’ shifts when they’re offered
Alarms go off rarely
Well documented Standard Operating Procedures (SOPs) are available
Most importantly, being on call must feel like Free Money. This is very ambiguous and subjective, but if you’ve ever been on call, you know that it is very easily measurable.
After all, you’re giving up a certain things you could do with your free time, and you’re committing yourself to carry around your laptop everywhere you go. Plus the nagging feeling, however small, that at any moment the alarm may go off.
If on calls don’t feel like Free Money, either because the alarms go off too often, or the problems are too hard to fix, there is what in bond markets would be called a liquidity shock: not enough engineers who would voluntarily take on call shifts, either from the employer or from each other.
When that happens, this market can self-correct in two ways: either the employer works at making the frequency of alarms go down, or it increases the money paid to engineers on call.
Are you convinced now that an oncall rotation is like a bond market?
Wait, Can’t You Use a Circuit Breaker and Call it a Day?
If only.
Aside from failing all the time, payment providers are very cagey with their outages.
It’s not that they don’t report that they’re experiencing problems. Is that they err on the side of “let’s wait it out before letting everyone know that we’ve screwed up”.
It’s a very human thing to do. Payment providers have contracts with explicit Service Level Agreements. These are often measured in only a few minutes per month, sometimes even per year.
Acknowledging an outage can cost a lot of money if done too early. The status page, as a result, is rarely the place that will get updated first.
But a few 500s don’t necessarily mean that the provider is down. It may simply mean that their Kubernetes cluster is acting up momentarily, or that they’re doing some database change that incurs some tiny amount of downtime.
A circuit breaker isn’t helpful when it comes to determine if a payments provider is down. That requires nuance, and nuance is inherently human.
That’s why on calls aren’t going away.
Selecting Out For Great Payments Engineers
When the alarm goes off, it feels a little bit like that moment in the Harry Potter series1 when Harry goes back in time and, as Dementors circle around past (future?) Harry and Sirius, he waits. He waits because he’s expecting the cavalry to save him any minute.
But there is no cavalry. No one is coming. It’s silence unless he does something.
Alarms going off in payment systems depend largely on which payment providers you’ve integrated into your platform.
You’d guess that the expensive ones fail less than the cheaper ones…but that’s not the case.
Payment providers fail all the time. Period.
This inescapable condition of money software has one important consequence for the engineers who maintain them: they must get good at being on call.
And there’s only one way to become good at it: flight time. Good payment systems are reliable, not only because the software is sound and well maintained, but also because, when some provider experiences a problem, engineers know what to do, and can navigate production effectively.
If you’re new to a payments engineering team, try to be on call as much as you can. Initially, shadow one of the veterans, and proactively ask them to be the first-responder, on the condition that you’ll escalate with them when you don’t know what to do.
This works for seniors and juniors alike. And nothing increases my trust on a new team member faster.
The Paradox of Observable Code
If navigating production effectively makes you a better payments engineer, how can you make that easier?
How do you write code that makes production testing easier?
By giving yourself the tools to make it easy to see what users are doing.
In Code-First Reliability in Payment Systems, I suggested three approaches to minimize the impact of bad releases: Canary deployments, Feature Flags and Small Changes.
But, once your code is running, you want to know as soon as possible that something bad is starting to happen. And, for that, you want to have meaningful data:
You want visible and business-related metrics that alert you when things aren’t going according to plan. For example, an authentication rate that’s below normal could be a glitch, but it could also mean that your provider is acting up. Metrics tell you what normal looks like.
You want complete tracing of the code that gets run as a result of any user’s actions. You want a few, but important events in your persistent storage, and the others in some temporary log store that can be retrieved in real time, or a few days later for debugging purpose. Which data belongs to which bucket may vary with mileage, but the crucial thing is to have both. Traces tell you the whole story.
You want exceptions being complete and accurate. Exceptions are the most valuable piece of data you can get from the use of your system; when they happen, you want to know everything about them. Exceptions challenge your assumptions.
Here’s the paradoxical thing about observable code: when done right, there’s less to debug. When you’ve made the effort to collect, assess and refine the data you capture from the use of your system, the amount of debugging you have to do is lower, because you’re more effective.
It’s not like it will completely erase production failures (nothing is), but at 2am on a Saturday, I accept all the help I can get. Especially if it makes my job easier.
No Heroics Here
If you have to think when an alarm goes off, then there’s room for improvement.
Production incidents have the tendency to happen at the worst time possible. Often, in the middle of the night. At that moment, chances are that your thinking will be wrong. That, combined with the adrenaline of your fight or flight response, will make you more likely to make a silly mistake.
Unless someone has already done the thinking for you.
The more you follow an agreed upon script, playbook, or SOP, the better. It doesn’t necessarily mean that the number steps is small or large, but that they’re easy to follow.
It’s a lot like having to watch flight attendants perform the safety demonstration before the plane takes off. It’s not that the steps are hard, but that, even at your worst, you can still do them.
If you’re going to be on call, and you don’t have a script to follow when something bad (but common) happens, write one down.
Even if it’s inaccurate, or incomplete, or if it involves praying to God for the problem to go away. Do it. And then, check with the rest of your team to get alignment.
A draft SOP is better than none. It brings wrong assumptions into the open.
Externalize your memory
At 2am, it’s better to leave breadcrumbs.
Before you follow each step in the appropriate SOP (did I say already that you should have one?), make sure that you write down what you’re doing, while you’re doing it, in some Slack channel that any of your teammates can see.
This is going to save everyone’s time.
First, because if you have to escalate to someone else, it’s best if they can fire up their laptop and see what you’ve been doing so far, without having to ask you. That’s less time to get up to speed, which reduces the time to the eventual fix, if needed, and the damage caused by the problem.
And second, because it is really easy to write postmortems with this information. Everything is timestamped; the document practically writes itself.
Overall, I’d suggest that you dial the frequency of comms depending on who you’re communicating with. Your teammates: go for as much as possible. Managers and executives: probably every 10 to 15 minutes will be enough in most cases, and even less than that if the issue goes on for longer.
The key is to spread the right amount of information to the right people.
It’s going to be OK
On call can be a joyful experience, but for many engineers, it is miserable. That’s rarely their fault.
And when something isn’t your fault, there’s only so much you can do to address its root cause. That’s why miserable on call rotations often leads to burnout.
Look, mental health is not my area of expertise. What I can give you is the kind of advice that a good friend may give you, and it’s this:
It’s going to be OK.
When I was starting out as a software engineer, I was part of a team that took the whole idea of “sprint” really seriously. Not making it, that is, not moving all the tickets agreed upon at the beginning of the two weeks to “done”, was a huge deal.
And one week, I got stuck with something. And we didn’t make it.
It sounds silly, but I took it really badly. “Am I going to get fired?” was a question I was constantly asking myself over that weekend.
Lucky for me, I was friends with someone more senior in the org. We grabbed some beers that Saturday, and I asked him about it.
“It’s going to be OK”, he said.
It’s not that what you do doesn’t matter—it does. But if you do something that costs money to your company, then it’s an investment in your education. Even if it’s a lot of money.
Production failures happen, and they’re going to be costly. But being defensive about them doesn’t make you grow as an engineer, and that’s, I believe, what hurts people the most.
We’re all humans, and we’re all going to make mistakes.
So we might as well make the most of them.
This has been The Payments Engineer Playbook. I’ll see you next week.
Yeah, well, I’m a Millenial.