Scale is Overrated in Payments. You Should Focus on Scalability Instead.
Black Friday And Making Payment Systems Scale
This is a bit weird to admit, but when I speak at conferences, I am terrified.
My most reliable tool for managing this anxiety has been to be over prepared. When I’m onstage, I have a script in front of me that I deviate very little from.
I know when to click the next slide, when to raise my voice, when to pause for dramatic effect.
I try to leave very little to chance, so that I can play offense.
But Q&A sessions can’t be planned. Q&As are about playing defense.
Some speakers ask organizers not to allow questions at the end. I think that is a mistake. Very often, these questions are what little speakers can get in the form of honest feedback.
That’s why I put myself through them, despite the anxiety and terror. Q&As are a chance to learn.
This is exactly what happened when I spoke at Pycon SK two weeks ago, delivering Pyments: How to Design Payment Applications in Python.
Someone asked some version of this: Does this design scale during Black Friday?
This is an important question. Black Friday is one of the most important shopping events of the year. A payments application that malfunctions during Black Friday is very harmful to the company. Payments must always work, especially when it’s table stakes.
However, scale has nothing to do with Black Friday. Scale is in fact irrelevant when it comes to payments.
I don’t think that is controversial. Rather than scale, engineers building payment applications should focus on scalability: being able to add capacity on demand, rather than predicting the capacity that will be needed.
Designing for scale is not the same as designing for scalability.
Welcome to Money In Transit, the newsletter bridging the gap between payments strategy and execution. I’m Alvaro Duran.
Recently, we’ve looked at the advice I wish I had been given when I started in payments, a primer on the domain of payment applications, and the limitations of building money software on top of relational databases, among others. They’re all free to read.
Want to be notified when there’s a new post? Hit that subscribe button below.
Behind every software design decision, there is performance. Scalability is how performance is improved by adding hardware.
A scaled system is performant under heavy load. A scalable system is one that allows engineers to add or remove hardware in the face of load, and get stable performance with optimal hardware use.
As in “when load doubles, we can double the hardware and forget about it”.
This difference is important, because hardware can be “added” in two ways: by having more powerful servers (vertical scaling) or by having more of the same servers in parallel (horizontal scaling).
To put it simply, a scaled system has been scaled vertically. A scalable system scales horizontally on demand.
You may see where I’m going with this. For payment applications, it makes a lot more sense to build for horizontal scalability rather than deploying on a powerful machine. The key is that it is way easier to add or remove capacity when you scale horizontally than when you scale vertically.
When load is becoming unbearable, adding more servers is simpler than failing over a more powerful machine.
In fact, engineering a system that scales horizontally is also easier, because you are not really that worried about how performant a single machine is. This is vital in payment applications, because being able to understand the code so that a payments specialist can tell you if you’re wrong is one of the most important aspects of payment applications.
The Scale Imperative
However, the impulse is to build for scale. And who can blame an engineer who thinks that? Payments are meant to run at scale! Aren’t your customers supposed to make lots of payments?
Payments applications should accommodate a level of scale from the get go, right?
Wrong.
The problem with that line of thinking is that anticipating load involves making the system difficult to understand. A scaled system trades-off understandability for single-machine performance.
Plus, you might get it wrong, because in the beginning, you have no idea where the performance bottlenecks are going to be. Premature optimization is the root of all evil.
Rather than building a system able to endure heavy load, engineers building payment applications should get good at handling the problems with having multiple servers in parallel. In payments, the true problem is concurrency, not scale.
A Complex System That Works Is Invariably Found To Have Evolved From A Simple System That Worked.
— John Gall, Systemantics
OPEX, and Scalability
Scalable systems are often easier to sell to the non-engineering side of the organization because resources, and therefore operating costs, go hand in hand with load.
If load were predictable and stable, then a system that’s vertically scaled works best, because costs become predictable and stable.
But load is never predictable nor stable.
Say your company sells primarily in the US, perhaps also in the UK and the EU. What happens when it’s 10am in Beijing? A scaled application would be sitting there, idle, whereas a scalable one would have the baseline number of servers, operating on way lesser costs.
That kind of money adds up.
Conversely, what happens when there’s a peak event like Black Friday? Most scaled applications are built to handle just above the average yearly load, which means that scaled systems are, too, going to be under stress during that day. A scalable application can add up new servers, and still be responsive.
To the extent that peak situations become a much bigger piece of the revenue pie, payment applications should be engineered to operate at lower levels of scale throughout the rest of the year, and only add new servers temporarily.
Decision Logic as MutEx
Does this design scale during Black Friday?
Here’s the kicker: it already does.
The design I presented in the talk is already built for horizontal scalability.
Why? Because the decision logic that prevents payment providers from overwhelming the application also allows each server to operate independently of the others.
Very few companies get the same level of load from New York and New Zealand.
It wouldn’t make sense that your payment application was designed to operate that way.