Boring Is Good: How Shopify Prepares for Black Friday

“Today’s Black Friday is tomorrow’s base load” attitude has powered the evolution of Shopify's multi-tenant data architecture.

Aug 28, 2024

Shopify got away with changing almost nothing of its codebase to achieve massive scale.

They’re at the epicenter of Black Friday every single year. I would have expected a few clever tricks, or an overwhelmingly complex design.

And yet, it’s all very boring. That’s what’s clever.

Shopify becomes humongous for one weekend every year: on Black Friday. Wouldn’t you expect cutting edge database technology and a blazingly fast programming language in order to do that?

I was surprised to learn that Shopify runs one of the oldest and biggest Ruby on Rails applications out there.

Now, here’s the trick: Shopify is a SaaS company. And they can get away with doing a few things that are inaccessible for Google or Facebook, even at scale.

This is good news for payment orchestrators. Most engineers working in payment orchestration dream and fear the times when their system will reach an unbearable amount of scale. “We must invest in complex technology upfront”, they often say. “We’re going to need it”.

Well, you aren’t gonna need it. And in this article, I’m going to tell you why.

I’m Alvaro Duran, and this is The Payments Engineer Playbook. Use your most elaborate prompts on chatGPT and you’ll be able to deploy a basic payment system. But no LLM is going to teach you how to support and scale this critical software for real users and real money.

The reason I know this is because I’ve built and maintained payment systems for almost ten years. And I’ve been able to see all types of interesting conversations about what works and what doesn't behind closed doors.

These conversations are what inspired this newsletter.

In The Payments Engineer Playbook, we investigate the technology that transfers money. All to help you become a smarter, more skillful and more successful payments engineer. And we do that by cutting off one sliver of it and extract tactics from it.

What is the most valuable asset of any SaaS? The trust of its customers.

Unlike Microsoft customers in the 1990s, who licensed its software, customers of a SaaS company pay rent. They’re tenants.

This distinction is crucial, because none of the FAANG are SaaS companies. And much of the lessons in achieving scale come from software companies that have a completely different business model.

Engineers building payment systems should pay more attention to Shopify than any of the FAANGs. Any team building payments, both for a single merchant or as a payment orchestration company, behaves a lot like a SaaS company.

The goal isn’t to offer a one-size-fits-all service, at scale, like the FAANGs do. To build a payment system is to build, for as many customers as possible, something bespoke to each of them. Just like building a SaaS,

That’s what Shopify’s engineers got right: they don’t run a single, centralized view of all their clients. Each one gets a separate piece of the pie.

And it all starts with separate schemas.

What you can do is to have a Separate Schema architecture: one schema per customer. It’s like having folders on your database. Queries will never mix one customer’s data with another. The isolation is built-in.

With a library like django-tenants, it takes very little to get it up and running. Plus, you’re still handling a single database, with shared connections, buffers and memory.

The downside is that recovery after a failure is very painful.

Let’s say that one of your tenant’s data is corrupted, and you need to restore it from a backup. You can’t restore the whole database again. All the other tenants would have their data rolled back, even when their data wasn’t corrupted.

Your only realistic option is to restore the backup of that tenant into a temporary database, copy the rest of the tenant’s data into it, and import the mix into the production database.

Complicated, and time consuming. That’s why Shopify has created an automated process so that it can do it all the time.

Shopify has leaned heavily on multi-tenancy to achieve the elasticity they need to scale to Black Friday and support a $4.2 million in sales per minute peak in 2023.

Shopify probably started like any SaaS: with a separate schema per customer, all contained in the same database.

Separate schemas are the best compromise between simplicity and performance.

However, in time, successful companies have to move beyond a single database.

For Shopify, this moment arrived very soon. Not necessarily because of the growth of its customer base. But because a few of them were extremely fast growing merchants doing flash sales.

Online merchants make the most of the attention they get on social media with extremely quick campaigns. Special offers, on a limited time window, often for a limited amount of stock. And that put a lot of pressure on Shopify’s infrastructure almost from the get go.

Especially on Black Friday.

But Shopify didn’t rewrite their codebase when they started to feel the pains of growing fast.

If you believe that Shopify should scale like Facebook or Google, that is surprising.

But Shopify started like any other SaaS. It’s likely that it relied on a Separated Schema architecture early on. To keep growing, it would make more sense to make copies of Shopify’s system to serve each merchant. As if each copy were a mini-version of the whole.

And that is roughly what Shopify did.

Not your friendly Kubernetes pod

When Shopify engineers use the word pod, they don’t mean a Kubernetes pod.

According to Bart de Water, a Shopify pod is a mini-version of the whole Shopify application, with all the data for a few of its merchants. This architecture follows what de Water calls Tenant Isolation Principle: each shop in the platform doesn’t know about any other shop.

That is a version of the Separate Schema data architecture, but at scale.

In order to accommodate the growing number of shops powered by Shopify, the infrastructure scales the number of pods horizontally: more shops, more pods.

But in order to accommodate the growth of each merchant, they rebalance the pods: merchants with a lot of traffic get paired with those with less traffic.

And the way they rebalance looks a lot like restoring from a backup.

When a request hits any of the Shopify tenant’s URL, it reaches the OpenResty service, a dispatcher built on top of NGINX. This dispatcher has a routing table that maintains which tenants live on which pod, and forwards the request accordingly.

Bart de Water gives a thorough explanation on this video, but the process looks like this:

Rebalancing starts when the data gets replicated into a second pod. Both the stored data and the binlog get sent to the new pod to guarantee there is no data loss.
When all the data is copied, the active pod gets locked for writes. New requests will now get queued until the process ends. This usually lasts only a couple of seconds.
OpenResty updates the routing table to point to the new pod. The new pod will start receiving the queued requests, and then will be live.
The now deactivated pod gets the tenant’s data deleted. An asynchronous job will clean the deactivated data for space.

For components that need to work across pods, such as Shop Pay, Tenant Isolation introduces a lot of complexity.

But inside the pod, Tenant Isolation is how engineers work in a scale agnostic environment. Shopify’s growth, both in the number of shops and in their size, doesn’t get reflected in more complex application code.

And that, for a company whose systems need to resist peaks of demand, is a lifesaver.

How To Load Test Like The Best

Even the best company at elastic scale does thorough load testing.

Here’s a chart that Bart de Water shared on Shopify’s Architecture to Handle the World’s Biggest Flash Sales:

CPU Utilization at Shopify on 2021. Source: Bart de Water

Shopify did two important things to be ready for Black Friday 2021.

The first thing was having two kinds of load testing.

Shopify initially tests the previous year’s Black Friday scale with their newly added components. These are called Architecture tests. Architecture tests make sure that engineers haven’t introduced performance bottlenecks, but without testing the whole application to its limits.

Architecture tests check that what didn’t exist during Black Friday the year before could have endured that load seamlessly.

And once the engineers have tested all functionality at a bigger scale than that of last year, they start dialing it out with Scale tests.

Architecture tests ask “can the system endure what happened?”. Scale tests ask “how much can the system endure?”.

However, not everyone tests load like Shopify.

Dependencies often fail. Especially at Black Friday. For that reason, Shopify performs a special kind of testing that reminds me of Netflix’s Chaos Monkey.

They call them game days.

The best way to be prepared for an outage is to know in advance how to detect when they happen, and what will be required to fix it.

That’s why Shopify uses resiliency matrices. These are just spreadsheets that map all the dependencies in your system with the expected user experience when they fail.

This is the one that de Water shares in the video:

Example of Shopify’s Resiliency Matrix. Source: Bart de Water

Resiliency matrices inform the scenarios that will be tested in game days. For example, if the Redis database that contains the session for all users logged into Shopify fails, what will users experience? That goes on a resiliency matrix.

You’ll agree with me that we don’t want them to see a failed screen.

But if that component is not available, what should they see? They’re probably best served if the system logs them out. A degraded system is better than one that’s unavailable.

And how should engineers react in the face of a component failure? Often, they don't have to. Shopify has a circuit breaker library called Semian that can define rescue scenarios.

Game days are meant to check two things:

The tested outage triggers the right set of alarms and puts the right rescue scenarios in motion.
Upon being notified, the processes guide developers to respond properly.

This is what they do for Black Friday, but the benefits span the whole year. Nowadays, digital first companies launch their products on flash sales, previously hyped to their massive social media followers.

Shopify lives in a perennial state of Black Friday. That’s why their engineers are so good at it. Every year, millions of people buy from one of Shopify’s merchants, and every year is more frantic, more demanding.

There’s a saying inside Shopify, and it’s this:

Today’s Flash Sale Is Tomorrow’s Base Load

And that’s my biggest takeaway from this article. The only way you can convince the executive team that Black Friday’s scale is achievable once a year is to build the engineering excellence to support it.

And that only makes business sense if the event is planned out and anticipated throughout the whole year.

But what’s most impressive to me is that the design system is…boring.

My manager often says, and I very much agree, that “boring is good”. If you’ve been dealing with money software long enough, I’m sure it resonates.

That’s it for The Payments Engineer Playbook. I’ll see you next week.

PS: I have to confess, this has been the article I’ve enjoyed the most in a while.

If you had a productive time reading this article, I would love for you to do me a little favor.

And that is talking to a colleague at work about it.

That post on Uber testing in production drew a lot of attention. But for you, it could’ve been a one time thing. Just another newsletter you casually read.

I bet that at this point you have made up your mind about this newsletter. And who could benefit the most from reading it.

Make a bet on The Payments Engineer Playbook. I’ll see you around.

Boring Is Good: How Shopify Prepares for Black Friday

“Today’s Black Friday is tomorrow’s base load” attitude has powered the evolution of Shopify's multi-tenant data architecture.

Not your friendly Kubernetes pod

How To Load Test Like The Best

Discussion about this post