Stay tuned for an important announcement on Friday, right in your mail inbox.
Now onto today’s free article.
Having gone through a database migration screwup is the mark of good engineers.
They’re like a rite of passage. The better you are, the more exposed you are to make a migration just complex enough for you to make one mistake, and cause an outage.
After one of those, you never see your database the same way anymore.
Code can be reviewed, but most importantly it is version controlled, and can be rolled back. Data, not so easy. You may revert to a past snapshot, but what about the correct data that was created between that snapshot and now?
Unlike code, reverting data involves data loss. And you probably can’t afford that.
Which is why database wartime stories are so funny. Everyone has one and, whenever I hear a good one, I feel a mix of comiseration and pure, unadulterated joy.
Here’s one of mine, if you’re wondering.
Back when I was working at a Django shop, I was co-responsible for a greenfield project that the company wanted to spin up really quickly. Me, and my partner in crime then, were extremely naive and junior. Especially me.
One week, we were waiting for the QA to give the thumbs up to a release candidate we had worked for the past two weeks, and I was bored. So I decided to do some refactor work that would make the codebase cleaner.
I’m wincing right now just remembering.
One of the things I did was renaming a File field. Django automatically provides a way to generate a migration file from changes you made in the models, and I was blindly creating those files all the time.
You should rarely, if ever, need to edit migration files by hand.
— Django official documentation, for real.
So, renaming a field. If I had only change the name of the field in the model, the generated migration would have renamed that column in the database. However, the model already specified the column name, and it was only natural to refactor that name too in one go.
Django isn’t that clever, as I was going to discover soon enough. If the renamed field didn’t match the database column name, the generated file was going to be, not a rename, but a full drop of that column, and the creation of a new, empty one.
# A db_column mismatch requires a prior noop AlterField for the
# subsequent RenameField to be a noop on attempts at preserving the
# old name.— A perfectly comprehensible comment in Django codebase
Migration files generated. Applied on sandbox. Let the QA test it. I even remember running the new migration with my manager next to me, who was asking “is this what you expect?” as the command line was casually advising that there was a column dropped, and a new one created.
“Yeah, this was the output on sandbox”, I said, technically correct, but also full of undeserved confidence. None of us were really paying attention to what seemed like a banal column rename.
A few hours later, our lawyer comes to our desk, asking where the files she saved were. The application was giving her Not Found errors.
And soon, it became totally clear. I had screwed up. All files were gone.
I’m Alvaro Duran, and this is The Payments Engineer Playbook. Since that Django renaming incident, I’ve become extremely paranoid about database migrations. Most of us learn how to write them the hard way. Either a more senior teammate writes a bunch of snarky comments in one of our PRs, or you naively cause an outage, Little Bobby Tables style.
I don’t want that for you. But I’m pretty sure that you know the basics by now.
So, today, I’m going to share with you 3 very common mistakes that engineers make when they try to apply migrations at scale. You know, the kind of database changes that have completely different outcomes on sandbox and production.
As in “sandbox went well, production went down”.
The shocking thing is that these 3 scenarios are very common: you usually add new indexes in the database; you sometimes add a foreign key constraint; and from time to time you change the type of a column.
These tasks sound harmless, but they require finesse and a little bit knowledge of Postgres to get right when tables are bigger than your petty pet project.
This article will cover:
What are database locks, and why they exist
Why scale matters in migrations
One little trick to create indexes without causing an outage
How to add foreign key constraints at scale
Why I’m strongly against the very idea of enums in databases
And my favorite: a very weird corner case in which ORDER BY doesn’t give you a sorted outcome (hint: it has to do with locks)
Caveat: these are tricks that apply to PostgreSQL. They might apply elsewhere, but if they do, they probably need a different syntax. Be advised.
That was a long intro already, so here we go.
How To Maintain Order
Think of each table in your database as a nightclub: people come out when you read; people get in when you write.
Locks are the bouncers that maintain order.
For instance, you may have a narrow corridor after the door, and you have instructed bouncers to never let people in until there’s nobody trying to get out, and never letting people out until there’s nobody trying to get in. That’s how ACCESS EXCLUSIVE works: a lock that makes everything else waits until it’s released.
You probably knew it already, or at least guessed it: every operation that creates, alters or drops from a table (collectively known as DDL, or data definition language) often acquires ACCESS EXCLUSIVE locks in Postgres.
Not always, but it’s a sensible assumption.
ACCESS EXCLUSIVE is a table-level lock. There are less restrictive locks (and I’m about to talk about one of them in a minute), but they all operate the same way: they prevent other operations from running, which have to wait until the lock gets released.
Long Queueing Is Indistinguishable from an Outage
Try to imagine what the nightclub would look like inside if there were an infinite amount of people trying to get in.
A busy one? Think again. It would be one in which people inside are waiting an infinite amount of time, because the queue of people getting in never runs out.
That, in essence, is what a migration outage looks like. If you’re running a DDL that takes too long to run, the ACCESS EXCLUSIVE lock prevents anything else from running. It queues forever, or until it times out.
Let’s say you want to create a new index in a table. Indexes are a bit like a cache for a database table, and it gets “invalidated” every time there’s an update. For that reason, Postgres often locks the table to be indexed against writes (not reads), and then builds the index in one go.
Which means that a seemly harmless index you decided to add to make queries more performant blocks all writes when you apply the migration.
Solution? Postgres provides a CONCURRENTLY option to index creation:
When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially modify or use the index to terminate. Thus this method requires more total work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment.
CONCURRENTLY trades off locking for writes for a migration that takes a bit longer to apply, and more CPU and I/O load (read: slower reads).
Maybe you knew that one already. I have 2 more for you.
Add Foreign Keys In 2 steps
Most constraints require ACCESS EXCLUSIVE lock. But not ADD FOREIGN KEY.
When you add a foreign key, two, less restrictive locks get applied, called SHARE ROW EXCLUSIVE, in the table you want to add the foreign key, and in the referenced table.
SHARE ROW EXCLUSIVE locks prevents concurrent DDLs, but not reads. Which makes sense, because adding a foreign key only needs to check that the key exists on the referenced table for the whole column. Adding or modifying new rows in the table would make the task impossible, as well as deleting rows in the referenced table.
How to go around this lock? Again, this is such a common case, Postgres already has an option available: NOT VALID.
When you add a foreign key with NOT VALID, rather than scanning the whole table to validate that all rows reference an existing row in the foreign table, it sets everything up so that new and updated rows get validated as if the foreign key already existed. The old rows aren’t checked.
As you can imagine, this is extremely fast, but the job is not finished.
However, this first step leaves the easiest part of the job undone: adding the constraint to a subset of the database that doesn’t need any lock at all.
We can then apply a VALIDATE CONSTRAINT, which will check these rows, one by one, taking a very unrestrictive ROW SHARE lock on the reference table (we can wait on other, more restrictive locks to be released anyway) to make sure that the old rows conform to the constraint:
With
NOT VALID
, theADD CONSTRAINT
command does not scan the table and can be committed immediately. After that, aVALIDATE CONSTRAINT
command can be issued to verify that existing rows satisfy the constraint. The validation step does not need to lock out concurrent updates, since it knows that other transactions will be enforcing the constraint for rows that they insert or update; only pre-existing rows need to be checked. Hence, validation acquires only aSHARE UPDATE EXCLUSIVE
lock on the table being altered.
A lot of people know about the first step, but not the second. If you leave things midway, the foreign key constraint doesn’t get reflected in the table metadata, impacting query performance, and a bunch of other stuff.
I’ve covered a lot of ground already, but I have energy for one more.
Do Not Use Enum Types
It is tempting to use enums to further restrict the possible values that a column may take from simply varchar or text.
Resist this impulse.
To me, this restriction comes with few benefits. Yes, your database will reject any new row with an invalid string. But this restriction isn’t apparent to the server on which you run your application. You’ll have to define that enum anyway there. So why doing it twice? Aren’t all insert requests already coming from your server anyway?
Yes, there’s some level of paranoia that we won’t be able to reach if we leave columns as varchar rather than enum. But that’s a level I haven’t had the need for.
And, to make matters worse, enum types are unmanageable in Postgres:
[T]here is support for adding new values to an existing enum type, and for renaming values (see ALTER TYPE). Existing values cannot be removed from an enum type, nor can the sort ordering of such values be changed, short of dropping and re-creating the enum type.
So, on top of the tiny benefits that enum types provide, if you make a mistake, you can’t remove any enum value without recreating the whole enum from scratch, which (you guessed it) is an operation that requires an ACCESS EXCLUSIVE lock.
Bottom line: just don’t. Use varchar or text types instead, and validate enums on your application. You were going to do it anyway.
Locks Knowledge Will Make You Look Like A Database Guru
Knowing which lock applies on which scenario takes a life to master.
There’s the docs, yes, but knowing that you need to check the docs is the hard part. SQL syntax makes it look a bit too simple.
Here’s some advice to you: whenever you can, test your migrations. There are a few libraries out there, but the true testing is attempting a migration in a separate database with a realistic size.
Which comes against my advice of not relying too much on sandbox for payments. That’s because database migrations don’t require any third party provider to work well, and getting them right can be tested in isolation.
Testing migrations will reinforce your database knowledge. It takes longer to do a task when extra testing is involved, so don’t feel too compelled to do it.
Just promise me that you will tell me your database wartime stories afterwards.
In preparation from this article, I found myself constantly checking a few references, like PayPal’s PostgreSQL at Scale and The Art of PostgreSQL (buy the book it’s awesome!). But at the end of the day, I was double checking everything on the PostgreSQL docs. It’s comprehensible, and fairly easy to follow, with many examples. At work, I often check the SQL Cookbook, though I didn’t use it for thi article (but has good tricks for your day to day work).
That’s it for The Payments Engineer Playbook. I’ll see you on Friday for an important announcement.