Exactly-Once Payments At Airbnb
Eventually consistent databases make it really hard to ensure that payments are made only once. Idempotency and clever retries make it possible.
Processing payments exactly once in distributed systems is impossible.
Then how does Airbnb have leaned into a service oriented architecture and still managed to pull it off?
By faking it.
But first, I want to tell you why Airbnb even tried to handle payments in-house in the first place. It all starts with his charismatic founder Brian Chesky.
Chesky has lots of funny stories about what Airbnb was like before they handled payments. Back in 2008, for example, he attended the South by Southwest festival. He stayed, of course, in an Airbnb. And he repeatedly forgot to bring cash to pay his host every single day of the stay.
Can you imagine? Every morning, Chesky had to promise that he would bring the money that day. And every afternoon, after having a blast at the festival, he would arrive at the Airbnb, suddenly remembering the promise he made that morning.
The awkwardness was unbearable. It was then when he realized that Airbnb had to handle payments in-house. Otherwise, the experience for Airbnb guests was going to be eerily similar to a brothel: exchanging cash with strangers in a bedroom.
The problem is that online payments are dimensionally different from cold hard cash. In real life, money literally changes hands. But online, what gets exchanged is a message.
And messages can get lost.
I’m Alvaro Duran, and this is The Payments Engineer Playbook. There’s a ton of content out there that show you how to get hired for your system design skills with examples of payments systems. But there’s not much that teaches you how to build this critical software for real users and real money.
The reason I know this is because I’ve built and maintained systems that handled close to 100,000 payments a day. And I’ve been able to see all types of interesting conversations about what works and what doesn't for payment systems behind closed doors.
These conversations are what inspired this newsletter.
In The Payments Engineer Playbook, we investigate the technology that transfers money. And we do that by cutting off one sliver of it and extract tactics from it. So let’s get into this article about faking exactly-once-delivery by coordinating the design of clients and servers.
Crypto enthusiasts will tell you all about The Byzantine Generals Problem, but this is what it boils down to: from your end, you can’t say if the other party has received your message. What you can do, though, is send your message multiple times, and hope that one of them reaches its destination.
But if your message is meant to be read once, like a payment request, then sending a message more than once means that the receiver is going to make more transfers than you wanted to.
See where I’m going with this? Exactly-once-delivery is technically impossible in distributed systems communicating over a partitioned network.
So how are online payments possible? With a client that retries payment requests, and a server that makes sure that only one of them gets processed.
These are called idempotent requests.
Retries are straightforward, although there are some caveats. But first, let me start with the way Airbnb configures its servers to achieve idempotency.
The Caching and Mutex Aspects of Idempotency
You’ve probably heard that the standard way to implement idempotency is to have the client come up with an idempotency key that is passed in the request.
That’s because there’s a caching aspect to idempotency. When the server gets a request, it checks if the key is already stored in the database, and based on that it either returns the cached response, or calls the downstream service.
Most engineers know about this already. But if they only do this, they’re missing an important scenario that can lead to double charges.
You see, saving the idempotency key and its correspondent response can’t happen at the same time. Processing the request takes time. Which means that there’s a moment when the server has received the request and stored its idempotency key, but the response hasn’t been produced.
What if there is a retry request during that time?
If engineers only focus on the caching aspect, such a retry would cause problems. What should the server do with it, when there’s a matching key, but no cached response to provide?
To ensure that all requests get processed correctly, there’s an often forgotten aspect of idempotency. The mutual exclusion aspect: two requests associated with the same key must not be processed concurrently.
How Airbnb Process Idempotent Requests
Jon Chew and Ninad Khisti, from Airbnb Payments, have an informative post on how Airbnb handles idempotency. Airbnb takes into account both aspects of idempotency by splitting each request into 3 phases: pre-RPC, RPC and post-RPC.
We’ve learned the hard way that network calls (RPCs) during the Pre and Post-RPC phases are vulnerable and can result in bad things like rapid connection pool exhaustion and performance degradation. Simply put, network calls are inherently unreliable. Because of this, we wrapped Pre and Post-RPC phases in enclosing database transactions.
First, at pre-RPC, the server validates the existence of the key in the database, and locks that row in the database table to ensure that no other request with that key gets processed. Then, at RPC, the downstream service gets called. And finally, the post-RPC gets the response and saves it.
While this is happening, retries have to wait for the first request to finish. And after that, if there’s a stored response, it will be provided, without entering the RPC phase anymore.
The server is meant to store all successful responses. But some failed responses are meant to be stored too!
Look, some failures are not going to change no matter how many times you retry. Bad Requests, for instance, are permanent, just like most 4XX HTTP codes. However, a request can also fail for reasons that are transient, like bad connectivity, the server being down or timeouts. These failed responses are meant to be retried.
That’s why Airbnb categorizes failed responses into retryable and non-retryable.
That information is sent to the client, who should be able to arrange for a retry only when there’s been a retryable failure.
Which brings me to the client side of this design.
How Stripe Recommends Handling Retries
Airbnb is thorough on the server’s design.
But double charges can’t be avoided by working on the server only. The client must shoulder some of the responsibility. Recklessly retrying requests on an idempotent server can also lead to double charges.
Therefore, clients must be smarter when it comes to making payment requests to Airbnb’s payment platform.
That, however, is all Airbnb has to say about the client.
Luckily, Stripe has 3 memorable principles that all smart clients have to follow to handle failed requests to idempotent servers. These are:
Consistency: Clients must retry failed and retryable requests to make sure data is consistent across services.
Safety: Use idempotency keys so that the server can identify duplicate requests as retries.
Responsibility: Use techniques like exponential backoff and jitter to avoid overwhelming the server. Denis Isaev from Yandex made some important comments on how to do this effectively, and I recommend you read that.
All right, we’ve covered a lot in this article from the perspective of Airbnb payments. Distributed systems is a huge topic. There are a bunch of avenues that we could probably explore in later articles. And I think I might do it if you want me to.
That’s a big if, though.
But let me wrap up first. One of the most important takeaways for me has been the degree of coordination that client and server need to have to get idempotency right. Anything less than that, and you’ll start to see double charges popping up at reconciliation.
Idempotency is also something you can’t really rush. Most engineers have seen some version of it, but there’s a limit to how much software we can build by shipping something to production and seeing what happens. To do idempotency right, you’re better off writing on a whiteboard than on an IDE.
In some sense, this article has been a nice counterbalance to the one I wrote on Uber’s testing in production (check it out, it’s awesome!).
There’s no way you can come up with a correct idempotency mechanism in a weekend. Which is why Airbnb made a library for it.
It totally makes sense! Now that Airbnb has a library on a problem that is common across domains, and requires some forethought and care, any engineer can spin up a new service, install this library, and be confident that it’s going to be fine.
“Just works” still has tremendous value.
That’s it for this article of The Payments Engineer Playbook. See you next week.
PS: One more thing! The reason why I said that I might explore how distributed systems work in later articles if you want me to is because writing them is A LOT of work. And I’m only going to keep making them if you, my dear reader, want more.
So, if you really want to read more of this, you can do two things.
First, you can go to the LinkedIn post on this article, and leave a comment. Let me know if you want more articles like this. And if you do, what do you want them to be about?
And of course, if you hated this one, you hate The Payments Engineer Playbook, you hate me, I want to hear from you too.
Let me know what you think, and I will either stop or keep going. It all depends on your feedback.
And second, tell a colleague. A bunch of you kept this post on the front page of Hacker News for 12 hours, others told me that they’ve shared it with everyone at work, and others went to their social media and posted the link with some insightful comments.
If you’re reading this, I’m sure you already have a pretty good idea who might find this article useful.
And if you got this article from a colleague, do me a favor and subscribe. It’s a flex to be a reader of a well-known publication before it was cool.
Make a bet on The Payments Engineer Playbook. I’ll see you around.
> It all starts with his charismatic founder Brian Chesky.
Airbnb is not a man.
> The awkwardness was unbearable. It was then when he realized that Airbnb had to handle payments in-house.
This story makes no sense. Why did he agree to pay cash every day instead of paying up front? Why couldn't he write a check? Why was it awkward? And most importantly, what does this have to do with bringing payments in house? Why couldn't Airbnb just integrate a payment processor?
Maybe trying asking chatGPT to help you write/proofread this. I think it would come out much better.
> Then how does Airbnb have leaned into a service oriented architecture and still managed to pull it off?
"Then how did Airbnb lean into a service oriented architecture and still manage to pull it off?"
Although it's still not a great sentence, since you talk about distributed systems in the previous paragraph and service oriented architecture in this one, it feels like you're trying to throw buzzwords around. You could simplify it to "Then how did Airbnb pull it off?"