Facebook scaled to ~1b daily actives on MySQL + InnoDB. There was *lots* of engi...

9dev · on July 6, 2023

> And we backed it all up, every day, in < 24 hours, using an unmodified mysqldump.

But… how? Considering all the transactions in flight, and everything? And did you ever Test disaster recovery with that setup?

I’ve worked on relatively big projects, but FAANG engineering is like some entirely different field of software engineering. Fascinating.

yuliyp · on July 7, 2023

(a) make each individual database small, but have a lot of them (b) There are lots of transactions in flight, but there is a well-ordered sequence of mutations (the binlog) that defines what has and has not been committed. So applying a backup means taking the full backup + replaying the binlogs. (c) testing can be done by just bringing up a slave from the backup and then comparing consistency with normal replicas.

ninkendo · on July 7, 2023

To expand on this question, I'm wondering how useful daily backups even are for a site like Facebook. I mean, of _course_ you need them, but also, something about reverting all of FB to a state 24 hours ago seems disastrous even if it works. I can't imagine that it's an acceptable thing in anything but an absolute emergency. Imagine every single facebook user got rewound in time to the previous day, every message sent over the past day was lost, etc.

It was a lifetime ago I ever did DB administration (postgres in my case), but the write-ahead-logs being replicated out independently was extremely important for point-in-time recovery, such that you could always take the latest backup, zip forward through the WAL, and recover to any arbitrary point in time you want, so long as the WALs were available. I wonder how much something like this would have been done at FB scale.

ericbarrett · on July 7, 2023

We live-streamed the binlogs (what MySQL calls its WAL) to the backup infrastructure, check out the ORC article I posted elsewhere.

ericbarrett · on July 7, 2023

What yuliyp wrote is basically it. Although the individual shards weren't really small, even by modern standards.

> Considering all the transactions in flight, and everything?

If I remember, we used --flush-logs --master-data=2 --single-transaction, giving it a consistent point-in-time dump of the schemas, with a recorded starting point for binlog replays, enabling point-in-time and up-to-the-minute restores. Nowadays you have GTIDs so these flags are obsolete (except --single-transaction).

--single-transaction does put extra load on the database—I think it was undo logs? it's been a minute—which caused some hair-pulling, and I believe they eventually moved to xtrabackup, before RocksDB. But binary backups were too space-hungry when I was there, so we made it work.

Another unexpected advantage of mysqldump vs. xtrabackup, besides size, was when a disk error caused silent data corruption on the underlying file system. Mysqldump often read from InnoDB's buffer cache, which still had the good data in it. Or if the bad block was paged back in, it wouldn't have a valid structure and mysqld would panic, so we knew we had to restore.

> And did you ever Test disaster recovery with that setup?

Yes! I wrote the first version of ORC. This blog post is from long after I left, but it's a good summary of how it worked: https://engineering.fb.com/2016/10/28/data-infrastructure/co...

It wasn't the best code (sorry Divij)—the main thing I'm proud of was the recursive acronym and the silly Warcraft theme. But it did the job.

Two things I remember about developing ORC:

1) The first version was an utter disaster. I was just learning Python, and I hit the global interpreter lock super hard, type-error crashes everywhere, etc. I ended up abandoning the project and restarting it a few months later, which became ORC. In the interim I did a few other Python projects and got somewhat better.

2) Another blocker the first version had was that the clients updated their status by doing SELECT...FOR UPDATE to a central table, en masse, which turns out to be really bad practice with MySQL. The database got lock-jammed, and I remember Domas Mituzas walking over to my desk demanding to know what I was doing. Hey, I never said I was a DBA! Anyway, that's why ORC ended up having the Warchief/Peon model—the Peons would push their status to the Warchief (or be polled, I forgot), so there was only a single writer to that table, requiring no hare-brained locking.

fmajid · on July 11, 2023

More impressive was how Facebook managed so many MySQL instances with such a small DBA team. The average regional bank probably has more Oracle DBAs managing a handful of databases.

aestetix · on July 6, 2023

It sounds like you were involved in this. Since you were working there so long ago, would you be willing to write up a technical account of the things you did? I'd be interested in learning more about it. I figure the tech from 10 years ago is outdated enough that it wouldn't cause any issues if you made it public.

ericbarrett · on July 7, 2023

Appreciate the interest. Honestly, most of the cool stuff was getting to play with toys that all the other talented engineers developed; I had a relatively narrow piece of the pie. I did write up a bit in a sibling reply.