Blog - andy tran

LSM: log-structured merge tree

Say you store your data as one big sorted array on disk.

reads are good: it's sorted, so you could Binary Search to find anything fast
writes are bad: to insert a new item in order, you have to move everything over by one

That doesn't seem fun for writes, so let's flip the story.

Say you store data as one big log and append writes to the end.

writes are good: appending is fast on disk
reads are bad: nothing is sorted, so a naive search is an O(n) scan

Let's take a closer look at the iops layer scaled to 1 op = 40ms.

WRITES

LOG (0 ops)

SORTED (0 ops)

READS (repeated worst case)

LOG (0 ops)

SORTED (0 ops)

Appends to a log happen on a 1 op per write basis. For a sorted array, it costs k ops per write to maintain the order. Notice how we can only make reads so smart, but writes can be as dumb as 1 op. One can build a strong case to keep these simple writes.

A log strucutred merge tree is a trick that tries to get both good writes and good reads. The game is that you want fast writes, so you're going to append to a log. But you also need reads to not be terrible. What's the first thing you'd try to get some order back; without giving up fast appends?

Maybe the name suggests storing the logs as a tree to preserve order? Then search the tree?

Not a bad answer. That's basically what a B-Tree is. But if you store the logs as a tree, then you'll have sacrificed the log writes. To put a new item in the tree in the right spot, you have to go find that spot on disk and write into it. We call this a random write which is the scenario the log avoids.

Additionally, a spot can be full and the tree has to split nodes and shuffle connections around. This is sometimes called write amplification, meaning a write can produce cascading effects throughout the data structure.

So how do you keep the log sequential write append, but still end up with sorted data you can search fast.

Append now? Then structure later?

Even if you hallucinated this idea, it happens to be the mental model of an LSM. To get sorted data without ever writing into the middle, you do the sorting somewhere cheap: in memory.

Every write now goes two places at once:

a log on disk, appended to the end
a small sorted structure in memory, which we call a memtable

WRITES

log (disk) 0 disk ops ≈ 0

MEMTABLE (RAM, sorted) 0 memory ops ≈ 0

We write to the log first. Memtables are fast in-memory copies of these writes that gets sorted, typically abstracted as batches. The memtables are in memory lands, which makes sorting there faster because it avoids doing it in Disk. When a memtable fills up, it gets flushed into a single file, written to disk start-to-finish in one sequential pass. Then you never touch that file again. They become read-only immutable files. Move onto filling the next memtable.

WRITES

log (disk) 0 disk ops ≈ 0

MEMTABLE (RAM, sorted) 0 memory ops ≈ 0

SORTED FILES (disk, frozen)

We kept the log which gives us fast appends. Some amount of sorting happens in RAM and was delivered as one big sequential write. No random writes, or shuffling, or splitting of nodes. We have order without paying the B-Tree Tax.

We are on the path to earning reads back, but we paid for them in complexity. Let's sympathize with the read path. What do they read? They read your writes. A problem with our mental model is that writes can overlap. You could write a value to the same ID N times, so the same key can live in several files at once, each holding a different value. This means your read has to be smarter about finding the key. It has to reconcile with which one to trust.

So how do you trust scattered frozen files? A naive approach is to remember that it's a log, so you can just trust the latest ID update. This is not far from how an LSM handles this, but know there is still room for a more complete tenable solution. If an ID gets rewritten N times and you only trust the Nth ID, then what do you suppose you do with a graveyard of N-1 deprecated IDs. You could forget them, but somewhere in your mental model, you know it demands an additional mechanism in the interest of space. Remember the current trajectory is append now. then structure later. Each frozen file is sorted within; but the files don't know about each other. There's no unified sort across the pile. While you serve reads and writes, the LSM merges these files: many small sorted files become fewer big ones, with duplicates reconciled along the way.

WRITES (with duplicates)

log (disk)

MEMTABLE (RAM, sorted)

SORTED FILES (disk, frozen)

We call this compaction. Here our duplicate ID of interest is 5. Notice how when it overwrites a 5, the old ones lost their meaning. Compaction reads the immutable files and sees a graveyard of N-1 dead 5s. It coalesces these files by branding a new one and deleting the old ones. This maintains the space of duplicates and also friction on the read path. By betting on "then structure later" you have to reconcile with the compaction layer contending for the same resources as your reads & writes. It's all the same Disk, I/O, and CPU. This is a particularly challenging game of tradeoffs. Compact aggressively and you threaten the available I/O. Compact lazily and the files start to pile, which adds drag to the read path.

This added complexity opens a lot of channels for efficiency. Compaction schemes come in different flavors and often depend on the workload. You kept fast sequential writes and earned back reads, but now you've convoluted the separation of concerns. There is now a whole new scope of work on the read path because you can now speculate ways to skip files to get a key. Those frozen files that get written from the memtables are also called Sorted String Tables (SS-Tables). Each one carries a small sparse index. If you want smarter reads, you could find ways to skip files. LevelDB for example use Bloom filter per SSTable basis, but Know that every database has its own operational territory where it can optimize.

Don't forget about the write path. We kept them fast, but remember how we noted that writes fan out to two places: the memtable in RAM and also the log. What happens if your compute node dies in the middle of a batch write? Our log we've been appending to this whole time is called A Write Ahead Log or WAL^[1]. A good log needs to do two things right and only do those two things. The first is that it holds the canon sequential record of writes. And the second is the first, but in a way that makes it easy to replicate. The log is deterministic, so it has a "commit history" that allows someone to replicate the work and arrive at the same final state. In essence, it holds the democracy of durability. If your compute node dies, you could replay the WAL to regenerate the memtables that got lost in RAM. That's the gist, but the WAL deserves its own blog.

It's important to distinguish that not all ops are made equal. Sequential and Random iops for Memory and Disk have different latencies and throughput. We are interested in the separation of responsibility; in short, who is doing what and who is lacking. This visual keeps our naive journey somewhat pragmatic. It happens to be common database knowledge to separate the read and write paths in support of simplicity. This isn't exactly what the wisdom means, but it's on the right track. Writes to a WAL also aren't 1 op, but close in nature. A lot of databases will force these writes in the form of fsync, which multiplies some factor contant on the 1 op per write basis. Other databases opt out of fsync and have other ways of dealing with the durability.

I wrote this mainly as an ode to the LSM and to practice writing about database structures. It was invented in 1996 and perhaps fragments of it as early as the 60s. Databases are monolithic buildings of state (at least before sharding) that we use or indirectly use when we go on the internet. This often makes database primitives fairly attractive in their elegance. The LSM in particular is a fun intellectual nugget to share with your friends because you can start with some naive operations and ladder up the complexity as you discover more requirements. Even if they do not know much about computers, these problems feel closer to physics. Maybe one day I will write more on this, but there are already tons of resources on this topic that could explain it better than me. I meant for this to be an exercise to the reader anyway.