Say you store your data as one big sorted array on disk.
Say you store data as one big log and append writes to the end.
Let's take a closer looks at the iops layer scaled to 1 op = 40ms.
Appends to a log happen on a 1 op per write basis. For a sorted array, it costs k ops per write to maintain the order. Notice how we can only make reads so smart, but writes can be as dumb as 1 op. One can build a strong case to keep these simple writes.
A log strucutred merge tree is a trick that tries to get both good writes and good reads. The game is that you want fast writes, so you're going to append to a log. But you also need reads to not be terrible. What's the first thing you'd try to get some order back; without giving up fast appends?
Maybe the name suggest storing the logs as a tree to preserve order? Then search the tree?
Not a bad answer. That's basically what a B-Tree is. But if you store the logs as a tree, then you'll have sacrificed the log writes. To put a new item in the tree in the right spot, you have to go find that spot on disk and write into it. We call this a random write which is the scenario the log avoids.
Additionally, a spot can be full and the tree has to split nodes and shuffle connections around. This is sometimes called write amplification, meaning a write can produce cascading effects throughout the data structure.
So how do you keep the log sequential write append, but still end up with sorted data you can search fast.
Append now? Then structure later?
Even if you hallucinated this idea, it happens to be the mental model of an LSM. To get sorted data without ever writing into the middle, you do the sorting somewhere cheap: in memory.
Every write now goes two places at once:
We write to the log first. Memtables are fast in-memory copies of these writes that gets sorted, typically abstracted as batches. The memtables are in memory land, so you are faster at sorting near RAM and CPU than you would be sorting in disk. When a memtable fills up, it gets flushed into a single file, written to disk start-to-finish in one sequential pass. Then you never touch that file again. It's frozen. Move onto filling the next memtable.
We kept the log which gives us fast appends. Some amount of sorting happens in RAM and was delivered as one big sequential write. No random writes, or shuffling, or splitting of nodes. We have order without paying the B-Tree Tax.
We are on the path to earning reads back, but we paid for them in complexity. Let's sympathize with the read path. What do they read? They read your writes. A problem with our mental model is that writes can overlap. You could write a value to the same ID N times, so the same key can live in several files at once, each holding a different value. This means your read has to be smarter about finding the key. It has to reconcile with which one to trust.
So how do you trust scattered frozen files? A naive approach is to remember that it's a log, so you can just trust the latest ID update. This is not far from how an LSM handles this, but know there is still room for a more complete tenable solution. If an ID gets rewritten N times and you only trust the Nth ID, then what do you suppose you do with a graveyard of N-1 deprecated IDs. You could forget them, but somewhere in your mental model, you know it demands an additional mechanism in the interest of space. Remember the current trajectory is append now. then structure later. Each frozen file is sorted within; but the files don't know about each other. There's no unified sort across the pile. So while you serve reads and writes, the LSM merges these files: many small sorted files become fewer big ones, with the duplicates reconciled along the way.
We call this compaction. It cleans the graveyard of N-1 dead IDs and coalesces the files. You now monitor the space of duplicates and also friction on the read path. By betting on "then structure later" you have to reconcile with the compaction layer contending for the same resources as your reads & writes. It's all the same Disk, I/O, and CPU. This is a particularly challenging game of tradeoffs. Compact aggressively and you threaten the available I/O. Compact lazily and the files start to pile, which adds drag to the read path.This added complexity opens a lot of channels for efficiency. Compaction schemes come in different flavors and often depend on the workload. You kept fast sequential writes and earned back reads, but now you've convoluted the separation of concerns. There is now a whole new scope of work on the read path because you can now speculate ways to skip files to get a key. Those frozen files that get written from the memtables are also called Sorted String Tables (SS-Tables). Each one carries a small sparse index, meaning every Nth key maps to a byte offset. If you want smarter reads, you can guess that you want to minimize scanning whole files. Bloom Filters for example do something quite neat; they can tell you if a key is definitely not inside a given SSTable, so you can skip it entirely. LevelDB does this. Every database has its own operational territory where it can optimize.
Don't forget about the write path. We kept them fast, but remember how we noted that writes fan out to two places: the memtable in RAM and also the log. What happens if your compute node dies in the middle of a batch write? Our log we've been appending to this whole time is called A Write Ahead Log or WAL[1]. A good log needs to do two things right and only do those two things. The first is that it holds the canon sequential record of writes. And the second is the first, but in a way that makes it easy to replicate. The log is deterministic, so it has a "commit history" that allows someone to replicate the work and arrive at the same final state. In essence, it holds the democracy of durability. If your compute node dies, you could replay the WAL to regenerate the memtables that got lost in RAM. That's the gist, but the WAL deserves its own blog.
I wrote this mainly as an ode to the LSM and to practice writing about database structures. It was invented in 1996 and perhaps fragments of it as early as the 60s. Databases are monolithic buildings of state (at least before sharding) that we use or indirectly use when we go on the internet. This makes the mechanisms that defend databases attractive in their elegance. The LSM in particular is a fun intellectual nugget to share with your friends because you can start with some naive operations and ladder up the complexity as you discover more requirements. Maybe one day I will write more on this, but there are already tons of resources on this topic that could explain it better than me. I meant for this to be an exercise to the reader anyway.
You should know that not all ops are made equal. Sequential R/W and Random R/W have different latencies. We are interested in the separation of responsibility; in short, who is doing what and who is lacking. This visual keeps our naive journey somewhat pragmatic. It happens to be common database knowledge to separate the read and write paths in support of simplicity. This isn't exactly what the wisdom means, but it's on the right track.
C'est Tout
Andy Tran | 2026