Journaling cannot guarantee data or filesystem integrity if your hardware is lying to you. If you send flush to an SSD and it reports "ok, your data is on the persistent storage", while actually keeping it in DRAM buffers (to get higher numbers on benchmarks), and your power goes down, shit ensues. This is surprisingly common behavior.
Wow, this jogged a memory from when Brad Fitzpatrick (bradfitz on HN) had to write a utility to ensure the hard drives running LiveJournal didn't lie about successfully completing fsync().[1] IIRC, the behavior caused fairly serious database corruption after a power outage.
Went back and found the link. To my surprise, it was 15 years ago. To my greater surprise, the original post, the Slashdot article, and the utility all remain available.
And hard drives (or their NVMe successors) still lie.
Anecdotally, consumer NVMe SSDs actually tend to not lie about it. Every time I've benchmarked a consumer NVMe SSD under Windows both with and without the "Write Cache Buffer Flushing" option, it has a profound impact on the measured performance of the SSD. I have not observed a comparable performance impact for SATA SSDs, so I suspect Microsoft's description of what that option does is inaccurate for at least one type of drive, though it is at least possible that ignoring flushes is extremely common for consumer SATA SSDs but uncommon for consumer NVMe SSDs.
Sure, although the proper way to test it would be to write a lot of data to the drive, issue an fsync, and cut power in the middle of the operation. Rinse and repeat a (few) hundred times for each drive.
There's a guy on btrfs' LKML (also the author of [0]) who is diligent enough to do these tests on much of the hardware he gets, and his experience does not sound good for consumer drives.
> although the proper way to test it would be to write a lot of data to the drive, issue an fsync, and cut power in the middle of the operation. Rinse and repeat a (few) hundred times for each drive.
This isn't quite right. You have to ensure that the drive returned completion of a flush command to the OS before the plug was pulled, or else the NVMe spec does allow the drive to return old data after power is restored. Without confirming receipt of a completion queue entry for a flush command (or equivalent), this test as described is mainly checking whether the drive has a volatile write cache—and there are much easier ways to check that.
TLDR: Very few drives don't implement flush correctly. Notice that he mainly uses hard disks, not SSDs/NVMe. Failure often occurs when two (usually rare) things occur at once. E.g. remapping an unreadable sector while power-cycling.
But as long as you write the journal entry first and the device guarantees flushes for writes in the order they are queued, there should be no inconsistent state at all?
> and the device guarantees flushes for writes in the order they are queued
NVMe does not require such a guarantee, nor does it provide a way for drives to signal such a guarantee.
(Part of the reason is that NVMe devices have multiple queues, and the standard tries to avoid imposing unnecessary timing or synchronization requirements between commands that aren't submitted to the same queue.)
Assuming NVME queing works like SATA or SCSI queuing (which I believe it does), then basically queue entries are unordered [1]; the device is free to process them in any order. If you (as in, person who is implementing a block layer or file system in an OS kernel, or some fancy kernel-bypass stuff) want requests A and B to be ordered before request C, then you must do something like
1. Issue A and B.
2. Wait for A and B to complete.
3. Issue a FLUSH operation (to ensure that A and B are written from the drive cache to persistent storage), and wait for it to complete.
4. Issue C with FUA (force unit access) bit set.
5. Wait for C to complete.
Alternatively, if the device doesn't support FUA, for writing C you must instead do
4b. Issue C.
5b. Wait for C to complete.
6b. Issue FLUSH, and wait for the FLUSH to complete.
Now, like wtallis already said, NVME additionally has multiple queues per device, but these are independent from each other. If you somehow want ordering between different queues, you must implement that in higher level software.
[1] The SCSI spec has an optional feature to enable ordered tags. But apparently almost no devices ever implemented it, and AFAIK Linux and Windows never use that feature either.
It's worthy of note that log-structured and copy-on-write filesystems (which I've seen described as two types of journaling) like btrfs and F2FS log data as part of their normal operation without any performance loss, so you always get a consistent view of the filesystem (barring bugs in the FS code or fsync-is-not-really-an-fsync treachery from your hardware).
You don’t get a guarantee of write ordering from not just the disk (pretty much any kind) but the OS IO scheduler.
Journaling filesystems can still implement atomic appends with only metadata journaling.
Updating in place is generally not atomic because of the way writeback works for buffered IO.
If you use unbuffered IO you bypass the OS scheduler but still have the disk reordering things if you don’t use write barriers and they still don’t guarantee atomicity for regular writes.