Assuming NVMe’s are mostly used with journaling file systems, why is this import...

throwaway8941 · on Sept 29, 2020

Journaling cannot guarantee data or filesystem integrity if your hardware is lying to you. If you send flush to an SSD and it reports "ok, your data is on the persistent storage", while actually keeping it in DRAM buffers (to get higher numbers on benchmarks), and your power goes down, shit ensues. This is surprisingly common behavior.

agar · on Sept 29, 2020

Wow, this jogged a memory from when Brad Fitzpatrick (bradfitz on HN) had to write a utility to ensure the hard drives running LiveJournal didn't lie about successfully completing fsync().[1] IIRC, the behavior caused fairly serious database corruption after a power outage.

Went back and found the link. To my surprise, it was 15 years ago. To my greater surprise, the original post, the Slashdot article, and the utility all remain available.

And hard drives (or their NVMe successors) still lie.

[1] https://brad.livejournal.com/2116715.html

wtallis · on Sept 29, 2020

> This is surprisingly common behavior.

Anecdotally, consumer NVMe SSDs actually tend to not lie about it. Every time I've benchmarked a consumer NVMe SSD under Windows both with and without the "Write Cache Buffer Flushing" option, it has a profound impact on the measured performance of the SSD. I have not observed a comparable performance impact for SATA SSDs, so I suspect Microsoft's description of what that option does is inaccurate for at least one type of drive, though it is at least possible that ignoring flushes is extremely common for consumer SATA SSDs but uncommon for consumer NVMe SSDs.

throwaway8941 · on Sept 29, 2020

Sure, although the proper way to test it would be to write a lot of data to the drive, issue an fsync, and cut power in the middle of the operation. Rinse and repeat a (few) hundred times for each drive.

There's a guy on btrfs' LKML (also the author of [0]) who is diligent enough to do these tests on much of the hardware he gets, and his experience does not sound good for consumer drives.

[0]: https://github.com/Zygo/bees/

wtallis · on Sept 29, 2020

> although the proper way to test it would be to write a lot of data to the drive, issue an fsync, and cut power in the middle of the operation. Rinse and repeat a (few) hundred times for each drive.

This isn't quite right. You have to ensure that the drive returned completion of a flush command to the OS before the plug was pulled, or else the NVMe spec does allow the drive to return old data after power is restored. Without confirming receipt of a completion queue entry for a flush command (or equivalent), this test as described is mainly checking whether the drive has a volatile write cache—and there are much easier ways to check that.

tobias3 · on Sept 29, 2020

Here is a post from him: https://lore.kernel.org/linux-btrfs/20190623204523.GC11831@h...

TLDR: Very few drives don't implement flush correctly. Notice that he mainly uses hard disks, not SSDs/NVMe. Failure often occurs when two (usually rare) things occur at once. E.g. remapping an unreadable sector while power-cycling.

RealStickman_ · on Sept 29, 2020

Does he share the results of his tests anywhere?

sedatk · on Sept 29, 2020

But as long as you write the journal entry first and the device guarantees flushes for writes in the order they are queued, there should be no inconsistent state at all?

wtallis · on Sept 29, 2020

> and the device guarantees flushes for writes in the order they are queued

NVMe does not require such a guarantee, nor does it provide a way for drives to signal such a guarantee.

(Part of the reason is that NVMe devices have multiple queues, and the standard tries to avoid imposing unnecessary timing or synchronization requirements between commands that aren't submitted to the same queue.)

sedatk · on Sept 29, 2020

I see. Then it makes sense. Thanks.

jabl · on Sept 29, 2020

Assuming NVME queing works like SATA or SCSI queuing (which I believe it does), then basically queue entries are unordered [1]; the device is free to process them in any order. If you (as in, person who is implementing a block layer or file system in an OS kernel, or some fancy kernel-bypass stuff) want requests A and B to be ordered before request C, then you must do something like

1. Issue A and B.

2. Wait for A and B to complete.

3. Issue a FLUSH operation (to ensure that A and B are written from the drive cache to persistent storage), and wait for it to complete.

4. Issue C with FUA (force unit access) bit set.

5. Wait for C to complete.

Alternatively, if the device doesn't support FUA, for writing C you must instead do

4b. Issue C.

5b. Wait for C to complete.

6b. Issue FLUSH, and wait for the FLUSH to complete.

Now, like wtallis already said, NVME additionally has multiple queues per device, but these are independent from each other. If you somehow want ordering between different queues, you must implement that in higher level software.

[1] The SCSI spec has an optional feature to enable ordered tags. But apparently almost no devices ever implemented it, and AFAIK Linux and Windows never use that feature either.

arielweisberg · on Sept 29, 2020

Out of the box configuration for most journaling filesystems only journal metadata not data.

Journaling data cuts write throughput in half and it’s not necessary most of the time.

throwaway8941 · on Sept 29, 2020

It's worthy of note that log-structured and copy-on-write filesystems (which I've seen described as two types of journaling) like btrfs and F2FS log data as part of their normal operation without any performance loss, so you always get a consistent view of the filesystem (barring bugs in the FS code or fsync-is-not-really-an-fsync treachery from your hardware).

sedatk · on Sept 29, 2020

As long as writes are flushed sequentially, I see no problem with metadata only journaling, but apparently, NVMe's don't provide such guarantees.

arielweisberg · on Sept 30, 2020

You don’t get a guarantee of write ordering from not just the disk (pretty much any kind) but the OS IO scheduler.

Journaling filesystems can still implement atomic appends with only metadata journaling.

Updating in place is generally not atomic because of the way writeback works for buffered IO.

If you use unbuffered IO you bypass the OS scheduler but still have the disk reordering things if you don’t use write barriers and they still don’t guarantee atomicity for regular writes.