The venn diagram of "people who want a modern copy-on-write filesystem with snap...

procaryote · 2025-09-04T07:38:30 1756971510

If you ignore explicit warnings at mkfs time and then get upset the warning was accurate, you can't really fully blame the file system for it.

Just raid on a lower layer and btrfs on top.

aaronmdjones · 2025-09-05T06:30:50 1757053850

> If you ignore explicit warnings at mkfs time and then get upset the warning was accurate, you can't really fully blame the file system for it.

Oh, no doubt. I agree.

> Just raid on a lower layer and btrfs on top.

That has its own set of problems. The conventional RAID solution on Linux (MD) also has some pretty terrifying corruption edge cases with RAID-5 and RAID-6 (as I explained in [1]) which will bite you if you're not aware of them and how to work around them.

A robust filesystem purpose-built for the task can only really be found in ZFS.

[1] https://news.ycombinator.com/item?id=42915999

procaryote · 2025-09-06T06:25:22 1757139922

Won't silent corruption on the raid level be detected by the integrity checks in btrfs? It won't be able to automatically repair it, but it should give errors at least, right?

aaronmdjones · 2025-09-06T08:08:35 1757146115

Yeah, that would be the "error detection at a higher level" (than MD) part. It'd still be on you to pull one drive at a time from the array until the errors go away (then you know which drive has the corrupted block in that stripe, and can remove the mdadm metadata from it and then re-add it to the array so that the kernel forces a clean resync, reconstructing the good block from the parity). Doing the "repair" action in MD would instead rewrite your good parity for now-corrupted data and you would have no means of recovering. MD can't know whether the data is bad or the parity is bad because it doesn't know what the data is supposed to look like; even if btrfs does have a checksum for it, that's on a higher, disconnected layer. All filesystems on top of a parity MD suffer from this same vulnerability; some of them won't even be able to tell you when a file has become corrupted (e.g. FAT32), leading to this corruption being persisted into backups.

procaryote · 2025-09-06T13:37:50 1757165870

Sure, you might have some work on corruption but it won't be silent and it would be recoverable without data loss

I'd personally replace rather than re-add a drive with corruption but perhaps I'm overly paranoid

aaronmdjones · 2025-09-06T14:14:36 1757168076

If it were only one data block in one stripe I'd be confident re-adding the same drive (and have done so); this is overwhelmingly likely to be a transient error (e.g. bit rot on the drive or a RAM bit flip while writing; either in the drive itself or the machine's main memory) that won't recur.

The MD "check" action can confirm this (it will iterate every stripe and report all parity/data mismatches, so if it only reports one ...) and some distributions ship a cronjob that automatically does this on a monthly basis.

If it were a corrupt parity block in a stripe (i.e. a filesystem with strong error detection reports no errors but the MD check action still reports a data/parity mismatch), this is usually more indicative of a lost write during a re-write operation (e.g. the machine was powered off in the middle of updating the contents of a stripe), as the parity is written last -- i.e. the parity would be for the old data in that stripe, not the data as it is now.

The MD "repair" action (if you are ABSOLUTELY CERTAIN that it is the parity that is bad) will automatically correct this problem, which you should do, as the failure of a disk containing a data block within that stripe will then leave you with incorrectly calculated data that will then start showing up as filesystem errors (if you're fortunate enough to be using such a filesystem).

Of course all of the usual caveats about checking SMART statistics apply in determining whether a drive is still suitable for continued use. If the same drive kept showing up with the same problems, I'd retire it; if the drive starts reporting an increase in reallocated sector count, I'd retire it; and so on.