Considering the size of disks now-a-days, the chance of bit rot is high. And (I don't have the original source) on SSD, bit rots probability is higher still. So... ZFS and BTRF have meta-data as well as data checksumming. From what I've read, XFS may have metadata checksumming, but not on the data side of things.
I consider checksumming important. Do others? What is the solution? What other file systems offer that sort of capability?
Snapshotting is a second go-to function. Particularly when it is integrated into the LXC container creation process. (There was a comment elsewhere here which said LXC is on it's way out.... huh? what?)
There are many different ways that storage can be layered, and depending on your use case, you can put various advanced features (snapshots, checksum/data integrity, encryption, etc.) functionality in different places in the storage stack. You can put functionality the block device layer (e.g., lvm, dm-thin, dm-verity), you can put functionality into the file system, you can put functionality into the cluster filesystem (if you have such a thing), or you can put iti in the application level.
Depending on the requirements of your use case different choices will make more sense. It's important to remember that RHEL is used for enterprise customers, and what might be common in the enterprise world might not be common for yours, and vice versa. Certainly, if you are using a cluster file system, it makes no sense to do checksum protections at the disk file system level, because you will be using some kind of erasure coding (e.g., Reed Solomon error correcting codes) to protect against node failure. This will also take care of bit flips.
If you are using cloud VM's, or if you are using Docker / Kubernetes, then LXC won't make sense. It all depends on your technology choices, and so it's important to look at the big picture, not just at the individual file system's features.
Given a stock (or additional packages?) RHEL 7.4 install on non-clustered storage, what would be the best combination to detect & correct bitrot at the filesystem and lower level?
One good thing about ZFS integrity checking is that when it finds an error it can repair the bit rot from another disk if you have parity or mirroring. Can dm-integrity do that?
dm-integrity will only operate on a single disk so no, not on its own.
It does however return an error if the integrity check fails, so if you put mdadm on top, mdadm can repair the erroneous block. I've tested this and am currently running it on a 32TB array.
Yes it can detect errors, but it can't continue to function correctly (read: return the correct data) because it doesn't know which copy of the differing data is damaged because it doesn't have checksums.
Moreover, if it doesn't always read both copies of the data (which it may well not, for performance reasons), then you have the possibility of silently propagating damaged data to all mirrors in the case that damaged data is returned to an application and the application then rewrites said data.
Compare that to a filesystem with checksums, which, in addition to being able to detect such a problem, could also continue to function completely correctly in the face of it.
Yep. "What happens if you read all the disks successfully but the redundancy doesn't agree?" is a great question.
Mirrors and RAID5: there's obviously no way that `md` software RAID can help, since it doesn't know which is correct. What about RAID6 though? Double parity means `md` would have enough information to determine which disk has provided incorrect data. Surely it does this, right?
Wrong. In the event of any parity mismatch, `md` assumes the data disks are correct and rewrites the parity to match. See "Scrubbing and Mismatches" section in `man 4 md`:
If you scrub a RAID 6 array with a disk that returns bad data, `md` helpfully overwrites your two disks of redundancy in order to agree with the one disk that's wrong. Array consistent, job done, data... eaten.
When reading the file SomeFile into memory, the read will be distributed among the disks (for performance reasons) (and it will probby need to span a multiple of the stripe size).
Ok, file is read into memory, including the bitrotted part from disk B.
Now we write the file blocks back - as one does.
Voila! Both disks now contain the bitrot. And mdadm will not complain - disk A and B are identical for the area of file SomeFile.
Moreover, even if you don't read the file, and the bit rot is discovered during the monthly compare, at least on Linux the disk that is considered correct will be chosen at random. So you need at least three disks to have some semblance of protection. Have you guys seen many laptops that come with three or more drives?
Just use ZFS. Even on a single disk setup you will at least not get silent bit rot.
Actually it's better to just do mirrors. Avoid RAIDZ at all costs if you care about performance and the ability to resilver in a reasonable amount of time.
two sets of say 5 disks in a mirror raidz1 would still fail if a disk in one set failed and a disk in the other set failed. I guess you could do a stripe setup of 5 sets of 2 disks in mirrors. Still it seems wicked risky to me. I do agree though mirroring has been the best for speed but a lot of that changes with nicer SSDs especially NVMe ones.
I was curious about what a "nested" mirror is really. What exactly is nested?
I'd setup a large pool with mirror vdevs, i.e. n sets of 2 disks per mirror.
My half-remembered reasoning was that backups manage the risk you'll lose data. But replacing a disk in a mirror vdev is much easier, and faster, than doing so with RAIDZ.
The risk of RAIDZ is that resilvering impacts multiple vdevs, is much more intensive than a simple mirror resilvering, and thus the probability that additional drives will fail is much higher.
Here's a blog post that I definitely read the last time I was reading up on this:
I wonder if resilvering is still an issue with SSDs. But I cede your point, nested vdevs of two disks making mirrors makes sense. It doesn't sit well still, but makes sense
A google search turns up a number of sources. Another concept which could use justification is whether SSDs bit rot more over long term than do spinning disks. I heard that somewhere as well.
I consider checksumming important. Do others? What is the solution? What other file systems offer that sort of capability?
Snapshotting is a second go-to function. Particularly when it is integrated into the LXC container creation process. (There was a comment elsewhere here which said LXC is on it's way out.... huh? what?)