Considering the size of disks now-a-days, the chance of bit rot is high. And (I ...

tytso · on Aug 2, 2017

There are many different ways that storage can be layered, and depending on your use case, you can put various advanced features (snapshots, checksum/data integrity, encryption, etc.) functionality in different places in the storage stack. You can put functionality the block device layer (e.g., lvm, dm-thin, dm-verity), you can put functionality into the file system, you can put functionality into the cluster filesystem (if you have such a thing), or you can put iti in the application level.

Depending on the requirements of your use case different choices will make more sense. It's important to remember that RHEL is used for enterprise customers, and what might be common in the enterprise world might not be common for yours, and vice versa. Certainly, if you are using a cluster file system, it makes no sense to do checksum protections at the disk file system level, because you will be using some kind of erasure coding (e.g., Reed Solomon error correcting codes) to protect against node failure. This will also take care of bit flips.

If you are using cloud VM's, or if you are using Docker / Kubernetes, then LXC won't make sense. It all depends on your technology choices, and so it's important to look at the big picture, not just at the individual file system's features.

rob-olmos · on Aug 2, 2017

Given a stock (or additional packages?) RHEL 7.4 install on non-clustered storage, what would be the best combination to detect & correct bitrot at the filesystem and lower level?

mrob · on Aug 2, 2017

Linux 4.12 introduced dm-integrity, which adds integrity checking at the block device level, so it will work with any file system:

https://gitlab.com/cryptsetup/cryptsetup/wikis/DMIntegrity

zerd · on Aug 2, 2017

One good thing about ZFS integrity checking is that when it finds an error it can repair the bit rot from another disk if you have parity or mirroring. Can dm-integrity do that?

_wxyv · on Aug 2, 2017

dm-integrity will only operate on a single disk so no, not on its own.

It does however return an error if the integrity check fails, so if you put mdadm on top, mdadm can repair the erroneous block. I've tested this and am currently running it on a 32TB array.

harshreality · on Aug 2, 2017

Not so far, it seems. https://www.spinics.net/lists/dm-devel/msg31482.html

> this target do not provide error correction, only detection of error (such a tool could be written on top of dm-integrity though)

rleigh · on Aug 3, 2017

Or multiple copies of the data (copies=n property).

ars · on Aug 2, 2017

Use mirror raid and have mdadm do a full disk compare/check every month (this is the default on Debian).

Additionally use smartmontools and configure it to do a short self test each night, and a long self test (i.e. full disk read) each week.

This will catch/flag errors early, which mdadm will then detect.

usefulcat · on Aug 2, 2017

Yes it can detect errors, but it can't continue to function correctly (read: return the correct data) because it doesn't know which copy of the differing data is damaged because it doesn't have checksums.

Moreover, if it doesn't always read both copies of the data (which it may well not, for performance reasons), then you have the possibility of silently propagating damaged data to all mirrors in the case that damaged data is returned to an application and the application then rewrites said data.

Compare that to a filesystem with checksums, which, in addition to being able to detect such a problem, could also continue to function completely correctly in the face of it.

willglynn · on Aug 2, 2017

Yep. "What happens if you read all the disks successfully but the redundancy doesn't agree?" is a great question.

Mirrors and RAID5: there's obviously no way that `md` software RAID can help, since it doesn't know which is correct. What about RAID6 though? Double parity means `md` would have enough information to determine which disk has provided incorrect data. Surely it does this, right?

Wrong. In the event of any parity mismatch, `md` assumes the data disks are correct and rewrites the parity to match. See "Scrubbing and Mismatches" section in `man 4 md`:

https://linux.die.net/man/4/md

If you scrub a RAID 6 array with a disk that returns bad data, `md` helpfully overwrites your two disks of redundancy in order to agree with the one disk that's wrong. Array consistent, job done, data... eaten.

rob-olmos · on Aug 2, 2017

That's incredible! Thanks for the insight.

Any recommendations for detecting/correcting bitrot with RHEL 7.4 at the filesystem or lower levels?

Wicher · on Aug 2, 2017

Disk A and disk B both contain file SomeFile.

On disk B this file has rotted.

When reading the file SomeFile into memory, the read will be distributed among the disks (for performance reasons) (and it will probby need to span a multiple of the stripe size).

Ok, file is read into memory, including the bitrotted part from disk B. Now we write the file blocks back - as one does.

Voila! Both disks now contain the bitrot. And mdadm will not complain - disk A and B are identical for the area of file SomeFile.

IgorPartola · on Aug 2, 2017

Moreover, even if you don't read the file, and the bit rot is discovered during the monthly compare, at least on Linux the disk that is considered correct will be chosen at random. So you need at least three disks to have some semblance of protection. Have you guys seen many laptops that come with three or more drives?

Just use ZFS. Even on a single disk setup you will at least not get silent bit rot.

binaryphile · on Aug 2, 2017

"Never go to sea with two chronometers; take one or three."

- adage cited in the Mythical Man Month

gigatexal · on Aug 2, 2017

Or just do raidz6 in ZFS and call it a day.

feld · on Aug 2, 2017

Actually it's better to just do mirrors. Avoid RAIDZ at all costs if you care about performance and the ability to resilver in a reasonable amount of time.

gigatexal · on Aug 2, 2017

Sure I agree. But nested mirrors still suffer from the same issue of losing a drive and you lose everything.

aeorgnoieang · on Aug 2, 2017

> But nested mirrors still suffer from the same issue of losing a drive and you lose everything.

Are you referring to mirroring a volume or dataset on a single disk? Why would you want to do that instead of mirroring among multiple drives?

gigatexal · on Aug 2, 2017

how would you set up a large pool?

two sets of say 5 disks in a mirror raidz1 would still fail if a disk in one set failed and a disk in the other set failed. I guess you could do a stripe setup of 5 sets of 2 disks in mirrors. Still it seems wicked risky to me. I do agree though mirroring has been the best for speed but a lot of that changes with nicer SSDs especially NVMe ones.

aeorgnoieang · on Aug 2, 2017

I was curious about what a "nested" mirror is really. What exactly is nested?

I'd setup a large pool with mirror vdevs, i.e. n sets of 2 disks per mirror.

My half-remembered reasoning was that backups manage the risk you'll lose data. But replacing a disk in a mirror vdev is much easier, and faster, than doing so with RAIDZ.

The risk of RAIDZ is that resilvering impacts multiple vdevs, is much more intensive than a simple mirror resilvering, and thus the probability that additional drives will fail is much higher.

Here's a blog post that I definitely read the last time I was reading up on this:

- [ZFS: You should use mirror vdevs, not RAIDZ. – JRS Systems: the blog](http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-...)

aeorgnoieang · on Aug 2, 2017

A Reddit post about that blog post in my other reply:

- [You should use mirror vdevs, not RAIDZ. : DataHoarder](https://www.reddit.com/r/DataHoarder/comments/2v0quc/you_sho...)

gigatexal · on Aug 2, 2017

I wonder if resilvering is still an issue with SSDs. But I cede your point, nested vdevs of two disks making mirrors makes sense. It doesn't sit well still, but makes sense

gigatexal · on Aug 3, 2017

According to OpenZFS's changelog for v 0.7 resilvering is smarter now: https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.0

legulere · on Aug 2, 2017

> on SSD, bit rots probability is higher still

Do you have a source for this? So far I believed that bit-rot rates are pretty similar.

stargrazer · on Aug 2, 2017

A google search turns up a number of sources. Another concept which could use justification is whether SSDs bit rot more over long term than do spinning disks. I heard that somewhere as well.

legulere · on Aug 2, 2017

Using google was the first thing I did. I couldn't find something substantial.