On Mon, Feb 18, 2019 at 1:14 PM Sébastien Luttringer <seblu@xxxxxxxxx> wrote: > > On Tue, 2019-02-12 at 15:40 -0700, Chris Murphy wrote: > > On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@xxxxxxxxx> wrote: > > > > FYI: This only does full stripe reads, recomputes parity and overwrites the > > parity strip. It assumes the data strips are correct, so long as the > > underlying member devices do not return a read error. And the only way they > > can return a read error is if their SCT ERC time is less than the kernel's > > SCSI command timer. Otherwise errors can accumulate. > > > > smartctl -l scterc /dev/sdX > > cat /sys/block/sdX/device/timeout > > > > The first must be a lesser value than the second. If the first is disabled > > and can't be enabled, then the generally accepted assumed maximum time for > > recoveries is an almost unbelievable 180 seconds; so the second needs to be > > set to 180 and is not persistent. You'll need a udev rule or startup script > > to set it at every boot. > All my disks firmwares doesn't allow ERC to be modified trough SCT. > > # smartctl -l scterc /dev/sda > smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.20-seblu] (local build) > Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org > > SCT Error Recovery Control command not supported > > I was not aware of that timer. I needed time to read and experiment on this. > Sorry for the long response time. I hope you didn't timeout. :) > > After simulated several errors and timeouts with scsi_debug[1], > fault_injection[2], and dmsetup[3], I don't understand why you suggest this > could lead to corruption. When an SCSI command timeout, the mid-layer[4] do > several error recovery attempt. These attempts are logged into the kernel ring > buffer and at worst the device is put offline. No at worst what happens if SCSI command timer is reached before the drive's SCT ERC timeout, is the kernel assumes the device is not responding and does a link reset. That link reset obiterates the entire command queue on SATA drives. And that means it's no longer possible to determine what sector is having a problem; and therefore not possible to fix it by overwriting that sector with good data. This is a problem for Btrfs raid, as well as md and LVM. > > From my experiment, the md layer has no timeout, and waits as long as the > underlying layer doesn't return, either during check or normal read/write > attempt. > > I understand the benefits of keeping the disk time to recover from errors below > the hba timeout. It prevents the disk to be kicked out of the array. The md driver tolerates a fixed number or rate (I'm not sure which) of read errors before a drive is marked faulty. The md driver I think tolerates only one write failure, and then the drive is marked faulty. So far there is no faulty concept in Btrfs, there are patches upstream for this, but I don't know about their merge status. > However, I don't see how this could lead to a difference between check and > repair in the md layer and even trigger some corruption between the chunks > inside a stipe. It allows bad sectors to accumulate, because they never get repaired. The only way they can be repaired is if the drive itself gives up on a sector, and reports a discrete uncorrected read error along with the sector LBA. That's the only way the md driver knows what md chunk is affected, and where to get a good copy, read it, and then overwrite the bad copy on the device with a read error. The linux-raid@ list is full of examples of this. And it does sometimes lead to the loss of the array, in particular in the case of parity arrays where such read errors tend to be colocated. A read error in a stripe is functionally identical to a single device loss for that stripe. So if the bad sector isn't repaired, only one more error is needed and you get a full stripe loss, and it's not recoverable. If the lost stripe is (user) data only then you just lose a file. But if the lost stripe contains file system metadata it can mean the loss of the file system on that md array. > After reading the whole md (5) manual, I realize how bad it is to rely on the > md layer to guaranty data integrity. There is no mechanism to known which chunk > is corrupted in a stripe. Correct. There is a tool part of mdadm that will do this if it's a raid6 array. > I'm wondering if using btrfs raid5, despite its known flaws, it is not safer > than md. I can't point to a study that'd give us the various probabilities to answer this question. In the meantime, I'd say all raid5 is fraught with peril the instant there's any unhandled corruption or read error. And it's a very common misconfiguration to have consumer SATA drives that lack configurable SCT ERC so that it's less time to produce an error, than for the SCSI command timer to cause a link reset. > > > Further, if the mismatches are consistently in the same sector range, it > > suggests the repair scrub returned one set of data, and the subsequent check > > scrub returned different data - that's the only way you get mismatches > > following a repair scrub. > It was the same range. That was my understanding too. > > I finally get ride of these errors by removing a disk, wiping the superblock > and adding it back to the raid. Since then, no check error (tested twice). *shrug* I'm not super familiar with all the mdadm features. It's vaguely possible your md array is using the bad block mapping feature, and perhaps that's related to this behavior. Something in my memory is telling me that this isn't really the best feature to have enabled in every use case; it's really strictly for continuing to use drives that have all reserve sectors used up, which means bad sectors result in write failures. The bad block mapping allows md to do its own remapping so there won't be write failures in such a case. Anyway, raids are complicated and they are something of a Rube Goldberg contraption. If you don't understand all the possible outcomes, and aren't prepared for failures, it can lead to panic. And I've read on linux-raid a lot of panic induced dataloss. Really common is people do google searches first and get bad advice like recreating an array and then they wonder why there array is wiped... *shrug* My advice is, don't be in a hurry to fix things when they go wrong. Collect information. Do things that don't write changes anywhere. Post all information to the proper mailing list working from the bottom (start) of the storage stack to the top (the file system), and trust their advise. > > > If it's bad RAM, then chances are both copies of metadata will be identically > > wrong and thus no help in recovery. > RAM is not ECC. I tested the RAM recently and no error was found. You might check the archives about various memory testing strategies. A simple hour long test often won't find the most pernicious memory errors. At least do it over a weekend. Quick search austin hemmelgarn memory test compile and I found this thread: Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Wed, May 4, 2016, 10:12 PM > But, I needed more RAM to rsync all the data w/ hardlinks, so I added a swap > file on my system disk (an ssd). The filesystem on it is also btrfs, so I used > a loop device to workaround the hole issue. > I can find some link reset on this drive at time it was used as swap file. > Maybe this could be a reason. Yeah, if there is a link reset on the drive, the whole command queue is lost. It could cause a bunch of i/o errors that look scary but are one time errors that are related to the link reset. So you really don't want the link resets happening. Conversely many applications get mad if there really is a hang for 180 seconds for a consumer drive to do deep recovery. So it's a catch 22 if you use case can tolerate it. But hopefully you only rarely have bad sectors anyway. Once nice thing about Btrfs is you can do a balance and it causes everything to be written out, which itself "refreshens" sector data with a stronger signal. You probably shouldn't have to do that too often, maybe once every 12-18 months. Otherwise, too many bad sectors is a valid warranty claim. > I think I will remove the md layer and use only BTRFS to be able to recover > from silent data corruption. Btrfs on top of md will still repair metadata from data corruption if the metadata profile is DUP. And in the case of (user) data corruption, it's still not silent. Btrfs will tell you what file is corrupt and you can recover it from a backup. I can't tell you that Btrfs raid5 with a missing/failed drive is anymore reliable than md raid5. In a way it's simpler so that might be to your advantage, it really depends on your comfort and experience with user space tools. If you do want to move to strictly Btrfs, I suggest raid5 for data but use raid1 for metadata instead of raid5. Metadata raid 5 writes can't really be assured to be atomic. Using raid1 metadata is less fragile. No matter what, keep backups up to date, always be prepared to have to use them. The main idea of any raid is to just give you some extra uptime in the face of a failure. And the uptime is for your applications. > But I'm curious to be able to repair a broken BTRFS without moving all the > dataset to another place. It's the second time it happen to me. > > I tried: > # btrfs check --init-extent-tree /dev/md127 > # btrfs check --clear-space-cache v2 /dev/md127 > # btrfs check --clear-space-cache v1 /dev/md127 > # btrfs rescue super-recover /dev/md127 > # btrfs check -b --repair /dev/md127 > # btrfs check --repair /dev/md127 > # btrfs rescue zero-log /dev/md127 Wrong order. Not obvious either that it's the wrong order, the tools don't do a great job of telling us what order to do things in. Also, all of these involve writes. You really need to understand the problem first. zero log means some last minute writes will be lost, and it should only be used if there's difficulty mounting and the kernel errors point to a problem with log replay. clear-space is safe, the cache is recreated at next mount time, so it might result in slow initial mount after use. super-recover is safe by itself or with -v. It should be safe with -y but -y does write changes to disk. --init-extent-tree is about the biggest hammer in the arsenal and fixes only a very specific problem with the extent tree and usually doesn't help just makes things worse. --repair should be safe but even in 4.20.1 tools you'll see the man page says it's dangerous and you should ask on list before using it. > The detailed output is here [6]. But none of the above allowed me to drop the > broken part of the btrfs tree to move forward. Is there a way to repair (by > loosing corrupted data) without need to drop all the correct data? Well at this point if you ran a those commands the file system is different so you should refresh the thread by posting current normal mount (no options) kernel messages; and also 'btrfs check' output without repair; and also output from btrfs-debug-tree. If the problem is simple enough and a dev has time it might be they get you a file system specific patch to apply and it can be fixed. But it's really important that you stop making changes to the file system in the meantime. Just gather information. Be deliberate. -- Chris Murphy
