On Mon, Jan 6, 2020 at 10:32 AM Stephen Conrad <conradsd@xxxxxxxxx> wrote: > In January 2019 the file system remounted read-only. I have retained the dmesg > logs and notes on the actions I took at that time. Summary is that I saw a > bunch of SATA "link is slow to respond" and "hard resetting link" messages This is raid1? I can't tell. Anyway, this is common when drive SCT ERC is a greater duration than the SCSI command timer (defaults to 30s). Consumer drives often have it disabled and you'd have to dig through specs to find out what duration that means, but with HDDs it is often way more than 30s. The idea is single drive "deep recovery" - it's better for the drive to become slow than to report EIO for a bad sector. That may be specious, but for sure it's not a good idea for RAID. First choice is to set SCT ERC less than 30s. Common is 70 deciseconds for NAS and enterprise drives. Second choice, if SCT ERC cannot be configured, is change the SCSI command timer to something like 120-180s, which I know is crazy but there you go. More here. https://raid.wiki.kernel.org/index.php/Timeout_Mismatch Note that SCT ERC is a drive firmware configuration. Where the SCSI command timer is a kernel timer that applies per (whole) block device. Both reset to defaults on power cycle. You can use a udev rule for either of these (you only need one, again SCT ERC is preferred if available). SCT ERC is set by smartctl, and SCSI command timer is set via sysfs so you can just echo the value you want. > followed by BTRFS errors, "BTRFS: Transaction aborted (error -5)", and a stack > trace. I tried a few things (unsuccessfully) and had to power down the > hardware. After a power cycle the file system mounted R/W as normal. I then > ran a btrfs check and btrfs scrub which showed no errors. Make sure timeout mismatch doesn't exist, and if you make a change, redo the scrub to be sure. You really want the drive to give up as fast as possible so Btrfs is informed of the bad read, which will include the bad sector's LBA, Btrfs can look up what's there, find a good copy from mirror, and write over the bad sector causing it to be fixed up. > In October 2019 I had a similar event. Logs show SATA "hard resetting link" > for both drives as well as BTRFS "lost page write due to IO error" messages. Yeah virtually certain it's a misconfiguration. Very common. Bad sectors start out recovering pretty quick, but get slower and slower, eventually hitting the 30s SCSI command timer, and then the kernel does a link reset. And then these things never get fixed up, because when the link is reset, the entire drive's command queue is lost. That means no discrete read error, and no LBA for the bad sector which Btrfs needs to know in order to find a good copy of that data and write it back to the bad sector, thereby fixing it. > The logs show "forced readonly", "BTRFS: Transaction aborted (error -5)" and a > stack trace. After hard power cycle I ran a "btrfs check -p" which resulted > in a stream messages like "parent transid verify failed on 2418006753280 > wanted 34457 found 30647" and then the following: > parent transid verify failed on 2417279598592 wanted 35322 found 29823 Yuck. That's a lot of missing generations. I have no idea how that's possible, especially if this is raid1 (or DUP) metadata. > # btrfs dev stat /mnt/ > [/dev/mapper/K1JG82AD].write_io_errs 15799831 > [/dev/mapper/K1JG82AD].read_io_errs 15764242 > [/dev/mapper/K1JG82AD].flush_io_errs 4385 That is incredibly bad. That drive has dropped many millions of writes, and the flush errors could mean this drive is failing to adhere to proper write ordering. Maybe Qu can tell us. > At this point I was still seeing occasional log entries for "parent transid > verify failed" and "read error corrected" so I decided to upgrade from Debian9 > to Debian10 to get more current tools. Running a scrub with Debian10 tools I > saw errors detected and corrected... I also saw sata link issues during the > scrub... Right. You'll also find this discussed a ton on the linux-raid@ list for the very same reason, it's not unique to Btrfs. > # date > Mon 09 Dec 2019 10:29:05 PM EST > # btrfs scrub start -B -d /dev/disk/by-uuid/X > scrub device /dev/mapper/K1JG82AD (id 1) done > scrub started at Sun Dec 8 23:06:59 2019 and finished after 05:46:26 > total bytes scrubbed: 2.80TiB with 9490467 errors > error details: verify=1349 csum=9489118 > corrected errors: 9490467, uncorrectable errors: 0, unverified errors: > 0 Cool! BTW you'll want to reset those stats with -z, so that you can tell if these errors start happening again. > 1) How should I interpret these errors? Seems that btrfs messages are telling > me that there are an abundance of errors everywhere, but that they are all > correctable... Should I panic? Should I proceed? Never panic. Often results in misunderstanding the problem and bad decisions. The drive with all the errors has a bunch of bad sectors, maybe by now they've been repaired, but it's not clear until you ensure the timeout mismatch is addressed. Then the scrub will be reliable, even if there are stalls on bad sectors that take a damn long time to either successfully read or fail with the proper error. What you really want are consumer drives with configurable SCT ERC; or get "NAS" drives which already have this set to a default of 70 deciseconds. > > 2) Is my file system broken? Is my data corrupted? Should I be able to scrub > etc to get back to operation without scary log messages? Can I trust data that > I copy out now, or need to fall back on old/incomplete backups? Btrfs will EIO rather than propagate corrupt data to user space. You can trust the data you copy out. This assumes all data has checksums. If you're using nocow for anything, then there can be silent corruption. > 3) What steps are recommended to backup/offload/recover data? I am considering > installing the disks into a different machine, then mounting the array read- > only, and then pulling a full copy of the data... > > 4) What steps should I take to clean up the file system errors/messages? Start > fresh after full backup, (though I hate the idea of migrating off a redundant > array onto a single disk in the process)? Scrub etc? Evaluate each disk > independently and rebuild one from the other? I think all of these are addressed, but let me know if something isn't clear. -- Chris Murphy
