On 23.06.20 г. 11:00 ч., Russell Coker wrote: > On Tuesday, 23 June 2020 4:03:37 PM AEST Nikolay Borisov wrote: >>> I have a USB stick that's corrupted, I get the above kernel messages when >>> I >>> try to copy files from it. But according to btrfs dev sta it has had 0 >>> read and 0 corruption errors. >>> >>> root@xev:/mnt/tmp# btrfs dev sta . >>> [/dev/sdc1].write_io_errs 0 >>> [/dev/sdc1].read_io_errs 0 >>> [/dev/sdc1].flush_io_errs 0 >>> [/dev/sdc1].corruption_errs 0 >>> [/dev/sdc1].generation_errs 0 >>> root@xev:/mnt/tmp# uname -a >>> Linux xev 5.6.0-2-amd64 #1 SMP Debian 5.6.14-1 (2020-05-23) x86_64 >>> GNU/Linux >> The read/write io err counters are updated when even repair bio have >> failed. So in your case you had some checksum errors, but btrfs managed >> to repair them by reading from a different mirror. In this case those >> aren't really counted as io errs since in the end you did get the >> correct data. > > In this case I'm getting application IO errors and lost data, so if the error > count is designed to not count recovered errors then it's still not doing the > right thing. In this case yes, however this was utterly not clear from your initial email. In fact it seems you have omitted quite a lot of information. So let's step back and start afresh. So first give information about your current btrfs setup by giving the output of: btrfs fi usage /path/to/btrfs >From the output provided it seems the affected mirror is '1', which to me implies you have at least another disk containing the same data. So unless you have errors in mirror 0 as well those errors should be recovered from by simply reading from that mirror. > > # md5sum * > md5sum: 'Rise of the Machines S1 Ep6 - Mega Digger-qcOpMtIWsrgN.mp4': Input/ > output error > md5sum: 'Rise of the Machines S1 Ep7 - Ultimate Dragster-Ke9hMplfQAWF.mp4': > Input/output error > md5sum: 'Rise of the Machines S1 Ep8 - Aircraft Carrier-Qxht6qMEwMKr.mp4': > Input/output error > ^C You are trying to md5sum 3 distinct files.... > # btrfs dev sta . > [/dev/sdc1].write_io_errs 0 > [/dev/sdc1].read_io_errs 0 > [/dev/sdc1].flush_io_errs 0 > [/dev/sdc1].corruption_errs 0 > [/dev/sdc1].generation_errs 0 > # tail /var/log/kern.log > Jun 23 17:48:40 xev kernel: [417603.547748] BTRFS warning (device sdc1): csum > failed root 5 ino 275 off 59580416 csum 0x8941f998 expected csum 0xb5b581fc > mirror 1 > Jun 23 17:48:40 xev kernel: [417603.609861] BTRFS warning (device sdc1): csum > failed root 5 ino 275 off 60628992 csum 0x8941f998 expected csum 0x4b6c9883 > mirror 1 > Jun 23 17:48:40 xev kernel: [417603.672251] BTRFS warning (device sdc1): csum > failed root 5 ino 275 off 61677568 csum 0x8941f998 expected csum 0x89f5fb68 > mirror 1 Yet here all the errors happen in one inode, namely 275. So the md5sum commands do not correspond to those errors specifically. Also provide the name of inode 275. And for good measure also provide the output of "btrfs check /dev/sdc1" - this is a read only command so if there is some metadata corruption it will at least not make it any worse. > # uname -a > Linux xev 5.6.0-2-amd64 #1 SMP Debian 5.6.14-1 (2020-05-23) x86_64 GNU/Linux > > On Tuesday, 23 June 2020 4:17:55 PM AEST waxhead wrote: >> I don't think this is what most people expect. >> A simple way to solve this could be to put the non-fatal errors in >> parentheses if this can be done easily. > > I think that most people would expect a "device stats" command to just give > stats of the device and not refer to what happens at the higher level. If a > device is giving corruption or read errors then "device stats" should tell > that. That's a fair point. > > On Tuesday, 23 June 2020 5:11:05 PM AEST Nikolay Borisov wrote: >> read_io_errs. But this leads to a different can of worms - if a user >> sees read_io_errs should they be worried because potentially some data >> is stale or not (give we won't be distinguishing between unrepairable vs >> transient ones). > > If a user sees errors reported their degree of worry should be based on the > degree to which they use RAID and have decent backups. If you have RAID-1 and > only 1 device has errors then you are OK. If you have 2 devices with errors > then you have a problem. > > Below is an example of a zpool having errors that were corrected. The DEVICE > had an unrecoverable error, but the RAID-Z pool recovered it from other > devices. It states that "Applications are unaffected" so the user knows the > degree of worry that should be involved. BTRFS' internal structure is very different from ZFS' so we don't have this notion of vdev, consisting of multiple child devices. And so each physical + vdev can be considered a separate device. So again, without extending the on-disk format i.e introducing new items or changing the format of existing ones we can't accommodate the exact same reports. And while the on-disk format can be changed (which of course comes with added complexity) there should be a very good reason to do so. Clearly something is amiss in your case, but I would like to first properly root cause it before jumping to conclusions. > > # zpool status > pool: pet630 > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://zfsonlinux.org/msg/ZFS-8000-9P > scan: scrub repaired 380K in 156h39m with 0 errors on Sat Jun 20 13:03:26 > 2020 > config: > > NAME STATE READ WRITE CKSUM > pet630 ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > sdf ONLINE 0 0 0 > sdq ONLINE 0 0 0 > sdd ONLINE 0 0 0 > sdh ONLINE 0 0 0 > sdi ONLINE 41 0 1 > >
