Thanks for those quick replies. It took a bit to build some pieces of this reply though. I realized that I had made a huge mistake by relying on a backup strategy by syncing valuable data between two computers on two sites while completely ignoring such a disk failure that may happen on both sites at the same interval of "btrfs scrub" examination. This is what happened at the moment. (We are talking about more than 6 months. This is of course a big period. Obviously not monitoring the filesystem for this much time is my fault, I accept that, and lessons learned: https://github.com/ceremcem/monitor-btrfs-disk) My first action was determining the corrupted files. I was wondering if insisting CouchDB on BTRFS would eventually cause a failure or not, so this corrupted files list might help shedding the light on the cause: https://gist.github.com/ceremcem/b507be2669682857f37039eb9655d7ad My second action is, as there is only a disk present at the moment, to convert the Single data profile to DUP (which I couldn't, due to "Input/output error"s) in order to be able to fix any further corruption. I'll replace the disk by two new disks in the meanwhile and setup a RAID-1 with them. While searching for "converting to DUP profile", I noticed that the man page of btrfs explicitly states: > In any case, a device that starts to misbehave and repairs from the DUP copy should be replaced! DUP is not backup. Based on that, the uncorrectable errors (in Single profile) also means that we should replace the misbehaving disk. > Try 'smartctl -t long', then wait some minutes (it will give you an > estimate of how many), then look at the detailed self-test log output from > 'smartctl -x'. The long self-test usually reads all sectors on the disk > and will quantify errors (giving UNC sector counts and locations). I tried this one, however I couldn't interpret the results. Here is the `smartctl -a /dev/sda` output: https://gist.github.com/1a741135af10f6bebcaf6175c04594df > You need to look at the specific error counts individually, as they > indicate different problems. There are 5 kinds of uncorrectable > error: `btrfs scrub` isn't giving us those kinds of details, or is it? How can we get such a detailed report? Thank you all for those detailed answers. Adam Borowski <kilobyte@xxxxxxxxxx>, 11 Ara 2019 Çar, 19:00 tarihinde şunu yazdı: > > On Wed, Dec 11, 2019 at 04:11:05PM +0300, Cerem Cem ASLAN wrote: > > This is the second time after a year that the server's disk throws > > "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors > > along with some corrected errors. However, "smartctl -x" displays > > "SMART overall-health self-assessment test result: PASSED". > > > > Should we interpret "btrfs scrub"'s "uncorrectable error count" as > > "time to replace the disk" or are those unrelated events? > > "btrfs scrub" operates on a higher layer, and can detect more errors, some > of which may have a cause elsewhere. For example, dodgy memory very often > corrupts data this way; you can retry the scrub to see if the corruption > happened during write (so the data is lost) or during read (so retrying > should work). In that case, you may want to test and/or replace your > memory, motherboard, processor, etc. > > Or, the disk's firmware may fail to detect errors. It's supposed to verify > disk's internal checksum but detecting errors is another place where a dodgy > manufacturer can shave some costs -- either intentionally, or by neglecting > testing. > > Or, some buggy software (which may even include btrfs itself, albeit > unlikely) might scribble on wrong areas of the disk. > > Or... > > > Anyway, all you know for sure that you have _some_ breakage, which a > filesystem without data checksums would fail to detect, allowing silent data > corruption. Finding the cause is another story. > > > Meow! > -- > ⢀⣴⠾⠻⢶⣦⠀ A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, > ⣾⠁⢠⠒⠀⣿⡁ 1kg raspberries, 0.4kg sugar; put into a big jar for 1 month. > ⢿⡄⠘⠷⠚⠋⠀ Filter out and throw away the fruits (can dump them into a cake, > ⠈⠳⣄⠀⠀⠀⠀ etc), let the drink age at least 3-6 months.
