Hi,I’ve been using a btrfs RAID1 of two hard disks since early 2012 on my home server. The machine has been working well overall, but recently some problems with the file system surfaced. Since I do have backups, I do not worry about the data, but I post here to better understand what happened. Also I cannot exclude that my case is useful in some way to btrfs development.
First some information about the system: root@mim:~# uname -aLinux mim 4.6.0-1-amd64 #1 SMP Debian 4.6.3-1 (2016-07-04) x86_64 GNU/Linux
root@mim:~# btrfs --version btrfs-progs v4.7.3 root@mim:~# btrfs fi show Label: none uuid: 2da00153-f9ea-4d6c-a6cc-10c913d22686 Total devices 2 FS bytes used 345.97GiB devid 1 size 465.29GiB used 420.06GiB path /dev/sda2 devid 2 size 465.29GiB used 420.04GiB path /dev/sdb2 root@mim:~# btrfs fi df / Data, RAID1: total=417.00GiB, used=344.62GiB Data, single: total=8.00MiB, used=0.00B System, RAID1: total=40.00MiB, used=68.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=3.00GiB, used=1.35GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=464.00MiB, used=0.00B root@mim:~# dmesg | grep -i btrfs [ 4.165859] Btrfs loaded[ 4.481712] BTRFS: device fsid 2da00153-f9ea-4d6c-a6cc-10c913d22686 devid 1 transid 2075354 /dev/sda2 [ 4.482025] BTRFS: device fsid 2da00153-f9ea-4d6c-a6cc-10c913d22686 devid 2 transid 2075354 /dev/sdb2 [ 4.521090] BTRFS info (device sdb2): disk space caching is enabled [ 4.628506] BTRFS info (device sdb2): bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [ 4.628521] BTRFS info (device sdb2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [ 18.315694] BTRFS info (device sdb2): disk space caching is enabled
The disks themselves have been turning for almost 5 years by now, but their SMART health is still fully satisfactory.
I noticed that something was wrong because printing stopped to work. So I did a scrub that detected 0 "correctable errors" and 6 "uncorrectable" errors. The relevant bits from kern.log are:
Jan 11 11:05:56 mim kernel: [159873.938579] BTRFS warning (device sdb2): checksum error at logical 180829634560 on dev /dev/sdb2, sector 353143968, root 5, inode 10014144, offset 221184, length 4096, links 1 (path: usr/lib/x86_64-linux-gnu/libcups.so.2) Jan 11 11:05:57 mim kernel: [159874.857132] BTRFS warning (device sdb2): checksum error at logical 180829634560 on dev /dev/sda2, sector 353182880, root 5, inode 10014144, offset 221184, length 4096, links 1 (path: usr/lib/x86_64-linux-gnu/libcups.so.2) Jan 11 11:28:42 mim kernel: [161240.083721] BTRFS warning (device sdb2): checksum error at logical 260254629888 on dev /dev/sda2, sector 508309824, root 5, inode 9990924, offset 6676480, length 4096, links 1 (path: var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages) Jan 11 11:28:42 mim kernel: [161240.235837] BTRFS warning (device sdb2): checksum error at logical 260254638080 on dev /dev/sda2, sector 508309840, root 5, inode 9990924, offset 6684672, length 4096, links 1 (path: var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages) Jan 11 11:37:21 mim kernel: [161759.725120] BTRFS warning (device sdb2): checksum error at logical 260254629888 on dev /dev/sdb2, sector 508270912, root 5, inode 9990924, offset 6676480, length 4096, links 1 (path: var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages) Jan 11 11:37:21 mim kernel: [161759.750251] BTRFS warning (device sdb2): checksum error at logical 260254638080 on dev /dev/sdb2, sector 508270928, root 5, inode 9990924, offset 6684672, length 4096, links 1 (path: var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
As you can see each disk has the same three errors, and there are no other errors. Random bad blocks cannot explain this situation. I asked on #btrfs and someone suggested that these errors are likely due to RAM problems. This may indeed be the case, since the machine has no ECC. I managed to fix these errors by replacing the broken files with good copies. Scrubbing shows no errors now:
root@mim:~# btrfs scrub status / scrub status for 2da00153-f9ea-4d6c-a6cc-10c913d22686scrub started at Sat Jan 14 12:52:03 2017 and finished after 01:49:10
total bytes scrubbed: 699.17GiB with 0 errorsHowever, there are further problems. When trying to archive the full filesystem I noticed that some files/directories cannot be read. (The problem is localized to some ".git" directory that I don’t need.) Any attempt to read the broken files (or to delete them) does not work:
$ du -sh .gitdu: cannot access '.git/objects/28/ea2aae3fe57ab4328adaa8b79f3c1cf005dd8d': No such file or directory du: cannot access '.git/objects/28/fd95a5e9d08b6684819ce6e3d39d99e2ecccd5': Stale file handle du: cannot access '.git/objects/28/52e887ed436ed2c549b20d4f389589b7b58e09': Stale file handle
du: cannot access '.git/objects/info': Stale file handle du: cannot access '.git/objects/pack': Stale file handleDuring the above command the following lines were added to kern.log:
Jan 16 09:41:34 mim kernel: [132206.957566] BTRFS critical (device sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15 Jan 16 09:41:34 mim kernel: [132206.957924] BTRFS critical (device sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15 Jan 16 09:41:34 mim kernel: [132206.958505] BTRFS critical (device sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15 Jan 16 09:41:34 mim kernel: [132206.958971] BTRFS critical (device sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15 Jan 16 09:41:34 mim kernel: [132206.959534] BTRFS critical (device sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15 Jan 16 09:41:34 mim kernel: [132206.959874] BTRFS critical (device sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15 Jan 16 09:41:34 mim kernel: [132206.960523] BTRFS critical (device sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15 Jan 16 09:41:34 mim kernel: [132206.960943] BTRFS critical (device sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
So I tried to repair the file system by running "btrfs check --repair", but this doesn’t work:
(initramfs) btrfs --version btrfs-progs v4.7.3 (initramfs) btrfs check --repair /dev/sda2 UUID: ... checking extents incorrect offsets 2527 2543 items overlap, can't fix cmds-check.c:4297: fix_item_offset: Assertion `ret` failed. btrfs[0x41a8b4] btrfs[0x41a8db] btrfs[0x42428b] btrfs[0x424f83] btrfs[0x4259cd] btrfs(cmd_check+0x1111)[0x427d6d] btrfs(main+0x12f)[0x40a341] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fd98859d2b1] btrfs(_start+0x2a)[0x40a37a] I now have the following questions:* So scrubbing is not enough to check the health of a btrfs file system? It’s also necessary to read all the files?
* Any ideas what coud have caused the "stale file handle" errors? Is there any way to fix them? Of course RAM errors can in principle have _any_ consequences, but I would have hoped that even without ECC RAM it’s practically inpossible to end up with an unrepairable file system. Perhaps I simply had very bad luck.
* I believe that btrfs RAID1 is considered reasonably safe for production use by now. I want to replace that home server with a new machine (still without ECC). Is it a good idea to use btrfs for the main file system? I would certainly hope so! :-)
Thanks for your time, Christoph
Attachment:
signature.asc
Description: PGP signature
