Hello all, I have been running a Rockstor 3.8.16-8 on an older Dell Optiplex for about a month. The system has four drives separated into two Raid1 filesystems (“pools” in Rockstor terminology). A few days ago I restarted it and noticed that the services (NFS, Samba, etc) weren’t working. Looking at dmesg, I saw: kernel: BTRFS error (device sdb): parent transid verify failed on 1721409388544 wanted 19188 found 83121 and sure enough, one of the subvolumes on my main filesystem is corrupted. By corrupted I mean it can’t be accessed, deleted, or even looked at: ls -l kernel: BTRFS error (device sdb): parent transid verify failed on 1721409388544 wanted 19188 found 83121 kernel: BTRFS error (device sdb): parent transid verify failed on 1721409388544 wanted 19188 found 83121 ls: cannot access /mnt2/Primary/Movies: Input/output error total 16 drwxr-xr-x 1 root root 100 Dec 29 02:00 . drwxr-xr-x 1 root root 208 Jan 3 12:05 .. drwxr-x--- 1 kbogert root 698 Feb 6 08:49 Documents drwxr-xrwx 1 root root 916 Jan 3 12:54 Games drwxr-xrwx 1 xenserver xenserver 2904 Jan 3 12:54 ISO d????????? ? ? ? ? ? Movies drwxr-xrwx 1 root root 139430 Jan 3 12:53 Music drwxr-xrwx 1 root root 82470 Jan 3 12:53 RawPhotos drwxr-xr-x 1 root root 80 Jan 1 04:00 .snapshots drwxr-xrwx 1 root root 72 Jan 3 13:07 VMs The input/output error is given for any operation on Movies. Luckily there has been no data loss that I am aware of. As it turns out I have a snapshot of the Movies subvolume taken a few days before the incident. I was able to simply cp -a all files off of the entire filesystem, with no reported errors, and verified a handful of them. Note that the transid error in dmesg alternates between sdb and sda5 after each startup. SETUP DETAILS uname -a Linux ironmountain 4.8.7-1.el7.elrepo.x86_64 #1 SMP Thu Nov 10 20:47:24 EST 2016 x86_64 x86_64 x86_64 GNU/Linux btrfs —version btrfs-progs v4.8.3 btrfs dev scan kernel: BTRFS: device label Primary devid 1 transid 83461 /dev/sdb kernel: BTRFS: device label Primary devid 2 transid 83461 /dev/sda5 btrfs fi show /mnt2/Primary Label: 'Primary' uuid: 21e09dd8-a54d-49ec-95cb-93fdd94f0c17 Total devices 2 FS bytes used 943.67GiB devid 1 size 2.73TiB used 947.06GiB path /dev/sdb devid 2 size 2.70TiB used 947.06GiB path /dev/sda5 btrfs dev usage /mnt2/Primary /dev/sda5, ID: 2 Device size: 2.70TiB Device slack: 0.00B Data,RAID1: 944.00GiB Metadata,RAID1: 3.00GiB System,RAID1: 64.00MiB Unallocated: 1.77TiB /dev/sdb, ID: 1 Device size: 2.73TiB Device slack: 0.00B Data,RAID1: 944.00GiB Metadata,RAID1: 3.00GiB System,RAID1: 64.00MiB Unallocated: 1.80TiB btrfs fi df /mnt2/Primary Data, RAID1: total=944.00GiB, used=942.60GiB System, RAID1: total=64.00MiB, used=176.00KiB Metadata, RAID1: total=3.00GiB, used=1.07GiB GlobalReserve, single: total=512.00MiB, used=0.00B This server is very light use, however, I do have a number of VMs in the VMs filesystem, exported over NFS, that are used by a Xenserver. These are not marked nocow, though I probably should have. At the time of restart no VMs were running. I have deviated from Rockstor’s default setup a bit. They take an “appliance” view and try to enforce btrfs partitions that cover entire disks. I installed Rockstor onto /dev/sda4, created the Primary partition on /dev/sdb using Rockstor’s gui, then on the command line added /dev/sda5 to it and converted to raid1. As far as I can tell Rockstor is just CentOS 7 with a few updated utilities and a bunch of python scripts for providing a web interface to btrfs-progs. I have it setup to take monthly snapshots and do monthly scrubs, with the exception of the Documents subvolume which takes daily snapshots. These are all readonly and go in the .snapshots directory. Rockstor automatically deletes old snapshots once a limit is reached (7 daily snapshots, for instance). Side note, btrfs-progs 4.8.3 apparently has problems with CentOS 7’s glibc: https://github.com/rockstor/rockstor-core/issues/1608 . I have confirmed that bug in my own compiled version of 4.8.3, and that 4.9.1 does not have it. WHAT I’VE TRIED AND RESULTS First off, I have created an image with btrfs-image that I can make available (though large, I believe it was a few Gbs and the filesystem is 3 TB) * btrfs-zero-log had no discernible effect. * At this point, I compiled btrfs-progs 4.9.1. The following commands were run with this version: * btrfs check This exits in an assert fairly quickly: checking extents cmds-check.c:5406: check_owner_ref: BUG_ON `rec->is_root` triggered, value 1 /mnt/usb/btrfs-progs-bin/bin/btrfs[0x42139b] /mnt/usb/btrfs-progs-bin/bin/btrfs[0x421483] /mnt/usb/btrfs-progs-bin/bin/btrfs[0x430529] /mnt/usb/btrfs-progs-bin/bin/btrfs[0x43160c] /mnt/usb/btrfs-progs-bin/bin/btrfs[0x435d6f] /mnt/usb/btrfs-progs-bin/bin/btrfs[0x43ab71] /mnt/usb/btrfs-progs-bin/bin/btrfs[0x43b065] /mnt/usb/btrfs-progs-bin/bin/btrfs(cmd_check+0xbbc)[0x441b82] /mnt/usb/btrfs-progs-bin/bin/btrfs(main+0x12b)[0x40a734] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff6fa7b35] /mnt/usb/btrfs-progs-bin/bin/btrfs[0x40a179] Full backtrace is attached as btrfsck_debug.log * btrfs check -mode lowmem This outputs a large number of errors before finally segfault’ing. Full backtrace attached as btrfsck_lowmem_debug.log * btrfs scrub This completes with no errors. * Memtest86 completed more than 6 passes with no errors (left it running for a day) * No SMART errors, btrfs device stats shows no errors. The drives the filesystem is on are brand new. * I have tried to recreate the problem by installing Rockstor into a number of VMs and redoing my steps, no such luck. The main Rockstor partition (btrfs), as well as the other Raid1 partition on completely separate drives were not affected. I can provide any other logs requested. Help would be greatly appreciated! Kenneth Bogert
Attachment:
btrfsck_lowmem_debug.log
Description: Binary data
Attachment:
btrfsck_debug.log
Description: Binary data
