On Thu, Jul 13, 2017 at 12:17:16PM -0600, Chris Murphy wrote: > Well I'd say it's a bug, but that's not a revelation. Is there a > snapshot being deleted in the approximate time frame for this? I see a Yep :) I run btrfs-snaps and it happens right aroudn that time. It creates a snapshot and deletes the oldest one. There is likely a race condition if you delete a or more snapshots just after creating one on the same subvolume, although this has worked for about 3 years up to now. http://marc.merlins.org/perso/btrfs/post_2014-03-21_Btrfs-Tips_-How-To-Setup-Netapp-Style-Snapshots.html http://marc.merlins.org/linux/scripts/btrfs-snaps Sure, I can start adding sleeps between creation and deletion, but I haven't had to so far. > snapshot is being cleaned up and chunks being removed. So I wonder if > this can be avoided or intentionally triggered by manipulating > snapshot deletion coinciding with the workload? Maybe it's a race, and > that's why it hits EEXIST, and if so then it's just getting confused > and needs to start from scratch - if true then it's OK to just umount > and mount (rw) again and continue on. which is what I've been doing. > There are some changes in the code between 4.9.36 and 4.12.1 (not sure > when the change was introduced, or if it alters whether you hit this > bug) I don't think I hit the bug with 4.11 or 4.12 since I didn't stay on it long enough to know for sure (I don't think I hit the bug on 4.11, but with the corruption issues I had which I'm still not sure were due to other factors or the kernel, I've rolled back as discussed earlier. On my biggest system, I'm still debugging an issue with 3 of my 8 drives get pseudo randomly kicked out after returning corrupted data for a few seconds. I'm pretty sure it's not an issue with the drives, but I'm not sure if it's the disk carrier/enclosure, cables, or actual ports on the SAS card (working through the option matrix to find out) > Another thing I'm not certain of is if the dm-2 reference is just how > it's referring to the file system, or if it's to be taken literally as > an issue with this device. My understanding of the code is really > weak, but I think this whole trace is within Btrfs logical block > handling, in which case it wouldn't know of a problem with a > particular device. It knows that it's in the weeds, but has no idea > what golf course it's on. dm-2 is correct, it does refer to the correct device. gargamel:~# dmsetup status -v dshelf1 Name: dshelf1 State: ACTIVE Read Ahead: 8192 Tables present: LIVE Open count: 1 Event number: 1 Major, minor: 253, 2 Number of targets: 1 UUID: CRYPT-LUKS1-3cd9bbafa2bb44a587a658a77487ee73-dshelf1_unformatted 0 46883102704 crypt gargamel:~# l /dev/mapper/dshelf1 /dev/dm-2 brw-rw---- 1 root disk 253, 2 Jul 14 06:30 /dev/dm-2 lrwxrwxrwx 1 root root 7 Jul 14 06:30 /dev/mapper/dshelf1 -> ../dm-2 Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
