Thanks for the reply! Am 16.01.19 um 01:41 schrieb Chris Murphy: > The relevant error messages are: > > unable to find ref byte > errno=-2 No such entry > > Somehow a reference byte has been corrupted and inserted into multiple > locations in the tree and it's not repairable: i.e. neither a correct > value can be inferred from other available information, nor do the > tools have a good way to just trim out the item that contains bad key > pointers - part of the problem with just cutting out the bad parts is > it's not clear the problem is made even worse or how far the > corruption extends. > > What's further troubling though is the idea that this corruption might > have propagated to a separate volume via snapshot send receive. Either > of the file systems might still be useful for a developer, it seems to > me important to have some kind of check to make sure it's not possible > for corruption to propagate in this manner. > > In the meantime, I think it's a good idea to do a memory test. There's > some information in the archives about how to do this in a more > reliable way than just memtest86 type tests, but if you can run even a > memtest86 over a weekend it might confirm there's a memory problem. > Unfortunately a pass doesn't necessarily mean there aren't rare > transient problems. There are some things which do not quote match up for a broken-memory explanation, unless my understanding is wrong. I'll try to explain more concisely: - The broken file system is on an external USB drive (SMR sadly!) and was used as backup target for btrfs send of snapshots. - The machine sending data there does not have a corrupted filesystem. It scrubs perfectly fine. The disk was only connected to that machine for backups, from time to time. - To salvage data from the broken FS, I have now mounted it read-only (to prevent btrfs-cleaner from kicking in) and sent all snapshots (via btrbk archive) to a fresh filesystem (on a non-SMR disk). For the read-only-mounted broken filesystem, no corruption error was shown in syslog. Checking the new filesystem which has received all snapshots with "btrfs check --readonly", no corruption is visible. So I must deduce the corruption was not part of a snapshot which was sent - which would mean the corruption is only part of a subvolume pending cleanup by btrfs-cleaner. So the only way corruption could have crept in from the machine's memory would have been during actual send / receive. Also, since sending from the corrupted FS worked, I presume this corruption only affects subvolumes marked for deletion, which can't be deleted due to the corruption. It *might* have happened that during the reboot after the kernel upgrade (after which the corruption appeared), the disk did not properly unmount (while btrfs-cleaner was running). Unmounting that SMR disk while deferred activities are going on may take many minutes, and something may have timeouted during shutdown. I can't exclude this, and since after the reboot, btrfs-cleaner continued, that's indeed pretty likely. Is an interrupted btrfs-cleaner execution a possible explanation for this issue? This would also explain why the re-sent snapshots all seem fine. The filesystem itself has 1.2 TB with personal content. If there is a way to extract just the important bits for the developers and remove anything about the actual content, of course I can do that. Cheers, Oliver > > > Chris Murphy >
