Re: btrfs goes read-only when btrfs-cleaner runs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for the reply!

Am 16.01.19 um 01:41 schrieb Chris Murphy:
> The relevant error messages are:
> 
> unable to find ref byte
> errno=-2 No such entry
> 
> Somehow a reference byte has been corrupted and inserted into multiple
> locations in the tree and it's not repairable: i.e. neither a correct
> value can be inferred from other available information, nor do the
> tools have a good way to just trim out the item that contains bad key
> pointers - part of the problem with just cutting out the bad parts is
> it's not clear the problem is made even worse or how far the
> corruption extends.
> 
> What's further troubling though is the idea that this corruption might
> have propagated to a separate volume via snapshot send receive. Either
> of the file systems might still be useful for a developer, it seems to
> me important to have some kind of check to make sure it's not possible
> for corruption to propagate in this manner.
> 
> In the meantime, I think it's a good idea to do a memory test. There's
> some information in the archives about how to do this in a more
> reliable way than just memtest86 type tests, but if you can run even a
> memtest86 over a weekend it might confirm there's a memory problem.
> Unfortunately a pass doesn't necessarily mean there aren't rare
> transient problems.

There are some things which do not quote match up for a broken-memory explanation,
unless my understanding is wrong. 

I'll try to explain more concisely:
- The broken file system is on an external USB drive (SMR sadly!) and 
  was used as backup target for btrfs send of snapshots. 
- The machine sending data there does not have a corrupted filesystem. 
  It scrubs perfectly fine. The disk was only connected to that machine for backups, 
  from time to time. 
- To salvage data from the broken FS, I have now mounted it read-only (to prevent btrfs-cleaner from kicking in)
  and sent all snapshots (via btrbk archive) to a fresh filesystem (on a non-SMR disk). 
  For the read-only-mounted broken filesystem, no corruption error was shown in syslog. 
  Checking the new filesystem which has received all snapshots with "btrfs check --readonly",
  no corruption is visible. 
  So I must deduce the corruption was not part of a snapshot which was sent - which would mean
  the corruption is only part of a subvolume pending cleanup by btrfs-cleaner. 

So the only way corruption could have crept in from the machine's memory would have been
during actual send / receive. Also, since sending from the corrupted FS worked, I presume this corruption
only affects subvolumes marked for deletion, which can't be deleted due to the corruption. 

It *might* have happened that during the reboot after the kernel upgrade (after which the corruption appeared), 
the disk did not properly unmount (while btrfs-cleaner was running). Unmounting that SMR disk while deferred
activities are going on may take many minutes, and something may have timeouted during shutdown. 
I can't exclude this, and since after the reboot, btrfs-cleaner continued, that's indeed pretty likely. 

Is an interrupted btrfs-cleaner execution a possible explanation for this issue? 
This would also explain why the re-sent snapshots all seem fine. 

The filesystem itself has 1.2 TB with personal content. If there is a way to extract just the important bits for the developers
and remove anything about the actual content, of course I can do that. 

Cheers,
	Oliver

> 
> 
> Chris Murphy
> 



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux