On Mon, Jan 11, 2016 at 03:20:36PM -0700, Chris Murphy wrote: > On Mon, Jan 11, 2016 at 3:10 PM, Hugo Mills <hugo@xxxxxxxxxxxxx> wrote: > > On Mon, Jan 11, 2016 at 02:31:41PM -0700, Chris Murphy wrote: > >> On Mon, Jan 11, 2016 at 2:03 AM, Hugo Mills <hugo@xxxxxxxxxxxxx> wrote: > >> > On Sun, Jan 10, 2016 at 05:13:28PM -0700, Chris Murphy wrote: > >> >> On Sat, Jan 9, 2016 at 2:04 PM, Hugo Mills <hugo@xxxxxxxxxxxxx> wrote: > >> >> > On Sat, Jan 09, 2016 at 09:59:29PM +0100, cheater00 . wrote: > >> >> >> OK. How do we track down that bug and get it fixed? > >> >> > > >> >> > I have no idea. I'm not a btrfs dev, I'm afraid. > >> >> > > >> >> > It's been around for a number of years. None of the devs has, I > >> >> > think, had the time to look at it. When Josef was still (publicly) > >> >> > active, he had it second on his list of bugs to look at for many > >> >> > months -- but it always got trumped by some new bug that could cause > >> >> > data loss. > >> >> > >> >> > >> >> Interesting. I did not know of this bug. It's pretty rare. > >> > > >> > Not really. It shows up maybe on average once a week on IRC. It > >> > gets reported much less on the mailing list. > >> > >> Is there a pattern? Does it only happen at a 2TiB threshold? > > > > No, and no. > > > > There is, as far as I can tell from some years of seeing reports of > > this bug, no correlation with RAID level, hardware, OS, kernel > > version, FS size, usage of the FS at failure, or allocation level of > > either data or metadata at failure. > > > > I haven't tried correlating with the phase of the moon or the > > losses on Lloyds Register yet. > > Huh. So it's goofy cakes. > > This is specifically where btrfs_free_extent produces errno -28 no > space left, and then the fs goes read-only? The symptoms I'm using for a diagnosis of this bug are that the FS runs out of (usually data) space when there's still unallocated space remaining that it could use for another block group. Forced RO isn't usually a symptom, although the FS can get into a state where you can't modify it (as distinct from being explicitly read-only). Block-group level operations, like balance, device delete, device add sometimes seem to have some kind of (usually small) effect on the point at which the error occurs. If you hit the problem and run a balance, you might end up making things worse by a couple of gigabytes, or making things better by the same amount, or having no effect at all. Hugo. -- Hugo Mills | "What are we going to do tonight?" hugo@... carfax.org.uk | "The same thing we do every night, Pinky. Try to http://carfax.org.uk/ | take over the world!" PGP: E2AB1DE4 |
Attachment:
signature.asc
Description: Digital signature
