Re: btrfs filled up dm-thin and df%: shows 8.4TB of data used when I'm only using 10% of that.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/21/20 7:01 PM, Marc MERLIN wrote:
On Fri, Feb 21, 2020 at 06:43:45PM -0500, Josef Bacik wrote:
Well, carap, see how 'used' went from 445.73GiB to 8.42TiB after balance?

Wtf?  Can you do btrfs filesystem usage on that fs?  I'd like to see the
breakdown.  I'm super confused about what's happening there.

You and me both :)
gargamel:/mnt/btrfs_pool2/backup/ubuntu# btrfs fi df .
Data, single: total=8.40TiB, used=8.40TiB
System, DUP: total=8.00MiB, used=912.00KiB
Metadata, DUP: total=17.00GiB, used=16.33GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Looks like used is back to 8.4TB there too.

Man this is bizarre, does fsck say anything useful? I wonder if the block groups are messed up and saying the wrong value for used. You said du shows only ~400gib of space actually used right? I'm curious to see what fsck says. If it comes back clean I'll write something up to go and figure out where the space is.




And now for extra points, this also damaged a 2nd of my filesystems on the same VG :(
[64723.601630] BTRFS error (device dm-17): bad tree block start, want 5782272294912 have 0
[64723.628708] BTRFS error (device dm-17): bad tree block start, want 5782272294912 have 0
[64897.028176] BTRFS error (device dm-13): parent transid verify failed on 22724608 wanted 10005 found 10001
[64897.080355] BTRFS error (device dm-13): parent transid verify failed on 22724608 wanted 10005 found 10001

This will happen if the transaction aborts, does it still happen after you
unmount and remount?  Thanks,

the problematic filesystem mounts fine, but that doesn't mean it's
clean.
the one that I'd like very much not to be damaged, I'm not touching it
until I can get my VG back to having it's 50% of free space it needs to
have, with 99.9x%, it's not safe to use anything on it.
But thanks for the heads up that my other filesystem may be ok. I'll run
a btrfs check on it later when it's safe.

Back to dm-13, it's now hung on umount, I'm getting a string of these:
[67980.657803] BTRFS info (device dm-13): the free space cache file (4344624709632) is invalid, skip it
[67991.562812] BTRFS info (device dm-13): the free space cache file (4447703924736) is invalid, skip it
[67991.755262] BTRFS info (device dm-13): the free space cache file (4448777666560) is invalid, skip it
[68000.379059] BTRFS info (device dm-13): the free space cache file (4518570885120) is invalid, skip it
[68013.462077] BTRFS info (device dm-13): the free space cache file (4574405459968) is invalid, skip it
[68015.286730] BTRFS info (device dm-13): the free space cache file (4589437845504) is invalid, skip it
[68015.318239] BTRFS info (device dm-13): the free space cache file (4589437845504) is invalid, skip it
[68016.212246] BTRFS info (device dm-13): the free space cache file (4596954038272) is invalid, skip it
[68016.730826] BTRFS info (device dm-13): the free space cache file (4602322747392) is invalid, skip it
[68020.547135] BTRFS info (device dm-13): the free space cache file (4634535002112) is invalid, skip it
[68021.812820] BTRFS info (device dm-13): the free space cache file (4646346162176) is invalid, skip it
[68037.173441] BTRFS info (device dm-13): the free space cache file (4768752730112) is invalid, skip it
[68039.559383] BTRFS info (device dm-13): the free space cache file (4778416406528) is invalid, skip it
[68040.531083] BTRFS info (device dm-13): the free space cache file (4781637632000) is invalid, skip it
[68050.184300] BTRFS info (device dm-13): the free space cache file (4843914657792) is invalid, skip it
[68074.134080] BTRFS info (device dm-13): the free space cache file (4988869804032) is invalid, skip it
[68078.943126] BTRFS info (device dm-13): the free space cache file (5015713349632) is invalid, skip it
[68099.512978] BTRFS info (device dm-13): the free space cache file (5151004819456) is invalid, skip it
[68100.575692] BTRFS info (device dm-13): the free space cache file (5160668495872) is invalid, skip it
[68100.689222] BTRFS info (device dm-13): the free space cache file (5161742237696) is invalid, skip it

I knew that filling up a btrfs filesystem was bad, but filling it the
normal way makes it slow down enough that you usually know and fix it.
Filling it by having an underlying dm-thin deny writes, is much worse (I expected
it wouldn't be pretty though, which is why I had a cronjob to catch this before it
happened, but I missed it due to the df bug).


Yeah I'm curious about this too, it was my understanding that thinp would just return an error, which should trigger a transaction abort and then you should come back to a completely valid file system.

I sort of wonder if there's a different failure case that allowed some writes to complete and let others not, which resulted in this bad file system state. I'll put it on my list of things to investigate, because if that's the case we're likely missing some error condition that doesn't trigger a transaction abort properly.

We for sure bang the hell out of the "disk starts throwing errors" path, there's several xfstests for it and I've spent the last month running a bunch of them in a loop, so I know for full failures we're doing the right thing. Thanks,

Josef



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux