Re: Issues with "no space left on device" maybe related to 3.13

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 13.01.2014 08:25, schrieb Duncan:
> [This mail was also posted to gmane.comp.file-systems.btrfs.]
> 
> Thomas Kuther posted on Mon, 13 Jan 2014 00:05:25 +0100 as excerpted:
>>
> 
> [ Rearranged to standard quote/reply order so replies are in context.  
> Top-posting is irritating to try to reply to.]
Oops, sorry. Has been too late for the second mail yesterday.

> 
>> Am 12.01.2014 21:24, schrieb Thomas Kuther:
>>>
>>> I'm experiencing an interesting issue with the BTRFS filesystem on my
>>> SSD drive. It first occured some time after the upgrade to kernel
>>> 3.13-rc (-rc3 was my first 3.13-rc) but I'm not sure if it is
>>> related.
>>>
>>> The obvious symptoms are that services on my system started crashing
>>> with "no space left on device" errors.
>>>
>>> └» mount |grep "/mnt/ssd"
>>> /dev/sda2 on /mnt/ssd type btrfs
>>> (rw,noatime,compress=lzo,ssd,noacl,space_cache)
>>>
>>> └» btrfs fi df /mnt/ssd
>>> Data, single: total=113.11GiB, used=90.02GiB
>>> System, DUP: total=64.00MiB, used=24.00KiB
>>> System, single: total=4.00MiB, used=0.00
>>> Metadata, DUP: total=3.00GiB, used=2.46GiB
> 
> This shows only half the story, tho.  You also need the output of btrfs
> fi show /mnt/ssd.  Btrfs fi show displays how much of the total
> available space is chunk-allocated; btrfs fi df displays how much of
> the chunk- allocation for each type is actually used.  Only with both
> of them is the picture complete enough to actually see what's going on.

└» sudo btrfs fi show /mnt/ssd
Label: none  uuid: 52bc94ba-b21a-400f-a80d-e75c4cd8a936
        Total devices 1 FS bytes used 93.22GiB
        devid    1 size 119.24GiB used 119.24GiB path /dev/sda2

Btrfs v3.12
└» sudo btrfs fi df /mnt/ssd
Data, single: total=113.11GiB, used=90.79GiB
System, DUP: total=64.00MiB, used=24.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=3.00GiB, used=2.43GiB

So, this looks like it's really full.

>>> I use snapper on two subvolumes of that BTRFS volume (/ and /home) -
>>> each keeping 7 daily snapshots and up to 10 hourlys.
>>>
>>> When I saw those errors I started to delete most of the older
>>> snapshots,
>>> and the issue went away instantly, but this couldn't be a solution
>>> nor a workaround.
>>>
>>> I do though have a "usual suspect" on that BTRFS volume. A KVM disk
>>> image of a Win8 VM (I _need_ Adobe Lightroom)
>>>
>>> » lsattr /mnt/ssd/kvm-images/
>>> ---------------C /mnt/ssd/kvm-images/Windows_8_Pro.qcow2
>>>
>>> So the image has CoW disabled. Now comes the interesting part:
>>> I'm trying to copy off the image to my raid5 array (BTRFS ontop of a
>>> mdraid 5 - absolutely no issues with that one), but the cp process
>>> seems like it's stalled.
>>>
>>> After one hour the size of the destination copy is still 0 bytes.
>>> iotop almost constantly show values like
>>>
>>>  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>>>  4636 be/4 tom        14.40 K/s    0.00 B/s  0.00 %  0.71 % cp
>>> /mnt/ssd/kvm-images/Windows_8_Pro.qcow2 .
>>>
>>> It tries to read the file with some 14K/s and writes absolutely
>>> nothing.
>>>
>>> Any idea what's going wrong here, or suggestions how to get that qcow
>>> file copied off? I do have a backup, but honestly that one is quite
>>> aged - so simply rm'ing it would be the very last thing I'd like to
>>> try.
> 
> OK.  There's a familiar known-troublesome pattern here that your 
> situation fits... with one difference that I had previously /thought/ 
> would ameliorate the problem, but either you didn't catch the problem 
> soon enough, or the root issue is more complex than I at first
> understood (quite possible, since while I'm a regular on the list and
> thus see the common issues posted, I'm just a btrfs user/admin, not a
> dev, btrfs or otherwise).
> 
> The base problem is that btrfs is normally a copy-on-write filesystem, 
> and frequently internally-rewritten (as opposed to sequential-write 
> append-only or write once, read many) files are in general a COW-
> filesystem's worst-case, the larger the file and more frequently 
> partially rewritten, the worse it is, since every small internal write 
> will COW the area being written elsewhere, quickly fragmenting large 
> routinely internal-written files such as VM images into hundreds of 
> thousands of extents!  =:^(
> 
> In general, btrfs has two methods to help deal with that.  For smaller 
> files the autodefrag mount option can help.  For larger files
> autodefrag can be a performance issue in itself due to write
> magnification (each small internal write triggering a rewrite of the
> entire multi-gig file), but there's the NOCOW extended-attribute, which
> is what /has/ been recommended for these things as it's supposed to
> tell the filesystem to do in-place rewrites instead of COW.  That
> doesn't seem to have worked for you, which is the interesting bit, but
> it's possible that's an artifact of how it was handled.  Additionally,
> there's the snapshot aspect throwing further complexity into the works,
> as described below.
> 
> OK, so the file has NOCOW (the +C xattribute) set, which is good.
> *BUT*, when/how did you set it?  On btrfs that can make all the
> difference!
> 
> The caveat with NOCOW on btrfs is that in ordered to be properly 
> effective, NOCOW must be set on the file when it's first created,
> before there's actually any data in it.  If the attribute is not set
> until later, when the file is not zero-size, behavior isn't what one
> might expect or desire -- simply stated, it doesn't work.
> 
> The simplest way to ensure that a file gets the NOCOW attribute set
> while it's still empty is to set the attribute on the parent directory
> before the file is created in the first place.  Any newly created files
> will then automatically inherit the directory's attribute, and thus
> will be set NOCOW from the beginning.

I created the subvolume /mnt/ssd/kvm-images and set +C on it. Then I
moved the VM image in there. So the attribute for the file was inherited
by the parent directory at creation time, yes.

> 
> A second method is to do it manually by first creating the zero-length 
> file using touch, then setting the NOCOW attribute using chattr +C, and 
> only /then/ copying the content into it.  However, this is a rather 
> difficult for files created by other processes, so the directory 
> inheritance method is generally recommended as the simplest method.
> 
> So now the question is, the file has NOCOW set as recommended, but was
> it set before the file had content in it as required, or was NOCOW only
> set later, on the existing file with its existing content, thus in
> practice nullifying the effect of setting it at all?
> 
> Meanwhile, the other significant factor here is the snapshotting.  In
> VM- image-cases *WITHOUT* the NOCOW xattr properly set, heavy
> snapshotting of a filesystem with VM images is a known extreme
> worst-case of the worst- cases, with *EXTREMELY* bad behavior
> characteristics that don't scale well at all, such that attempting to
> work with that file will tie up the filesystem in huge knots such that
> very little forward progress can be made, period.  We're talking days
> or even weeks to do what /should/ have taken a few minutes, due to the
> *SEVERE* scaling issues.  They're working on the problem, but it's a
> tough one to solve and its scale only recently became apparent.

I do not have any snapshots of that specific kvm-images subvolume for
those reasons. There are some snapshots of other subvolumes (/ and
/home) but only a hand full dating back a few days.

> 
> Actually, the current theory is that the recent changes to make defrag 
> snapshot-aware may have triggered the severe scaling issues we're
> seeing now.  Before that, the situation was bad, but apparently not
> horribly terribly broken to the point of not working at all, as it is
> now.
> 
> But as I said, the previous recommendation has been to NOCOW the file
> to prevent the problem from ever appearing in the first place.
> 
> Which you have apparently done and the problem is still there, except 
> that we don't know yet whether you set NOCOW effectively, probably
> using the inheritance method, or not.  If you set it effectively, then
> the problem is worse, MUCH worse, than thought, since the recommended 
> workaround, doesn't workaround.  But if you set it too late to be 
> effective, then the problem is simply another instance of the already 
> known issue.

So it seems I hit the worst case.

> 
> As for how to manage the existing file, you seem to have figured that
> out already, below...
> 
>>> PS: please reply-to-all, I'm not subscribed. Thanks.
> 
> OK.  I'm doing so here, but please remind in every reply.
> 
> FWIW, I read and respond to the list as a newsgroup using gmane.org's 
> list2news service and normally reply to the "newsgroup", which gets 
> forwarded to the list.  So I'm not actually using a mail client but a 
> news client, and replying to both author and newsgroup/list isn't 
> particularly easy, nor do I do it often, so reminding with every reply 
> does help me remember.

Hmm, using nntp is a good idea, actually.

> 
>> I did some more digging, and I think I have two maybe unrelated issues
>> here.
>>
>> The "no space left on device" could be caused by the amount of
>> metadata used. I defragmented the KVM image and other parts, ran a
>> "balance start -dusage=5", and now it looks like
>>
>> └» btrfs fi df /
>> Data, single: total=113.11GiB, used=88.83GiB
>> System, DUP: total=64.00MiB, used=24.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=3.00GiB, used=2.40GiB
> 
> Just as a hint, you can get rid of that extra system chunk (the empty 
> single one) by doing a balance -sf (system, force, force necessary when 
> balancing system chunks only, not as part of metadata).   Since that's 
> only a few KiB of actual system data, it should go fast, and you won't 
> have that second system chunk display any more. =:^)

OK, will do. Thanks!

> 
>> The issue with copying/moving off the KVM image still remains. Using
>> "cp" or "mv" hangs. Interestingly, what did work was using "qemu-img
>> convert -O raw ..." so now I have a fresh backup at least. The VM
>> works just fine with the original image file. I really wonder what
>> goes wrong with cp and mv.
> 
> They're apparently getting caught up in that 100k-extents snapshot 
> scaling morass...

Even when subvolume in question has no snapshots and never had?

> 
> But *THANKS* for the qemu-img convert idea.  I haven't setup any VMs
> here so didn't know about that at all.  At least now I can pass on
> something that should actually let people get a backup to work with.
> =:^)
> 
> 
> Meanwhile...
> 
>> And I stumbled over a third issue with my raid5 array:
>> └» df -h|grep /mnt/btrfs
>> /dev/md0        5,5T    3,4T  2,1T   63% /mnt/btrfs
>> └» sudo btrfs fi df /mnt/btrfs/
>> Data, single: total=3.33TiB, used=3.33TiB
>> System, DUP: total=8.00MiB, used=388.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=56.12GiB, used=5.14GiB
>> Metadata, single: total=8.00MiB, used=0.00
> 
> Again, you can use balance to get rid of those unused single chunks.  
> They're currently an artifact from the creation of the filesystem due
> to how mkfs.btrfs works at present, so I've started doing a balance 
> immediately after first mount to deal with them, before there's
> anything on the filesystem so the balance goes real fast. =:^)  3+ TiB
> of data is a little late for that, but you can balance metadata (and
> system) only, at least.
>  
>> The array has been grown quite a while ago using "btrfs filesystem
>> resize max", but "btrfs fi df" still shows the old data size. How
>> could that happen?
> 
> As hinted at above, btrfs fi df <mntpnt> is only half the story, 
> displaying how much of currently allocated chunks are used and for what 
> (data/metadata/system/shared/etc).  What it does *NOT* display is how 
> much of the total filesystem size is actually allocated in the first 
> place.  That's where btrfs fi show <mntpnt> comes in.  (Just btrfs fi 
> show, without the <mntpnt> parameter, works fine if you've only a
> single btrfs or maybe a couple, but once you get a half dozen or so,
> adding the <mntpnt> just as you do for df, is useful to just display
> the one.)
> 
> Consider: On a single device btrfs, data is single mode by default,
> with data chunks normally 1 GiB each, metadata is dup mode by default,
> with metadata chunks normally 1/4 GiB (256 MiB), but due to dup mode,
> two of them are allocated at a time, so half a GiB.
> 
> Given that, how do you represent unallocated space that could be 
> allocated as either data (single, takes the space of the size of the 
> data, or a bit less when compression is on) or metadata (dup, takes
> twice as much space as the size of the actual metadata as there's two
> copies of it), depending on what is needed?
> 
> Of course btrfs can be used on multiple devices in various raid modes
> as well, complicating the picture further, particularly in the future
> when each subvolume can have its own single/dup/raid policy applied so
> they're not the same.
> 
> The way btrfs deals with this question is that btrfs fi show displays 
> allocated vs. total space (with the space that doesn't show up as 
> allocated obviously being... unallocated! =:^), while btrfs fi df, 
> displays the usage detail on only /allocated/ space.

OK, now I got it.

└» sudo btrfs fi show /mnt/btrfs
Label: none  uuid: 939f2547-176a-4942-b8d6-8883fed68973
        Total devices 1 FS bytes used 3.34TiB
        devid    1 size 5.46TiB used 3.44TiB path /dev/md0

No issues on that array, just PEBKAC.

> 
> Meanwhile, plain df (not btrfs df, just df) currently doesn't work 
> particularly well for btrfs, because the rules it uses to display used 
> vs. available space that work on most filesystems, don't really apply
> to btrfs in the same way, and it doesn't know to apply different rules
> to btrfs or what they might be if it did.  (There's an effort to teach
> df to know about btrfs and similar filesystems, but it's early stage
> ATM, as there's some very real questions to settle on exactly what a
> sensible kernel API might look like for that, first, with the
> assumption being that if the interface is designed correctly, other
> filesystems will be able to make use of it in the future as well.)
>> This is becomming a "collection of maybe unrelated BTRFS funny tales"
>> thread... still I'd be happy on suggestions regarding any of the
>> issues.
> 
> Some of this stuff, including discussion of the issues surrounding
> space used and left, is covered on the btrfs wiki, here (bookmark it!
> =:^) :
> 
> https://btrfs.wiki.kernel.org
> 
> In particular, see FAQ items 4.4-4.10 (documentation, faq...) covering 
> space questions, but it's worth reading pretty much all the User level 
> (as opposed to developer) documentation.
> 

Will do, last time I went through the wiki has been at least 2 or 3
years ago, I guess. And obviously I wasn't really aware of the
difference between btrfs fi show and df.

Thanks for your detailed input and the little slap on the backhead
regarding df vs. show :-)

Regards,
Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux