Am 13.01.2014 08:25, schrieb Duncan:
> [This mail was also posted to gmane.comp.file-systems.btrfs.]
>
> Thomas Kuther posted on Mon, 13 Jan 2014 00:05:25 +0100 as excerpted:
>>
>
> [ Rearranged to standard quote/reply order so replies are in context.
> Top-posting is irritating to try to reply to.]
Oops, sorry. Has been too late for the second mail yesterday.
>
>> Am 12.01.2014 21:24, schrieb Thomas Kuther:
>>>
>>> I'm experiencing an interesting issue with the BTRFS filesystem on my
>>> SSD drive. It first occured some time after the upgrade to kernel
>>> 3.13-rc (-rc3 was my first 3.13-rc) but I'm not sure if it is
>>> related.
>>>
>>> The obvious symptoms are that services on my system started crashing
>>> with "no space left on device" errors.
>>>
>>> └» mount |grep "/mnt/ssd"
>>> /dev/sda2 on /mnt/ssd type btrfs
>>> (rw,noatime,compress=lzo,ssd,noacl,space_cache)
>>>
>>> └» btrfs fi df /mnt/ssd
>>> Data, single: total=113.11GiB, used=90.02GiB
>>> System, DUP: total=64.00MiB, used=24.00KiB
>>> System, single: total=4.00MiB, used=0.00
>>> Metadata, DUP: total=3.00GiB, used=2.46GiB
>
> This shows only half the story, tho. You also need the output of btrfs
> fi show /mnt/ssd. Btrfs fi show displays how much of the total
> available space is chunk-allocated; btrfs fi df displays how much of
> the chunk- allocation for each type is actually used. Only with both
> of them is the picture complete enough to actually see what's going on.
└» sudo btrfs fi show /mnt/ssd
Label: none uuid: 52bc94ba-b21a-400f-a80d-e75c4cd8a936
Total devices 1 FS bytes used 93.22GiB
devid 1 size 119.24GiB used 119.24GiB path /dev/sda2
Btrfs v3.12
└» sudo btrfs fi df /mnt/ssd
Data, single: total=113.11GiB, used=90.79GiB
System, DUP: total=64.00MiB, used=24.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=3.00GiB, used=2.43GiB
So, this looks like it's really full.
>>> I use snapper on two subvolumes of that BTRFS volume (/ and /home) -
>>> each keeping 7 daily snapshots and up to 10 hourlys.
>>>
>>> When I saw those errors I started to delete most of the older
>>> snapshots,
>>> and the issue went away instantly, but this couldn't be a solution
>>> nor a workaround.
>>>
>>> I do though have a "usual suspect" on that BTRFS volume. A KVM disk
>>> image of a Win8 VM (I _need_ Adobe Lightroom)
>>>
>>> » lsattr /mnt/ssd/kvm-images/
>>> ---------------C /mnt/ssd/kvm-images/Windows_8_Pro.qcow2
>>>
>>> So the image has CoW disabled. Now comes the interesting part:
>>> I'm trying to copy off the image to my raid5 array (BTRFS ontop of a
>>> mdraid 5 - absolutely no issues with that one), but the cp process
>>> seems like it's stalled.
>>>
>>> After one hour the size of the destination copy is still 0 bytes.
>>> iotop almost constantly show values like
>>>
>>> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
>>> 4636 be/4 tom 14.40 K/s 0.00 B/s 0.00 % 0.71 % cp
>>> /mnt/ssd/kvm-images/Windows_8_Pro.qcow2 .
>>>
>>> It tries to read the file with some 14K/s and writes absolutely
>>> nothing.
>>>
>>> Any idea what's going wrong here, or suggestions how to get that qcow
>>> file copied off? I do have a backup, but honestly that one is quite
>>> aged - so simply rm'ing it would be the very last thing I'd like to
>>> try.
>
> OK. There's a familiar known-troublesome pattern here that your
> situation fits... with one difference that I had previously /thought/
> would ameliorate the problem, but either you didn't catch the problem
> soon enough, or the root issue is more complex than I at first
> understood (quite possible, since while I'm a regular on the list and
> thus see the common issues posted, I'm just a btrfs user/admin, not a
> dev, btrfs or otherwise).
>
> The base problem is that btrfs is normally a copy-on-write filesystem,
> and frequently internally-rewritten (as opposed to sequential-write
> append-only or write once, read many) files are in general a COW-
> filesystem's worst-case, the larger the file and more frequently
> partially rewritten, the worse it is, since every small internal write
> will COW the area being written elsewhere, quickly fragmenting large
> routinely internal-written files such as VM images into hundreds of
> thousands of extents! =:^(
>
> In general, btrfs has two methods to help deal with that. For smaller
> files the autodefrag mount option can help. For larger files
> autodefrag can be a performance issue in itself due to write
> magnification (each small internal write triggering a rewrite of the
> entire multi-gig file), but there's the NOCOW extended-attribute, which
> is what /has/ been recommended for these things as it's supposed to
> tell the filesystem to do in-place rewrites instead of COW. That
> doesn't seem to have worked for you, which is the interesting bit, but
> it's possible that's an artifact of how it was handled. Additionally,
> there's the snapshot aspect throwing further complexity into the works,
> as described below.
>
> OK, so the file has NOCOW (the +C xattribute) set, which is good.
> *BUT*, when/how did you set it? On btrfs that can make all the
> difference!
>
> The caveat with NOCOW on btrfs is that in ordered to be properly
> effective, NOCOW must be set on the file when it's first created,
> before there's actually any data in it. If the attribute is not set
> until later, when the file is not zero-size, behavior isn't what one
> might expect or desire -- simply stated, it doesn't work.
>
> The simplest way to ensure that a file gets the NOCOW attribute set
> while it's still empty is to set the attribute on the parent directory
> before the file is created in the first place. Any newly created files
> will then automatically inherit the directory's attribute, and thus
> will be set NOCOW from the beginning.
I created the subvolume /mnt/ssd/kvm-images and set +C on it. Then I
moved the VM image in there. So the attribute for the file was inherited
by the parent directory at creation time, yes.
>
> A second method is to do it manually by first creating the zero-length
> file using touch, then setting the NOCOW attribute using chattr +C, and
> only /then/ copying the content into it. However, this is a rather
> difficult for files created by other processes, so the directory
> inheritance method is generally recommended as the simplest method.
>
> So now the question is, the file has NOCOW set as recommended, but was
> it set before the file had content in it as required, or was NOCOW only
> set later, on the existing file with its existing content, thus in
> practice nullifying the effect of setting it at all?
>
> Meanwhile, the other significant factor here is the snapshotting. In
> VM- image-cases *WITHOUT* the NOCOW xattr properly set, heavy
> snapshotting of a filesystem with VM images is a known extreme
> worst-case of the worst- cases, with *EXTREMELY* bad behavior
> characteristics that don't scale well at all, such that attempting to
> work with that file will tie up the filesystem in huge knots such that
> very little forward progress can be made, period. We're talking days
> or even weeks to do what /should/ have taken a few minutes, due to the
> *SEVERE* scaling issues. They're working on the problem, but it's a
> tough one to solve and its scale only recently became apparent.
I do not have any snapshots of that specific kvm-images subvolume for
those reasons. There are some snapshots of other subvolumes (/ and
/home) but only a hand full dating back a few days.
>
> Actually, the current theory is that the recent changes to make defrag
> snapshot-aware may have triggered the severe scaling issues we're
> seeing now. Before that, the situation was bad, but apparently not
> horribly terribly broken to the point of not working at all, as it is
> now.
>
> But as I said, the previous recommendation has been to NOCOW the file
> to prevent the problem from ever appearing in the first place.
>
> Which you have apparently done and the problem is still there, except
> that we don't know yet whether you set NOCOW effectively, probably
> using the inheritance method, or not. If you set it effectively, then
> the problem is worse, MUCH worse, than thought, since the recommended
> workaround, doesn't workaround. But if you set it too late to be
> effective, then the problem is simply another instance of the already
> known issue.
So it seems I hit the worst case.
>
> As for how to manage the existing file, you seem to have figured that
> out already, below...
>
>>> PS: please reply-to-all, I'm not subscribed. Thanks.
>
> OK. I'm doing so here, but please remind in every reply.
>
> FWIW, I read and respond to the list as a newsgroup using gmane.org's
> list2news service and normally reply to the "newsgroup", which gets
> forwarded to the list. So I'm not actually using a mail client but a
> news client, and replying to both author and newsgroup/list isn't
> particularly easy, nor do I do it often, so reminding with every reply
> does help me remember.
Hmm, using nntp is a good idea, actually.
>
>> I did some more digging, and I think I have two maybe unrelated issues
>> here.
>>
>> The "no space left on device" could be caused by the amount of
>> metadata used. I defragmented the KVM image and other parts, ran a
>> "balance start -dusage=5", and now it looks like
>>
>> └» btrfs fi df /
>> Data, single: total=113.11GiB, used=88.83GiB
>> System, DUP: total=64.00MiB, used=24.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=3.00GiB, used=2.40GiB
>
> Just as a hint, you can get rid of that extra system chunk (the empty
> single one) by doing a balance -sf (system, force, force necessary when
> balancing system chunks only, not as part of metadata). Since that's
> only a few KiB of actual system data, it should go fast, and you won't
> have that second system chunk display any more. =:^)
OK, will do. Thanks!
>
>> The issue with copying/moving off the KVM image still remains. Using
>> "cp" or "mv" hangs. Interestingly, what did work was using "qemu-img
>> convert -O raw ..." so now I have a fresh backup at least. The VM
>> works just fine with the original image file. I really wonder what
>> goes wrong with cp and mv.
>
> They're apparently getting caught up in that 100k-extents snapshot
> scaling morass...
Even when subvolume in question has no snapshots and never had?
>
> But *THANKS* for the qemu-img convert idea. I haven't setup any VMs
> here so didn't know about that at all. At least now I can pass on
> something that should actually let people get a backup to work with.
> =:^)
>
>
> Meanwhile...
>
>> And I stumbled over a third issue with my raid5 array:
>> └» df -h|grep /mnt/btrfs
>> /dev/md0 5,5T 3,4T 2,1T 63% /mnt/btrfs
>> └» sudo btrfs fi df /mnt/btrfs/
>> Data, single: total=3.33TiB, used=3.33TiB
>> System, DUP: total=8.00MiB, used=388.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=56.12GiB, used=5.14GiB
>> Metadata, single: total=8.00MiB, used=0.00
>
> Again, you can use balance to get rid of those unused single chunks.
> They're currently an artifact from the creation of the filesystem due
> to how mkfs.btrfs works at present, so I've started doing a balance
> immediately after first mount to deal with them, before there's
> anything on the filesystem so the balance goes real fast. =:^) 3+ TiB
> of data is a little late for that, but you can balance metadata (and
> system) only, at least.
>
>> The array has been grown quite a while ago using "btrfs filesystem
>> resize max", but "btrfs fi df" still shows the old data size. How
>> could that happen?
>
> As hinted at above, btrfs fi df <mntpnt> is only half the story,
> displaying how much of currently allocated chunks are used and for what
> (data/metadata/system/shared/etc). What it does *NOT* display is how
> much of the total filesystem size is actually allocated in the first
> place. That's where btrfs fi show <mntpnt> comes in. (Just btrfs fi
> show, without the <mntpnt> parameter, works fine if you've only a
> single btrfs or maybe a couple, but once you get a half dozen or so,
> adding the <mntpnt> just as you do for df, is useful to just display
> the one.)
>
> Consider: On a single device btrfs, data is single mode by default,
> with data chunks normally 1 GiB each, metadata is dup mode by default,
> with metadata chunks normally 1/4 GiB (256 MiB), but due to dup mode,
> two of them are allocated at a time, so half a GiB.
>
> Given that, how do you represent unallocated space that could be
> allocated as either data (single, takes the space of the size of the
> data, or a bit less when compression is on) or metadata (dup, takes
> twice as much space as the size of the actual metadata as there's two
> copies of it), depending on what is needed?
>
> Of course btrfs can be used on multiple devices in various raid modes
> as well, complicating the picture further, particularly in the future
> when each subvolume can have its own single/dup/raid policy applied so
> they're not the same.
>
> The way btrfs deals with this question is that btrfs fi show displays
> allocated vs. total space (with the space that doesn't show up as
> allocated obviously being... unallocated! =:^), while btrfs fi df,
> displays the usage detail on only /allocated/ space.
OK, now I got it.
└» sudo btrfs fi show /mnt/btrfs
Label: none uuid: 939f2547-176a-4942-b8d6-8883fed68973
Total devices 1 FS bytes used 3.34TiB
devid 1 size 5.46TiB used 3.44TiB path /dev/md0
No issues on that array, just PEBKAC.
>
> Meanwhile, plain df (not btrfs df, just df) currently doesn't work
> particularly well for btrfs, because the rules it uses to display used
> vs. available space that work on most filesystems, don't really apply
> to btrfs in the same way, and it doesn't know to apply different rules
> to btrfs or what they might be if it did. (There's an effort to teach
> df to know about btrfs and similar filesystems, but it's early stage
> ATM, as there's some very real questions to settle on exactly what a
> sensible kernel API might look like for that, first, with the
> assumption being that if the interface is designed correctly, other
> filesystems will be able to make use of it in the future as well.)
>> This is becomming a "collection of maybe unrelated BTRFS funny tales"
>> thread... still I'd be happy on suggestions regarding any of the
>> issues.
>
> Some of this stuff, including discussion of the issues surrounding
> space used and left, is covered on the btrfs wiki, here (bookmark it!
> =:^) :
>
> https://btrfs.wiki.kernel.org
>
> In particular, see FAQ items 4.4-4.10 (documentation, faq...) covering
> space questions, but it's worth reading pretty much all the User level
> (as opposed to developer) documentation.
>
Will do, last time I went through the wiki has been at least 2 or 3
years ago, I guess. And obviously I wasn't really aware of the
difference between btrfs fi show and df.
Thanks for your detailed input and the little slap on the backhead
regarding df vs. show :-)
Regards,
Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html