Re: Strange prformance degradation when COW writes happen at fixed offsets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Nik Markovic posted on Fri, 24 Feb 2012 14:38:57 -0600 as excerpted:

> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@xxxxxxx> wrote:
>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>
>>> I noticed a few errors in the script that I used. I corrected it and
>>> it seems that degradation is occurring even at fully random writes:
>>
>> I don't have an ssd, but is it possible that you're simply seeing
>> erase- block related degradation due to multi-write-block sized
>> erase-blocks?
>>
>> It seems to me that when originally written to the btrfs-on-ssd, the
>> file will likely be written block-sequentially enough that the file as
>> a whole takes up relatively few erase-blocks.  As you COW-write
>> individual blocks, they'll be written elsewhere, perhaps all the
>> changed blocks to a new erase-block, perhaps each to a different erase
>> block.
> 
> This is a very interesting insight. I wasn't even aware of the
> erase-block issue, so I did some reading up on it...

I take it you looked at TRIM/discard, then, as well?  In theory and for 
some SSD firmware, it works well at helping to alleviate the problem by 
informing the SSD of data areas that it no longer needs to care about 
(empty space), thus allowing more effective management of those erase-
blocks.

Reality is however not quite so simple, and it doesn't help a lot with 
some SSDs, plus there's a potential performance issue due when doing the 
discard on especially earlier devices, since TRIM is an unqueueable 
command in the earlier standards (I've read it's defined as queueable in 
the latest standards, however), thus forcing a flush of all activity in 
the queue before the discard, potentially triggering I/O freeze 
behavior.  Additionally, when run on top of dm-crypt, there's a potential 
security issue (examination of the raw undecrypted storage reveals 
whether there's data there or not, and possibly the filesystem type used 
based on patterns, a potential deniability issue in that they know the 
data is there, tho it doesn't affect the strength of the encryption 
itself).

So since on a lot of firmware it doesn't make a lot of difference anyway, 
and there's a couple of down sides, the btrfs ssd mount option does NOT 
enable discard as well.  However, there *IS* a discard option that you 
can experiment with if you like, and it probably WILL help with erase-
block handling on SOME firmware.

See the FAQ, part 3 Features, # 3.4 on TRIM/discard, for a bit more. I've 
really covered what it says, above, but there's a link to the encryption 
security vs TRIM research, for instance.  And the discard mount-option 
for whatever reason isn't listed in mount options, or at least I didn't 
see it, only in the FAQ.

(This is one URL, my client is wrapping it and it's a hassle to fix.)

http://btrfs.ipv5.de/index.php?
title=FAQ#Does_Btrfs_support_TRIM.2Fdiscard.3F

Bottom line, if it is indeed an erase-block issue, the discard mount 
option MIGHT help, or it might not, depending on your device firmware.  
It's an experiment-and-see thing.

>> As you increase the successive COW generation count, the file's file-
>> system/write blocks will be spread thru more and more erase-blocks,
>> basically fragmentation but of the SSD-critical type, into more and
>> more erase blocks, thus affecting modification and removal time but not
>> read time.
> 
> OK, so time to write would increase due to fragmentation and writing, it
> now makes sense (though I don't see why small writes would affect this,
> but my concerns are not writes anyway), but why would cp --reflink time
> increase so much. Yes, new extents would be created, but btrfs doesn't
> write into data blocks, does it? I figured its metadata would be kept in
> one place. I figure the only thing BTRFS would do on cp
> --reflink=always:
> 1. Take a collection of extents owned by source.
> 2. Make the new copy use the same collection of extents.
> 3. Write the collection of extents to the "directory".
> 
> Now this process seems to be CPU intensive. When I remove or make a
> reflink copy, one core pikes up to 100%, which tells me that there's a
> performance issue there, not an ssd issue. Also, only one CPU thread is
> being used for this. I figured that I can improve this by some setting.
> Maybe thread_pool mount option? Are there any updates in later kernels
> that I should possibly pick up?

FWIW... I am by no means an expert on this.  I /think/ I understand 
enough of it to somewhat guide trial and error testing to arrive at a 
reasonable if not best-case config for any setup I might deal with, and 
well enough to hopefully point you in the right direction for your own 
research and testing, but I'd not going to claim to be able to explain 
the whys of individual cases, or even necessarily to understand them 
myself, just understand enough to know of the issue and to trial and 
error resolve to a hopefully reasonable situation on any hardware I might 
run.

However, I could speculate (enough to guide my own testing were I 
troubleshooting here) that it's one of several things or more likely a 
combination of them.   

One, I'm not sure if the metadata ends up being COW also, or not, but if 
it is, then your test case is fragmenting it too, thus explaining the 
reflink copy issue.  And keep in mind that by default, btrfs uses DUP for 
metadata, so there's TWO copies of it written, thereby DOUBLING the 
performance effects of anything affecting metadata!

Two, see the FAQ deduplication question/answer a couple questions below 
the TRIM/discard one mentioned above.  I'm rather fuzzy on the filesystem 
implications of this myself, but it seems to me that our COW assumptions 
might be wrong because they're assuming deduplication effects that simply 
aren't the way btrfs works presently, as it hasn't implemented 
deduplication.  Admittedly, this is at best a handwavy black-box factor, 
but that's the best I can do with it, presently.  I guess that at least 
gives you another place to do additional research, if it comes to that.  
(In this regard I do wish the COW subsection of the sysadmin guide page 
on the wiki was written, it's simply punted ATM, since there's a fair 
chance that a good explanation there would cover the filesystem viewpoint 
differences between full deduplication and the COW that btrfs does, 
perhaps clearing up some misconceptions people including me may have 
about it, as well.)

Three, as evident in the discussion on the nodatacow and autodefrag 
options I mentioned before, there's known issues with some use cases 
involving large files and rewrites of data at random locations within 
them.  But I'm not sure if these known issues are simply the ones we've 
been discussing, or if there's other factors I'm unaware of in this 
regard.  Knowing more about just what those known issues are and the 
specific scenarios under which they occur, could go a long way toward 
resolving the situation for you.

But I'm only a recent list regular, joining a few weeks ago as part of my 
own research into btrfs (FWIW my use case involves N-way mirroring, with 
N=3-4; since only no-mirroring and N=2 is available today and 3-way/n-way 
is planned to layer on top of raid5/6, which is planned for kernel 
3.4/3.5, I'm now waiting for that... while continuing to stay current on 
the list), so whatever research or test cases lead to the remarks on the 
wiki regarding large files with random data rewrites, predates my 
involvement likely by quite some time.

Four, there's additional block alignment issues having to do with the 
alignment of the partition on the physical storage, as it relates to 
read-, write- and erase-block sizes and alignment.  On SSDs, erase-block 
sizes are the biggest, so the optimum alignment would be to erase-block 
size.  Getting it wrong can result in multiple block writes and/or erases 
where proper alignment would require only one.  This phenomenon is called 
write-amplification (and less commonly, erase-amplification).  However, 
depending on what you used to create the partition on which the 
filesystem resides (and loopback files do tend toward worst-case), it's 
quite possible you don't have block-alignment level control at all.

FWIW, that's one use case for the mkbtrfs/mkfs.btrfs --alloc-start/-A 
option, since that allows you to align the allocation within the 
partition as necessary for alignment, regardless of the partition 
alignment.

FWIW2, gptfdisk (a gpt partitioner as opposed to the old mbr style) has 
reasonable alignment defaults of 1 MiB on disks without an existing 
partition layout, and attempts 8-sector (4 KiB) alignment even on 
existing layouts, for disks >=300 GB at least.  That's what I've been 
using for the last few years, having converted to gpt-based partitioning 
for everything, even USB-thumb-drives, if partitioned.  (GPT was designed 
for EFI, but can be used on BIOS based systems as well, which is what I'm 
doing.  Grub2 understands gpt well and puts to good use any reserved BIOS 
partition it finds, and there's options in the kernel for it that need 
enabled as well.)

Block alignment is DEFINITELY something you can play with, in terms of 
testing whether it makes a difference on your drives, SSD or "spinning 
rust".

There's probably other factors involved of which I'm unaware, as well.

>> IIRC I saw a note about this on the wiki, in regard to the nodatacow
>> mount-option.

>> In addition to nodatacow, see the note on the autodefrag option.

> Unless I am wrong, this would disable COW completely and reflink copy.
> Reflinks are a crucial component and the sole reason I picked BTRFS for
> the system that I am writing for my company.
> The autodefrag option addresses multiple writes. Writing is not the
> problem, but cp --reflink should be near-instant. That was the reason we
> chose BTRFS over ZFS, which seemed to be the only feasible alternative.
> ZFS snapshot complicate the design and deduplication copy time is the
> same as (or not much better than) raw copy.

> As I mentioned above, the COW is the crucial component of our system,
> XFS won't do. Our system does not do random writes. In fact it is mainly
> heavy on read operation. The system does occasional "rotation of rust"
> on large files in a way that version control system would (large files
> are modified and then used as a new baseline)

Pardon me, I think I might have been too vague with that "rotating rust" 
allusion and lost you.  Either that or you're taking the allusion out 
even further and potentially lost me! =;^0

I meant spinning magnetic media with that "rotating rust" reference, the 
"rotating rust" bit being a double entendre allusion both to the iron 
oxide (rust) used as the data storage layer, and to the fact that many 
view rotating magnetic media as a legacy technology (rusting out) 
compared to SSDs. =:^)  As it happens, I saw that double-meaning word-
play used elsewhere recently with the same two allusions attached, and 
liked it enough to use it myself, when I got the chance.  Only I'm not 
sure you got the reference, because...

You used it quite differently, referring to file rotation.  So either you 
saw my reference and upped the ante, so to speak, leaving me to pick up 
the pieces, or I lost you with the original reference, one of the two.

But I guess we should be on the same page knowing each other's meaning, 
now.  Meanwhile...

[I do see your followup mentioning that it doesn't actually disable /all/ 
COW, and that you tested it, without significant change in the results...]

FWIW, I wasn't so much SUGGESTING those options, as noting the 
INFORMATION contained in their description, the random writes to large db 
files and its effect on btrfs bit.  But testing (which you did) is a good 
idea, just to see what difference it makes, little in your case, so 
either the nocow option isn't disabling it in your case (specific use of 
cp --reflink), or the cow isn't the problem at all.


While you're at testing, tho, the question occurred to me of whether 
simply using btrfs' snapshotting would make a difference.  (I did say I 
don't claim a full understanding, and that trial and error testing would 
be my method here, that I really only understand enough to hopefully 
guide me a bit in what to test...)  Snapshotting by definition uses the 
COW capacities, bit it occurs to me that since it's doing it on a 
filesystem-wide basis instead of a single-file basis, that might allow 
more efficiency in metadata handling.

Note that I don't necessarily expect that snapshotting would be a 
workable final solution for you, but if in testing you discover that the 
speed stays reasonable with the snapshot method (still only changing the 
single file between snapshots), while it degrades (as you've found) with 
the single cp --reflink method, then that's important data for the test 
case, and given btrfs' state of development, it could well lead to major 
optimizations of the single-file cp --reflink case as well, which you 
presumably COULD use in final deployment.


> Thanks for all your help on this issue. I hope that someone can point
> out some more tweaks or added features/fixes after 3.2 RC5 that I may
> do.

Talking about which... since you mentioned 3.2-rc5, you do seem aware of 
the fact that btrfs is still very much experimental status, in active 
development, and the need for staying current on the kernel.

However, unless your testing is for a system with actual deployment 
scheduled for say a year or more out, I'd question btrfs as a reasonable 
solution in any case.  One of the things that a lot of people don't seem 
to realize is just how much active btrfs development is still going on, 
and that it's NOT just corner-case use cases such as the multi-mirror 
raid1 that I'm waiting on ATM, but that there's still data corruption 
issues being traced and fixed, etc.

IOW, btrfs isn't something I'd recommend on either a production system or 
even a general user's system, for the time being.  If the intent is to 
test btrfs, filling it with data that you are not only prepared for it to 
be destroyed, but expect it to happen, so you not only have backups or 
simply don't value the data enough to be worth backups, you're not 
counting on the btrfs copy as anything but experimental "garbage" data, 
expected to be lost in testing, as well, then that's FINE.  Such testing, 
and hopefully bug reporting, and patching where possible, is what btrfs 
is out there for, ATM.  

But if the intent is to actually put production data on the filesystem, 
or use it as the primary copy of data that you don't want to lose, btrfs 
isn't an appropriate choice at this point, and I'd say probably won't be 
until say Q4, or even next year, so if your production deployment is 
scheduled for before that, really, you shouldn't be looking at btrfs for 
it, as it's not fit for that purpose ATM and isn't likely to be, for 
another year or so (and even then, it'll be suitable for only the early 
adopters, the cautious folk will wait another year or more after that, 
just as many of the cautious folk are only now warming to ext4 as opposed 
to ext3).

I just don't want to see you back here as one of those folks asking 
questions about recovering data on a screwed filesystem, because they had 
no backups or the backups weren't kept current, because they were using 
btrfs for real-life use beyond testing purposes, and that's simply not 
the sort of use btrfs is designed to or can properly deliver at this 
point!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux