On Fri, Oct 24, 2014 at 01:05:39AM +0000, Duncan wrote: > Austin S Hemmelgarn posted on Thu, 23 Oct 2014 07:39:28 -0400 as > excerpted: > > > On 2014-10-23 05:19, Miao Xie wrote: > >> > >> Now my colleague and I is implementing the scrub/replace for RAID5/6 > >> and I have a plan to reimplement the balance and split it off from the > >> metadata/file data process. the main idea is > >> - allocate a new chunk which has the same size as the relocated one, > >> but don't insert it into the block group list, so we don't allocate > >> the free space from it. > >> - set the source chunk to be Read-only > >> - copy the data from the source chunk to the new chunk > >> - replace the extent map of the source chunk with the one of the new > >> chunk(The new chunk has the same logical address and the length as > >> the old one) > >> - release the source chunk > >> > >> By this way, we needn't deal the data one extent by one extent, and > >> needn't do any space reservation, so the speed will be very fast even > >> [if] we have lots of snapshots. > >> > > Even if balance gets re-implemented this way, we should still provide > > some way to consolidate the data from multiple partially full chunks. > > Maybe keep the old balance path and have some option (maybe call it > > aggressive?) that turns it on instead of the new code. > > IMO: > > * Keep normal default balance behavior as-is. > > * Add two new options, --fast, and --aggressive. > > * --aggressive behaves as today and is the normal default. > > * --fast is the new chunk-by-chunk behavior. This becomes the default if > the convert filter is used, or if balance detects that it /is/ changing > the mode, thus converting or filling in missing chunk copies, even when > the convert filter was not specifically set. Thus, if there's only one > chunk copy (single or raid0 mode, or raid1/10 or dup with a missing/ > invalid copy) and the balance would result in two copies, default to > --fast. Similarly, if it's raid1/10 and switching to single/raid0, > default to --fast. If no conversion is being done, keep the normal > --aggressive default. My pet peeve: if balance is converting profiles from RAID1 to single, the conversion should be *instantaneous* (or at least small_constant * number_of_block_groups). Pick one mirror, keep all the chunks on that mirror, delete all the corresponding chunks on the other mirror. Sometimes when a RAID1 mirror dies we want to temporarily convert the remaining disk to single data / DUP metadata while we wait for a replacement. Right now if we try to do this, we discover: - if the system reboots during the rebalance, btrfs now sees a mix of single and RAID1 data profiles on the disk. The rebalance takes a long time, and a hardware replacement has been ordered, so the probability of this happening is pretty close to 1.0. - one disk is missing, so there's a check in the mount code path that counts missing disks like this: - RAID1 profile: we can tolerate 1 missing disk so just mount rw,degraded - single profile: we can tolerate zero missing disks, so we don't allow rw mounts even if degraded. That filesystem is now permanently read-only (or at least it was in 3.14). It's not even possible to add or replace disks any more since that requires mounting the filesystem read-write. > * Users could always specify the behavior they want, overriding the > default, using the appropriate option. > > * Of course defaults may result in some chunks being rebalanced in fast > mode, while others are rebalanced in aggressive mode, if for instance > it's 3+ device raid1 mode filesystem with one device missing, since in > that case there'd be the usual two copies of some chunks and those would > default to aggressive, while there'd be one copy of chunks where the > other one was on the missing device. However, users could always specify > the desired behavior using the last point above, thus getting the same > behavior for the entire balance. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html
Attachment:
signature.asc
Description: Digital signature
