Re: Unocorrectable errors with RAID1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2017-01-16 23:50, Janos Toth F. wrote:
BTRFS uses a 2 level allocation system.  At the higher level, you have
chunks.  These are just big blocks of space on the disk that get used for
only one type of lower level allocation (Data, Metadata, or System).  Data
chunks are normally 1GB, Metadata 256MB, and System depends on the size of
the FS when it was created.  Within these chunks, BTRFS then allocates
individual blocks just like any other filesystem.

This always seems to confuse me when I try to get an abstract idea
about de-/fragmentation of Btrfs.
Can meta-/data be fragmented on both levels? And if so, can defrag
and/or balance "cure" both levels of fragmentation (if any)?
But how? May be several defrag and balance runs, repeated until
returns diminish (or at least you consider them meaningless and/or
unnecessary)?
Defrag operates only at the block level. It won't allocate chunks unless it has to, and it won't remove chunks unless they become empty from it moving things around (although that's not likely to happen most of the time). Balance functionally operates at both levels, but it doesn't really do any defragmentation. Balance _may_ merge extents sometimes, but I'm not sure of this. It will compact allocations and therefore functionally defragment free space within chunks (though not necessarily at the chunk-level itself).

Defrag run with the same options _should_ have no net effect after the first run, the two exceptions being if the filesystem is close to full or if the data set is being modified live while the defrag is happening. Balance run with the same options will eventually hit a point where it doesn't do anything (or only touches one chunk of each type but doesn't actually give any benefit). If you're just using the usage filters or doing a full balance, this point is the second run. If you're using other filters, it's functionally not possible to determine when that point will be without low-level knowledge of the chunk layout.

For an idle filesystem, if you run defrag then a full balance, that will get you a near optimal layout. Running them in the reverse order will get you a different layout that may be less optimal than running defrag first because defrag may move data in such a way that new chunks get allocated. Repeated runs of defrag and balance will in more than 95% of cases provide no extra benefit.


What balancing does is send everything back through the allocator, which in
turn back-fills chunks that are only partially full, and removes ones that
are now empty.

Does't this have a potential chance of introducing (additional)
extent-level fragmentation?
In theory, yes. IIRC, extents can't cross a chunk boundary. Beyond that packing constraint, balance shouldn't fragment things further.

FWIW, while there isn't a daemon yet that does this, it's a perfect thing
for a cronjob.  The general maintenance regimen that I use for most of my
filesystems is:
* Run 'btrfs balance start -dusage=20 -musage=20' daily.  This will complete
really fast on most filesystems, and keeps the slack-space relatively
under-control (and has the nice bonus that it helps defragment free space.
* Run a full scrub on all filesystems weekly.  This catches silent
corruption of the data, and will fix it if possible.
* Run a full defrag on all filesystems monthly.  This should be run before
the balance (reasons are complicated and require more explanation than you
probably care for).  I would run this at least weekly though on HDD's, as
they tend to be more negatively impacted by fragmentation.

I wonder if one should always run a full balance instead of a full
scrub, since balance should also read (and thus theoretically verify)
the meta-/data (does it though? I would expect it to check the
chekcsums, but who knows...? may be it's "optimized" to skip that
step?) and also perform the "consolidation" of the chunk level.
Scrub uses fewer resources than balance. Balance has to read _and_ re-write all data in the FS regardless of the state of the data. Scrub only needs to read the data if it's good, and if it's bad it only (for raid1) has to re-write the replica that's bad, not both of them. In fact, the only practical reason to run balance on a regular basis at all is to compact allocations and defragment free space. This is why I only have it balance chunks that are less than 1/5 full.

I wish there was some more "integrated" solution for this: a
balance-like operation which consolidates the chunks and also
de-fragments the file extents at the same time while passively
uncovers (and fixes if necessary and possible) any checksum mismatches
/ data errors, so that balance and defrag can't work against
each-other and the overall work is minimized (compared to several full
runs or many different commands).
More than 90% of the time, the performance difference between the absolute optimal layout and the one generated by just running defrag then balancing is so small that it's insignificant. The closer to the optimal layout you get, the lower the returns for optimizing further (and this applies to any filesystem in fact). In essence, it's a bit like the traveling salesman problem, any arbitrary solution probably isn't optimal, but it's generally close enough to not matter.

As far as scrub fitting into all of this, I'd personally rather have a daemon that slowly (less than 1% bandwidth usage) scrubs the FS over time in the background and logs and fixes errors it encounters (similar to how filesystem scrubbing works in many clustered filesystems) instead of always having to manually invoke it and jump through hoops to keep the bandwidth usage reasonable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux