Re: btrfs-scrub: slow scrub speed (raid5)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 06, 2020 at 07:13:41PM +0100, Sebastian Döring wrote:
> (oops, forgot to reply all the first time)
> 
> > Is RAID5 stable? I was under the impression that it wasn't.
> >
> > -m
> 
> Not sure, but AFAIK most of the known issues have been addressed.
> 
> I did some informal testing with a bunch of usb devices, ripping them
> out during writes, remounting the array with a device missing in
> degraded mode, then replacing the device with a fresh one, etc. Always
> worked fine. Good enough for me. The scary write hole seems hard to
> hit, power outages are rare and if they happen I will just run a scrub
> immediately.

It's quite hard to hit the write hole for data.  A bunch of stuff has
to happen at the same time:

	- Writes have to be small.  btrfs actively tries to prevent this,
	but can be defeated by a workload that uses fsync().  Big writes
	will get their own complete RAID stripes, therefore no write hole.
	Writes smaller than a RAID stripe (64K * (number_of_disks -
	1)) will be packed into smaller gaps in the free space map,
	more of which will be colocated in RAID stripes with previously
	committed data.  This is good for collections of big media files,
	and bad for databases, VM images, and build trees.

	- The filesystem has to have partially filled RAID stripes, as
	the write hole cannot occur in an empty or full RAID stripe. [1]
	A heavily fragmented filesystem has more of these.  An empty
	filesystem (or a recently balanced one) has fewer.

	- Power needs to fail (or host crash) *during* a write
	that meets the other criteria.	Hosts that spend only 1%
	of their time writing will have write hole failures at 10x
	lower rates than hosts that spend 10% of their time writing.
	The vulnerable interval for write hole is very short--typically
	less than a millisecond--but if you are writing to thousands of
	raid stripes per second, then there are thousands of write hole
	windows per second.

	- Write hole can only affect a system in degraded mode, so
	after all the above, you're still only _at risk_ of a write
	hole failure--you also need a disk fault to occur before you
	can repair parity with a scrub.

It is harder to meet these conditions for data, but it's the common
case for metadata.  Metadata is all 16K writes and the 'nossd' allocator
perversely prefers partially-filled RAID stripes.  btrfs spends a _lot_
of its time doing metadata writes.  This maximizes all the prerequisites
above.  btrfs has zero tolerance for uncorrectable metadata loss, so
raid5 and raid6 should never be used for metadata.  Adding the 'nossd'
mount option will make total filesystem loss even faster.

In practice, btrfs raid5 data with raid1 metadata fails (at least one
block unreadable) about once per 30 power failures or crashes while
running a write stress test designed to maximize the conditions listed
above.  I'm not sure if all of that failure rate is due to write hole
or other currently active btrfs raid5 bugs--we'd have to fix the other
bugs and measure the change in failure rate to know.

If your workload doesn't meet the above criteria then the failure rates
will be lower.  A light-duty SOHO file sharing server will probably
last 5 years between data losses with a disk fault every 2.5 years;
however, if you put a database or VM host on that server than it might
have losses on almost every power failure.  The rate you will experience
depends on your workload.  As long as metadata is raid1, 10, 1c3 or 1c4,
the data losses which do occur due to write hole will be small, and can
be recovered by deleting and replacing the damaged files from backups.


[1] except in nodatacow and prealloc files; however, nodatacow is
basically a flag that says "please allow my data to be corrupted as much
as possible without intentionally destroying it," so this is expected.
Prealloc prevents the allocator from avoiding partially filled RAID
stripes because it forces logically consecutive writes to be physically
consecutive as well.

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux