On Thu, Feb 06, 2020 at 07:13:41PM +0100, Sebastian Döring wrote: > (oops, forgot to reply all the first time) > > > Is RAID5 stable? I was under the impression that it wasn't. > > > > -m > > Not sure, but AFAIK most of the known issues have been addressed. > > I did some informal testing with a bunch of usb devices, ripping them > out during writes, remounting the array with a device missing in > degraded mode, then replacing the device with a fresh one, etc. Always > worked fine. Good enough for me. The scary write hole seems hard to > hit, power outages are rare and if they happen I will just run a scrub > immediately. It's quite hard to hit the write hole for data. A bunch of stuff has to happen at the same time: - Writes have to be small. btrfs actively tries to prevent this, but can be defeated by a workload that uses fsync(). Big writes will get their own complete RAID stripes, therefore no write hole. Writes smaller than a RAID stripe (64K * (number_of_disks - 1)) will be packed into smaller gaps in the free space map, more of which will be colocated in RAID stripes with previously committed data. This is good for collections of big media files, and bad for databases, VM images, and build trees. - The filesystem has to have partially filled RAID stripes, as the write hole cannot occur in an empty or full RAID stripe. [1] A heavily fragmented filesystem has more of these. An empty filesystem (or a recently balanced one) has fewer. - Power needs to fail (or host crash) *during* a write that meets the other criteria. Hosts that spend only 1% of their time writing will have write hole failures at 10x lower rates than hosts that spend 10% of their time writing. The vulnerable interval for write hole is very short--typically less than a millisecond--but if you are writing to thousands of raid stripes per second, then there are thousands of write hole windows per second. - Write hole can only affect a system in degraded mode, so after all the above, you're still only _at risk_ of a write hole failure--you also need a disk fault to occur before you can repair parity with a scrub. It is harder to meet these conditions for data, but it's the common case for metadata. Metadata is all 16K writes and the 'nossd' allocator perversely prefers partially-filled RAID stripes. btrfs spends a _lot_ of its time doing metadata writes. This maximizes all the prerequisites above. btrfs has zero tolerance for uncorrectable metadata loss, so raid5 and raid6 should never be used for metadata. Adding the 'nossd' mount option will make total filesystem loss even faster. In practice, btrfs raid5 data with raid1 metadata fails (at least one block unreadable) about once per 30 power failures or crashes while running a write stress test designed to maximize the conditions listed above. I'm not sure if all of that failure rate is due to write hole or other currently active btrfs raid5 bugs--we'd have to fix the other bugs and measure the change in failure rate to know. If your workload doesn't meet the above criteria then the failure rates will be lower. A light-duty SOHO file sharing server will probably last 5 years between data losses with a disk fault every 2.5 years; however, if you put a database or VM host on that server than it might have losses on almost every power failure. The rate you will experience depends on your workload. As long as metadata is raid1, 10, 1c3 or 1c4, the data losses which do occur due to write hole will be small, and can be recovered by deleting and replacing the damaged files from backups. [1] except in nodatacow and prealloc files; however, nodatacow is basically a flag that says "please allow my data to be corrupted as much as possible without intentionally destroying it," so this is expected. Prealloc prevents the allocator from avoiding partially filled RAID stripes because it forces logically consecutive writes to be physically consecutive as well.
Attachment:
signature.asc
Description: PGP signature
