Re: Western Digital Red's SMR and btrfs?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 04, 2020 at 05:24:11PM -0600, Chris Murphy wrote:
> On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell
> <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> > Some kinds of RAID rebuild don't provide sufficient idle time to complete
> > the CMR-to-SMR writeback, so the host gets throttled.  If the drive slows
> > down too much, the kernel times out on IO, and reports that the drive
> > has failed.  The RAID system running on top thinks the drive is faulty
> > (a false positive failure) and the fun begins (hope you don't have two
> > of these drives in the same array!).
> 
> This came up on linux-raid@ list today also, and someone posted this
> smartmontools bug.
> https://www.smartmontools.org/ticket/1313
> 
> It notes in part this error, which is not a time out.

Uhhh...wow.  If that's not an individual broken disk, but the programmed
behavior of the firmware, that would mean the drive model is not usable
at all.

> [20809.396284] blk_update_request: I/O error, dev sdd, sector
> 3484334688 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
> 
> An explicit write error is a defective drive. But even slow downs
> resulting in link resets is defective. The marketing of DM-SMR says
> it's suitable without having to apply local customizations accounting
> for the drive being SMR.
> 
> 
> > Desktop CMR drives (which are not good in RAID arrays but people use
> > them anyway) have firmware hardcoded to retry reads for about 120
> > seconds before giving up.  To use desktop CMR drives in RAID arrays,
> > you must increase the Linux kernel IO timeout to 180 seconds or risk
> > false positive rejections (i.e. multi-disk failures) from RAID arrays.
> 
> I think we're way past the time when all desktop oriented Linux
> installations should have overridden the kernel default, using 180
> second timeouts instead. Even in the single disk case. The system is
> better off failing safe to slow response, rather than link resets and
> subsequent face plant. But these days most every laptop and desktop's
> sysroot is on an SSD of some kind.
> 
> 
> > Now here is the problem:  DM-SMR drives have write latencies of up to 300
> > seconds in *non-error* cases.  They are up to 10,000 times slower than
> > CMR in the worst case.  Assume that there's an additional 120 seconds
> > for error recovery on top of the non-error write latency, and add the
> > extra 50% for safety, and the SMR drive should be configured with a
> > 630 second timeout (10.5 minutes) in the Linux kernel to avoid false
> > positive failures.
> 
> Incredible.
> 
> 
> -- 
> Chris Murphy



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux