On Thu, Dec 26, 2019 at 7:05 AM Leszek Dubiel <leszek@xxxxxxxxx> wrote:
>
> Dec 22 19:20:11 wawel kernel: [ 5912.116874] ata1.00: exception Emask
> 0x0 SAct 0x1f80 SErr 0x0 action 0x0
> Dec 22 19:20:11 wawel kernel: [ 5912.116878] ata1.00: irq_stat 0x40000008
> Dec 22 19:20:11 wawel kernel: [ 5912.116880] ata1.00: failed command:
> READ FPDMA QUEUED
> Dec 22 19:20:11 wawel kernel: [ 5912.116882] ata1.00: cmd
> 60/00:38:00:00:98/0a:00:45:01:00/40 tag 7 ncq dma 1310720 in
> Dec 22 19:20:11 wawel kernel: [ 5912.116882] res
> 43/40:18:e8:05:98/00:04:45:01:00/40 Emask 0x409 (media error) <F>
> Dec 22 19:20:11 wawel kernel: [ 5912.116885] ata1.00: status: { DRDY
> SENSE ERR }
> Dec 22 19:20:11 wawel kernel: [ 5912.116886] ata1.00: error: { UNC }
> Dec 22 19:20:11 wawel kernel: [ 5912.153695] ata1.00: configured for
> UDMA/133
> Dec 22 19:20:11 wawel kernel: [ 5912.153707] sd 0:0:0:0: [sda] tag#7
> FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Dec 22 19:20:11 wawel kernel: [ 5912.153709] sd 0:0:0:0: [sda] tag#7
> Sense Key : Medium Error [current]
> Dec 22 19:20:11 wawel kernel: [ 5912.153710] sd 0:0:0:0: [sda] tag#7
> Add. Sense: Unrecovered read error - auto reallocate failed
> Dec 22 19:20:11 wawel kernel: [ 5912.153711] sd 0:0:0:0: [sda] tag#7
> CDB: Read(16) 88 00 00 00 00 01 45 98 00 00 00 00 0a 00 00 00
> Dec 22 19:20:11 wawel kernel: [ 5912.153712] print_req_error: I/O error,
> dev sda, sector 5462556672
> Dec 22 19:20:11 wawel kernel: [ 5912.153724] ata1: EH complete
> Dec 22 19:21:28 wawel kernel: [ 5989.527853] BTRFS info (device sda2):
> found 8 extents
Weird. This is not expected. I see a discrete read error with LBA
reported for the device, and yet Btfs shows no attempt to correct it
(using raid1 metadata) nor does it report the path to file that's
affected by this lost sector. I'm expecting to see one of those two
outcomes, given the profiles being used.
>
> Dec 23 00:08:20 wawel kernel: [23201.188424] INFO: task btrfs:2546
> blocked for more than 120 seconds.
Multiples of these, but no coinciding read errors or SATA link resets.
This suggests bad sectors in deep recover. And that would explain why
the copies are so slow. It's not a Btrfs problem per se. It's that
you've decided to have only one copy of data, self healing of data
isn't possible. The file system itself is fine, but slow because the
affected drive is slow to recover these bad sectors.
Again, dropping the SCT ERC for the drives would result in a faster
error recovery when encountering bad sectors. It also increases the
chance of data loss (not metadata loss since it's raid1).
--
Chris Murphy