Can't repair raid 1 array after drive failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi again,
I'm still running into problems with btrfs. For testing purposes, I
created a raid 1 filesystem yesterday and let the computer copy a ton
of data on it over night:

Label: 'BTRFS1'  uuid: 61e5aba9-6811-46ae-9396-35a72d3b1117
        Total devices 3 FS bytes used 1.15TiB
        devid    1 size 5.46TiB used 1.16TiB path /dev/sdc1
        devid    3 size 698.64GiB used 10.00GiB path /dev/sdf
        devid    4 size 1.82TiB used 1.15TiB path /dev/sde

Today I started scrub and looked at the status some hours later, which
gave thousands of errors on drive 4:

root@OMV:/var# btrfs scrub status /srv/dev-disk-by-label-BTRFS1/
scrub status for 61e5aba9-6811-46ae-9396-35a72d3b1117
        scrub started at Fri May  1 11:37:36 2020, running for 04:37:48
        total bytes scrubbed: 1.58TiB with 75751000 errors
        error details: read=75751000
        corrected errors: 0, uncorrectable errors: 75750996,
unverified errors: 0

(Not shown here that it was drive 4, but it was)

Then found that the drive is missing:

Label: 'BTRFS1'  uuid: 61e5aba9-6811-46ae-9396-35a72d3b1117
        Total devices 3 FS bytes used 1.15TiB
        devid    1 size 5.46TiB used 1.16TiB path /dev/sdc1
        devid    3 size 698.64GiB used 10.00GiB path /dev/sdf
        *** Some devices missing

Canceled scrub:
root@OMV:/var# btrfs scrub cancel /srv/dev-disk-by-label-BTRFS1/
scrub cancelled

Stats showing lots of error on sde, which is the missing drive:
root@OMV:/var# btrfs device stats /srv/dev-disk-by-label-BTRFS1/
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  0
[/dev/sdc1].generation_errs  0
[/dev/sdf].write_io_errs    0
[/dev/sdf].read_io_errs     0
[/dev/sdf].flush_io_errs    0
[/dev/sdf].corruption_errs  0
[/dev/sdf].generation_errs  0
[/dev/sde].write_io_errs    154997860
[/dev/sde].read_io_errs     77170574
[/dev/sde].flush_io_errs    310
[/dev/sde].corruption_errs  0
[/dev/sde].generation_errs  0


I tried to replace
root@OMV:/var# btrfs replace start 2 /dev/sdb /srv/dev-disk-by-label-BTRFS1/ &
[1] 1809
root@OMV:/var# ERROR: '2' is not a valid devid for filesystem
'/srv/dev-disk-by-label-BTRFS1/'

--> That's inconsistent with the device remove syntax, as it allows to
use a non-existing number? I try again using the /dev/sdx syntax, but
as sde is gone, I rescan and now it's sdi!

root@OMV:/var# btrfs replace start /dev/sdi /dev/sdb
/srv/dev-disk-by-label-BTRFS1/
root@OMV:/var# ERROR: target device smaller than source device
(required 2000398934016 bytes)

--> OK, there is a rerstriction in drive size. Not sure if this is
relly necessary because the size is sufficient to hold all data in
raid 1 profile, but ok, if it's implemented like that. I have no
larger drive available. So I try to replace by adding another drive
and removing the failed drive

root@OMV:/var# btrfs dev add /dev/sdb /srv/dev-disk-by-label-BTRFS1/
root@OMV:/var# btrfs fi show

Label: 'BTRFS1'  uuid: 61e5aba9-6811-46ae-9396-35a72d3b1117
        Total devices 4 FS bytes used 1.15TiB
        devid    1 size 5.46TiB used 1.16TiB path /dev/sdc1
        devid    3 size 698.64GiB used 10.00GiB path /dev/sdf
        devid    5 size 931.51GiB used 0.00B path /dev/sdb
        *** Some devices missing

root@OMV:/var# btrfs device remove missing /srv/dev-disk-by-label-BTRFS1/
ERROR: error removing device 'missing': no missing devices found to remove

--> But fi show tells there is are devices missing??

root@OMV:/var# btrfs device remove 2 /srv/dev-disk-by-label-BTRFS1/
Killed

--> What does "Killed" tell?

Try again with a non-existent number:
root@OMV:/var# btrfs device remove 8 /srv/dev-disk-by-label-BTRFS1/
ERROR: error removing devid 8: add/delete/balance/replace/resize
operation in progress

--> Seems like there is some operation going on, but fi show doesn't
show any progress:

root@OMV:/var# btrfs fi show
Label: 'BTRFS1'  uuid: 61e5aba9-6811-46ae-9396-35a72d3b1117
        Total devices 4 FS bytes used 1.15TiB
        devid    1 size 5.46TiB used 1.16TiB path /dev/sdc1
        devid    3 size 698.64GiB used 10.00GiB path /dev/sdf
        devid    5 size 931.51GiB used 0.00B path /dev/sdb
        *** Some devices missing

--> The used space on drive 5 should be increasing, but isn't. Also no
hdd activity LED flashing.

Try to start balance, but balance status reporting that no balance found:

root@OMV:/var# btrfs balance start --bg /srv/dev-disk-by-label-BTRFS1/
root@OMV:/var# btrfs balance status /srv/dev-disk-by-label-BTRFS1/
No balance found on '/srv/dev-disk-by-label-BTRFS1/'

I think that the data is still OK. There are messages like

May  1 14:05:44 OMV kernel: [235034.853573] BTRFS warning (device
sdc1): i/o error at logical 774427533312 on dev /dev/sde, physical
656521453568, root 471, inode 458, offset 64360448, length 4096, links
1 (path: Filme/Meine erfundene Frau.mkv)

I calculated sha256 on the file (which is a crappy movie) and compared
with the original. The file is read correct. I have not checked all
files, but guess they are readable. But how to return to a healthy and
redundant filesystem now? I'd try next if a reboot does help, but
wanted to ask for further steps before. And, are there any log files
or other information that will be useful for the developers?

Note: The 2TB drive is not the bad drive that I mentioned in my other
mail, but a drive that has worked without showing problems and has
been removed from my old NAS to increase capacity a couple of years
ago. It's possible that there are some hardware issues with the
mainboard, SATA cables or drives, though. It's all old hardware.

Version info:
btrfs-progs v4.20.1
Kernel 5.4.0-0.bpo.4-amd64



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux