Re: 5.4.20: cannot mount device that blipped off the bus: duplicate device fsid:devid for

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 4/14/20 8:38 AM, Marc MERLIN wrote:
Anaud, I had this happen agin with 5.5.11, and it was impossible to do
anything to fix it, I had to reboot again.
btrfs device scan --forget
did nothing.

See details:
BTRFS: device label btrfs_space devid 1 transid 35178413 /dev/sde1
BTRFS info (device sde1): use lzo compression, level 0
BTRFS info (device sde1): disk space caching is enabled
BTRFS info (device sde1): has skinny extents
BTRFS info (device sde1): enabling ssd optimizations
sd 6:1:3:0: [sde] tag#642 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=2s
sd 6:1:3:0: [sde] tag#640 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=2s
sd 6:1:3:0: [sde] tag#702 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=2s
sd 6:1:3:0: [sde] tag#702 CDB: Write(16) 8a 00 00 00 00 00 f1 a7 3a 68 00 00 01 f0 00 00
blk_update_request: I/O error, dev sde, sector 4054268520 op 0x1:(WRITE) flags 0x100000 phys_seg 62 prio class 0
sd 6:1:3:0: [sde] tag#701 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=2s
sd 6:1:3:0: [sde] tag#701 CDB: Write(16) 8a 00 00 00 00 00 f1 a7 38 68 00 00 02 00 00 00
blk_update_request: I/O error, dev sde, sector 4054268008 op 0x1:(WRITE) flags 0x104000 phys_seg 64 prio class 0
sd 6:1:3:0: [sde] tag#700 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=2s
sd 6:1:3:0: [sde] tag#700 CDB: Write(16) 8a 00 00 00 00 00 f1 a7 36 68 00 00 02 00 00 00
blk_update_request: I/O error, dev sde, sector 4054267496 op 0x1:(WRITE) flags 0x104000 phys_seg 64 prio class 0
BTRFS error (device sde1): bdev /dev/sde1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
sd 6:1:3:0: [sde] tag#641 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=10s
sd 6:1:3:0: [sde] tag#641 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00


BTRFS info (device sde1): forced readonly

Unfortunately that's the only thing we do as of now.

BTRFS warning (device sde1): Skipping commit of aborted transaction.
BTRFS: error (device sde1) in cleanup_transaction:1894: errno=-5 IO failure
BTRFS info (device sde1): delayed_refs has NO entry
btrfs_dev_stat_print_on_error: 244 callbacks suppressed

gargamel:~# dmtail 3
[1887142.765448] BTRFS error (device sde1): bdev /dev/sde1 errs: wr 1038, rd 4529, flush 0, corrupt 0, gen 0
[1887142.795820] BTRFS error (device sde1): bdev /dev/sde1 errs: wr 1038, rd 4530, flush 0, corrupt 0, gen 0
[1887142.826176] BTRFS error (device sde1): bdev /dev/sde1 errs: wr 1038, rd 4531, flush 0, corrupt 0, gen 0

gargamel:~# cat /proc/partitions  |grep sd[ep]
    8      240 3750738264 sdp
    8      241 3750737223 sdp1

So the same device reappears as sdp. But btrfs does not close a failed device yet (patches are in the mailing list) the old path sde
is still in the block layer and opened. I guess /proc/partitions
doesn't show non working sde.

gargamel:~# mount | grep sde /dev/sde1 on /mnt/btrfs_space type btrfs (ro,noatime,compress=lzo,ssd,discard,space_cache,skip_balance,subvolid=5,subvol=/)
/dev/sde1 on /var/local/space type btrfs (ro,noexec,noatime,compress=lzo,ssd,discard,space_cache,skip_balance,subvolid=257,subvol=/varlocalspace)
/dev/sde1 on /var/cache/zoneminder type btrfs (ro,nosuid,nodev,noatime,compress=lzo,ssd,discard,space_cache,skip_balance,subvolid=257,subvol=/varlocalspace/zoneminder)
/dev/sde1 on /var/lib/mysql type btrfs (ro,nosuid,nodev,noatime,compress=lzo,ssd,discard,space_cache,skip_balance,subvolid=3648,subvol=/mysql)


gargamel:~# umount /mnt/btrfs_space; umount /var/local/space; umount /var/cache/zoneminder; umount /var/lib/mysql


gargamel:~# mount | grep sde
better to have grep-ed sdp also, here.
And /proc/self/mounts will be more accurate as it probes the fs module.

gargamel:~# mount /dev/sdp1 /mnt/mnt
mount: /mnt/mnt: mount(2) system call failed: File exists.

gargamel:~# dmtail 2
[1887142.826176] BTRFS error (device sde1): bdev /dev/sde1 errs: wr 1038, rd 4531, flush 0, corrupt 0, gen 0
[1887453.610947] BTRFS warning (device sde1): duplicate device fsid:devid for 727c7ba3-f6f9-462a-8472-453dd7d46d8a:1 old:/dev/sde1 new:/dev/sdp1

Unmount wasn't successful above. Or it was remounted by automount? just guessing.


gargamel:/usr/local/bin# btrfs device scan --forget
gargamel:/usr/local/bin# mount /dev/sdp1 /mnt/mnt
mount: /mnt/mnt: mount(2) system call failed: File exists.


 Can you please send a complete kernel logs.

After reboot, I made sure sde is not used by anything weird, just simple mounts:
gargamel:~# lsblk  | grep sde
sde                                 8:64   1 931.5G  0 disk
├─sde1                              8:65   1 488.3M  0 part
├─sde2                              8:66   1  14.9G  0 part
├─sde3                              8:67   1    80G  0 part
└─sde4                              8:68   1 836.1G  0 part



So in summary the chronological order of events are...

 sde disappears.
 btrfs does not close the device.
 block layer creates sdp when the disappeared device reappears.
unmount of sde was tried but it might not have completely successful we don't have sufficient logs to prove it.
 mount of sdp fails per log indicates that sde is still mounted.

So thing(s) to fix is/are:
 The root of the issue - When sde fails we need to close the device
 so that block layer can reuse sde when it reappears (not sdp).
 In btrfs as we have closed the failed device btrfs dev scan --forget
 can work to cleanup the stale entries left behind during unmount.

 We can do something better here:
 When two different device with same fsid uuid and devid and one of it
 is mounted we have to fail the scan/mount of the newer device for
 obvious reasons. That's when we get the log - 'duplicate device fsid'.
 But here the case it bit skewed that both are same device with same
 major number but different minor number (sde sdp). I need to figure
 out a way so that we don't treat these two device paths as different
 device. Probably should check the guid/wwid assigned by the block
 layer which should be same for both of these devices, or in the
 last resort check scsi inquiry_VPD page and get the serial number
 but its going too much beyond what FS should do. Let me check with
 block layer experts what they suggest.

 We might need a workaround tool to force clean a given FSID to avoid
 reboot.

Still unknown:
unmount is successful? And mount logs shows that device sde still exists in btrfs.

Sorry I was diverted into other stuffs when you reported last time, let me take a fresh look.

Thanks, Anand


Any ideas?

Marc




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux