Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2017-10-06 19:33, Liu Bo wrote:
On Thu, Oct 05, 2017 at 07:07:44AM -0400, Austin S. Hemmelgarn wrote:
On 2017-10-04 16:11, Liu Bo wrote:
On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
From: Anand Jain <Anand.Jain@xxxxxxxxxx>

Write and flush errors are critical errors, upon which the device fd
must be closed and marked as failed.


Can we defer the job of closing device to umount?

We can go mark the device failed and skip it while doing read/write,
and umount can do the cleanup work.

That way we don't need a dedicated thread looping around to detect a
rare situation.
If BTRFS doesn't close the device, then it's 100% guaranteed if it
reconnects that it will show up under a different device node.  It would
also mean that the device node stays visible when there is in fact no device
connected to it, which is a pain from a monitoring perspective.

I see, you're assuming that these errors are due to disconnection of
disks, it could be bad sectors (although almost impossible from
enterprise hard disks) or some other errors across the stack.
And we need to handle any of these cases. If BTRFS is going to claim handling of device failures, it needs to handle all types of device failure it reasonably can without making assumptions about what type of failure occurred.

I do agree that cleanup needs to be done if disk got disconnected, but
not doing cleanup here, a udev rule is needed to handle such an event.
And why exactly does it need to involve userspace beyond notifying that something bad happened? We know the failure happened, why should we tell userspace about it and require userspace to tell us to handle it? It's not like this is a performance critical path since it won't happen very often, and the primary concern is data safety, not performance, so why exactly do we need to push something that isn't a policy decision (and arguably isn't a decision, if we're not using a device, we shouldn't be holding it open, period) out to userspace?

Looking at it a bit differently though, every time I've seen any write or flush errors, either the connection to the device was unstable, or the storage device was failing, and neither case is something that should require userspace to make a decision about how to handle things.

To be entirely honest though, from a system administrator's perspective, I would actually expect a device getting disconnected (or otherwise disappearing off the bus) to end up in a 'Missing' state, not a 'Failed' state. I see no reason to treat it any differently than if it disappeared before the volume was mounted, other than the requirement to actually handle it. That type of handling _will_ require userspace involvement (because I seriously doubt there will ever be proper notification from the block layer to holders that a device is gone), but that's also largely orthogonal to this patch set (there's no reason that we need to handle write or flush errors the same way).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux