Re: [ogfs-dev]Recovery Race conditions
Updated list after sig.
Greg
--
Greg Freemyer
======
Recovery is comprised of:
Cluster Lock Recovery
Journal Replay
Granting of queued lock requests to waiting nodes.
Abort and retry any ongoing mounts (if required)
Any other major steps above?
Recovery Races:
1) Single node failure:
1.a) Journal Replay by multiple nodes may occur
Status: ogfs must address
Current Solution: ???
Potential Solution: Deadman Locks (See below)
Potential Problems: What would this do "no-lock" mode?
Do we still want to support "no-lock" mode?
1.b) Lock Recovery must occur prior to journal replay
Status: ogfs must address
Current Solution: ???
Potential Solution: Deadman Locks (See below)
1.c) Journal replay must occur prior to granting queued locks
Status: ogfs must address
Current Solution: ???
Potential Solution: Persistent DLM locks
1.d) Failed node holds lock for which no one is waiting. After failure
and lock recovery, but prior to journal recovery, a different node may
request and be granted the lock.
ogfs must address, lock should not be granted until after
journal replay.
1.e) Mounting of new nodes should not occur during journal replay
Status: If all nodes participate in the mounting process, then
ogfs must address.
Does ogfs mount do this? Is there another mount/replay conflict?
Current Solution: ???
Potential Solution:
The mount could abort and allow recovery to complete and then retry
the mount.
We would need a single mount/recovery lock for the whole FS I assume.
1.f) Normal FS activity should be blocked during journal replay
Status: ??? Is this a ogfs problem?
> This depends on the specific Cluster FS implementation. If an FS
> had inodes shared a FS block, then blocking activity and flushing
> and invalidating the block would be required, so that the node
> doing log replay get valid data and after log replay the other
> nodes see the new updated data. There might be other things
> that required activity to be blocked. I'm not sure if this is
> a problem for ogfs.
>
Current Solution: ???
2) multiple node failure
2.a) Multiple Journal Recoveries should not occur simultaneously.
Status: ??? (Mentioned by Jeffrey Orlin, I don't understand the
problem)
Current Solution: ???
Deadman Locks Explained:
<To be assembled from previous e-mails>
-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel
[Kernel]
[Security]
[Bugtraq]
[Photo]
[Yosemite]
[MIPS Linux]
[ARM Linux]
[Linux Clusters]
[Linux RAID]
[Yosemite Hiking]
[Linux Resources]