Re: [ogfs-dev]Recovery Race conditions
Third Pass at full doc:
Note: In evolution, how do I set the line wrap to 80 chars. THe below
is hard to read.
=========
Recovery
=========
Recovery is comprised of:
* Fencing (STOMITH or Stonith)
* Cluster Lock Recovery of non-persistent locks
Journal Replay
* Notify DLM to clean-up persistent locks
Granting of queued lock requests to waiting nodes.
Abort and retry any ongoing mounts (if required)
* New or modified item
Recovery Races:
1) Single node failure:
1.a) Fencing must occur prior to recovery beginning
Status: ???
Current Solution: locking module invokes STOMITH as required
Potential Solutions: I hope that appropriate lock modules will handle
fencing???
If not definately handle inside new lock/cluster interface.
1.b) Journal Replay by multiple nodes may occur
Status: ogfs must address
Current Solution: The TRANSaction lock is used to ensure only one at a
time.
Potential Solution: Deadman Locks (See below) in addition to
TRANSaction lock.
At the conclusion of the replay, the journal is empty on other nodes
simply get
the Deadman lock and replay "empty" journal.
1.c) Lock Recovery must occur prior to journal replay
Status: ogfs must address
Info: Ben thinks this applies only to locks that were held by the dead
node in *exclusive* (write)
mode. This is based on looking at lock recovery in memexp. His
understanding may be
incomplete.
Current Solution: ???
Potential Solution: Deadman Locks (See below)
1.d) Journal replay must occur prior to granting queued locks
Status: ogfs must address
Current Solution: current implementation uses the memexp *lock module*
(one in each node)
as the (distributed?) cluster manager, with the help of the memexpd
central
lock *storage* server. All of the "brains" for cluster management
are in the
lock module ... memexpd simply stores the current state, to share
among all nodes.
Potential Solution: For persistent DLM locks, ogfs to invoke cleanup
after journal replay
1.e) Failed node holds lock for which no one is waiting. After failure
and lock recovery, but prior to journal recovery, a different node may
request and be granted the lock.
Status: ???
1.f) Mounting of new nodes should not occur during journal replay
Status: It appears this is a ogfs issue
Current Solution: The TRANSaction lock in exclusive mode seems to
handle this.
Info:
Current interface between filesystem and locking module makes
provision for the
first-node-to-mount to replay all journals, then allow other nodes to
mount. After
that, I don't know that there is any problem in allowing new nodes to
mount while
another node is doing a journal replay ... as long as the new node
does not write
anything to disk (and it won't, because the node doing recovery will
own the
TRANSaction lock in exclusive mode), I don't think it would be
troublesome for
the new node to go ahead and mount (does anyone have insight
otherwise?).
OGFS currently locks a LIVE lock in shared mode when
mounting, and unlocks it when unmounting. I haven't been able to see
how this is actually used/effective, however. Nothing seems to try to
lock it in exclusive mode or anything. Perhaps something in memexp
looks at it (doesn't seem quite right, though).
Current OGFS locks a MOUNT lock during the mount process, in exclusive
mode, and unlocks it when the mount process is complete. See
_ogfs_read_super()
(src/fs/arch_linux_2_4/super_linux.c). Also described in ogfs_locking
doc.
Potential Solution:
Continue using the TRANSaction lock in exclusive mode.
1.g) Normal FS activity should be blocked during journal replay
Status: It appears this is a ogfs issue
Info:
This depends on the specific Cluster FS implementation. If an FS
had inodes shared a FS block, then blocking activity and flushing
and invalidating the block would be required, so that the node
doing log replay get valid data and after log replay the other
nodes see the new updated data. There might be other things
that required activity to be blocked. I'm not sure if this is
a problem for ogfs.
Current Solution:
OGFS currently uses a cluster-wide OGFS_TRANS_LOCK "transaction" lock
to protect
journal replay from any other node writing to disk. When a node
starts a transaction,
it grabs the lock in shared mode, so all nodes can create transactions
simultaneously ...
When a node wants to start a journal replay, it asks for the lock in
exclusive
mode, and won't begin journal replay until all transactions have
(completed?/suspended?)
So, a journal replay gets exclusive access to the entire filesystem,
with no fear of
interference from anything else writing to disk.
Proposed Solution: Continue current solution
2) multiple node failure
2.a) Multiple Journal Recoveries should not occur simultaneously.
Status: It appears this is a ogfs issue
This may relate:
It might not be the log replay part itself. You do have
to make sure that a 2 nodes do not try and replay the same
log simultaneously. I've seen problems in other steps of
recovery (log replay being just one step) where having
another node messing around simultaneously could cause problems.
Knowing you are the only node messing with the file system during
recovery can simplify the programming. Inodes sharing the same
fs block is good example of something that could not handle multiple
log replays simultaneously easily, but that does not seem to be a for
ogfs.
Current Solution: See 1.g above
Proposed Solution: Continue current solution
2.b) Multiple Persistent Lock cleanups should not occur simultaneously
Status: Per Daniel McNeil, this can be a problem for some DLMs, so ogfs
should address
Current Solution: ???
Proposed Solution: Invoke Persistent lock cleanup from within exclusive
OGFS_TRANS_LOCK syncronization block. (See 1.g)
=============
OpenDLM Internal Recovery Process:
OpenDLM, as it exists, will "recover" locks by:
1) internal to the DLM, remastering locks that were mastered on a
failed node.
2) internal to the DLM, recovering the queues of lock requests.
3) externally visible, granting non-persistent locks that were held by
failed holders (e.g.,processes on the failed node) based on queued
requests.
And if the locks are persistent (aka, "orphan") locks, then those locks
remain "held" by the failed holders. The DLM does the internal steps
only for these.
So at this point, the DLM clients are either granted locks, if they had
been blocked by the now-failed holder, or they remain blocked on
persistent locks.
OpenDLM has no particular interest in when its told to clean up
persistent locks, it will do the internal clean up automagically upon
failure notification. Once the client level has determined the locks
should be "freed", someone has to issue the dlm_purge() call to either:
- free all persistent locks held by all failed clients on a specified
node, or
- free only the persistent locks held by a specific failed client,
(dlm_purge can also be used by a client to immediately free all of the
locks it holds, but as this isn't a failed client, these aren't
orphaned
locks.)
OpenDLM doesn't care what the clients do prior to calling dlm_purge().
It's up to the user to coordinate recovering the resources it controls
with when it informs the DLM that persistent locks can go back into
normal circulation.
Deadman Locks Explained:
<To be assembled from previous e-mails>
-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel
[Kernel]
[Security]
[Bugtraq]
[Photo]
[Yosemite]
[MIPS Linux]
[ARM Linux]
[Linux Clusters]
[Linux RAID]
[Yosemite Hiking]
[Linux Resources]