[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ogfs-dev]Recovery Race conditions



From: Greg Freemyer <freemyer-ml@NorcrossGroup.com>
> To: opengfs-devel@lists.sourceforge.net <opengfs-devel@lists.sourceforge.net>
> Subject: Re: [ogfs-dev]Recovery Race conditions
> Date: 01 Aug 2003 01:35:42 -0400
> 
> Updated list after sig.
> 
> Greg
> -- 
> Greg Freemyer
> 
> ======
> 
> 
> Recovery is comprised of:
> 
> Cluster Lock Recovery
> Journal Replay
> Granting of queued lock requests to waiting nodes.
> Abort and retry any ongoing mounts  (if required)
> 
> Any other major steps above?
Stonith? It's cluster manager's duty to fence the died node. We may not
care about it when using OpenDLM plus HA Heartbeat.

> 
> Recovery Races:
> 
> 1) Single node failure:

[snip]
> 1.b)  Lock Recovery must occur prior to journal replay
>         Status: ogfs must address
> 	Current Solution: ???
> 	Potential Solution: Deadman Locks (See below)
> 
> 1.c)  Journal replay must occur prior to granting queued locks
>         Status: ogfs must address
> 	Current Solution: ???
> 	Potential Solution: Persistent DLM locks
For 1.b). Current implementation uses also locking server as the cluster
manager. Hence lock recovery occurs prior to journal replay.
For 1.c). Current implementation reset the expired locks after
completing journal-replaying. (in the end of ogfs_recover_journal()
function)
I beleive persistent DLM lock will address this issue very well:)

> 
> 1.d)  Failed node holds lock for which no one is waiting.  After failure
> and lock recovery, but prior to journal recovery, a different node may
> request and be granted the lock.
>          ogfs must address, lock should not be granted until after
> journal replay.
Does persistent DLM lock work under this condition?


> 2) multiple node failure
> 
> 2.a) Multiple Journal Recoveries should not occur simultaneously.
>         Status: ???  (Mentioned by Jeffrey Orlin, I don't understand the
> problem)
> 	Current Solution: ???
> 
2.b) The failure of node that is replaying journal.
	Current Solution: ??? (Since the failure node holds a 			non-expired
lock, I don't know how memexpd will 			deal with it.)
	Potential Solution: Deadman locks

It seems the deadman lock is really a useful idea :) 
 
> 
> Deadman Locks Explained:
> 
> <To be assembled from previous e-mails>
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email sponsored by: Free pre-built ASP.NET sites including
> Data Reports, E-commerce, Portals, and Forums are available now.
> Download today and enter to win an XBOX or Visual Studio .NET.
> http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
> _______________________________________________
> Opengfs-devel mailing list
> Opengfs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opengfs-devel
-- 
Opinions expressed are those of the author and do not represent Intel
Corporation
"gpg --recv-keys --keyserver wwwkeys.pgp.net E1390A7F"
{E1390A7F:3AD1 1B0C 2019 E183 0CFF  55E8 369A 8B75 E139 0A7F}



-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel

[Kernel]     [Security]     [Bugtraq]     [Photo]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Clusters]     [Linux RAID]     [Yosemite Hiking]     [Linux Resources]

Powered by Linux