[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ogfs-dev]RE: [Opendlm-devel] ODLM/OGFS Recovery



Comments below.

> -----Original Message-----
> From: opendlm-devel-admin@xxxxxxxxxxxxxxxxxxxxx 
> [mailto:opendlm-devel-admin@xxxxxxxxxxxxxxxxxxxxx] On Behalf 
> Of Cahill, Ben M
> Sent: Friday, April 23, 2004 12:41 PM
> To: opendlm-devel@xxxxxxxxxxxxxxxxxxxxx
> Cc: opengfs-devel@xxxxxxxxxxxxxxxxxxxxx
> Subject: RE: [Opendlm-devel] ODLM/OGFS Recovery
> 
> 
>  
> 
> > -----Original Message-----
> > From: opendlm-devel-admin@xxxxxxxxxxxxxxxxxxxxx
> > [mailto:opendlm-devel-admin@xxxxxxxxxxxxxxxxxxxxx] On Behalf 
> > Of Stanley Wang
> > Sent: Friday, April 23, 2004 4:20 AM
> > To: opendlm-devel@xxxxxxxxxxxxxxxxxxxxx
> > Cc: opengfs-devel@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re: [Opendlm-devel] ODLM/OGFS Recovery
> > 
> > Hi Ben,
> > 
> > Thanks for your docs, it really helps a lot!
> > 
> > Current solution in my mind:
> > 
> > Once a node deads, all other nodes are notified by deadman
> > locks. After
> > being notified, every node's lock module hold all lock 
> > request (put them in
> > a wait  queue) except NOEXP lock request. 
> 
> That's the basic idea I've been thinking about, also, 
> (inspired by the work you've already done, and CA's 
> comments), but I think there might be some problems with 
> timing.  There's no guarantee of processing order when doing 
> lock recovery.  The deadman lock will be granted during 
> ODLM's lock recovery process, but might be granted after many 
> other locks have been granted (and therefore mistakenly 
> passed to OGFS).  That is, there's nothing special about the 
> deadman lock (from OpenDLM's point of view) that would cause 
> it to be processed first.
> 
That is correct.

> CA uses the DLM_VALNOTVALID invalid LVB status as a possible 
> earlier indicator.  I haven't convinced myself that that is 
> early enough, or conclusive enough (discussion was in the 
> attachment, "false positive/negative"), to serve as the 
> trigger for withholding locks from OGFS (but it might be, and 
> I might not understand it well enough!). Plus, it forces us 

When a node dies I see three possible scenarios:
1)  a lock is granted withOUT the DLM_VALNOTVALID flag set
  - to me this indicates the master was/is on a live node, no possible data corruption occurred
2)  a lock is granted with the DLM_VALNOTVALID flag set
  - to me this indicates the master of that resource was forcefully removed (mostly likely died), the new master is now the current lock, and its data is probably corrupted.  Therefore rebuild from the journal.
3)  a deadman lock is granted.  Enough said, rebuild from the journal.

As for your false positive, this is scenario 2 described above.  If it accidentally marks the value block as invalid, then does it really matter?  It is playing better safe than sorry.  It is very difficult to determine what mode the lock on the dead node was in before it died.  Unless the locks are designed to be used a certain way such that they are never in EX/PW/CW, you can never really tell if you have a false positive or not.  Any notification is better than no notification.
As for your false negative, if no one knows about the resource, then yes the resource is destroyed.  However the scenario you propose seems unlikely.  Before the opendlm returns to the run state, I believe it grants all the newly acquired locks (in this case deadman locks).  Then it reaches the run state and processes new locks.  If those new locks are granted (say they are of the aforementioned resource) before the deadman locks, then I would think opendlm has some issues.  
I could probably run some test cases early next week, as we are currently testing a 3/4 node cluster while pulling its power cord.  


> to use LVBs with every lock (maybe not a huge problem, but 
> nice to avoid if possible).  And, it prevents us from using 
> the LKM_INVVALBLK flag!  (but I don't think there's a need 
> to, so not a problem).

I don't know how OpenGFS works, but I thought there would a wrapper function around opendlm calls such that adding code for the lock value blocks would be quick and painless.

> 
> The other tricky part is, if we withhold *all* locks from 
> OpenGFS (not just the ones that were blocked by the dead 


> node), will surviving nodes be able to finish their write 
> transactions (and release their CR transaction locks)?  They 

A write while holding a read lock??

> need to do this, or else the node that attempts the journal 
> recovery will never get the (EX) transaction lock ....
> 
>  .... Again (false positive/negative), how do we conclusively 
> tell which locks were blocked by dead node vs. not?
> 
> 
> > After the node who
> > replays journal
> > completes its work, it notifies (though a dedicate 
> "recover_complete" 
> > lock?) all
> > the others that the blocked requests can continue now. And 
> > then the OGFS 
> > cluster
> > resumes.
> 
> Yes, "recover_complete" locks might work.  The good thing 
> about this is that each lock would be filesystem-specific (a 
> concern if the user mounts multiple OGFS filesystems).  I 
> think that each lock would need to be node- or 
> journal-specific as well (multiple nodes might need to 
> recover multiple journals before continuing operation).  I'd 
> rather use locks than the callbacks, etc., ... I'll keep 
> working on this.
> 
> More comments/suggestions from anyone????

I was just curious how opengfs retrieved the memexp storage area of a dead node?  
This seems counter-intuitive.  Unless every node had a copy of each others storage area.  

Best Regards,
Don


> 
> -- Ben --
> 
> Opinions are mine, not Intel's
> 
> 
> > 
> > Any comments?
> > 
> > Best Regards,
> > Stan
> > 
> > Cahill, Ben M wrote:
> > 
> > >Hi all,
> > >
> > >Attached please find some discussion I'm trying to write to
> > understand
> > >the problems with OpenDLM/OpenGFS recovery after a node 
> dies.  Memexp 
> > >had this all integrated, but OpenDLM works quite differently.
> > >
> > >Please take a look and comment ... I'm hoping that I'm wrong
> > about some
> > >of the things I wrote, especially about DLM_VALNOTVALID.
> > Please give me
> > >a sanity check.
> > >
> > >I'm thinking about using a list within the lock module to
> > retain locks
> > >that get granted during lock recovery, but must not be 
> forwarded to 
> > >OpenGFS yet.
> > >
> > >There's more that I need to write, about timing and recovery 
> > >notification ... Running the journal recovery after lock 
> recovery is 
> > >complete, then notifying all nodes when journal recovery is done.
> > >
> > >I'm thinking (just an idea) about using a callback from
> > OpenDLM to the
> > >lock module to indicate:
> > >
> > >-- when lock recovery begins (and which node died)
> > >-- when lock recovery ends
> > >-- when journal recovery ends (in response to a call from the lock 
> > >module), perhaps using a new BARRIER state in the ODLM 
> recovery state 
> > >machine, to wait until all nodes' journal recovery ("client"
> > recovery,
> > >more generically) had occurred.
> > >
> > >Also wrestling with multiple filesystems ... Each must 
> recover before 
> > >moving on.  Memexp had separate instances and storage 
> areas for each
> > >filesystem, but there's just one ODLM.   Arrgh.
> > >
> > >Stan's been thinking about this also, but I haven't captured his 
> > >thoughts ... Please add/comment ... And anyone else, too.
> > >
> > >-- Ben --
> > >
> > >Opinions are mine, not Intel's
> > >
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: The Robotic Monkeys at 
> ThinkGeek For a limited time only, get FREE Ground shipping 
> on all orders of $35 or more. Hurry up and shop folks, this 
> offer expires April 30th! 
> http://www.thinkgeek.com/freeshipping/?cpg297
> 
> _______________________________________________
> Opendlm-devel mailing list
> Opendlm-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.sourceforge.net/lists/listinfo/opendlm-devel
> 
> 


-------------------------------------------------------
This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek
For a limited time only, get FREE Ground shipping on all orders of $35
or more. Hurry up and shop folks, this offer expires April 30th!
http://www.thinkgeek.com/freeshipping/?cpg297
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/opengfs-devel


[Kernel]     [Security]     [Bugtraq]     [Photo]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Clusters]     [Linux RAID]     [Yosemite Hiking]     [Linux Resources]

Powered by Linux