[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [ogfs-dev]Clustered mmap algorithm



Hi Daniel,

This sounds interesting.

I don't know much about mmapping, but I've been looking at the locking infrastructure a lot recently, and there may be some tools that can help.  If you haven't already, you might wish to read the ogfs-locking document (and let us know if you have any comments).  It's not perfectly organized, but I hope you find it quicker to read than looking through the code.

You may want to consider adding a new lock type, e.g. LM_TYPE_MMAP, that would have associated glops to handle the "necessary work" (cache writeout, etc.) of transitions.  glops currently handle such things for existing lock types at various points throughout the lifetime of a lock.

The callback mechanism in glock.c does not currently have any hooks to pro-actively cause *this* node's *filesystem* code to release a lock.  Filesystem code currently works on a cooperative basis in which a file operation grabs a lock, does its job, then releases the lock to the glock layer.  Note that I'm not talking about the glock cache here ... the callback *will* proactively get the cache to release a lock that is not currently locked by filesystem code.  

mmaps would likely hold the lock for a much longer time than typical file operations, though, as Dominik commented in the ogfs-locking doc.  This would violate the "cooperative" nature of the current glock layer.  However, an extra hook would be easy to add, as a new glops operation, e.g. go_request() or go_callback() or go_needs(), to request that this node give up the mapping and the lock.  This hook could be invoked from within the callback function ogfs_glock_cb() as soon as a NEEDS callback comes in.  The new LM_TYPE_MMAP might be the only type to fill in a function for the go_request() glop, but that's fine.

I hope these may be helpful ideas.

Perhaps you could write up a DESIGN-mmap doc, to be a living (changing) document to hold your latest ideas?  I'd be happy to add it to the web page, if you would like.

-- Ben --

Opinions are mine, not Intel's


> -----Original Message-----
> From: Daniel Phillips [mailto:phillips@arcor.de]
> Sent: Tuesday, August 19, 2003 6:27 PM
> To: opengfs-devel@lists.sourceforge.net
> Subject: [ogfs-dev]Clustered mmap algorithm
> 
> 
> Hi Everybody,
> 
> As you may know, I've set out to tackle the problem of adding 
> a clustered, 
> writable mmap feature to OpenGFS.  The object is to prove 
> that the changes 
> proposed for the VFS to support clustered mmap are in fact 
> correct, by 
> showing a correct implementation in OpenGFS.  This work will 
> also provide a 
> model for mmap implementations in other clustered filesystems.
> 
> The strategy for a clustered, writable mmap is simple.  There 
> are two basic 
> cluster-wide states:
> 
>    A) (Exclusive) One node may have a particular file page 
> mapped RW in
>    one or more of its page tables, and no other node may map 
> that page.
>    If any memory access is attempted on another node, a fault 
> will occur,
>    and the necessary work will be done to put the cluster 
> into state (B)
>    below in the case of a read, or the ownership of the 
> exclusive will be
>    changed in the case of a write.
> 
>    B) (Shared) More than one node in a cluster may map the same file
>    page, and all page table entries are RO.  If any memory 
> write operation
>    is attempted, a fault occurs and the necessary work is be done to
>    put the cluster into state (A), then the write operation is allowed
>    to proceed.
> 
> This can be further simplified by implementing only the exclusive 
> state, and my initial implementation will work that way.  
> This sacrifices some 
> performance in certain circumstances in return for 
> considerably reducing the 
> number of states and transitions that need to be debugged.  
> Later, when I add 
> support for the second state, it will be under control of an 
> ifdef for 
> debugging purposes.  That is, enabling the shared state 
> should simply give 
> increased performance, not any new functionality.
> 
> While the above description is in terms of page granularity, this 
> implementation will be at file granularity, because OpenGFS 
> global locking is 
> currently done this way, and because I'd rather not break new 
> ground here 
> just at the moment.  The above description doesn't have to 
> change much to 
> accommodate this simplification:
> 
>    A) (Exclusive) One node may have a particular file's pages 
> mapped RW in
>    one or more of its page tables, and no other node may map 
> those pages.
>    If any memory access is attempted on another node, a fault 
> will occur,
>    and the necessary work will be done to put the cluster 
> into state (B)
>    below in the case of a read, or the ownership of the 
> exclusive will be
>    changed in the case of a write.
> 
>    B) (Shared) More than one node in a cluster may map the same file's
>    pages, and all page table entries are RO.  If any memory 
> write operation
>    is attempted, a fault occurs and the necessary work is be done to
>    put the cluster into state (A), then the write operation is allowed
>    to proceed.
> 
> The "necessary work" to make the transitions between shared 
> and exclusive 
> states consists of:
> 
>   - cache writeout
>   - cache invalidation
>   - page table invalidation
>   - page table write protect
> 
> This work will be performed mainly by a local daemon in 
> response to requests 
> from the central lock manager, which in turn responds to 
> requests from nodes 
> on which page faults occur.
> 
> The specific events and transitions are:
> 
>    write fault: (do_no_page)
>       have shared lock:
>          attempt to upgrade to exclusive.  If this fails 
> (because another
>          node is also trying to upgrade to exclusive) then 
> invalidate all
>          page table entries for this inode and drop the 
> shared lock, then
>          request the exclusive lock.
>       have no lock:
>          obtain the exclusive
>             If some other node already holds the exclusive, 
> it must flush
>             its dirty pages and inode state to disk, and 
> invalidate its page
>             table entries and cache for this file.
>       continue as with normal do_no_page
> 
>    read fault:
>       have no lock:
>          obtain shared lock
>             If some other node already holds the exclusive, 
> it must flush
>             its dirty pages and inode state to disk, and 
> invalidate its page
>             table entries and cache for this file, as above, 
> and additionally,
>             write-protect any already-mapped pages.
>       now we hold exclusive or shared access
>       continue as with normal do_no_page.  If we hold shared 
> access, the page
>       will be mapped write-protected.
> 
> Lock daemon request handling:
>    Release a lock:
>       Block new page faults for the file (by changing the state of the
>          lock it owns)
>       Invalidate any page table mappings of this file, by 
> traversing the
>          list of shared mappings for the file
>       Write out any dirty pages (if in shared state, there 
> can be no dirty
>          pages, so check this)
>       Remove all cached pages from the page cache
>       Acknowledge to the lock manager that the lock was released
> 
>    Downgrade an exclusive lock to shared:
>       This is the same as releasing it, but pages are not removed from
>       the page cache.
> 
> Interaction between read/write and mmap: precautions are 
> needed to avoid 
> deadlock when writing from a mmapped file to a file on the 
> same clustered 
> filesystem, or likewise, when reading from a file to a 
> mmapped file.  The 
> deadlock possibility arises because two files have to be 
> locked to complete 
> these operations; two such operations simultaneously must be 
> careful not to 
> take the locks in opposite order.
> 
> This is rough and incomplete.  I didn't discuss how the mmap 
> is set up 
> initially, or what happens on file close, munmap, loss of a 
> node, etc.  My 
> intention at this point is just to focus on the core algorithm.
> 
> Discussion, and/or flames welcome :-)
> 
> Regards,
> 
> Daniel
> 


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel


[Kernel]     [Security]     [Bugtraq]     [Photo]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Clusters]     [Linux RAID]     [Yosemite Hiking]     [Linux Resources]

Powered by Linux