[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ogfs-dev]Clustered mmap algorithm



On Wed, Aug 20, 2003 at 12:27:28AM +0200, Daniel Phillips wrote:
> Hi Everybody,
> 
> As you may know, I've set out to tackle the problem of adding a clustered, 
> writable mmap feature to OpenGFS.  The object is to prove that the changes 
> proposed for the VFS to support clustered mmap are in fact correct, by 
> showing a correct implementation in OpenGFS.  This work will also provide a 
> model for mmap implementations in other clustered filesystems.
> 
> The strategy for a clustered, writable mmap is simple.  There are two basic 
> cluster-wide states:
> 
>    A) (Exclusive) One node may have a particular file page mapped RW in
>    one or more of its page tables, and no other node may map that page.
>    If any memory access is attempted on another node, a fault will occur,
>    and the necessary work will be done to put the cluster into state (B)
>    below in the case of a read, or the ownership of the exclusive will be
>    changed in the case of a write.

How does this mix with normal file system operations?  For
example, Node 1 has mmap'ed the file shared R/W (state A).

 - Node 2 wants to read the file.  The current code blocks until
   the lock request is granted.

   If I understand your proposal correctly, the shared lock would
   never be dropped except to allow access by another node that
   has mmap'ed the file shared writable?  How can another process
   reqad the file (normal file access)?

 - Node 3 wants to write the file.  Blocks too.

Another problem:  How do you distinguish between "normal" locks
and the mmap locks?  For example, process a has mmapped a file
shared writable.  Process b on the same node reads the file.  Who
decides whether the lock can be dropped and in which context is it
done?

>    B) (Shared) More than one node in a cluster may map the same file
>    page, and all page table entries are RO.  If any memory write operation
>    is attempted, a fault occurs and the necessary work is be done to
>    put the cluster into state (A), then the write operation is allowed
>    to proceed.
> 
> This can be further simplified by implementing only the exclusive 
> state, and my initial implementation will work that way.  This sacrifices some 
> performance in certain circumstances in return for considerably reducing the 
> number of states and transitions that need to be debugged.  Later, when I add 
> support for the second state, it will be under control of an ifdef for 
> debugging purposes.  That is, enabling the shared state should simply give 
> increased performance, not any new functionality.
> 
> While the above description is in terms of page granularity, this 
> implementation will be at file granularity, because OpenGFS global locking is 
> currently done this way, and because I'd rather not break new ground here 
> just at the moment.  The above description doesn't have to change much to 
> accommodate this simplification:
> 
>    A) (Exclusive) One node may have a particular file's pages mapped RW in
>    one or more of its page tables, and no other node may map those pages.
>    If any memory access is attempted on another node, a fault will occur,
>    and the necessary work will be done to put the cluster into state (B)
>    below in the case of a read, or the ownership of the exclusive will be
>    changed in the case of a write.
> 
>    B) (Shared) More than one node in a cluster may map the same file's
>    pages, and all page table entries are RO.  If any memory write operation
>    is attempted, a fault occurs and the necessary work is be done to
>    put the cluster into state (A), then the write operation is allowed
>    to proceed.
> 
> The "necessary work" to make the transitions between shared and exclusive 
> states consists of:
> 
>   - cache writeout
>   - cache invalidation
>   - page table invalidation
>   - page table write protect
> 
> This work will be performed mainly by a local daemon in response to requests 
> from the central lock manager, which in turn responds to requests from nodes 
> on which page faults occur.
> 
> The specific events and transitions are:
> 
>    write fault: (do_no_page)
>       have shared lock:
>          attempt to upgrade to exclusive.  If this fails (because another
>          node is also trying to upgrade to exclusive) then invalidate all
>          page table entries for this inode and drop the shared lock, then
>          request the exclusive lock.

There is a potential race condition here.  Assuming node A and
node B both have a shared lock and try to write at the same time.
The lock manager denies both lock requests.  Both nodes then yield
their locks.  At this time, another node may come along and grab
the lock while the file is in an inconsistent state.


>       have no lock:
>          obtain the exclusive
>             If some other node already holds the exclusive, it must flush
>             its dirty pages and inode state to disk, and invalidate its page
>             table entries and cache for this file.
>       continue as with normal do_no_page
> 
>    read fault:
>       have no lock:
>          obtain shared lock
>             If some other node already holds the exclusive, it must flush
>             its dirty pages and inode state to disk, and invalidate its page
>             table entries and cache for this file, as above, and additionally,
>             write-protect any already-mapped pages.
>       now we hold exclusive or shared access
>       continue as with normal do_no_page.  If we hold shared access, the page
>       will be mapped write-protected.
> 
> Lock daemon request handling:
>    Release a lock:
>       Block new page faults for the file (by changing the state of the
>          lock it owns)
>       Invalidate any page table mappings of this file, by traversing the
>          list of shared mappings for the file
>       Write out any dirty pages (if in shared state, there can be no dirty
>          pages, so check this)
>       Remove all cached pages from the page cache
>       Acknowledge to the lock manager that the lock was released
> 
>    Downgrade an exclusive lock to shared:
>       This is the same as releasing it, but pages are not removed from
>       the page cache.
> 
> Interaction between read/write and mmap: precautions are needed to avoid 
> deadlock when writing from a mmapped file to a file on the same clustered 
> filesystem, or likewise, when reading from a file to a mmapped file.  The 
> deadlock possibility arises because two files have to be locked to complete 
> these operations; two such operations simultaneously must be careful not to 
> take the locks in opposite order.
> 
> This is rough and incomplete.  I didn't discuss how the mmap is set up 
> initially, or what happens on file close, munmap, loss of a node, etc.  My 
> intention at this point is just to focus on the core algorithm.
> 
> Discussion, and/or flames welcome :-)
> 
> Regards,
> 
> Daniel

Bye

Dominik ^_^  ^_^


-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines
at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel

[Kernel]     [Security]     [Bugtraq]     [Photo]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Clusters]     [Linux RAID]     [Yosemite Hiking]     [Linux Resources]

Powered by Linux