[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Fwd: [ogfs-dev]Clustered mmap algorithm]



David B. Zafman wrote:

What do you think about this?


------------------------------------------------------------------------

Subject:
[ogfs-dev]Clustered mmap algorithm
From:
Daniel Phillips <phillips@arcor.de>
Date:
Wed, 20 Aug 2003 00:27:28 +0200
To:
opengfs-devel@lists.sourceforge.net


Hi Everybody,


As you may know, I've set out to tackle the problem of adding a clustered, writable mmap feature to OpenGFS. The object is to prove that the changes proposed for the VFS to support clustered mmap are in fact correct, by showing a correct implementation in OpenGFS. This work will also provide a model for mmap implementations in other clustered filesystems.

The strategy for a clustered, writable mmap is simple. There are two basic cluster-wide states:

   A) (Exclusive) One node may have a particular file page mapped RW in
   one or more of its page tables, and no other node may map that page.
   If any memory access is attempted on another node, a fault will occur,
   and the necessary work will be done to put the cluster into state (B)
   below in the case of a read, or the ownership of the exclusive will be
   changed in the case of a write.

   B) (Shared) More than one node in a cluster may map the same file
   page, and all page table entries are RO.  If any memory write operation
   is attempted, a fault occurs and the necessary work is be done to
   put the cluster into state (A), then the write operation is allowed
   to proceed.

Given my understanding of the 2.4 Linux VM, the only way pages will be mapped read-only is if it is a private mapping. (do_mmap_pgoff() currently turns "shared" read-only mappings into private mappings.)


If you want to have multiple shared writable mappings that are used "mostly read", then you need to change the way the mapping and the fault code works. You will have to insure that do_no_page() maps the pages only read-only for read faults. (Probably the easiest way to do this is to change how do_mmap_pgoff() sets the page protections.) do_wp_page() currently will COW the pages that you have mapped read-only. You'll have to add some code to prevent this; probably as a new vm op.


This can be further simplified by implementing only the exclusive state, and my initial implementation will work that way. This sacrifices some performance in certain circumstances in return for considerably reducing the number of states and transitions that need to be debugged. Later, when I add support for the second state, it will be under control of an ifdef for debugging purposes. That is, enabling the shared state should simply give increased performance, not any new functionality.

I'm honestly don't think this will simplify things too much.



While the above description is in terms of page granularity, this implementation will be at file granularity, because OpenGFS global locking is currently done this way, and because I'd rather not break new ground here just at the moment. The above description doesn't have to change much to accommodate this simplification:


   A) (Exclusive) One node may have a particular file's pages mapped RW in
   one or more of its page tables, and no other node may map those pages.
   If any memory access is attempted on another node, a fault will occur,
   and the necessary work will be done to put the cluster into state (B)
   below in the case of a read, or the ownership of the exclusive will be
   changed in the case of a write.

   B) (Shared) More than one node in a cluster may map the same file's
   pages, and all page table entries are RO.  If any memory write operation
   is attempted, a fault occurs and the necessary work is be done to
   put the cluster into state (A), then the write operation is allowed
   to proceed.

The "necessary work" to make the transitions between shared and exclusive states consists of:

  - cache writeout
  - cache invalidation
  - page table invalidation
  - page table write protect

This work will be performed mainly by a local daemon in response to requests from the central lock manager, which in turn responds to requests from nodes on which page faults occur.

The specific events and transitions are:

   write fault: (do_no_page)
      have shared lock:
         attempt to upgrade to exclusive.  If this fails (because another
         node is also trying to upgrade to exclusive) then invalidate all
         page table entries for this inode and drop the shared lock, then
         request the exclusive lock.
      have no lock:
         obtain the exclusive
            If some other node already holds the exclusive, it must flush
            its dirty pages and inode state to disk, and invalidate its page
            table entries and cache for this file.
      continue as with normal do_no_page

   read fault:
      have no lock:
         obtain shared lock
            If some other node already holds the exclusive, it must flush
            its dirty pages and inode state to disk, and invalidate its page
            table entries and cache for this file, as above, and additionally,
            write-protect any already-mapped pages.

This is confusing. If you didn't have the lock, how can any pages have already been mapped?


      now we hold exclusive or shared access
      continue as with normal do_no_page.  If we hold shared access, the page
      will be mapped write-protected.

Lock daemon request handling:
   Release a lock:
      Block new page faults for the file (by changing the state of the
         lock it owns)
      Invalidate any page table mappings of this file, by traversing the
         list of shared mappings for the file
      Write out any dirty pages (if in shared state, there can be no dirty
         pages, so check this)
      Remove all cached pages from the page cache
      Acknowledge to the lock manager that the lock was released

   Downgrade an exclusive lock to shared:
      This is the same as releasing it, but pages are not removed from
      the page cache.

In this case you need would need to write-protect existing mappings, but simply unmapping them and allowing them to be faulted in again works and requires less new code.



Interaction between read/write and mmap: precautions are needed to avoid deadlock when writing from a mmapped file to a file on the same clustered filesystem, or likewise, when reading from a file to a mmapped file. The deadlock possibility arises because two files have to be locked to complete these operations; two such operations simultaneously must be careful not to take the locks in opposite order.


This is rough and incomplete. I didn't discuss how the mmap is set up initially, or what happens on file close, munmap, loss of a node, etc. My intention at this point is just to focus on the core algorithm.

Discussion, and/or flames welcome :-)

Regards,

Daniel

John Byrne





------------------------------------------------------- This SF.net email is sponsored by Dice.com. Did you know that Dice has over 25,000 tech jobs available today? From careers in IT to Engineering to Tech Sales, Dice has tech jobs from the best hiring companies. http://www.dice.com/index.epl?rel_code=104 _______________________________________________ Opengfs-devel mailing list Opengfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opengfs-devel

[Kernel]     [Security]     [Bugtraq]     [Photo]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Clusters]     [Linux RAID]     [Yosemite Hiking]     [Linux Resources]

Powered by Linux