Re: [Fwd: [ogfs-dev]Clustered mmap algorithm]
David B. Zafman wrote:
What do you think about this?
------------------------------------------------------------------------
Subject:
[ogfs-dev]Clustered mmap algorithm
From:
Daniel Phillips <phillips@arcor.de>
Date:
Wed, 20 Aug 2003 00:27:28 +0200
To:
opengfs-devel@lists.sourceforge.net
Hi Everybody,
As you may know, I've set out to tackle the problem of adding a clustered,
writable mmap feature to OpenGFS. The object is to prove that the changes
proposed for the VFS to support clustered mmap are in fact correct, by
showing a correct implementation in OpenGFS. This work will also provide a
model for mmap implementations in other clustered filesystems.
The strategy for a clustered, writable mmap is simple. There are two basic
cluster-wide states:
A) (Exclusive) One node may have a particular file page mapped RW in
one or more of its page tables, and no other node may map that page.
If any memory access is attempted on another node, a fault will occur,
and the necessary work will be done to put the cluster into state (B)
below in the case of a read, or the ownership of the exclusive will be
changed in the case of a write.
B) (Shared) More than one node in a cluster may map the same file
page, and all page table entries are RO. If any memory write operation
is attempted, a fault occurs and the necessary work is be done to
put the cluster into state (A), then the write operation is allowed
to proceed.
Given my understanding of the 2.4 Linux VM, the only way pages will be
mapped read-only is if it is a private mapping. (do_mmap_pgoff()
currently turns "shared" read-only mappings into private mappings.)
If you want to have multiple shared writable mappings that are used
"mostly read", then you need to change the way the mapping and the fault
code works. You will have to insure that do_no_page() maps the pages
only read-only for read faults. (Probably the easiest way to do this is
to change how do_mmap_pgoff() sets the page protections.) do_wp_page()
currently will COW the pages that you have mapped read-only. You'll have
to add some code to prevent this; probably as a new vm op.
This can be further simplified by implementing only the exclusive
state, and my initial implementation will work that way. This sacrifices some
performance in certain circumstances in return for considerably reducing the
number of states and transitions that need to be debugged. Later, when I add
support for the second state, it will be under control of an ifdef for
debugging purposes. That is, enabling the shared state should simply give
increased performance, not any new functionality.
I'm honestly don't think this will simplify things too much.
While the above description is in terms of page granularity, this
implementation will be at file granularity, because OpenGFS global locking is
currently done this way, and because I'd rather not break new ground here
just at the moment. The above description doesn't have to change much to
accommodate this simplification:
A) (Exclusive) One node may have a particular file's pages mapped RW in
one or more of its page tables, and no other node may map those pages.
If any memory access is attempted on another node, a fault will occur,
and the necessary work will be done to put the cluster into state (B)
below in the case of a read, or the ownership of the exclusive will be
changed in the case of a write.
B) (Shared) More than one node in a cluster may map the same file's
pages, and all page table entries are RO. If any memory write operation
is attempted, a fault occurs and the necessary work is be done to
put the cluster into state (A), then the write operation is allowed
to proceed.
The "necessary work" to make the transitions between shared and exclusive
states consists of:
- cache writeout
- cache invalidation
- page table invalidation
- page table write protect
This work will be performed mainly by a local daemon in response to requests
from the central lock manager, which in turn responds to requests from nodes
on which page faults occur.
The specific events and transitions are:
write fault: (do_no_page)
have shared lock:
attempt to upgrade to exclusive. If this fails (because another
node is also trying to upgrade to exclusive) then invalidate all
page table entries for this inode and drop the shared lock, then
request the exclusive lock.
have no lock:
obtain the exclusive
If some other node already holds the exclusive, it must flush
its dirty pages and inode state to disk, and invalidate its page
table entries and cache for this file.
continue as with normal do_no_page
read fault:
have no lock:
obtain shared lock
If some other node already holds the exclusive, it must flush
its dirty pages and inode state to disk, and invalidate its page
table entries and cache for this file, as above, and additionally,
write-protect any already-mapped pages.
This is confusing. If you didn't have the lock, how can any pages have
already been mapped?
now we hold exclusive or shared access
continue as with normal do_no_page. If we hold shared access, the page
will be mapped write-protected.
Lock daemon request handling:
Release a lock:
Block new page faults for the file (by changing the state of the
lock it owns)
Invalidate any page table mappings of this file, by traversing the
list of shared mappings for the file
Write out any dirty pages (if in shared state, there can be no dirty
pages, so check this)
Remove all cached pages from the page cache
Acknowledge to the lock manager that the lock was released
Downgrade an exclusive lock to shared:
This is the same as releasing it, but pages are not removed from
the page cache.
In this case you need would need to write-protect existing mappings, but
simply unmapping them and allowing them to be faulted in again works
and requires less new code.
Interaction between read/write and mmap: precautions are needed to avoid
deadlock when writing from a mmapped file to a file on the same clustered
filesystem, or likewise, when reading from a file to a mmapped file. The
deadlock possibility arises because two files have to be locked to complete
these operations; two such operations simultaneously must be careful not to
take the locks in opposite order.
This is rough and incomplete. I didn't discuss how the mmap is set up
initially, or what happens on file close, munmap, loss of a node, etc. My
intention at this point is just to focus on the core algorithm.
Discussion, and/or flames welcome :-)
Regards,
Daniel
John Byrne
-------------------------------------------------------
This SF.net email is sponsored by Dice.com.
Did you know that Dice has over 25,000 tech jobs available today? From
careers in IT to Engineering to Tech Sales, Dice has tech jobs from the
best hiring companies. http://www.dice.com/index.epl?rel_code=104
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel
[Kernel]
[Security]
[Bugtraq]
[Photo]
[Yosemite]
[MIPS Linux]
[ARM Linux]
[Linux Clusters]
[Linux RAID]
[Yosemite Hiking]
[Linux Resources]