[ogfs-dev][RFC] Mmap race followup
Hi all,
I revisited my previously posted patches under subjects "[RFC][PATCH]
Eliminate mmap race" and "[RFC][PATCH] do_no_page hook" and it seems nothing
needs changing. The main thing on my mind was that I thought I might have
been a little hasty in removing the BKLs and I was considering putting them
back just to be on the safe side. But a quick read through the code shows
that there are really only two things going on: ogfs_glock_i/ogfs_unglock_i
and ogfs_readpage. The former is full of spinlocks that look as though they
are meant to work, and ought to work. The latter function is covered
completely by lock/unlock_kernel, so it ogfs_do_no_page seems to be
completely covered. The BKLs in ogfs_readpage may be redundant too, since
most of the interesting work happens in ogfs_get_block, which has its own
BKLs. And even ogfs_get_block might be OK without BKL, though I didn't read
enough to get a good feeling for that.
In general I saw enough redundant BKL's to think that there must be a lot
more. Over time these should be pruned away, slowly with lots of torture
testing so that if any critical regions are accidently uncovered it's easy to
revert patches until the race goes away.
I also looked critically at the ogfs_glock/gunlock_i in ogfs_readpage, which
takes a shared lock recursively underneath the exclusive lock in
ogfs_do_no_page. This redundant locking can't just be removed because
->readpage is also called from generic_file_readahead. I don't have a good
suggestion re what to do about that, however this will continue to bother me
until I come up with something. It's not hurting anything, it's just
redundant.
Anyway, I decided there's enough justification to leave the BKLs out of
ogfs_do_no_page. I hope my little writeup here will provide some insight
into how I came to that conclusion, and how to hunt for other redundant
locking, of which I'm sure there's plenty.
I worried about John Byrne's point regarding erroneously unmapping COW'd pages
of private mappings. To fix this properly needs a change to zap_page_range
or some other similar core kernel tweak, which I have not done. The easier
thing is to just skip all the private mappings during invalidation, which the
current code seems to do. This actually gives Posix-compliant behaviour: the
client won't necessarily see the most up-to-date version of the mapped file
pages that it hasn't yet written, but Posix allows that. Linus, however, has
stricter standards and has stated he wants tighter-than-Posix semantics. So
that is in the pipeline but I haven't done it yet.
The next thing I got interested in is how anybody might trigger the race in
question, since there are only a few program steps between the time the glock
(in the old version) is dropped and when the new page is entered into the
page table by do_no_page. To trigger the race, another client has to be able
to come in, grab the glock and invalidate the page table entry during the
small window of the race, which sounds practically impossible. However, one
of those program steps is the retaking of the ->page_table_lock spinlock,
which the other client could easily grab as soon as the first client drops
the glock. So this makes the race a whole lot more likely and also makes me
think that nobody will see the race on a uniprocessor kernel. I'm waiting
for some test results to confirm that theory.
Speaking of testing, I wrote a little test program, included below, to
convince myself that mmapped file pages actually bounce between clients as
they should. It just mmaps a file then repeatedly pokes the first byte of it
twice a second. I ran this simultaneously on two clients, put printks in the
ogfs_do_no_page, and sure enough I see the do_no_page function get invoked
repeatedly on both clients. Stop the test on one of the clients and
do_no_page stops getting invoked on the other, as expected. This convinces
me that the invalidation is working the way it's supposed to.
I plan to improve my poke-test over time. For one thing, I want it to check
for correct file contents as well. A simple check is to have each poke-test
increment a counter at a different offset in the file, and check to see that
the counter in the file always matches the counter in the program. It would
be good to run the test on several files in parallel, which is easily
accomplished by running several poke-tests in parallel. Likewise I could run
poke-tests on more clients, if I had more clients. I should also run the
test a lot faster; there isn't actually a reason to have a delay at all. The
goal is to turn up races as fast as possible so I can discover them as soon
as I create them instead of going through the much longer and more
embarrassing process of having users complain about them.
All of the above is my long-winded way of saying, I think these patches are
correct. If nobody has any objections, I'd like to commit these two to cvs
and move on to other things.
Regards,
Daniel
-----------------
file: poke-test.c
-----------------
#include <stdio.h>
#include <time.h>
#include <sys/fcntl.h>
#include <sys/mman.h>
int main(int argc, char *argv[])
{
char *name = argv[1], *mapped;
int fd;
if (argc < 2) {
printf("usage: %s <filename>\n", argv[0]);
return 1;
}
if ((fd = open(name, O_RDWR)) < 0) {
printf("%s not found\n", name);
return 1;
}
if ((mapped = mmap(0, 4096, PROT_WRITE, MAP_SHARED, fd, 0)) == MAP_FAILED) {
printf("writeable mmap of %s not allowed\n", name);
return 1;
}
while (1) {
printf("poke %p\n", mapped);
*mapped = 'x';
nanosleep(&(struct timespec){ 0, 500 * 1000 * 1000 }, NULL);
}
return 0;
}
-------------------------------------------------------
This SF.net email is sponsored by: The SF.net Donation Program.
Do you like what SourceForge.net is doing for the Open
Source Community? Make a contribution, and help us add new
features and functionality. Click here: http://sourceforge.net/donate/
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel
[Kernel]
[Security]
[Bugtraq]
[Photo]
[Yosemite]
[MIPS Linux]
[ARM Linux]
[Linux Clusters]
[Linux RAID]
[Yosemite Hiking]
[Linux Resources]