[ogfs-dev][RFC] Mmap race followup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I revisited my previously posted patches under subjects "[RFC][PATCH] 
Eliminate mmap race" and "[RFC][PATCH] do_no_page hook" and it seems nothing 
needs changing.  The main thing on my mind was that I thought I might have 
been a little hasty in removing the BKLs and I was considering putting them 
back just to be on the safe side.  But a quick read through the code shows 
that there are really only two things going on: ogfs_glock_i/ogfs_unglock_i 
and ogfs_readpage.  The former is full of spinlocks that look as though they 
are meant to work, and ought to work.  The latter function is covered 
completely by lock/unlock_kernel, so it ogfs_do_no_page seems to be 
completely covered.  The BKLs in ogfs_readpage may be redundant too, since 
most of the interesting work happens in ogfs_get_block, which has its own 
BKLs.  And even ogfs_get_block might be OK without BKL, though I didn't read 
enough to get a good feeling for that.

In general I saw enough redundant BKL's to think that there must be a lot 
more.  Over time these should be pruned away, slowly with lots of torture 
testing so that if any critical regions are accidently uncovered it's easy to 
revert patches until the race goes away.

I also looked critically at the ogfs_glock/gunlock_i in ogfs_readpage, which 
takes a shared lock recursively underneath the exclusive lock in 
ogfs_do_no_page.  This redundant locking can't just be removed because 
->readpage is also called from generic_file_readahead.  I don't have a good 
suggestion re what to do about that, however this will continue to bother me 
until I come up with something.  It's not hurting anything, it's just 
redundant.

Anyway, I decided there's enough justification to leave the BKLs out of 
ogfs_do_no_page.  I hope my little writeup here will provide some insight 
into how I came to that conclusion, and how to hunt for other redundant 
locking, of which I'm sure there's plenty.

I worried about John Byrne's point regarding erroneously unmapping COW'd pages 
of private mappings.  To fix this properly needs a change to zap_page_range 
or some other similar core kernel tweak, which I have not done.  The easier 
thing is to just skip all the private mappings during invalidation, which the 
current code seems to do.  This actually gives Posix-compliant behaviour: the 
client won't necessarily see the most up-to-date version of the mapped file 
pages that it hasn't yet written, but Posix allows that.  Linus, however, has 
stricter standards and has stated he wants tighter-than-Posix semantics.  So 
that is in the pipeline but I haven't done it yet.

The next thing I got interested in is how anybody might trigger the race in 
question, since there are only a few program steps between the time the glock 
(in the old version) is dropped and when the new page is entered into the 
page table by do_no_page.  To trigger the race, another client has to be able 
to come in, grab the glock and invalidate the page table entry during the 
small window of the race, which sounds practically impossible.  However, one 
of those program steps is the retaking of the ->page_table_lock spinlock, 
which the other client could easily grab as soon as the first client drops 
the glock.  So this makes the race a whole lot more likely and also makes me 
think that nobody will see the race on a uniprocessor kernel.  I'm waiting 
for some test results to confirm that theory.

Speaking of testing, I wrote a little test program, included below, to 
convince myself that mmapped file pages actually bounce between clients as 
they should.  It just mmaps a file then repeatedly pokes the first byte of it 
twice a second.  I ran this simultaneously on two clients, put printks in the 
ogfs_do_no_page, and sure enough I see the do_no_page function get invoked 
repeatedly on both clients.  Stop the test on one of the clients and 
do_no_page stops getting invoked on the other, as expected.  This convinces 
me that the invalidation is working the way it's supposed to.

I plan to improve my poke-test over time.  For one thing, I want it to check 
for correct file contents as well.  A simple check is to have each poke-test 
increment a counter at a different offset in the file, and check to see that 
the counter in the file always matches the counter in the program.  It would 
be good to run the test on several files in parallel, which is easily 
accomplished by running several poke-tests in parallel.  Likewise I could run 
poke-tests on more clients, if I had more clients.  I should also run the 
test a lot faster; there isn't actually a reason to have a delay at all.  The 
goal is to turn up races as fast as possible so I can discover them as soon 
as I create them instead of going through the much longer and more 
embarrassing process of having users complain about them.

All of the above is my long-winded way of saying, I think these patches are 
correct.  If nobody has any objections, I'd like to commit these two to cvs 
and move on to other things.

Regards,

Daniel

-----------------
file: poke-test.c
-----------------

#include <stdio.h>
#include <time.h>
#include <sys/fcntl.h>
#include <sys/mman.h>

int main(int argc, char *argv[])
{
	char *name = argv[1], *mapped;
	int fd;

	if (argc < 2) {
		printf("usage: %s <filename>\n", argv[0]);
		return 1;
	}

	if ((fd = open(name, O_RDWR)) < 0) {
		printf("%s not found\n", name);
		return 1;
	}

	if ((mapped = mmap(0, 4096, PROT_WRITE, MAP_SHARED, fd, 0)) == MAP_FAILED) {
		printf("writeable mmap of %s not allowed\n", name);
		return 1;
	}

	while (1) {
		printf("poke %p\n", mapped);
		*mapped = 'x';
		nanosleep(&(struct timespec){ 0, 500 * 1000 * 1000 }, NULL);
	}
	return 0;
}



-------------------------------------------------------
This SF.net email is sponsored by: The SF.net Donation Program.
Do you like what SourceForge.net is doing for the Open
Source Community?  Make a contribution, and help us add new
features and functionality. Click here: http://sourceforge.net/donate/
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel

[Index of Archives]     [Kernel]     [Security]     [Bugtraq]     [MIPS Linux]     [ARM Linux]     [Linux Clusters]     [Linux RAID]     [Yosemite Hiking]
  Powered by Linux