Hi all, I revisited my previously posted patches under subjects "[RFC][PATCH] Eliminate mmap race" and "[RFC][PATCH] do_no_page hook" and it seems nothing needs changing. The main thing on my mind was that I thought I might have been a little hasty in removing the BKLs and I was considering putting them back just to be on the safe side. But a quick read through the code shows that there are really only two things going on: ogfs_glock_i/ogfs_unglock_i and ogfs_readpage. The former is full of spinlocks that look as though they are meant to work, and ought to work. The latter function is covered completely by lock/unlock_kernel, so it ogfs_do_no_page seems to be completely covered. The BKLs in ogfs_readpage may be redundant too, since most of the interesting work happens in ogfs_get_block, which has its own BKLs. And even ogfs_get_block might be OK without BKL, though I didn't read enough to get a good feeling for that. In general I saw enough redundant BKL's to think that there must be a lot more. Over time these should be pruned away, slowly with lots of torture testing so that if any critical regions are accidently uncovered it's easy to revert patches until the race goes away. I also looked critically at the ogfs_glock/gunlock_i in ogfs_readpage, which takes a shared lock recursively underneath the exclusive lock in ogfs_do_no_page. This redundant locking can't just be removed because ->readpage is also called from generic_file_readahead. I don't have a good suggestion re what to do about that, however this will continue to bother me until I come up with something. It's not hurting anything, it's just redundant. Anyway, I decided there's enough justification to leave the BKLs out of ogfs_do_no_page. I hope my little writeup here will provide some insight into how I came to that conclusion, and how to hunt for other redundant locking, of which I'm sure there's plenty. I worried about John Byrne's point regarding erroneously unmapping COW'd pages of private mappings. To fix this properly needs a change to zap_page_range or some other similar core kernel tweak, which I have not done. The easier thing is to just skip all the private mappings during invalidation, which the current code seems to do. This actually gives Posix-compliant behaviour: the client won't necessarily see the most up-to-date version of the mapped file pages that it hasn't yet written, but Posix allows that. Linus, however, has stricter standards and has stated he wants tighter-than-Posix semantics. So that is in the pipeline but I haven't done it yet. The next thing I got interested in is how anybody might trigger the race in question, since there are only a few program steps between the time the glock (in the old version) is dropped and when the new page is entered into the page table by do_no_page. To trigger the race, another client has to be able to come in, grab the glock and invalidate the page table entry during the small window of the race, which sounds practically impossible. However, one of those program steps is the retaking of the ->page_table_lock spinlock, which the other client could easily grab as soon as the first client drops the glock. So this makes the race a whole lot more likely and also makes me think that nobody will see the race on a uniprocessor kernel. I'm waiting for some test results to confirm that theory. Speaking of testing, I wrote a little test program, included below, to convince myself that mmapped file pages actually bounce between clients as they should. It just mmaps a file then repeatedly pokes the first byte of it twice a second. I ran this simultaneously on two clients, put printks in the ogfs_do_no_page, and sure enough I see the do_no_page function get invoked repeatedly on both clients. Stop the test on one of the clients and do_no_page stops getting invoked on the other, as expected. This convinces me that the invalidation is working the way it's supposed to. I plan to improve my poke-test over time. For one thing, I want it to check for correct file contents as well. A simple check is to have each poke-test increment a counter at a different offset in the file, and check to see that the counter in the file always matches the counter in the program. It would be good to run the test on several files in parallel, which is easily accomplished by running several poke-tests in parallel. Likewise I could run poke-tests on more clients, if I had more clients. I should also run the test a lot faster; there isn't actually a reason to have a delay at all. The goal is to turn up races as fast as possible so I can discover them as soon as I create them instead of going through the much longer and more embarrassing process of having users complain about them. All of the above is my long-winded way of saying, I think these patches are correct. If nobody has any objections, I'd like to commit these two to cvs and move on to other things. Regards, Daniel ----------------- file: poke-test.c ----------------- #include <stdio.h> #include <time.h> #include <sys/fcntl.h> #include <sys/mman.h> int main(int argc, char *argv[]) { char *name = argv[1], *mapped; int fd; if (argc < 2) { printf("usage: %s <filename>\n", argv[0]); return 1; } if ((fd = open(name, O_RDWR)) < 0) { printf("%s not found\n", name); return 1; } if ((mapped = mmap(0, 4096, PROT_WRITE, MAP_SHARED, fd, 0)) == MAP_FAILED) { printf("writeable mmap of %s not allowed\n", name); return 1; } while (1) { printf("poke %p\n", mapped); *mapped = 'x'; nanosleep(&(struct timespec){ 0, 500 * 1000 * 1000 }, NULL); } return 0; } ------------------------------------------------------- This SF.net email is sponsored by: The SF.net Donation Program. Do you like what SourceForge.net is doing for the Open Source Community? Make a contribution, and help us add new features and functionality. Click here: http://sourceforge.net/donate/ _______________________________________________ Opengfs-devel mailing list Opengfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opengfs-devel