10.04.2012 22:45, Jeff Layton пишет:
This check is expensive (as you mentioned), but have to be done only once on NFS server start.Well, no. The subtree check happens every time nfsd processes a filehandle -- see nfsd_acceptable(). Basically we have to turn the filehandle into a dentry and then walk back up to the directory that's exported to verify that it is within the correct subtree. If that fails, then we might have to do it more than once if it's a hardlinked file.Wait. Looks like I'm missing something. This subtree check has nothing with my proposal (if I'm not mistaken). This option and it's logic remains the same. My proposal was to check directories, desired to be exported, on NFS server start. And if any of passed exports intersects with any of exports, already shared by another NFSd - then shutdown NFSd and print error message. Am I missing the point here?Sorry I got confused with the discussion. You will need to do something similar to what subtree checking does in order to handle your proposal however.
Agreed. But this check should be performed only once on NFS server start (not every fh lookup.
With this solution, grace period can simple, and no support from exporting file system is required. But the main problem here is that such intersections can be checked only in initial file system environment (containers with it's own roots, gained via chroot, can't handle this situation). So, it means, that there have to be some daemon (kernel or user space), which will handle such requests from different NFS server instances... Which in turn means, that some way of communication between this daemon and NFS servers is required. And unix (any of them) sockets doesn't suits here, which makes this problem more difficult.This is a truly ugly problem, and unfortunately parts of the nfsd codebase are very old and crusty. We've got a lot of cleanup work ahead of us no matter what design we settle on. This is really a lot bigger than the grace period. I think we ought to step back a bit and consider this more "holistically" first. Do you have a pointer to an overall design document or something?What exactly you are asking about? Overall design of containerization?I meant containerization of nfsd in particular.
If you are asking about some kind of white paper, then I don't have it. But here are main visible targets:1) Move all network-related resources to per-net data (caches, grace period, lockd calls, transports, your tracking engine).
2) make nfsd filesystem superblock per network namespace.3) service itself will be controlled like Lockd done (one pool for all, per-net resources allocated on service start).
One thing that puzzles me at the moment. We have two namespaces to deal with -- the network and the mount namespace. With nfs client code, everything is keyed off of the net namespace. That's not really the case here since we have to deal with a local fs tree as well. When an nfsd running in a container receives an RPC, how does it determine what mount namespace it should do its operations in?We don't use mount namespaces, so that's why I wasn't thinking about it... But if we have 2 types of namespaces, then we have to tie mount namesapce to network. I.e we can get desired mount namespace from per-net NFSd data.One thing that Bruce mentioned to me privately is that we could plan to use whatever mount namespace mountd is using within a particular net namespace. That makes some sense since mountd is the final arbiter of who gets access to what.
Could you, please, give some examples? I don't get the idea.
But, please, don't ask me, what will be, if two or more NFS servers shares the same mount namespace... Looks like this case should be forbidden.I'm not sure we need to forbid sharing the mount namespace. They might be exporting completely different filesystems after all, in which case we'd be forbidding it for no good reason.
Actually, if we will make file system responsible for grace period control, then yes, no reason for forbidding of shared mount namespace.
Note that it is quite easy to get lost in the weeds with this. I've been struggling to get a working design for a clustered nfsv4 server for the last several months and have had some time to wrestle with these issues. It's anything but trivial. What you may need to do in order to make progress is to start with some valid use-cases for this stuff, and get those working while disallowing or ignoring other use cases. We'll never get anywhere if we try to solve all of these problems at once...
Agreed. So, my current understanding of the situation can be summarized as follows:1) The idea of making grace period (and int internals) per networks namespace stays the same. But it's implementation affect only current "generic grace period" code.
2) Your idea of making grace period per file system looks reasonable. And maybe this approach (using of filesystem's export operations if available) have to be used by default. But I suggest to add new option to exports (say, "no_fs_grace"), which will disable this new functionality. With this option system administrator becomes responsible for any problems with shared file system.
Any objections? -- Best regards, Stanislav Kinsbursky -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html