On Wed, Nov 11, 2009 at 8:19 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Wed, 11 Nov 2009, Chris Mason wrote: > >> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote: >> > On Tue, 10 Nov 2009, Andrey Kuzmin wrote: >> > >> > > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> > > > Hi all, >> > > > >> > > > This is an alternative approach to atomic user transactions for btrfs. >> > > > The old start/end ioctls suffer from some basic limitations, namely >> > > > >> > > > - We can't properly reserve space ahead of time to avoid ENOSPC part >> > > > way through the transaction, and >> > > > - The process may die (seg fault, SIGKILL) part way through the >> > > > transaction. Currently when that happens the partial transaction will >> > > > commit. >> > > > >> > > > This patch implements an ioctl that lets the application completely >> > > > specify the entire transaction in a single syscall. If the process gets >> > > > killed or seg faults part way through, the entire transaction will still >> > > > complete. >> > > > >> > > > The goal is to atomically commit updates to multiple files, xattrs, >> > > > directories. But this is still a file system: we don't get rollback if >> > > > things go wrong. Instead, do what we can up front to make sure things >> > > > will work out. And if things do go wrong, optionally prevent a partial >> > > > result from reaching the disk. >> > > >> > > Why not snapshot respective root (doesn't work if transaction spans >> > > multiple file-systems, but this doesn't look like a real-world >> > > limitation), run txn against that snapshot and rollback on failure >> > > instead? Snapshots are writable, cheap, and this looks like a real >> > > transaction abort mechanism. >> > >> > Good question. :) >> > >> > I hadn't looked into this before, but I think the snapshots could be used >> > to achieve both atomicity and rollback. If userspace uses an rw mutex to >> > quiesce writes, it can make sure all transactions complete before creating >> > a snapshot (commit). The problem with this currently is the create >> > snapshot ioctl is relatively slow... it calls commit_transaction, which >> > blocks until everything reaches disk. I think to perform well this >> > approach would need a hook to start a commit and then return as soon as it >> > can guarantee than any subsequent operation's start_transaction can't join >> > in that commit. >> > >> > This may be a better way to go about this, though. Does that sound >> > reasonable, Chris? >> >> Yes, we could do this, but I don't think it will perform very well >> compared to your multi-operation ioctl. It really does depend on how >> often you need to do atomic ops (my guess is very). > > The thing is, I'm not sure using snaps is that different from what I'm > doing now. Currently the ioctl transactions don't hit disk until each > full commit (flushoncommit, no fsync). Unless the presense of a snapshot > adds additional overhead (to the commit, or to cleaning up the slightly > longer-living snapped roots), the difference would be that starting > transactions would need to be blocked by the application instead of > wait_current_trans in start_transaction, and (currently at least) they > would wait longer (the extra writes between blocked = 0 and commit_done = > 1 in commit_transaction). > > The key, as now, is keeping the full fs syncs infrequent. And, if > possible, reducing the duration of the blocked == 1 period during > commit_transaction. It took me some time to associate you with Ceph project and to recall what Ceph is, so my original snapshot suggestion was out-of-context. When put into Ceph context, it looks too heavy-weight and may turn an overkill. Chris's write-ahead logging idea looks much more realistic for your use case. > > >> Honestly you'll get better performance with a simple write-ahead log >> from userland: > > There actually is a log, but it's optional and not strictly write-ahead... > it's only used to reduce the commit latency: > > 1- apply operations to fs (grouped into atomic transactions) > 2- (optionally) write and flush log entry > ...repeat... > 3- periodically sync the fs, then trim the log. or sync early if a > client explicitly requests it. > > But > > 1- I don't want to make the log required. Sometimes you're more concerned > about total throughput, not latency, and the log halves your write bw > unless you add more spindles. Log-induced latency penalty is the price for transactional consistency :). Traditional mitigation recipe involves low-latency log device (NVRAM and, recently, SLC flash). Since you specifically target distributed systems, you have a distributed in-memory logging option. Regards, Andrey > > 2- I don't want it strictly write-ahead because (in the absense of atomic > ops) it means you have to wait for the log to sync before applying the ops > to the fs (to ensure the fs doesn't get a partial transaction ahead of the > log). This marries atomicity with your schedule for durability, which > isn't necessarily what you want. (e.g., Ceph makes a distinction between > serialized and commited ops, allowing limited sharing of data before it > hits disk. That's the nice thing about this ioctl... it's pretty common > that atomicity is the only requirement.) > > With the optional (write-behind?) log and transaction ioctls, IF you want > low latency commits, enable the log and ideally give it it's own spindle, > and infrequently sync btrfs to get good layout and low overhead. > > > Unless you think I'm missing something with the snapshot approach, I can > give that a try and see how it does. It requires explicit management of > the sync/commit schedule, but in my case at least I'm doing that already. > A transaction ioctl is simpler for userland and would be more generically > useful for other apps (particularly those who don't want to manage > commits), but will always have some small possibility of partial > failure/abort without rollback. > > sage > > >> >> step1: write redo log somewhere in the FS, with enough information to >> bring all the objects you're about to touch to a consistent state. >> step2: fsync the log >> step3: do your operations >> step4: append a record to the undo log that invalidates the last log >> op, or just truncate it to zero. >> step5: fsync the log. >> >> The big advantage of the log is that you won't be tied to btrfs, but >> it's two fsyncs where the big transaction framework does none. This >> should allow you to turn on the fast fsync log again, but I think the >> multi-operation ioctl would do that as well. >> >> -chris >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
