Re: [RFC] big fat transaction ioctl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 11 Nov 2009, Chris Mason wrote:

> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
> > On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
> > 
> > > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > > Hi all,
> > > >
> > > > This is an alternative approach to atomic user transactions for btrfs.
> > > > The old start/end ioctls suffer from some basic limitations, namely
> > > >
> > > >  - We can't properly reserve space ahead of time to avoid ENOSPC part
> > > > way through the transaction, and
> > > >  - The process may die (seg fault, SIGKILL) part way through the
> > > > transaction.  Currently when that happens the partial transaction will
> > > > commit.
> > > >
> > > > This patch implements an ioctl that lets the application completely
> > > > specify the entire transaction in a single syscall.  If the process gets
> > > > killed or seg faults part way through, the entire transaction will still
> > > > complete.
> > > >
> > > > The goal is to atomically commit updates to multiple files, xattrs,
> > > > directories.  But this is still a file system: we don't get rollback if
> > > > things go wrong.  Instead, do what we can up front to make sure things
> > > > will work out.  And if things do go wrong, optionally prevent a partial
> > > > result from reaching the disk.
> > > 
> > > Why not snapshot respective root (doesn't work if transaction spans
> > > multiple file-systems, but this doesn't look like a real-world
> > > limitation), run txn against that snapshot and rollback on failure
> > > instead? Snapshots are writable, cheap, and this looks like a real
> > > transaction abort mechanism.
> > 
> > Good question.  :)
> > 
> > I hadn't looked into this before, but I think the snapshots could be used 
> > to achieve both atomicity and rollback.  If userspace uses an rw mutex to 
> > quiesce writes, it can make sure all transactions complete before creating 
> > a snapshot (commit).  The problem with this currently is the create 
> > snapshot ioctl is relatively slow... it calls commit_transaction, which 
> > blocks until everything reaches disk.  I think to perform well this 
> > approach would need a hook to start a commit and then return as soon as it 
> > can guarantee than any subsequent operation's start_transaction can't join 
> > in that commit.
> > 
> > This may be a better way to go about this, though.  Does that sound 
> > reasonable, Chris?
> 
> Yes, we could do this, but I don't think it will perform very well
> compared to your multi-operation ioctl.  It really does depend on how
> often you need to do atomic ops (my guess is very).

The thing is, I'm not sure using snaps is that different from what I'm 
doing now.  Currently the ioctl transactions don't hit disk until each 
full commit (flushoncommit, no fsync).  Unless the presense of a snapshot 
adds additional overhead (to the commit, or to cleaning up the slightly 
longer-living snapped roots), the difference would be that starting 
transactions would need to be blocked by the application instead of 
wait_current_trans in start_transaction, and (currently at least) they 
would wait longer (the extra writes between blocked = 0 and commit_done = 
1 in commit_transaction).  

The key, as now, is keeping the full fs syncs infrequent.  And, if 
possible, reducing the duration of the blocked == 1 period during 
commit_transaction.


> Honestly you'll get better performance with a simple write-ahead log
> from userland:

There actually is a log, but it's optional and not strictly write-ahead... 
it's only used to reduce the commit latency:

1- apply operations to fs (grouped into atomic transactions)
2- (optionally) write and flush log entry
...repeat...
3- periodically sync the fs, then trim the log.  or sync early if a 
client explicitly requests it.

But

1- I don't want to make the log required.  Sometimes you're more concerned 
about total throughput, not latency, and the log halves your write bw 
unless you add more spindles.

2- I don't want it strictly write-ahead because (in the absense of atomic 
ops) it means you have to wait for the log to sync before applying the ops 
to the fs (to ensure the fs doesn't get a partial transaction ahead of the 
log).  This marries atomicity with your schedule for durability, which 
isn't necessarily what you want.  (e.g., Ceph makes a distinction between 
serialized and commited ops, allowing limited sharing of data before it 
hits disk.  That's the nice thing about this ioctl... it's pretty common 
that atomicity is the only requirement.)

With the optional (write-behind?) log and transaction ioctls, IF you want 
low latency commits, enable the log and ideally give it it's own spindle, 
and infrequently sync btrfs to get good layout and low overhead. 


Unless you think I'm missing something with the snapshot approach, I can 
give that a try and see how it does.  It requires explicit management of 
the sync/commit schedule, but in my case at least I'm doing that already.  
A transaction ioctl is simpler for userland and would be more generically 
useful for other apps (particularly those who don't want to manage 
commits), but will always have some small possibility of partial 
failure/abort without rollback.

sage


> 
> step1: write redo log somewhere in the FS, with enough information to
> bring all the objects you're about to touch to a consistent state.
> step2: fsync the log
> step3: do your operations
> step4: append a record to the undo log that invalidates the last log
> op, or just truncate it to zero.
> step5: fsync the log.
> 
> The big advantage of the log is that you won't be tied to btrfs, but
> it's two fsyncs where the big transaction framework does none.  This
> should allow you to turn on the fast fsync log again, but I think the
> multi-operation ioctl would do that as well.
> 
> -chris
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux