On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
> On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > Hi all,
> >
> > This is an alternative approach to atomic user transactions for btrfs.
> > The old start/end ioctls suffer from some basic limitations, namely
> >
> > - We can't properly reserve space ahead of time to avoid ENOSPC part
> > way through the transaction, and
> > - The process may die (seg fault, SIGKILL) part way through the
> > transaction. Currently when that happens the partial transaction will
> > commit.
> >
> > This patch implements an ioctl that lets the application completely
> > specify the entire transaction in a single syscall. If the process gets
> > killed or seg faults part way through, the entire transaction will still
> > complete.
> >
> > The goal is to atomically commit updates to multiple files, xattrs,
> > directories. But this is still a file system: we don't get rollback if
> > things go wrong. Instead, do what we can up front to make sure things
> > will work out. And if things do go wrong, optionally prevent a partial
> > result from reaching the disk.
>
> Why not snapshot respective root (doesn't work if transaction spans
> multiple file-systems, but this doesn't look like a real-world
> limitation), run txn against that snapshot and rollback on failure
> instead? Snapshots are writable, cheap, and this looks like a real
> transaction abort mechanism.
Good question. :)
I hadn't looked into this before, but I think the snapshots could be used
to achieve both atomicity and rollback. If userspace uses an rw mutex to
quiesce writes, it can make sure all transactions complete before creating
a snapshot (commit). The problem with this currently is the create
snapshot ioctl is relatively slow... it calls commit_transaction, which
blocks until everything reaches disk. I think to perform well this
approach would need a hook to start a commit and then return as soon as it
can guarantee than any subsequent operation's start_transaction can't join
in that commit.
This may be a better way to go about this, though. Does that sound
reasonable, Chris?
sage
>
> Regards,
> Andrey
>
> >
> > A few things:
> >
> > - The implementation just exports the sys_* calls it needs (a popular
> > move, no doubt :). I've looked at using the corresponding vfs_*
> > instructions instead, and keeping a table of struct file *'s instead of
> > fd's to avoid these exports, but this requires a large amount of
> > duplication of semi-boilerplate path lookup, security_path_* hooks, and
> > similar code from fs/namei.c and elsewhere. If we want to go that
> > route, there are some advantages, the main one being that we can verify
> > that every dentry/inode we operate on belongs to the same fs. But the
> > code will be more complex... I'm not sure if I should pursue that just
> > yet.
> >
> > - The application gets to define what defines a failure for each
> > individual op based on its return value.
> >
> > - If the transaction fails, the process can instruct the fs to wedge
> > itself so that a partial result does not commit. This isn't a particuarly
> > elegant approach, but a wedged fs may be preferable to a partial
> > transaction commit. (Alternatively, a failure could branch/jump to
> > another point in the transaction op vector to do some cleanup and/or an
> > explicit WEDGE op to accomplish the same thing?)
> >
> > - This still uses the existing ioctl start transaction call. Depending on
> > how Josef's ENOSPC journal_info stuff works out, I should be able to avoid
> > the current global open_ioctl_trans counter for a cleaner interaction with
> > the btrfs transaction code.
> >
> > - The data space reservation is still missing. I need a way to
> > find which space_info will be used, and pin it for the duration
> > of the entire transaction.
> >
> > - The metadata reservation is a worst case bound. It could be less
> > conservative, but currently each op is pulled out of the user address
> > space individually so we'd either need two passes, a big kmalloc, or
> > further trust the app to get the value right. (Same goes for the data
> > size, actually, although that's easier to get correct.)
> >
> > Thoughts on this?
> >
> > Thanks-
> > sage
> >
> >
> > Signed-off-by: Sage Weil <sage@xxxxxxxxxxxx>
> > ---
> > fs/btrfs/ioctl.c | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/btrfs/ioctl.h | 49 ++++++++++++++
> > fs/namei.c | 3 +
> > fs/open.c | 2 +
> > fs/read_write.c | 2 +
> > fs/xattr.c | 2 +
> > 6 files changed, 245 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 136c5ed..4269616 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -37,6 +37,7 @@
> > #include <linux/compat.h>
> > #include <linux/bit_spinlock.h>
> > #include <linux/security.h>
> > +#include <linux/syscalls.h>
> > #include <linux/xattr.h>
> > #include <linux/vmalloc.h>
> > #include "compat.h"
> > @@ -1303,6 +1304,190 @@ long btrfs_ioctl_trans_end(struct file *file)
> > return 0;
> > }
> >
> > +/*
> > + * return number of successfully complete ops via @ops_completed
> > + * (where success/failure is defined by the _FAIL_* flags).
> > + */
> > +static long do_usertrans(struct btrfs_root *root,
> > + struct btrfs_ioctl_usertrans *ut,
> > + u64 *ops_completed)
> > +{
> > + int i;
> > + int *fds;
> > + int err;
> > + struct file *file;
> > + struct btrfs_ioctl_usertrans_op *ops = (void *)ut->ops_ptr;
> > + int fd1, fd2;
> > +
> > + fds = kcalloc(sizeof(int), ut->num_fds, GFP_KERNEL);
> > + if (!fds)
> > + return -ENOMEM;
> > +
> > + for (i = 0; i < ut->num_ops; i++) {
> > + struct btrfs_ioctl_usertrans_op op;
> > + int ret;
> > +
> > + err = -EFAULT;
> > + if (copy_from_user(&op, &ops[i], sizeof(op)))
> > + goto out;
> > +
> > + /* lookup fd args? */
> > + err = -EINVAL;
> > + switch (op.op) {
> > + case BTRFS_IOC_UT_OP_CLONERANGE:
> > + if (op.args[1] < 0 || op.args[1] >= ut->num_fds)
> > + goto out;
> > + fd2 = fds[1];
> > +
> > + case BTRFS_IOC_UT_OP_CLOSE:
> > + case BTRFS_IOC_UT_OP_PWRITE:
> > + if (op.args[0] < 0 || op.args[0] >= ut->num_fds)
> > + goto out;
> > + fd1 = fds[0];
> > + }
> > +
> > + /* do op */
> > + switch (op.op) {
> > + case BTRFS_IOC_UT_OP_OPEN:
> > + ret = -EINVAL;
> > + if (op.args[3] < 0 || op.args[3] >= ut->num_fds)
> > + goto out;
> > + ret = sys_open((const char __user *)op.args[0],
> > + op.args[1], op.args[2]);
> > + fds[op.args[3]] = ret;
> > + break;
> > + case BTRFS_IOC_UT_OP_CLOSE:
> > + ret = sys_close(fd1);
> > + break;
> > + case BTRFS_IOC_UT_OP_PWRITE:
> > + ret = sys_pwrite64(fd1, (const char __user *)op.args[1],
> > + op.args[2], op.args[3]);
> > + break;
> > + case BTRFS_IOC_UT_OP_UNLINK:
> > + ret = sys_unlink((const char __user *)op.args[0]);
> > + break;
> > + case BTRFS_IOC_UT_OP_MKDIR:
> > + ret = sys_mkdir((const char __user *)op.args[0],
> > + op.args[1]);
> > + break;
> > + case BTRFS_IOC_UT_OP_RMDIR:
> > + ret = sys_rmdir((const char __user *)op.args[0]);
> > + break;
> > + case BTRFS_IOC_UT_OP_TRUNCATE:
> > + ret = sys_truncate((const char __user *)op.args[0],
> > + op.args[1]);
> > + break;
> > + case BTRFS_IOC_UT_OP_SETXATTR:
> > + ret = sys_setxattr((char __user *)op.args[0],
> > + (char __user *)op.args[1],
> > + (void __user *)op.args[2],
> > + op.args[3], op.args[4]);
> > + break;
> > + case BTRFS_IOC_UT_OP_REMOVEXATTR:
> > + ret = sys_removexattr((char __user *)op.args[0],
> > + (char __user *)op.args[1]);
> > + break;
> > + case BTRFS_IOC_UT_OP_CLONERANGE:
> > + ret = -EBADF;
> > + file = fget(fd1);
> > + if (file) {
> > + ret = btrfs_ioctl_clone(file, fd2,
> > + op.args[2], op.args[3],
> > + op.args[4]);
> > + fput(file);
> > + }
> > + break;
> > + }
> > + pr_debug(" ut %d/%d op %d args %llx %llx %llx %llx %llx = %d\n",
> > + i, (int)ut->num_ops, (int)op.op, op.args[0],
> > + op.args[1], op.args[2], op.args[3], op.args[4], ret);
> > +
> > + put_user(ret, &ops[i].rval);
> > +
> > + if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE) &&
> > + ret != op.rval)
> > + goto out;
> > + if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ) &&
> > + ret == op.rval)
> > + goto out;
> > + if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT) &&
> > + ret < op.rval)
> > + goto out;
> > + if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT) &&
> > + ret > op.rval)
> > + goto out;
> > + if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE) &&
> > + ret <= op.rval)
> > + goto out;
> > + if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE) &&
> > + ret >= op.rval)
> > + goto out;
> > + }
> > + err = 0;
> > +out:
> > + *ops_completed = i;
> > + kfree(fds);
> > + return err;
> > +}
> > +
> > +long btrfs_ioctl_usertrans(struct file *file, void __user *arg)
> > +{
> > + struct btrfs_root *root = BTRFS_I(fdentry(file)->d_inode)->root;
> > + struct btrfs_trans_handle *trans;
> > + struct btrfs_ioctl_usertrans ut, *orig_ut = arg;
> > + u64 ops_completed = 0;
> > + int ret;
> > +
> > + ret = -EPERM;
> > + if (!capable(CAP_SYS_ADMIN))
> > + goto out;
> > +
> > + ret = -EFAULT;
> > + if (copy_from_user(&ut, orig_ut, sizeof(ut)))
> > + goto out;
> > +
> > + ret = mnt_want_write(file->f_path.mnt);
> > + if (ret)
> > + goto out;
> > +
> > + ret = btrfs_reserve_metadata_space(root, 5*ut.num_ops);
> > + if (ret)
> > + goto out_drop_write;
> > +
> > + mutex_lock(&root->fs_info->trans_mutex);
> > + root->fs_info->open_ioctl_trans++;
> > + mutex_unlock(&root->fs_info->trans_mutex);
> > +
> > + ret = -ENOMEM;
> > + trans = btrfs_start_ioctl_transaction(root, 0);
> > + if (!trans)
> > + goto out_drop;
> > +
> > + ret = do_usertrans(root, &ut, &ops_completed);
> > + put_user(ops_completed, &orig_ut->ops_completed);
> > +
> > + if (ret < 0 && (ut.flags & BTRFS_IOC_UT_FLAG_WEDGEONFAIL))
> > + pr_err("btrfs: usertrans failed, wedging to avoid partial "
> > + " commit\n");
> > + else
> > + btrfs_end_transaction(trans, root);
> > +
> > +out_drop:
> > + mutex_lock(&root->fs_info->trans_mutex);
> > + root->fs_info->open_ioctl_trans--;
> > + mutex_unlock(&root->fs_info->trans_mutex);
> > +
> > + btrfs_unreserve_metadata_space(root, 5*ut.num_ops);
> > +out_drop_write:
> > + mnt_drop_write(file->f_path.mnt);
> > +out:
> > + return ret;
> > +}
> > +
> > long btrfs_ioctl(struct file *file, unsigned int
> > cmd, unsigned long arg)
> > {
> > @@ -1343,6 +1528,8 @@ long btrfs_ioctl(struct file *file, unsigned int
> > case BTRFS_IOC_SYNC:
> > btrfs_sync_fs(file->f_dentry->d_sb, 1);
> > return 0;
> > + case BTRFS_IOC_USERTRANS:
> > + return btrfs_ioctl_usertrans(file, argp);
> > }
> >
> > return -ENOTTY;
> > diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> > index bc49914..f94e293 100644
> > --- a/fs/btrfs/ioctl.h
> > +++ b/fs/btrfs/ioctl.h
> > @@ -67,4 +67,53 @@ struct btrfs_ioctl_clone_range_args {
> > struct btrfs_ioctl_vol_args)
> > #define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \
> > struct btrfs_ioctl_vol_args)
> > +
> > +/* usertrans ops */
> > +/* the 'fd' values are _indices_ into a temporary fd table, see num_fds below */
> > +#define BTRFS_IOC_UT_OP_OPEN 1 /* path, flags, mode, fd */
> > +#define BTRFS_IOC_UT_OP_CLOSE 2 /* fd */
> > +#define BTRFS_IOC_UT_OP_PWRITE 3 /* fd, data, length, offset */
> > +#define BTRFS_IOC_UT_OP_UNLINK 4 /* path */
> > +#define BTRFS_IOC_UT_OP_LINK 5 /* oldpath, newpath */
> > +#define BTRFS_IOC_UT_OP_MKDIR 6 /* path, mode */
> > +#define BTRFS_IOC_UT_OP_RMDIR 7 /* path */
> > +#define BTRFS_IOC_UT_OP_TRUNCATE 8 /* path, size */
> > +#define BTRFS_IOC_UT_OP_SETXATTR 9 /* path, name, data, len */
> > +#define BTRFS_IOC_UT_OP_REMOVEXATTR 10 /* path, name */
> > +#define BTRFS_IOC_UT_OP_CLONERANGE 11 /* dst fd, src fd, off, len, dst off */
> > +
> > +/* define what 'failure' entails for each op based on return value */
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE (1<< 1)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ (1<< 2)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT (1<< 3)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT (1<< 4)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE (1<< 5)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE (1<< 6)
> > +
> > +struct btrfs_ioctl_usertrans_op {
> > + __u64 op;
> > + __s64 args[5];
> > + __s64 rval;
> > + __u64 flags;
> > +};
> > +
> > +/*
> > + * If an op fails and we cannot complete the transaction, we may want
> > + * to lock up the file system (requiring a reboot) to prevent a
> > + * partial result from committing.
> > + */
> > +#define BTRFS_IOC_UT_FLAG_WEDGEONFAIL (1<<13)
> > +
> > +struct btrfs_ioctl_usertrans {
> > + __u64 num_ops; /* in: # ops */
> > + __u64 ops_ptr; /* in: usertrans_op array */
> > + __u64 num_fds; /* in: size of fd table (max fd + 1) */
> > + __u64 data_bytes, metadata_ops; /* in: for space reservation */
> > + __u64 flags; /* in: flags */
> > + __u64 ops_completed; /* out: # ops completed */
> > +};
> > +
> > +#define BTRFS_IOC_USERTRANS _IOW(BTRFS_IOCTL_MAGIC, 16, \
> > + struct btrfs_ioctl_usertrans)
> > +
> > #endif
> > diff --git a/fs/namei.c b/fs/namei.c
> > index d11f404..4d53225 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2148,6 +2148,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
> > {
> > return sys_mkdirat(AT_FDCWD, pathname, mode);
> > }
> > +EXPORT_SYMBOL(sys_mkdir);
> >
> > /*
> > * We try to drop the dentry early: we should have
> > @@ -2262,6 +2263,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
> > {
> > return do_rmdir(AT_FDCWD, pathname);
> > }
> > +EXPORT_SYMBOL(sys_rmdir);
> >
> > int vfs_unlink(struct inode *dir, struct dentry *dentry)
> > {
> > @@ -2369,6 +2371,7 @@ SYSCALL_DEFINE1(unlink, const char __user *, pathname)
> > {
> > return do_unlinkat(AT_FDCWD, pathname);
> > }
> > +EXPORT_SYMBOL(sys_unlink);
> >
> > int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
> > {
> > diff --git a/fs/open.c b/fs/open.c
> > index 4f01e06..15eddfc 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -294,6 +294,7 @@ SYSCALL_DEFINE2(truncate, const char __user *, path, long, length)
> > {
> > return do_sys_truncate(path, length);
> > }
> > +EXPORT_SYMBOL(sys_truncate);
> >
> > static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
> > {
> > @@ -1062,6 +1063,7 @@ SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
> > asmlinkage_protect(3, ret, filename, flags, mode);
> > return ret;
> > }
> > +EXPORT_SYMBOL(sys_open);
> >
> > SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
> > int, mode)
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 3ac2898..75e9f60 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -453,6 +453,8 @@ SYSCALL_DEFINE(pwrite64)(unsigned int fd, const char __user *buf,
> >
> > return ret;
> > }
> > +EXPORT_SYMBOL(sys_pwrite64);
> > +
> > #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
> > asmlinkage long SyS_pwrite64(long fd, long buf, long count, loff_t pos)
> > {
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 6d4f6d3..488c889 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -294,6 +294,7 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
> > path_put(&path);
> > return error;
> > }
> > +EXPORT_SYMBOL(sys_setxattr);
> >
> > SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
> > const char __user *, name, const void __user *, value,
> > @@ -523,6 +524,7 @@ SYSCALL_DEFINE2(removexattr, const char __user *, pathname,
> > path_put(&path);
> > return error;
> > }
> > +EXPORT_SYMBOL(sys_removexattr);
> >
> > SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
> > const char __user *, name)
> > --
> > 1.5.6.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>