Re: [PATCH 0/4] btrfs: offline dedupe v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 11, 2013 at 02:31:34PM -0600, Mark Fasheh wrote:
> Hi,
> 
> The following series of patches implements in btrfs an ioctl to do
> offline deduplication of file extents.
> 
> To be clear, "offline" in this sense means that the file system is
> mounted and running, but the dedupe is not done during file writes,
> but after the fact when some userspace software initiates a dedupe.
> 
> The primary patch is loosely based off of one sent by Josef Bacik back
> in January, 2011.
> 
> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/8508
> 
> I've made significant updates and changes from the original. In
> particular the structure passed is more fleshed out, this series has a
> high degree of code sharing between itself and the clone code, and the
> locking has been updated.
> 
> 
> The ioctl accepts a struct:
> 
> struct btrfs_ioctl_same_args {
> 	__u64 logical_offset;	/* in - start of extent in source */
> 	__u64 length;		/* in - length of extent */
> 	__u16 dest_count;	/* in - total elements in info array */
> 	__u16 reserved1;
> 	__u32 reserved2;
> 	struct btrfs_ioctl_same_extent_info info[0];
> };
> 
> Userspace puts each duplicate extent (other than the source) in an
> item in the info array. As there can be multiple dedupes in one
> operation, each info item has it's own status and 'bytes_deduped'
> member. This provides a number of benefits:
> 
> - We don't have to fail the entire ioctl because one of the dedupes failed.
> 
> - Userspace will always know how much progress was made on a file as we always
>   return the number of bytes deduped.
> 
> 
> #define BTRFS_SAME_DATA_DIFFERS	1
> /* For extent-same ioctl */
> struct btrfs_ioctl_same_extent_info {
> 	__s64 fd;		/* in - destination file */
> 	__u64 logical_offset;	/* in - start of extent in destination */
> 	__u64 bytes_deduped;	/* out - total # of bytes we were able
> 				 * to dedupe from this file */
> 	/* status of this dedupe operation:
> 	 * 0 if dedup succeeds
> 	 * < 0 for error
> 	 * == BTRFS_SAME_DATA_DIFFERS if data differs
> 	 */
> 	__s32 status;		/* out - see above description */
> 	__u32 reserved;
> };
> 
> 
> The kernel patches are based off Linux v3.9. At this point I've tested the
> ioctl against a decent variety of files and conditions.
> 
> A git tree for the kernel changes can be found at:
> 
> https://github.com/markfasheh/btrfs-extent-same
> 
> 
> I have a userspace project, duperemove available at:
> 
> https://github.com/markfasheh/duperemove
> 
> Hopefully this can serve as an example of one possible usage of the ioctl.
> 
> duperemove takes a list of files as argument and will search them for
> duplicated extents. If given the '-D' switch, duperemove will send dedupe
> requests for same extents and display the results.
> 
> Within the duperemove repo is a file, btrfs-extent-same.c that acts as
> a test wrapper around the ioctl. It can be compiled completely
> seperately from the rest of the project via "make
> btrfs-extent-same". This makes direct testing of the ioctl more
> convenient.
> 
> 
> Limitations
> 
> We can't yet dedupe within the same file (that is, source and destination
> are the same inode). This is due to a limitation in btrfs_clone().
> 
> 
> Perhaps this isn't a limiation per-se but extent-same requires read/write
> access to the files we want to dedupe.  During my last series I had a
> conversation with Gabriel de Perthuis about access checking where we tried
> to maintain the ability for a user to run extent-same against a readonly
> snapshot. In addition, I reasoned that since the underlying data won't
> change (at least to the user) that we ought only require the files to be
> open for read.
> 
> What I found however is that neither of these is a great idea ;)
> 
> - We want to require that the inode be open for writing so that an
>   unprivileged user can't do things like run dedupe on a performance
>   sensitive file that they might only have read access to.  In addition I
>   could see it as kind of a surprise (non-standard behavior) to an
>   administrator that users could alter the layout of files they are only
>   allowed to read.
> 
> - Readonly snapshots won't let you open for write anyway (unsuprisingly,
>   open() returns -EROFS).  So that kind of kills the idea of them being able
>   to open those files for write which we want to dedupe.
> 
> That said, I still think being able to run this against a set of readonly
> snapshots makes sense especially if those snapshots are taken for backup
> purposes. I'm just not sure how we can sanely enable it.
> 
> 
> 
> Code review is very much appreciated. Thanks,
>      --Mark
> 
> 
> ChangeLog
> 
> - check that we have appropriate access to each file before deduping. For
>   the source, we only check that it is opened for read. Target files have to
>   be open for write.
> 
> - don't dedupe on readonly submounts (this is to maintain 
> 
> - check that we don't dedupe files with different checksumming states
>  (compare BTRFS_INODE_NODATASUM flags)
> 
> - get and maintain write access to the mount during the extent same
>   operation (mount_want_write())
> 
> - allocate our read buffers up front in btrfs_ioctl_file_extent_same() and
>   pass them through for re-use on every call to btrfs_extent_same(). (thanks
>   to David Sterba <dsterba@xxxxxxx> for reporting this
> 
> - As the read buffers could possibly be up to 1MB (depending on user
>   request), we now conditionally vmalloc them.
> 
> - removed redundant check for same inode. btrfs_extent_same() catches it now
>   and bubbles the error up.
> 
> - remove some unnecessary printks
> 
> Changes from RFC to v1:
> 
> - don't error on large length value in btrfs exent-same, instead we just
>   dedupe the maximum allowed.  That way userspace doesn't have to worry
>   about an arbitrary length limit.
> 
> - btrfs_extent_same will now loop over the dedupe range at 1MB increments (for
>   a total of 16MB per request)
> 
> - cleaned up poorly coded while loop in __extent_read_full_page() (thanks to
>   David Sterba <dsterba@xxxxxxx> for reporting this)
> 
> - included two fixes from Gabriel de Perthuis <g2p.code@xxxxxxxxx>:
>    - allow dedupe across subvolumes
>    - don't lock compressed pages twice when deduplicating
> 
> - removed some unused / poorly designed fields in btrfs_ioctl_same_args.
>   This should also give us a bit more reserved bytes.
> 
> - return -E2BIG instead of -ENOMEM when arg list is too large (thanks to
>   David Sterba <dsterba@xxxxxxx> for reporting this)
> 
> - Some more reserved bytes are now included as a result of some of my
>   cleanups. Quite possibly we could add a couple more.

Ok I'm relatively happy with this set, I just want to have an xfstest on deck to
test it with.  Once I have an xfstest to run to verify its sanity I'll pull it
in.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux