19.05.2019 11:11, Newbugreport пишет: > I have 3-4 years worth of snapshots I use for backup purposes. I keep > R-O live snapshots, two local backups, and AWS Glacier Deep Freeze. I > use both send | receive and send > file. This works well but I get > massive deltas when files are moved around in a GUI via samba. Did you analyze whether it is client or server problem? If client does file copy (instead of move as you imply) may be the simplest solution would be to use different tool on client. If problem is on server side, it is something to discuss with SAMBA folks. > Reorganize a bunch of files and the next snapshot is 50 or 100 GB. > Perhaps mv or cp with reflink=always would fix the problem but it's > just not usable enough for my family. > > I'd like a solution to the massive delta problem. Perhaps someone > already has a solution, that would be great. If not, I need advice on > a few ideas. > > It seems a realistic solution to deduplicate the subvolume before > each snapshot is taken, and in theory I could write a small program > to do that. You mean that none of existing half a dozen tools to perform deduplication on btrfs fits your requirements? > However I don't know if that would work. Will Btrfs will > let me deduplicate between a file on the live subvolume and a file on > the R-O snapshot (really the same file but different path). If so, btrfs does not care because it does not perform any deduplication at all. All tools compute identical file ranges and then invoke kernel ioctl to replace reference to range in destination file by reference to identical range in source file. So there is nothing that prevents using read-only data as source for deduplcation of read-write data. Whether each of existing tools supports it (or makes it easy to do) I do not know. > will Btrfs send with -p result in a small delta? > Well, if all data is replaced by reference to existing extents in some snapshot then delta to this snapshot will be small. > Failing that I could probably make changes to the send data stream, > but that's suboptimal for the live volume and any backup volumes > where data has been received. > > Also, is it possible to access the Btrfs hash values for files so I > don't have to recalculate file hashes for the whole volume myself? > Currently btrfs does not compute hashes suitable for deduplication. It only stores CRC32 checksums. You can access checksum tree and at least one tool makes use of it to speed up scanning; but it then computes second hash to avoid false positives. Recently patch series was posted to add support for different hashes (I believe SHA256 at least); these would be more useful for deduplication when merged.
