Jun He posted on Tue, 25 Aug 2015 23:04:42 -0500 as excerpted: > I have been playing with btrfs discard for a while and found that btrfs > may fail to discard some extents with 'mount -o discard'. I am aware of > Jeff Mahoney's patches ( https://patchwork.kernel.org/patch/6609491/ ). > It seems that the patches do not fix the problem. I have seen the same > problematic behavior for the following versions > > - https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/ > integration-4.3 commit:477594f93c43b1ee685 > - 3.16.0 - 4.2.0-rc7 > > The problem can be reproduced by writing and fsyncing a 4MB file for 50 > times on a 256MB empty FS (mount option: -o discard). You will find that > some extents are not discarded (my expected behavior is that, after > overwriting, an old version of a file extent should be discarded). I use > several ways to confirm this: > > 1. I created a loop device back by a sparse file in tmpfs. After running > the workload, I found the file is 29MB (ls -lsh). If you fstrim the file > system, > the sparse file will become 4.1MB. This proves that there are a lot of > data not discarded. > > 2. I collected blktrace + blkparse output and plotted the write and > discard operations in a space-time graph, where you can intuitively see > some extents are overwritten but not discarded. Here is the space-time > graph > https://gist.githubusercontent.com/junhe/b6ce39eeb6de8887e66a/ raw/825a3c2946b52a50c2b6032a98d637f5a32bc5c3/integration-4.3.png > > Is it a known problem or is it not a problem? If it is a known problem > and there exists a patch that I am not aware of, can somebody direct me > to it? > If it is specifically designed this way, can the designers give the > rationale of discarding some, but not all of, old extents? I'm an admin, not a dev, far from an expert on fsync, and didn't pull your reproducer down from the linked git to check, but... do the numbers continue to change for some time (nominally 30 seconds) after the last operation? Do you do a final sync (not fsync) after the last file write, and does that affect the result? What I'm getting at is that there's a difference between sync and fsync, and you mentioned only fsync. After an fsync, the file's own data and metadata should be reliably synced to storage device, but unlike filesystems like ext3, where (I've read that) an fsync forces a sync of the entire filesystem, on btrfs, other data and metadata related to the filesystem, in this case, those discards clearing where the file WAS but is no longer due to COW, are not necessarily synced to storage device, yet. In the absence of a full filesystem sync, this outstanding activity may remain uncommitted until the normal btrfs commit timeout, 30 seconds by default, tho there's a mount option to change it. In the absence of that sync, a failure to discard before the commit, upto 30 seconds later, is entirely expected. Of course if you're either already doing that full filesystem sync, or are waiting at least 30 seconds (or whatever you have commit set to if non-default) before checking to see if the discard has been done, then indeed, it would appear that something's wrong. But there's no indication in your post that you're already doing that. FWIW, if you prefer to sync just the btrfs in question, not other filesystems btrfs and non-btrfs alike (as a full sync would do), you can use the btrfs filesystem sync <path> command, as covered in the btrfs- filesystem manpage. This command can be used in test scripts, etc, in place of sleeping 30 seconds or invoking a full system sync, where what's actually on the device counts. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
