Hi! On 5/9/20 1:11 PM, Steven Davies wrote: > For curiosity I'm trying to write a tool which will show me the size of > data extents belonging to which files in a snapshot are exclusive to > that snapshot as a way to show how much space would be freed if the > snapshot were to be deleted, and which files in the snapshot are taking > up the most space. Yes, so that means that if a file in that snapshot (subvolume) is using a data extent, you want to know if there's any file in a *different* subvolume that is still referencing *any* data from the extent, which will block the whole thing from getting freed. Just putting that here to verify if this is actually what you thought, because I think it is. And, it might not be immediately obvious to others. > I'm working with Hans van Kranenburg's python-btrfs python library but > my knowledge of the filesystem structures isn't good enough to allow me > to figure out which bits of data I need to be able to achieve this. I'd > be grateful if anyone could help me along with this. > > So far my idea is: > > for each OS file in a subvolume: I'd recommend to just dump the complete metadata tree with the subvolume ID as number. That way you will just see all FileExtentItem of all files in the subvol flying by in the results. I see you already do a search with btrfs.ioctl.search_v2, now make that a search for a whole tree by removing min and max that you currently use to limit it to an inode num. Using the 0 tree for search combined with opening the fs while pointing to anything inside the subvol you want to inspect is clever, great. Side note: the path that the fs object was initialized with is also available as fs.path, you don't have to drag it around (e.g. second arg in inspect_from). https://python-btrfs.readthedocs.io/en/stable/btrfs.html#btrfs.ctree.FileSystem > find its data extents Yes, these are all the FileExtentItem items that you see flying by. https://python-btrfs.readthedocs.io/en/stable/btrfs.html#btrfs.ctree.FileExtentItem > for each extent: FileExtentItem has attribute disk_bytenr, I see you're already using that. Good. > find what files reference it #1 > for each referencing file: > determine which subvolumes it lives in #2 For this, we delegate the work to the running linux kernel code, to ask it who's using the extent at this disk_bytenr. https://python-btrfs.readthedocs.io/en/stable/btrfs.html#btrfs.ioctl.logical_to_ino_v2 The main thing you're looking for is the ignore_offset option, which will give you a list of *any* user of *any* data in that extent, instead of only the first 4096 bytes in it which disk_bytenr itself is part of. There's some more info in the git commit message that added _v2 (one day this will end up in tutorial pages): https://github.com/knorrie/python-btrfs/commit/38cd5528ff1ced0be908e1e697758c7431b92a0d The show_block_group_data_extent_filenames.py implements it as example, look at the using_v2 in there: https://github.com/knorrie/python-btrfs/commit/7a0749d567abde425c11a43e5fe9177435d0cf28#diff-87cff59a5b983c0f3408b60d654f4f66 So, you want to end with bytes_missed 0, otherwise you're ignoring results. Now look at the inode objects you get, they have a root attribute, which is the subvol id of that inode. If inode.root is different than the subvol ID you're analyzing, then you know that there's a file in another subvol that uses the extent. Voila. > if all references are within this subvolume: > record the OS file path and extents it references By looking at the inode number of the FileExtentItem all the time, you will know when it jumps to the next number (next file), and then you can finish up for the previous file. Oh, I see the FileExtentItem class has no explicit helper attribute to get the inode number. Ha, I will add that (makes a note). You can either use item.key.objectid or header.objectid for now to get it. > for each recorded file path > find its data extents > output its path and the total number of bytes in all recorded extents > (those which are not shared) You already are using item.disk_num_bytes somewhere. This disk_num_bytes is the size of the extent, so no extra lookup in the extent tree is needed. While doing the above loop and looking at the FileExtentItem objects, you can already gather the information about total disk space that could be freed. Now, the last missing part is of course that you have an inode number, and want to know the file name that belongs to it. For that, you use the INO_LOOKUP ioctl: https://python-btrfs.readthedocs.io/en/stable/btrfs.html#btrfs.ioctl.ino_lookup > #1 and #2 are where my understanding breaks down. How do I find which > files reference an extent and which subvolume those files are in? > > Alternatively, if such a script already exists I would be happy to use > it. Others will be happy with what you're doing now. :-) Hans
