On Mon, Nov 25, 2019 at 03:31:59PM +0000, Filipe Manana wrote: > On Sun, Nov 24, 2019 at 1:33 AM Zygo Blaxell > <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Sat, Nov 23, 2019 at 01:33:33PM +0000, Filipe Manana wrote: > > > On Sat, Nov 23, 2019 at 5:29 AM Zygo Blaxell > > > <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@xxxxxxxxxx wrote: > > To be clear, I'm not doubting that the patch fixes a test case. > > It's more like I regularly encounter a lot of performance problems from > > iterate_extent_inodes() and friends, and only a tiny fraction of them > > look like that test case. > > > > TL;DR it looks like LOGICAL_INO has more bugs than I knew about before > > today, > > Which are? Where were they reported? The first and least important bug is that LOGICAL_INO is slow. This is well known, although maybe the magnitude of the problem is not. In one dedupe test corpus I can produce (with bees or duperemove) a filesystem with extents that have only a few thousand references, but take 45 kernel CPU minutes to process (each). On this filesystem image I can find backrefs by brute force processing 'btrfs inspect dump-tree' output with a Perl script in only 8 minutes, and the problematic extents only have a few hundred to a few thousand references. Other extents have half a million references and LOGICAL_INO can find them all in a few seconds (including both CPU and IO time). I don't know what the kernel's doing, but whatever it is, it's *far* beyond any reasonable amount of processing for the metadata. The second bug is a bit more subtle. I've been mining the last year of crash and corruption data and tracking about 7 distict frequent failure behaviors in 5.0..5.3 kernels. It turns out that LOGICAL_INO is implicated in half of these as well. I've been running some tests to confirm this and exclude other factors for the past month. I was intending to post something about it at the conclusion of testing a few weeks, but I have some actionable information today, so here we are... The data so far says that a host in the wild running LOGICAL_INO and balance at the same time is 150x more likely to crash compared to hosts that do not run LOGICAL_INO at all, or hosts that run either LOGICAL_INO or balance but not at the same time. Simultaneous LOGICAL_INO + balance correlates with these three failure modes: https://www.spinics.net/lists/linux-btrfs/msg89425.html https://www.mail-archive.com/linux-btrfs@xxxxxxxxxxxxxxx/msg88892.html [metadata corruption on 5.2 that looks the same as the previously reported corruption on 5.1, except 5.2 has a write time corruption detector that aborts the transaction instead of damaging the filesystem] A host in our fleet that rarely or never runs LOGICAL_INO has few or no problems on any kernel from 5.0 to 5.3. When crashes do occur, they are mostly random non-btrfs bugs (CIFS, wifi drivers, GPU, hardware failure, power failure, etc). We did have a few 5.2.x kernels hang in testing due to the writeback bug fixed in 5.2.15. The specific application we are running that uses LOGICAL_INO is bees. bees spends more time running LOGICAL_INO than any other single activity (other than sleeping). Other dedupe tools don't seem to have an impact on system uptime, which rules out EXTENT_SAME. If there was a bug in POSIX API like open() or read() we would all know about it. That leaves LOGICAL_INO, TREE_SEARCH_V2, and INO_PATHS as the 3 ioctls that bees calls with unusual frequency compared to other software, but TREE_SEARCH_V2 and INO_PATHS don't show up in BUG_ON stack traces while LOGICAL_INO does. I don't have a nice minimal reproducer yet, but "run bees and btrfs balance start on a big, busy filesystem" triggers some kind of crash (BUG_ON or transaction abort) for me in less than a day, 3x a day on some hosts. I think that a good reproducer would be a simple program that walks the extent tree and feeds every extent bytenr in random order to LOGICAL_INO in multiple threads while a balance is running, to see if that makes the crashes happen faster. I have not tried this yet. I don't know how the 5.1 and 5.2 metadata corruption is connected, other than the corruption immediately stops happening when we stop running bees on the test host, or arrange for bees and balance to run at separate times. The bug is possibly as old as 4.12--based on earlier crash data, it might have been the 4th most frequent crash case in kernels prior to 5.0. The first three most frequent cases in 4.14 were flushoncommit deadlocks, a reclaim/commit deadlock, and the rename/dedupe deadlock, all fixed by 5.0. LOGICAL_INO-related failures are now the most common crash case in my data up to 5.2. I'm running tests on 5.3 kernels, which are crashing in new, different, and possibly unrelated ways. LOGICAL_INO + balance is not the fastest way to crash a 5.3 kernel, but I don't know yet whether or not it is still a *possible* way to crash a 5.3 kernel. That's all I really know about this bug so far: there seems to be a bug, it's not a new bug, and simultaneous LOGICAL_INO + balance seems to trigger it. > > > > LOGICAL_INO_V2 only finds 170 extent references: > > > > > > > > # btrfs ins log 1344172032 -P . | wc -l > > > > 170 > > > > > > Pass "-s 65536", the default buffer is likely not large enough to all > > > inode/root numbers. > > > > The default buffer is SZ_64K. When I hacked up 'btrfs ins log' to use > > SZ_4K: https://github.com/kdave/btrfs-progs/blob/master/cmds/inspect.c#L151 ...which is not the SZ_64K I was referring to. Oops! OK, my bad. I forgot to double check this against the hand-rolled tools I usually use. With the right buffer size, `btrfs ins log` produces 1794 references, which is the right number given the extent appears in 2 snapshots. > > So I'm kind of stumped there. If it works as you describe above > > (and the description above is similar to my understanding of how it > > works), then extent_item.refs should be a *lower* bound on the number > > of references. Snapshots will make the real number of references larger > > than extent_item.refs, but cannot make it smaller (if an EXTENT_ITEM is > > deleted in one of two subvols, the reference count in the EXTENT_DATA > > would remain the same--incremented by the CoW, then decremented by > > the delete). Yet here is a case where that doesn't hold. > > > > Or...another theory is that LOGICAL_INO is broken, or trying to be > > "helpful" by doing unnecessary work to get the wrong answer (like > > FIEMAP does). If I look at this extent again, I see that > > Hum, what wrong answers does fiemap gives? Was this reported before somewhere? FIEMAP coalesces physically adjacent extents as opposed to reporting btrfs extent item boundaries, and it adjusts the first extent record to match the beginning of the range parameter given to FIEMAP instead of reporting the beginning of the extent. This does interesting things when the extents are compressed or if you try to iterate over them in reverse order. I presume that behavior is either by design, or just an awful compromise to make the output conform to the limited capabilities of the FIEMAP result format. This would mean FIEMAP's behavior is not really a bug, and I haven't reported it as one; however, it does mean FIEMAP is not an appropriate tool for applications that need accurate information about physical structure on btrfs. Computing the 'shared' flag for FIEMAP is prohibitively expensive at scale, and there's no way to skip that work with the current FIEMAP interface if the application is not interested. I switched to TREE_SEARCH_V2 to get extent item data: it's faster, more accurate, and more useful than FIEMAP, especially for compressed extents. If anyone's interested in improving the generic FIEMAP interface to handle btrfs quirks better, I can write up a proposal--it could be handy to get this data without having to be root as TREE_SEARCH_V2 requires. Alternatively, btrfs could have an unprivileged form of TREE_SEARCH_V2 that limits the min/max root/objectid ranges to the inode of the file descriptor that is passed to the ioctl.
Attachment:
signature.asc
Description: PGP signature
