On 2020/3/26 下午5:53, halfdog wrote: > Thanks for your lengthy reply! > > Zygo Blaxell writes: >> On Tue, Mar 24, 2020 at 08:27:22AM +0000, halfdog wrote: >>> Hello list, >>> >>> It seems the woes really continued ... After trashing the >>> old, corrupted filesystem (see old message below) I started >>> rebuilding the storage. Synchronization from another (still >>> working) storage roughly should have performed the same actions >>> as during initial build (minus count and time of mounts/unmounts, >>> transfer interrupts, ...). >>> >>> It does not seem to be a mere coincidence, that the corruption >>> occured when deduplicating the exact same file as last time. >>> While corruption last time made disk completely inaccessible, >>> this time it just was mounted readonly with a different error >>> message: >>> >>> [156603.177699] BTRFS error (device dm-1): parent transid >>> verify failed on 6680428544 wanted 12947 found 12945 >>> [156603.177707] BTRFS: error (device dm-1) in >>> __btrfs_free_extent:3080: errno=-5 IO failure [156603.177708] >>> BTRFS info (device dm-1): forced readonly [156603.177711] >>> BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188: >>> errno=-5 IO failure >> >> Normally those messages mean your hardware is dropping writes >> somewhere; however, you previously reported running kernels >> 5.3.0 and 5.3.9, so there may be another explanation. >> >> Try kernel 4.19.x, 5.4.19, 5.5.3, or later. Definitely do >> not use kernels from 5.1-rc1 to 5.4.13 inclusive unless backported >> fixes are included. > > Sorry, I forgot to update on that: I used the old kernel but also > managed t reproduce on > ii linux-image-5.4.0-4-amd64 5.4.19-1 amd64 Linux 5.4 for 64-bit PCs (signed) > Linux version 5.4.0-4-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 9.2.1 20200203 (Debian 9.2.1-28)) #1 SMP Debian 5.4.19-1 (2020-02-13) > >> I mention 5.5.3 and 5.4.19 instead of 5.5.0 and 5.4.14 because >> the later ones include the EOF dedupe fix. 4.19 avoids the >> regressions of later kernels. > > 5.4.19-1 matches your spec, but as latest Debian experimental > is "linux-signed-amd64 (5.5~rc5+1~exp1)", which is also above > your 5.5.3 recommendation, should I try again with that kernel > or even use the "5.5~rc5+1~exp1" config to apply it to yesterays > 5.5.13 LTS and build an own kernel? Despite the kernel version, would you like to mention any other history of the fs? Especially about even clean shutdown/reboot of the system? And further more, what's the storage stack below btrfs? (Things like bcache, lvm, dmraid) Furthermore, the specific storage hardware (e.g. SATA/SAS HDD with its model name, the raid card if involved) Have you experienced the same problem on other systems? Thanks, Qu > >>> As it seems that the bug here is somehow reproducible, I would >>> like to try to develop a reproducer exploit and fix for that >>> bug as an excercise. Unfortunately the fault occurs only after >>> transfering and deduplicating ~20TB of data. >>> >>> Are there any recommendations e.g. how to "bisect" that problem? >> >> Find someone who has already done it and ask. ;) > > Seems I found someone with good recommendations already :) > > Thank you! > >> Upgrade straight from 5.0.21 to 5.4.14 (or 5.4.19 if you want >> the dedupe fix too). Don't run any kernel in between for btrfs. >> >> There was a bug introduced in 5.1-rc1, fixed in 5.4.14, which >> corrupts metadata. It's a UAF bug, so its behavior can be >> unpredictable, but quite often the symptom is corrupted metadata >> or write-time tree-checker errors. Sometimes you just get a >> harmless NULL dereference crash, or some noise warnings. >> >> There are at least two other filesystem corrupting bugs with >> lifetimes overlapping that range of kernel versions; however >> both of those were fixed by 5.3. > > So maybe leaving my 5.4.19-1 to the 5.5+ series sounds like recommended > anyway? > >>> Is there a way (switch or source code modification) to log >>> all internal btrfs state transitions for later analysis? >> >> There are (e.g. the dm write logger), but most bugs that would >> be found in unit tests by such tools have been fixed by the >> time a kernel is released, and they'll only tell you that btrfs >> did something wrong, not why. > > As IO seems sane, the error reported "verify failed on 6680428544 > wanted 12947 found 12945" seems not to point to a data structure > problem at a sector/page/block boundary (12947==0x3293), I would > also guess, that basic IO/paging is not involved in it, but that > the data structure is corrupted in memory and used directly or > written/reread ... therefore I would deem write logs also as > not the first way to go .. > >> Also, there can be tens of thousands of btrfs state transitions >> per second during dedupe, so the volume of logs themselves >> can present data wrangling challenges. > > Yes, that's why me asking. Maybe someone has already taken up > that challenge as such a tool-chain (generic transaction logging > with userspace stream compression, analysis) might be quite > handy for such task, but hell effort to build ... > >> The more invasively you try to track internal btrfs state, >> the more the tools become _part_ of that state, and introduce >> additional problems. e.g. there is the ref verifier, and the >> _bug fix history_ of the ref verifier... > > That is right. Therefore I hoped, that some minimal invasive > toolsets might be available already for kernel or maybe could > be written, e.g. > > * Install an alternative kernel page fault handler > * Set breakpoints on btrfs functions > * When entering the function, record return address, stack > and register arguments, send to userspace > * Strip write bits kernel from page table for most pages > exept those needed by page fault handler > * Continue execution > * For each pagefault, the pagefault flips back to original > page table, sends information about write fault (what, where) > to userspace, performs the faulted instruction before switching > back to read-only page table and continuing btrfs function > * When returning from the last btrfs function, also switch back > to standard page table. > > By being completely btrfs-agnostic, such tool should not introduce > any btrfs-specific issues due to the analysis process. Does someone > know about such a tool or a simplified version of it? > > Doing similar over qemu/kernel debugging tools might be easier > to implement but too slow to handle that huge amount of data. > >>> Other ideas for debugging that? >> >> Run dedupe on a test machine with a few TB test corpus (or >> whatever your workload is) on a debug-enabled kernel, report >> every bug that kills the box or hurts the data, update the >> kernel to get fixes for the bugs that were reported. Repeat >> until the box stops crapping itself, then use the kernel it >> stopped on (5.4.14 in this case). Do that for every kernel >> upgrade because regressions are a thing. > > Well, that seems like overkill. My btrfs is not haunted by a > load of bugs, just one that corrupted the filesystem two times > when trying to deduplicate the same set of files. > > As desccribed, just creating a btrfs with only that file did > not trigger the corruption. If this is not a super-rare coincidence, > then something in the other 20TB of transferred files has to > have corrupted the file system or at least brought it to a state, > where then deduplication of exact that problematic set of files > triggered the final fault. > >>> Just creating the same number of snapshots and putting just >>> that single file into each of them did not trigger the bug >>> during deduplication. >> >> Dedupe itself is fine, but some of the supporting ioctls a >> deduper has to use to get information about the filesystem >> structure triggered a lot of bugs. > > To get rid of that, I already ripped out quite some of the userspace > deduping part. I now do the extent queries in a Python tool > using ctypes, split the dedup request into smaller chunks (to > improve logging granularity) and just use the deduper to do > that single FIDEDUPERANGE call (I was to lazy to ctype that > in Python too). > > Still deduplicating the same files caused corruption again. > > hd > >> ... >
Attachment:
signature.asc
Description: OpenPGP digital signature
