Re: FIDEDUPERANGE woes may continue (or unrelated issue?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2020/3/26 下午5:53, halfdog wrote:
> Thanks for your lengthy reply!
> 
> Zygo Blaxell writes:
>> On Tue, Mar 24, 2020 at 08:27:22AM +0000, halfdog wrote:
>>> Hello list,
>>>
>>> It seems the woes really continued ... After trashing the
>>> old, corrupted filesystem (see old message below) I started
>>> rebuilding the storage. Synchronization from another (still
>>> working) storage roughly should have performed the same actions
>>> as during initial build (minus count and time of mounts/unmounts,
>>> transfer interrupts, ...).
>>>
>>> It does not seem to be a mere coincidence, that the corruption
>>> occured when deduplicating the exact same file as last time.
>>> While corruption last time made disk completely inaccessible,
>>> this time it just was mounted readonly with a different error
>>> message:
>>>
>>> [156603.177699] BTRFS error (device dm-1): parent transid
>>> verify failed on 6680428544 wanted 12947 found 12945
>>> [156603.177707] BTRFS: error (device dm-1) in
>>> __btrfs_free_extent:3080: errno=-5 IO failure [156603.177708]
>>> BTRFS info (device dm-1): forced readonly [156603.177711]
>>> BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188:
>>> errno=-5 IO failure
>>
>> Normally those messages mean your hardware is dropping writes
>> somewhere; however, you previously reported running kernels
>> 5.3.0 and 5.3.9, so there may be another explanation.
>>
>> Try kernel 4.19.x, 5.4.19, 5.5.3, or later.  Definitely do
>> not use kernels from 5.1-rc1 to 5.4.13 inclusive unless backported
>> fixes are included.
> 
> Sorry, I forgot to update on that: I used the old kernel but also
> managed t reproduce on
> ii  linux-image-5.4.0-4-amd64            5.4.19-1                            amd64        Linux 5.4 for 64-bit PCs (signed)
> Linux version 5.4.0-4-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 9.2.1 20200203 (Debian 9.2.1-28)) #1 SMP Debian 5.4.19-1 (2020-02-13)
> 
>> I mention 5.5.3 and 5.4.19 instead of 5.5.0 and 5.4.14 because
>> the later ones include the EOF dedupe fix.  4.19 avoids the
>> regressions of later kernels.
> 
> 5.4.19-1 matches your spec, but as latest Debian experimental
> is "linux-signed-amd64 (5.5~rc5+1~exp1)", which is also above
> your  5.5.3 recommendation, should I try again with that kernel
> or even use the "5.5~rc5+1~exp1" config to apply it to yesterays
> 5.5.13 LTS and build an own kernel?

Despite the kernel version, would you like to mention any other history
of the fs?

Especially about even clean shutdown/reboot of the system?

And further more, what's the storage stack below btrfs?
(Things like bcache, lvm, dmraid)

Furthermore, the specific storage hardware (e.g. SATA/SAS HDD with its
model name, the raid card if involved)

Have you experienced the same problem on other systems?

Thanks,
Qu

> 
>>> As it seems that the bug here is somehow reproducible, I would
>>> like to try to develop a reproducer exploit and fix for that
>>> bug as an excercise. Unfortunately the fault occurs only after
>>> transfering and deduplicating ~20TB of data.
>>>
>>> Are there any recommendations e.g. how to "bisect" that problem?
>>
>> Find someone who has already done it and ask.  ;)
> 
> Seems I found someone with good recommendations already :)
> 
> Thank you!
> 
>> Upgrade straight from 5.0.21 to 5.4.14 (or 5.4.19 if you want
>> the dedupe fix too).  Don't run any kernel in between for btrfs.
>>
>> There was a bug introduced in 5.1-rc1, fixed in 5.4.14, which
>> corrupts metadata.  It's a UAF bug, so its behavior can be
>> unpredictable, but quite often the symptom is corrupted metadata
>> or write-time tree-checker errors. Sometimes you just get a
>> harmless NULL dereference crash, or some noise warnings.
>>
>> There are at least two other filesystem corrupting bugs with
>> lifetimes overlapping that range of kernel versions; however
>> both of those were fixed by 5.3.
> 
> So maybe leaving my 5.4.19-1 to the 5.5+ series sounds like recommended
> anyway?
> 
>>> Is there a way (switch or source code modification) to log
>>> all internal btrfs state transitions for later analysis?
>>
>> There are (e.g. the dm write logger), but most bugs that would
>> be found in unit tests by such tools have been fixed by the
>> time a kernel is released, and they'll only tell you that btrfs
>> did something wrong, not why.
> 
> As IO seems sane, the error reported "verify failed on 6680428544
> wanted 12947 found 12945" seems not to point to a data structure
> problem at a sector/page/block boundary (12947==0x3293), I would
> also guess, that basic IO/paging is not involved in it, but that
> the data structure is corrupted in memory and used directly or
> written/reread ... therefore I would deem write logs also as
> not the first way to go ..
> 
>> Also, there can be tens of thousands of btrfs state transitions
>> per second during dedupe, so the volume of logs themselves
>> can present data wrangling challenges.
> 
> Yes, that's why me asking. Maybe someone has already taken up
> that challenge as such a tool-chain (generic transaction logging
> with userspace stream compression, analysis) might be quite
> handy for such task, but hell effort to build ...
> 
>> The more invasively you try to track internal btrfs state,
>> the more the tools become _part_ of that state, and introduce
>> additional problems. e.g. there is the ref verifier, and the
>> _bug fix history_ of the ref verifier...
> 
> That is right. Therefore I hoped, that some minimal invasive
> toolsets might be available already for kernel or maybe could
> be written, e.g.
> 
> * Install an alternative kernel page fault handler
> * Set breakpoints on btrfs functions
>   * When entering the function, record return address, stack
>     and register arguments, send to userspace
>   * Strip write bits kernel from page table for most pages
>     exept those needed by page fault handler
>   * Continue execution
> * For each pagefault, the pagefault flips back to original
>   page table, sends information about write fault (what, where)
>   to userspace, performs the faulted instruction before switching
>   back to read-only page table and continuing btrfs function
> * When returning from the last btrfs function, also switch back
>   to standard page table.
> 
> By being completely btrfs-agnostic, such tool should not introduce
> any btrfs-specific issues due to the analysis process. Does someone
> know about such a tool or a simplified version of it?
> 
> Doing similar over qemu/kernel debugging tools might be easier
> to implement but too slow to handle that huge amount of data.
> 
>>> Other ideas for debugging that?
>>
>> Run dedupe on a test machine with a few TB test corpus (or
>> whatever your workload is) on a debug-enabled kernel, report
>> every bug that kills the box or hurts the data, update the
>> kernel to get fixes for the bugs that were reported.  Repeat
>> until the box stops crapping itself, then use the kernel it
>> stopped on (5.4.14 in this case).  Do that for every kernel
>> upgrade because regressions are a thing.
> 
> Well, that seems like overkill. My btrfs is not haunted by a
> load of bugs, just one that corrupted the filesystem two times
> when trying to deduplicate the same set of files.
> 
> As desccribed, just creating a btrfs with only that file did
> not trigger the corruption. If this is not a super-rare coincidence,
> then something in the other 20TB of transferred files has to
> have corrupted the file system or at least brought it to a state,
> where then deduplication of exact that problematic set of files
> triggered the final fault.
> 
>>> Just creating the same number of snapshots and putting just
>>> that single file into each of them did not trigger the bug
>>> during deduplication.
>>
>> Dedupe itself is fine, but some of the supporting ioctls a
>> deduper has to use to get information about the filesystem
>> structure triggered a lot of bugs.
> 
> To get rid of that, I already ripped out quite some of the userspace
> deduping part. I now do the extent queries in a Python tool
> using ctypes, split the dedup request into smaller chunks (to
> improve logging granularity) and just use the deduper to do
> that single FIDEDUPERANGE call (I was to lazy to ctype that
> in Python too).
> 
> Still deduplicating the same files caused corruption again.
> 
> hd
> 
>> ...
> 

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux