Re: Growing number of "invalid tree nritems" errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2020/7/5 下午4:37, Thilo-Alexander Ginkel wrote:
> Hello everyone,
> 
> one of our servers just started producing loads of "BTRFS error
> (device dm-0): invalid tree nritems" errors and eventually caught a
> hung task (not sure if those are related):
> 
> [...]
> [126990.493897] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=201179545600 nritems=0 expect >0

This means we got a child tree block whose nritems is 0.

This is not valid for child tree block at all, thus btrfs is warning
about it.

Unfortunately, we didn't output more info about it to further pindown
the problem.

The only good news is, at this stage, nothing wrong has reached disk,
thus the fs should be fine, just as your later btrfs check run shows.


> [127041.504620] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=204159336448 nritems=0 expect >0
> [127106.733494] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=233554296832 nritems=0 expect >0
> [127125.504302] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=233693298688 nritems=0 expect >0
> [127254.512800] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=299654774784 nritems=0 expect >0
> [127544.739078] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=435922501632 nritems=0 expect >0
> [127544.739190] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=435922714624 nritems=0 expect >0
> [...]
> [129532.769484] INFO: task kcompactd0:64 blocked for more than 120 seconds.
> [129532.769569]       Tainted: G            E    4.15.0-109-generic #110-Ubuntu
> [129532.769651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [129532.769749] kcompactd0      D    0    64      2 0x80000000
> [129532.769751] Call Trace:
> [129532.769756]  __schedule+0x24e/0x880
> [129532.769758]  schedule+0x2c/0x80
> [129532.769759]  io_schedule+0x16/0x40
> [129532.769761]  __lock_page+0xff/0x140
> [129532.769763]  ? page_cache_tree_insert+0xe0/0xe0
> [129532.769765]  migrate_pages+0x91f/0xb80
> [129532.769766]  ? __ClearPageMovable+0x10/0x10
> [129532.769768]  ? isolate_freepages_block+0x3b0/0x3b0
> [129532.769769]  compact_zone+0x681/0x950
> [129532.769770]  kcompactd_do_work+0xfe/0x2a0
> [129532.769772]  ? __switch_to_asm+0x35/0x70
> [129532.769773]  ? __switch_to_asm+0x41/0x70
> [129532.769774]  kcompactd+0x86/0x1c0
> [129532.769775]  ? kcompactd+0x86/0x1c0
> [129532.769778]  ? wait_woken+0x80/0x80
> [129532.769780]  kthread+0x121/0x140
> [129532.769781]  ? kcompactd_do_work+0x2a0/0x2a0
> [129532.769782]  ? kthread_create_worker_on_cpu+0x70/0x70
> [129532.769783]  ret_from_fork+0x35/0x40
> 
> I took the server offline and ran `btrfs check`, which did not bring
> up any errors:
> 
> # btrfs check -p --check-data-csum /dev/mapper/luks
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/luks
> UUID: b5872f47-c87e-47ac-b036-4f2725cf6dc6
> [1/7] checking root items                      (0:00:20 elapsed,
> 12381226 items checked)
> [2/7] checking extents                         (0:05:38 elapsed,
> 5163753 items checked)
> [3/7] checking free space cache                (0:00:12 elapsed, 376
> items checked)
> [4/7] checking fs roots                        (0:41:33 elapsed,
> 5021296 items checked)
> [5/7] checking csums against data              (0:05:35 elapsed,
> 3911047 items checked)
> [6/7] checking root refs                       (0:00:00 elapsed, 28110
> items checked)
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 292229652480 bytes used, no error found
> total csum bytes: 200196548
> total tree bytes: 84142292992
> total fs tree bytes: 82578096128
> total extent tree bytes: 1175896064
> btree space waste bytes: 24570610642
> file data blocks allocated: 245858725888
>  referenced 202896068608
> 
> I will be running memtester to make sure the problems are not RAM-related.

That would be helpful to rule out RAM related problem.

> 
> Any ideas?

How producible is this?

If it still shows the same symptom after verifying the RAM, would you
please apply this small debug diff on your kernel?

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c27022f13150..92dd9a3e5644 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -406,8 +406,9 @@ int btrfs_verify_level_key(struct extent_buffer *eb,
int level,
        /* We have @first_key, so this @eb must have at least one item */
        if (btrfs_header_nritems(eb) == 0) {
                btrfs_err(fs_info,
-               "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
-                         eb->start);
+               "invalid tree nritems, bytenr=%llu owner=%lld nritems=0
expect >0",
+                         eb->start, btrfs_header_owner(eb));
+               WARN_ON_ONCE(1);
                WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
                return -EUCLEAN;
        }



It would:
- Provide which tree owns the offending tree block
  If it's some essential tree, then it should never be empty, and this
  is really something wrong other than false alerts.

- The call trace of the first encounter
  This may provide some info on how it's happening.

Thanks,
Qu

> 
> Thanks,
> Thilo
> 

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux