Re: btrfs-transacti hangs system for several seconds every few minutes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Mar 28, 2020 at 11:26:56AM -0700, Brad Templeton wrote:
> I have a decent sized 3 disk Raid 1 that I have had on btrfs for many
> years. Over time, a serious problem has emerged, in that from time to
> time all I/O will pause, freezing any programs attempting to use the
> btrfs filesystem.   Performance has degraded over the years as well, so
> that just browsing around in directories with 300 or so files often
> takes many seconds just to autocomplete a filename or do an ls.
> 
> But the big problem is that during periods of active but not heavy use,
> every few minutes the i/o system will hang for periods of 1 to 10
> seconds.   During these hangs, btrfs-transacti is doing very heavy I/O.
>   Programs waiting on I/O block -- the most frustrating is typing in vi
> and having the echo stop.  It's getting close to unusable and may be
> time to leave btrfs after many years for a different FS.
> 
> During these incidents iotop will look like this:
> 
> Total DISK READ :     499.57 K/s | Total DISK WRITE :    1639.00 K/s
> Actual DISK READ:     492.73 K/s | Actual DISK WRITE:       0.00 B/s
>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>   882 be/4 root      499.57 K/s 1604.78 K/s  0.00 % 98.60 %
> [btrfs-transacti]
> 21829 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.23 %
> [kworker/u32:1-btrfs-endio-meta]
> 14662 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.17 %
> [kworker/u32:0-btrfs-endio-meta]
> 22184 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.11 %
> [kworker/u32:3-events_freezable_power_]
> 13063 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.06 %
> [kworker/u32:6-events_freezable_power_]
>   486 be/3 root        0.00 B/s    6.84 K/s  0.00 %  0.00 % systemd-journald
> 22213 be/4 brad        0.00 B/s    6.84 K/s  0.00 %  0.00 % chrome
> --no-startup-window [ThreadPoolForeg]
> 
> A way to reliably generate it, I have found, is to quickly skim through
> my large video collection  (looking for videos) I would be hitting
> "next" every second or so -- lots of read, but very little write.
> After doing about 40 seconds of this, it is sure to hang.
> 
> I am running kernel 5.3.0 on Ubuntu 18.04.4, but have seen this problem
> gong back into much older kernels.

PSA:  Get off 5.3.0.  There is a serious bug in kernels 5.1 to 5.4.13 that
can lead to metadata corruption resulting in loss of the filesystem.
Go to 5.4.14 or later, or back to 4.19.y for y > 100 or so.  This advice
applies to all btrfs users, it's not related to latency.

In this case 4.19 might be a better choice than later kernels for latency.
5.0 had some latency-related regressions, and fixes for those are still
in development.

> My array looks like this:
> 
> /dev/sda, ID: 2
>    Device size:             3.64TiB
>    Device slack:              0.00B
>    Data,RAID1:              1.79TiB
>    Metadata,RAID1:          8.00GiB
>    Unallocated:             1.84TiB
> 
> /dev/sdg, ID: 1
>    Device size:             9.10TiB
>    Device slack:              0.00B
>    Data,RAID1:              7.21TiB
>    Metadata,RAID1:         14.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             1.87TiB
> 
> /dev/sdh, ID: 3
>    Device size:             7.28TiB
>    Device slack:          344.00KiB
>    Data,RAID1:              5.43TiB
>    Metadata,RAID1:          8.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             1.84TiB
> 
> /dev/sdg on /home type btrfs
> (rw,relatime,space_cache,subvolid=256,subvol=/home)

Two things in the mount options:

1.  PSA:  Upgrade to space_cache=v2.  Unmount the filesystem, then mount
it with '-o clear_cache,space_cache=v2' (remount is not sufficient, you
have to completely umount).  This will take some minutes, but it only
has to be done once.  Transactions will be quite slow on a filesystem
with ~10000 block groups with space_cache=v1.  Afterwards, use

	btrfs ins dump-tree -t 10 /dev/vgwaya/root |
		grep 'owner FREE_SPACE_TREE' | wc -l

to verify the space_cache=v2 conversion was done (it should give a
non-zero number).  Although directly relevant to this case, this advice
is a PSA because it also applies to all btrfs users.

2.  Use noatime instead of relatime.

In the mount man page for 'relatime':

	since Linux 2.6.30, the file's last access time is always updated
	if it is more than 1 day old

If you get this high-latency behavior about once a day, but it's fine
at other times, then this is the likely cause.  Some users need atime
updates, and they're usually OK on small SSD filesystems; however, this
filesystem is neither small nor SSD, and most users don't need atime.

You didn't mention snapshots.  If you don't have snapshots then disregard
the rest of this paragraph.  If you do have snapshots, then each time
you modify a snapshotted subvol (either origin or snapshot, doesn't
matter, what matters is that the metadata is shared), btrfs will be
doing extra writes to unshare shared pages and update reference counts.
Immediately after the snapshot is created, the write multiplication factor
is about 300.  The factor drops rapidly to 1.0, but it can take a few
minutes to get through the first 10000 page updates after a snapshot,
and you can easily get that many by touching 500 files.  Note that the
snapshot could have been made in the past, its existence will still
affect the write performance of the filesystem in the present.

All of the above effects combine:  5.0 and later do not attempt to manage
latency, atime updates throw a lot of writes into the queue at once,
space_cache=v1 makes every write slower to exit the queue, and fresh
snapshots multiply everything else by an order of magnitude.  With all of
those at once, I'm surprised it's as fast as you reported.  Starting with
kernel 5.0 it's not hard to make a btrfs commit take 10 hours.

> I have 16gb of ram with 16gb of swap on a flash drive, the swap is in use
> 
> KiB Mem : 16393944 total,   398800 free, 13538088 used,  2457056 buff/cache
> KiB Swap: 16777212 total,  6804352 free,  9972860 used.  2045812 avail Mem

Check slabtop:

	# slabtop | grep btrfs_delayed_ref_head
	105072 105072 100%    0.33K   8756       12     35024K btrfs_delayed_ref_head

Divide the second number (count of btrfs_delayed_ref_head slabs in use)
by about 1000 (depends on how fast your disks are, range is about 500 to
10000 for consumer hardware) and the result is roughly the commit latency
in seconds.  It's not the only time spent in a commit, but btrfs spends
orders of magnitude more time on delayed refs than on anything else.
On kernels before 5.0 btrfs kept the delayed ref head count below 10000,
but after 5.0 it is allowed to grow until memory is exhausted.

The latency fixes currently in development put the latency caps from
4.19 back in, and also add new ones, e.g. snapshot delete could create
unlimited latency in btrfs since the beginning.  5.7 or 5.8 should be
better at latency than 4.19.

> What other information would be useful in attempting to diagnose or fix
> this?   I like a number of things about BTFS.  One of them that I don't
> want to give up is the ability to do RAID with different sized disks,
> which seems like the only way it should work.  Switching to ZFS or mdadm
> again would involve disk upgrades and a very large amount of time
> copying this much data, but I'll have to do it if I can't diagnose this.

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux