Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Feb 09, 2020 at 06:49:11PM -0700, Chris Murphy wrote:
> On Sat, Feb 8, 2020 at 5:43 PM Zygo Blaxell
> <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Upon reboot, the filesystem reverts to its state at the last completed
> > transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
> > be intact, but there is no trace of any update to the filesystem after
> > the transaction 4441796.  The last 'fi usage' logged before the crash
> > and the first 'fi usage' after show 40GB of data and 25GB of metadata
> > block groups freed in between.
> 
> Is this behavior affected by flushoncommit mount option? i.e. do you
> see a difference using flushoncommit vs noflushoncommit? My suspicion
> is the problem doesn't happen with noflushoncommit, but then you get
> another consequence that maybe your use case can't tolerate?

Sigh...the first three things anyone suggests when I talk about btrfs's
ridiculous commit latency are:

	1.  Have you tried sysctl vm.dirty_background_bytes?

	2.  Have you tried turning off flushoncommit?

	3.  Have you tried cgroupsv2 parameter xyz?

as if those are not the first things I'd try, or set up a test farm
to run random sets of parameter combinations (including discard, ssd cache
modes, etc) to see if there was any combination of these parameters that
made btrfs go faster, over any of the last five years.

I know what doesn't work:  Very low values of vm.dirty_background_bytes
can certainly harm throughput, but once it's above 100M or so it makes
no difference.  Some SSDs are terrible with discard, others need it
to avoid crippling performance losses every few months.  Writeback SSD
caches get flooded with data thanks to btrfs's already scary-fast write
path, and end up adding additional latency.  cgroupsv2 measures the wrong
things, so it reports io stall pressure of zero in high-priority cgroups
while all writes are blocked and some low-priority cgroup desparately
needs to be throttled.  If you throttle anything on btrfs at the block
level, you get priority inversion, because it's impossible to predict
which thread will end up hosting its very own btrfs transaction commit,
and nobody gets to write anything while one of those is running (well,
on 5.3+, apparently lots of processes can continue to write, but nothing
they write will be persisted after a crash).

When the kernel hits vm.dirty_bytes, none of the other settings matter:
the performance difference between flushoncommit and noflushoncommit
is the order of the writes during a commit, but the commit is always
dumping all the dirty pages that the kernel can store in RAM on disk.
noflushoncommit allows the kernel to dump the pages in the wrong order,
but has no performance advantages.  noflushoncommit might even make the
latency a little _worse_.

Profiling indicates that btrfs spends most of its time _reading_ the
filesystem during commits.  Roughly half the IO is metadata reads for
extent and csum trees, the other half is writing updated versions of
these, and maybe 1% is writing the data blocks.  While all that's going
on, more and more stuff gets locked, until eventually transactions stop
dead on kernels up to 5.0, or keep going forever on kernel 5.3 and later).
Freezing all reading processes helps the commit finish faster, but it
needs cripping levels of throttling (like 0.1% of raw write bandwidth
of the slowest disk in the array, or even less) before making a dent in
the big latency spikes.

I'm not quite sure what's different on 5.4--there were a lot of changes
and I haven't been doing profiling because I've been focused on fixing
the crashing bugs until recently.  5.4 has apparently moved the latency
to different places--ordinary writing processes no longer block at all,
while processes calling rename or fsync can be blocked for entire days.

> 
> -- 
> Chris Murphy

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux