Re: BTRFS File Delete Speed Scales With File Size?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



First and foremost Zygo, apologies for not having responded to your prior two emails. I was alerted by a colleague that we'd gotten a response, and found that very unfortunately our MS filter had marked your emails as spam. I've unmarked them so hopefully this doesn't happen in the future.

A few comments on prior emails, and then responses in-line:

40 threads is simply the max due to the nature of our OSD architecture. We can make good use of them in the case of heavy PFS metadata updates (which do not go to BTRFS), but yes, they can overwhelm spinning rust under very heavy deletion workloads. We're looking into task queuing to throttle this.

We are less concerned about the ability to process deletes in O(1) or O(log(n)) time, and more concerned about not holding up foreground I/O while a number of them are processed (ideally in the background).


On 6/15/20 11:56 PM, Zygo Blaxell wrote:
2. We find that speed to delete and fsync the directory on 4.12 is
equivalent to delete and then sync the entire filesystem.

I'm not quite sure how fsync() is relevant here--fsync() doesn't do
anything about delayed refs, unless fsync() accidentally triggers a full
sync (which I wouldn't expect to happen when fsyncing a simple unlink).

fsync is how we are guaranteeing that either a write or create has occurred to a new file, or a delete has been acknowledged by the underlying FS and will be rolled forward on power-fail. We recognize there may be non-trivial work involved in rolling that forward on reboot, especially on spinners. It was of interest because previously I was (wrongly, at least relative to our use of BTRFS in our PFS) timing both the remove and the full 'sync' together, rather than just the remove and the fsync of the housing directory. I agree in 4.12 the dominant time is the remove, so switching didn't matter, but it does matter in 5.7, for reasons you elucidate elsewhere.

4.12 throttles unlink() to avoid creating huge backlogs of ref updates
in kernel memory.  During the throttling, previously delayed refs
(including ref updates caused by other threads) are processed by the
kernel thread servicing the unlink() system call.

5.7 doesn't throttle unlink(), so the unlink() system call simply queues
delayed ref updates as quickly as the kernel can allocate memory.
Once kernel memory runs out, unlink() will start processing delayed
refs during the unlink() system call, blocking the caller of unlink().
If memory doesn't run out, then the processing happens during transaction
commit in some other thread (which may be btrfs-transaction, or some
other unlucky user thread writing to the filesystem that triggers delayed
ref processing).

The above is exactly the explanation I've been looking for.  Thank you!

In both kernels there will be bursts of fast processing as unlink()
borrows memory, with occasional long delays while unlink() (or some other
random system call) pays off memory debt.  4.12 limited this borrowing
to thousands of refs and most of the payment to the unlink() caller;
in 5.7, there are no limits, and the debt to be paid by a random user
thread can easily be millions of refs, each of which may require a page
of IO to complete.

Are there any user-tunable settings for this in 4.12? We would be extremely interested in bumping the outstanding refs in that version if doing so was as simple as a sysctl, hidden mount option, or something similar.

Any comments on when major change(s) came in that would impact these
behaviors would be greatly appreciated.

Between 4.20 and 5.0 the delayed ref throttling went away, which might
account for a sudden shift of latency to a different part of the kernel.

The impact is that there is now a larger window for the filesystem to
roll back to earlier versions of the data after a crash.  Without the
throttling, unlink can just keep appending more and more ref deletes to
the current commit faster than the disks can push out updated metadata.
As a result, there is a bad case where the filesystem spends all of
its IO time trying to catch up to metadata updates, but never successfully
completes a commit on disk.

A crash rolls back to the previous completed commit and replays the
fsync log.  A crash on a 5.0 to 5.7 kernel potentially erases hours of
work, or forces the filesystem to replay hours of fsync() log when the
filesystem is mounted again.  This requires specific conditions, but it
sounds like your use case might create those conditions.

Yes, we might under the right end-user scenarios and just the right timed power-fail, but we have appropriate states for the PFS to cope with this that can include a very lengthy bring-up period, of which mounting BTRFS is already included. Erasing acknowledged fsync'd work would be a breach of POSIX and concerning, but taking time on mount to replay things like this is both expected and something we do at a higher level in our PFS so we get it.

Some details:

	https://lore.kernel.org/linux-btrfs/20200209004307.GG13306@xxxxxxxxxxxxxx/

5.7 has some recent delayed ref processing improvements, but AFAIK
the real fixes (which include putting the throttling back) are still
in development.

All extremely helpful info Zygo. Thank you again, and do let me know if there are any tunables we can play with in 4.12 to better mimic the behavior we're seeing in 5.7. We'd be indebted to find out there were.

Best,

ellis



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux