Hi!
On 6/9/20 5:31 PM, Ellis H. Wilson III wrote:
Hi folks,
We have a few engineers looking through BTRFS code presently for answers
to this, but I was interested to get input from the experts in parallel
to hopefully understand this issue quickly.
We find that removes of large amounts of data can take a significant
amount of time in BTRFS on HDDs -- in fact it appears to scale linearly
with the size of the file. I'd like to better understand the mechanics
underpinning that behavior.
Like Adam already mentioned, the amount and size of the individual
extent metadata items that need to be removed is one variable in the big
equation.
The code in show_file.py example in python-btrfs is a bit outdated, and
it prints info about all extents that are referenced and a bit more.
Alternatively, we can only look at the file extent items and calculate
the amount and average and total size (on disk):
---- >8 ----
#!/usr/bin/python3
import btrfs
import os
import sys
with btrfs.FileSystem(sys.argv[1]) as fs:
inum = os.fstat(fs.fd).st_ino
min_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, 0)
max_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, -1)
amount = 0
total_size = 0
for header, data in btrfs.ioctl.search_v2(fs.fd, 0, min_key,
max_key):
item = btrfs.ctree.FileExtentItem(header, data)
if item.type != btrfs.ctree.FILE_EXTENT_REG:
continue
amount += 1
total_size += item.disk_num_bytes
print("Referencing data from {} regular extents with total size {} "
"average size {}".format(
amount, btrfs.utils.pretty_size(total_size),
btrfs.utils.pretty_size(int(total_size/amount))))
---- >8 ----
If I e.g. point this at /bin/bash (compressed), I get:
Referencing data from 9 regular extents with total size 600.00KiB
average size 66.67KiB
The above code assumes that the real data extents are only referenced
once (no dedupe within the same file etc), otherwise you'll have to
filter them (and keep more stuff in memory). And, it ignores inlined
small extents for simplicity. Anyway, you can hack away on it to e.g.
make it look up more interesting things.
https://python-btrfs.readthedocs.io/en/latest/btrfs.html#btrfs.ctree.FileExtentItem
See the attached graph for a quick experiment that demonstrates this
behavior. In this experiment I use 40 threads to perform deletions of
previous written data in parallel. 10,000 files in every case and I
scale files by powers of two from 16MB to 16GB. Thus, the raw amount of
data deleted also expands by 2x every step. Frankly I expected deletion
of a file to be predominantly a metadata operation and not scale with
the size of the file, but perhaps I'm misunderstanding that. While the
overall speed of deletion is relatively fast (hovering between 30GB/s
and 50GB/s) compared with raw ingest of data to the disk array we're
using (in our case ~1.5GB/s) it can still take a very long time to
delete data from the drives and removes hang completely until that data
is deleted, unlike in some other filesystems. They also compete
aggressively with foreground I/O due to the intense seeking on the HDDs.
What is interesting in this case is to see what kind of IO pattern the
deletes are causing (so, obviously, test without adding data). Like, how
much throughput for read and write, and how many iops read and write per
second (so that we can easily calculate average iop size).
If most time is spent doing random read IO, then hope for a bright
future in which you can store btrfs metadata on solid state and the data
itself on cheaper spinning rust.
If most time is spent doing writes, then what you're seeing is probably
what I call rumination, which is caused by making changes in metadata,
which leads to writing modified parts of metadata in free space again.
The extent tree (which has info about the actual data extents on disk
that the file_extent_item ones seen above point to) is once of those,
and there's only one of that kind, which is even tracking its own space
allocation, which can lead to the effect of a cat or dog chasing it's
own tail. There have definitely been performance improvements in that
area, I don't know exactly where, but when I moved from 4.9 to 4.19
kernel, things improved a bit.
There is a very long story at
https://www.spinics.net/lists/linux-btrfs/msg70624.html about these
kinds of things.
There are unfortunately no easy accessible tools present yet that can
give live insight in what exactly is happening when it's writing
metadata like crazy.
This is with an older version of BTRFS (4.12.14-lp150.12.73-default) so
if algorithms have changed substantially such that deletion rate is no
longer tied to file size in newer versions please just say so and I'll
be glad to take a look at that change and try with a newer version.
That seems to be a suse kernel. I have no idea what kind of btrfs is in
there.
FWIW, we are using the v2 free space cache.
Just to be sure, there are edge cases in which the filesystem says it's
using space cache v2, but actually isn't.
Does /sys/fs/btrfs/<UUID>/features/free_space_tree exist?
Or, of course a fun little program to just do a bogus search in it,
which explodes if it's not there:
---- >8 ----
#!/usr/bin/python3
import btrfs
with btrfs.FileSystem('/') as fs:
try:
list(fs.free_space_extents(0, 0))
print("Yay!")
except FileNotFoundError:
print("No FST :(")
---- >8 ----
Space cache v1 will cause filesystem stalls combined with a burst of
writes on every 'transaction commit', and space cache v2 will cause more
random reads and writes, but they don't block the whole thing.
If any other information is
relevant to this just let me know and I'll be glad to share.
Suggestions for things to try, and see what difference it makes:
* Organize incoming data in btrfs subvolumes in a way that enables you
to remove the subvol instead of rm'ing the files. If there are a lot of
files and stuff, this can be more efficient, since btrfs knows about
what parts to 'cut away' with a chainsaw when removing, instead of
telling it to do everything file by file in small steps. It also keeps
the size of the individual subvolume metadata trees down, reducing
rumination done by the cow. If you don't have them, your default fs tree
with all file info is relatively seen as large as the extent tree.
* Remove stuff in the same order as it got added. Remember, rm * removes
files in alphabetical order, not in the order in which metadata was
added to disk (the inode number sequence). It might cause less jumping
around in metadata. Using subvolumes instead is still better, because in
that case the whole issue does not exist.
* (not recommended) If your mount options do not show 'ssd' in them and
your kernel does not have this patch:
https://www.spinics.net/lists/linux-btrfs/msg64203.html then you can try
the mount -o ssd_spread,nossd (read the story in the link above).
Anyway, you're better off moving to something recent instead of trying
all these obsolete workarounds...
If you have a kernel >= 4.14 then you're better off mounting with -o ssd
for rotational disks because the metadata writes are put more together
in bigger parts of free space, leading to less rumination. There is
still a lot of research and improvements to be done in this area (again,
see long post referenced above).
If you have kernel < 4.14, then definitely do not mount with -o ssd, but
even mount ssds with -o nossd explicitly. (yes, that's also a long
story)...
Fun! :) /o\
Thank you for any time people can spare to help us understand this
better!
Don't hesitate to ask more questions,
Have fun,
Hans