Austin S. Hemmelgarn wrote:
On 2017-01-16 06:10, Christoph Groth wrote:
root@mim:~# btrfs fi df /
Data, RAID1: total=417.00GiB, used=344.62GiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=40.00MiB, used=68.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=3.00GiB, used=1.35GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=464.00MiB, used=0.00B
Just a general comment on this, you might want to consider
running a full balance on this filesystem, you've got a huge
amount of slack space in the data chunks (over 70GiB), and
significant space in the Metadata chunks that isn't accounted
for by the GlobalReserve, as well as a handful of empty single
profile chunks which are artifacts from some old versions of
mkfs. This isn't of course essential, but keeping ahead of such
things does help sometimes when you have issues.
Thanks! So slack is the difference between "total" and "used"? I
saw that the manpage of "btrfs balance" explains this a bit in its
"examples" section. Are you aware of any more in-depth
documentation? Or one has to look at the source at this level?
I ran
btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /
btrfs balance start -dusage=25 -musage=25 /
This resulted in
root@mim:~# btrfs fi df /
Data, RAID1: total=365.00GiB, used=344.61GiB
System, RAID1: total=32.00MiB, used=64.00KiB
Metadata, RAID1: total=2.00GiB, used=1.35GiB
GlobalReserve, single: total=460.00MiB, used=0.00B
I hope that one day there will be a daemon that silently performs
all the necessary btrfs maintenance in the background when system
load is low!
* So scrubbing is not enough to check the health of a btrfs
file system? It’s also necessary to read all the files?
Scrubbing checks data integrity, but not the state of the data.
IOW, you're checking that the data and metadata match with the
checksums, but not necessarily that the filesystem itself is
valid.
I see, but what should one then do to detect problems such as mine
as soon as possible? Periodically calculate hashes for all files?
I’ve never seen a recommendation to do that for btrfs.
There are a few things you can do to mitigate the risk of not
using ECC RAM though:
* Reboot regularly, at least weekly, and possibly more
frequently.
* Keep the system cool, warmer components are more likely to
have transient errors.
* Prefer fewer numbers of memory modules when possible. Fewer
modules means less total area that could be hit by cosmic rays
or other high-energy radiation (the main cause of most transient
errors).
Thanks for the advice, I think I buy the regular reboots.
As a consequence of my problem I think I’ll stop using RAID1 on
the file server, since this only protects against dead disks,
which evidently is only part of the problem. Instead, I’ll make
sure that the laptop that syncs with the server has a SSD that is
big enough to hold all the data that is on the server as well (1
TB SSDs are affordable now). This way, instead of disk-level
redundancy, I’ll have machine-level redundancy. When something
like the current problem hits one of the two machines, I should
still have a usable second machine with all the data on it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html