Re: Unocorrectable errors with RAID1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Austin S. Hemmelgarn wrote:
On 2017-01-16 06:10, Christoph Groth wrote:

root@mim:~# btrfs fi df /
Data, RAID1: total=417.00GiB, used=344.62GiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=40.00MiB, used=68.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=3.00GiB, used=1.35GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=464.00MiB, used=0.00B

Just a general comment on this, you might want to consider running a full balance on this filesystem, you've got a huge amount of slack space in the data chunks (over 70GiB), and significant space in the Metadata chunks that isn't accounted for by the GlobalReserve, as well as a handful of empty single profile chunks which are artifacts from some old versions of mkfs. This isn't of course essential, but keeping ahead of such things does help sometimes when you have issues.

Thanks! So slack is the difference between "total" and "used"? I saw that the manpage of "btrfs balance" explains this a bit in its "examples" section. Are you aware of any more in-depth documentation? Or one has to look at the source at this level?

I ran

btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /
btrfs balance start -dusage=25 -musage=25 /

This resulted in

root@mim:~# btrfs fi df /
Data, RAID1: total=365.00GiB, used=344.61GiB
System, RAID1: total=32.00MiB, used=64.00KiB
Metadata, RAID1: total=2.00GiB, used=1.35GiB
GlobalReserve, single: total=460.00MiB, used=0.00B

I hope that one day there will be a daemon that silently performs all the necessary btrfs maintenance in the background when system load is low!

* So scrubbing is not enough to check the health of a btrfs file system? It’s also necessary to read all the files?

Scrubbing checks data integrity, but not the state of the data. IOW, you're checking that the data and metadata match with the checksums, but not necessarily that the filesystem itself is valid.

I see, but what should one then do to detect problems such as mine as soon as possible? Periodically calculate hashes for all files? I’ve never seen a recommendation to do that for btrfs.

There are a few things you can do to mitigate the risk of not using ECC RAM though: * Reboot regularly, at least weekly, and possibly more frequently. * Keep the system cool, warmer components are more likely to have transient errors. * Prefer fewer numbers of memory modules when possible. Fewer modules means less total area that could be hit by cosmic rays or other high-energy radiation (the main cause of most transient errors).

Thanks for the advice, I think I buy the regular reboots.

As a consequence of my problem I think I’ll stop using RAID1 on the file server, since this only protects against dead disks, which evidently is only part of the problem. Instead, I’ll make sure that the laptop that syncs with the server has a SSD that is big enough to hold all the data that is on the server as well (1 TB SSDs are affordable now). This way, instead of disk-level redundancy, I’ll have machine-level redundancy. When something like the current problem hits one of the two machines, I should still have a usable second machine with all the data on it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux