Re: Self-destruct of btrfs RAID6 array

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Paul Loewenstein posted on Thu, 19 Nov 2015 20:11:14 -0800 as excerpted:

> I have just had an apparently catastrophic collapse of a large RAID6
> array.  I was hoping that the dual-redundancy of a RAID6 array would
> compensate for having no backup media large enough to back it up!

Well...

First, while btrfs in general is "stabilizing" and is noticeably better 
than it was a year ago, it remains "not yet fully stable or mature."

There's a sysadmin's rule of backups, that if it's not backed up, you 
value the data it contains less than the time/trouble/resources of making 
a backup, and thus, should it fail, regardless of any loss of data you've 
saved what your actions defined as /really/ valuable, the time/trouble/
resources saved by not doing the backup, and thus should be happy as you 
saved the real important stuff.

Because btrfs isn't yet fully stable, having backups is even more 
important than it would be on a fully stable filesystem like xfs, ext*, 
or reiserfs (my previous favorite and what I still use on spinning rust 
and for backups), so that sysadmin's rule of backups applies double.

Of course some distros are choosing to deploy and support btrfs as if 
it's already fully stable, and that's their risk and their business for 
doing so, but by the same token, for that you'd get support from them, 
not from the upstream list (here), where btrfs is still considered to be 
"stabilizing, not yet fully stable".

Second, btrfs raid56 mode is much newer than btrfs in general, and isn't 
yet close to even the "stabilizing, good enough provided you have good 
backups or are using throw-away data" general level of btrfs.  Nominal 
code-completion was only kernel 3.19, and there were very significant 
bugs with it and 4.0, into the early 4.1 cycle, tho by 4.1 release the 
worst and known bugs were fixed.  But as a btrfs user and list regular, I 
and others have repeatedly recommended that people not consider btrfs 
raid56 mode as "stabilizing-stable" as btrfs in general is, for at least 
a year (five kernel cycles) after nominal code completion in 3.19, and 
even then, people thinking about using btrfs raid56 should check the list 
for recent bugs and consider, before deploying in anything but throw-away-
data (which can be because it's backed up data) test mode.  Of course 
that would be kernel 4.4, which is currently in development.

And as it happens, kernel 4.4 has been announced as a long-term-stable 
series, so things look to be working out reasonably well for those 
interested in first-opportunity-stablish btrfs raid56 deployment on it. 
=:^)

Since we're obviously not at 4.4 release yet, and in fact you're 
apparently running 4.1 stable series, that means btrfs raid56 mode must 
still be considered less stable than btrfs as a whole, which as I said is 
itself "still stabilizing, not fully stable and mature", so now we're at 
double-the-already-doubled-strength, 4 times the normal strength, of the 
sysadmin's backup rule.

So it's four-times self-evident that if you didn't have backups for data 
on raid56 mode btrfs, by your actions you placed a *REALLY* low value on 
that data!  So losing it is /very/ trivial, at least compared to the time 
and resources you can be happy you saved by not having a backup. =:^)

That said, there's still hope...

First, because btrfs raid56 mode /is/ so new and not yet stable, you 
really need to be working with the absolute latest tools in ordered to 
have the best chance at recovery.  That means kernel 4.3 and btrfs-progs 
4.3.1, if at all possible.  You can use earlier, but it might mean losing 
what's actually recoverable using the latest tools.

> Any suggestions for repairing this array, at least to the point of
> mounting it read-only?  I am thinking of trying to mount it degraded
> with different devices missing, but I don't know if that will be an
> exercise in futility.
> 
> btrfs fi show still works!
> 
> Label: 'btrfsdata'  uuid: ccde0a00-e50b-4154-977f-ac591ab580a5
>          Total devices 6 FS bytes used 9.62TiB
>          devid   10 size 3.64TiB  used 2.41TiB path /dev/sdg
>          devid   11 size 3.64TiB used 2.41TiB path /dev/sda
>          devid   12 size 3.64TiB used 2.41TiB path /dev/sdb
>          devid   13 size 3.64TiB used 2.41TiB path /dev/sdc
>          devid   14 size 3.64TiB used 2.41TiB path /dev/sdd
>          devid   15 size 3.64TiB used 2.41TiB path /dev/sde
> 
> It spontaneously (I believe it was after it successfully mounted rw on
> boot, but I can't check for sure without looking at the last file
> creation time).  After another reboot it won't mount at all.

You say mount, but there's no hint of the options you've tried.

If you've not yet read up on the user documentation on the wiki,
https://btrfs.wiki.kernel.org , I suggest you do so.  There's a lot of 
useful background information there, including discussion of mount 
options and recovery.

What you will want to try here if you haven't already is a degraded,ro 
mount, possibly with the recovery option as well (try it without first, 
then with, if necessary).

If you've not tried degraded writable yet, there's a possibility mounting 
degraded, writable, will work, but if it does, you want to do device 
replaces/deletes to get undegraded as soon as possible, preferably with 
as little other writing to the filesystem as possible, as if new chunks 
need allocated to do further writes they may be allocated in single mode, 
and there's currently a bug which won't allow degraded read-write mount 
after that, because btrfs sees the single-mode chunks on a degraded 
filesystem and thinks there may be others on the missing devices, without 
actually checking.  As a result, you often get just one shot at a 
writable mount to undegrade, and if that doesn't work, the filesystem is 
often only read-only mountable after that.  (This bug applies to all 
redundant/parity raid modes so to raid1 and raid10 as well, not just 
raid56.)

If you /had/ tried degraded mounting, that bug may be why you're now 
unable to mount again, writable, but degraded,ro, is likely to still 
work.  There's actually a patch for the bug, that makes btrfs check the 
actual chunk allocation to see if all are accounted for on the existing 
devices, allowing writable mounting if so, but it's definitely not in 4.1 
or 4.2, tho I think it might have made 4.3.  (If so it could possibly be 
backported to stable-series 4.1 at least, but it's unlikely to be there 
yet.)


If the various degraded,recovery,ro options don't work, the next thing to 
try is btrfs restore.  This works with an unmounted filesystem using the 
userspace code, so a current btrfs-progs, preferably 4.3.0 or 4.3.1, is 
recommended for the best chance at success.

What btrfs restore does is try to read the unmounted filesystem and 
retrieve files from it, writing them to some other mounted filesystem 
location.  Newer btrfs restore versions have options to save ownership/
permissions and timestamp data, and rewrite symlinks as well, otherwise 
the files are written as the executing user (root) using its umask.  
There's options to write only selective parts of the filesystem, and/or 
to restore specific snapshots (which are otherwise ignored), as well. 
Obviously you'll need space at wherever you point restore at to write 
whatever you intend to restore, but if you didn't have a current backup, 
as people considering this option obviously didn't, this is basically 
replacing the space you would have otherwise dedicated to backups, so 
it's not too horrible.

With a bit of luck, restore will work without further trouble.  If it 
doesn't, there's more damage, but btrfs does keep a history of main 
roots, and btrfs-find-root can be used to list them, with btrfs restore 
able to take a root by its bytenr, using the -t option.  Here's the wiki 
page link with further instructions, tho last I looked it was a bit dated.

https://btrfs.wiki.kernel.org/index.php/Restore

A hint, in case it's not obvious from the wiki page, generation, and 
transid/transaction-id, are the same thing. =:^)

Of course, also see the btrfs-restore manpage, which now actually lists 
the wiki link for more info.  As I said the wiki page was a bit dated 
last I looked, so definitely check the manpage, and pay attention to the 
newer options such as -l (list roots, useful with -t to see if that root 
is a good restore candidate), -D (dry run), and -m and -S, metadata and 
symlinks, without which files will be restored as the writing user (root) 
using the present umask, with current timestamps, and no symlinks.


If btrfs restore fails you, then getting a dev interested in the specific 
errors you have and patches to fix them, is your only hope.  But of 
course, since you already saved what was most important to you, the time 
and resources you would have otherwise spent to do the backup, and what 
might be lost here is as explained above at most valued at 4X-trivial, 
you can still be happy that you saved the really important stuff and any 
loss really /is/ trivial. 

(Seriously, when you compare the loss of a bit of data to what those 
folks in France lost recently, or what those Syrian refugees are risking 
and at times losing, their lives, or what the folks in 9/11 lost... in 
perspective, losing a bit of data here really *is* trivial.  The fact 
that we're both here at all, along with the others on the list, 
discussing this, makes us all pretty lucky, all things considered!  
Sometimes it does help to step back and get some /real/ perspective! =:^)

> Looking back in the journal (I shall now be setting up journal
> monitoring), I found lots of errors, starting last September, only a few
> weeks after converting from RAID1 to RAID6.
> Blank lines precede reboots and for the first log indicate the omission
> of over 30K entries!  The first log must represent some software bug,
> because /dev/sdh is NOT a btrfs device!

That very possibly indicates either a different device-detection order 
and thus device letter assignment on boot, such that one of the other 
devices appeared as /dev/sdh at that boot, or a device dropping out and 
reappearing as sdh, instead of whatever letter it had previously.  On 
today's hardware, such device reordering isn't uncommon, thus the switch 
to mounting by UUID or filesystem labels, for instance, as opposed to the 
now somewhat unpredictable /dev/sdX devices names, since the X can change!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux