Re: btrfs fail behavior when a device vanishes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Here is a kludge I hacked up.
Someone that cares could clean this up and start building a proper
test suite or something.

This test script creates a 3 disk raid1 filesystem and very slowly
writes a large file onto the filesystem while, one by one each disk is
disconnected then reconnected in a loop.
It is fairly trivial to trigger dataloss when devices are bounced like this.

You have to run the script as root due to the calls to [u]mount and iscsiadm




On Thu, Dec 31, 2015 at 1:23 PM, ronnie sahlberg
<ronniesahlberg@xxxxxxxxx> wrote:
> On Thu, Dec 31, 2015 at 12:11 PM, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>> This is a torture test, no data is at risk.
>>
>> Two devices, btrfs raid1 with some stuff on them.
>> Copy from that array, elsewhere.
>> During copy, yank the active device.
>>
>> dmesg shows many of these:
>>
>> [ 7179.373245] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr
>> 652123, rd 697237, flush 0, corrupt 0, gen 0
>
> For automated tests a good way could be to build a multi device btrfs filesystem
> ontop of it.
> For example STGT exporting n# volumes and then mount via the loopback interface.
> Then you could just use tgtadm to add / remove the device in a
> controlled fashion and to any filesystem it will look exactly like if
> you pulled the device physically.
>
> This allows you to run fully automated and scripted "how long before
> the filesystem goes into total dataloss mode" tests.
>
>
>
> If you want more fine control than just plug/unplug on a live
> filesystem , you can use
> https://github.com/rsahlberg/flaky-stgt
> Again, this uses iSCSI but it allows you to script event such as
> "this range of blocks are now Uncorrectable read error" etc.
> To automatically stress test that the filesystem can deal with it.
>
>
> I created this STGT fork so that filesystem testers would have a way
> to automate testing of their failure paths.
> In particular for BTRFS which seems to still be incredible fragile
> when devices fail or disconnect.
>
> Unfortunately I don't think anyone cared very much. :-(
> Please BTRFS devs,  please use something like this for testing of
> failure modes and robustness. Please!
>
>
>
>>
>> Why are the write errors nearly as high as the read errors, when there
>> is only a copy from this device happening?
>>
>> Is Btrfs trying to write the read error count (for dev stats) of sdc1
>> onto sdc1, and that causes a write error?
>>
>> Also, is there a command to make a block device go away? At least in
>> gnome shell when I eject a USB stick, it isn't just umounted, it no
>> longer appears with lsblk or blkid, so I'm wondering if there's a way
>> to vanish a misbehaving device so that Btrfs isn't bogged down with a
>> flood of retries.
>>
>> In case anyone is curious, the entire dmesg from device insertion,
>> formatting, mounting, copying to then from, and device yanking is here
>> (should be permanent):
>> http://pastebin.com/raw/Wfe1pY4N
>>
>> And the copy did successfully complete anyway, and the resulting files
>> have the same hashes as their originals. So, yay, despite the noisy
>> messages.
>>
>>
>> --
>> Chris Murphy
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: test_0100_write_raid1_unplug.sh
Description: Bourne shell script

Attachment: functions.sh
Description: Bourne shell script


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux