Re: Adventures in btrfs raid5 disk recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Jun 26, 2016 at 01:30:03PM -0600, Chris Murphy wrote:
> On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidjaar@xxxxxxxxx> wrote:
> > 26.06.2016 00:52, Chris Murphy пишет:
> >> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> >> is raid4.
> >
> > That's not what code suggests and what I see in practice - parity seems
> > to be distributed across all disks; each new 128KiB file (extent) has
> > parity on new disk. At least as long as we can trust btrfs-map-logical
> > to always show parity as "mirror 2".
> 
> 
> tl;dr Andrei is correct there's no raid4 behavior here.
> 
> Looks like mirror 2 is always parity, more on that below.
> 
> 
> >
> > Do you see consecutive full stripes in your tests? Or how do you
> > determine which devid has parity for a given full stripe?
> 
> I do see consecutive full stripe writes, but it doesn't always happen.
> But not checking the consecutivity is where I became confused.
> 
> [root@f24s ~]# filefrag -v /mnt/5/ab*
> Filesystem type is: 9123683e
> File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456128..   3456159:     32:             last,eof
> /mnt/5/ab128_2.txt: 1 extent found
> File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456224..   3456255:     32:             last,eof
> /mnt/5/ab128_3.txt: 1 extent found
> File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456320..   3456351:     32:             last,eof
> /mnt/5/ab128_4.txt: 1 extent found
> File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456352..   3456383:     32:             last,eof
> /mnt/5/ab128_5.txt: 1 extent found
> File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456384..   3456415:     32:             last,eof
> /mnt/5/ab128_6.txt: 1 extent found
> File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456416..   3456447:     32:             last,eof
> /mnt/5/ab128_7.txt: 1 extent found
> File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456448..   3456479:     32:             last,eof
> /mnt/5/ab128_8.txt: 1 extent found
> File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456480..   3456511:     32:             last,eof
> /mnt/5/ab128_9.txt: 1 extent found
> File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456096..   3456127:     32:             last,eof
> /mnt/5/ab128.txt: 1 extent found
> 
> Starting with the bottom file then from the top so they're in 4096
> byte block order; and the 2nd column is the difference in value:
> 
> 3456096
> 3456128 32
> 3456224 96
> 3456320 96
> 3456352 32
> 3456384 32
> 3456416 32
> 3456448 32
> 3456480 32
> 
> So the first two files are consecutive full stripe writes. The next
> two aren't. The next five are. They were all copied at the same time.
> I don't know why they aren't always consecutive writes.

The logical addresses don't include parity stripes, so you won't find
them with FIEMAP.  Parity locations are calculated after the logical ->
(disk, chunk_offset) translation is done (it's the same chunk_offset on
every disk, but one of the disks is parity while the others are data).

> [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
> mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
> mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
> mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
> mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
> mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
> mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
> mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
> mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
> mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
> mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
> mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
> mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a
> mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b
> mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a
> mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c
> mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a
> mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
> mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
> 
> 
> To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB
> "b", so expected parity is 0x03; if it's always 128KiB of the same
> value then parity is 0x00 and can result in confusion/mistakes with
> unwritten free space).
> 
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> 
> Ok so in particular the last five, parity is on device b, c, a, b, c -
> that suggests it's distributing parity on consecutive full stripe
> writes.
> 
> Where I became confused is, there's not always a consecutive write,
> and that's what ends up causing parity to end up on one device less
> often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x
> VG/b.
> 
> Basically it's a bad test. The sample size is too small. I'd need to
> increase the sample size by a ton in order to know for sure if this is
> really a problem.
> 
> 
> >This
> > information is not actually stored anywhere, it is computed based on
> > block group geometry and logical stripe offset.
> 
> I think you're right. A better test is a scrub or balance on a raid5
> that's exhibiting slowness, and find out if there's disk contention on
> that system, and whether it's the result of parity not being
> distributed enough.
> 
> 
> > P.S. usage of "stripe" to mean "stripe element" actually adds to
> > confusion when reading code :)
> 
> It's confusing everywhere. mdadm chunk = strip = stripe element. And
> then LVM introduces -i --stripes which means "data strips" i.e. if you
> choose -i 3 with raid6 segment type, you get 5 strips per stripe (3
> data 2 parity). It's horrible.
> 
> 
> 
> 
> -- 
> Chris Murphy
> 

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux