On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
> [root@f24s ~]# filefrag -v /mnt/5/*
> Filesystem type is: 9123683e
> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 3: 2931712.. 2931715: 4: last,eof
Hmm ... I wonder what is wrong here (openSUSE Tumbleweed)
nohostname:~ # filefrag -v /mnt/1
Filesystem type is: 9123683e
File size of /mnt/1 is 3072 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 269376.. 269376: 1: last,eof
/mnt/1: 1 extent found
But!
nohostname:~ # filefrag -v /etc/passwd
Filesystem type is: 9123683e
File size of /etc/passwd is 1527 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 4095: 0.. 4095: 4096:
last,not_aligned,inline,eof
/etc/passwd: 1 extent found
nohostname:~ #
Why it works for one filesystem but does not work for an other one?
...
>
> So at the old address, it shows the "aaaaa..." is still there. And at
> the added single block for this file at new logical and physical
> addresses, is the modification substituting the first "a" for "g".
>
> In this case, no rmw, no partial stripe modification, and no data
> already on-disk is at risk.
You misunderstand the nature of problem. What is put at risk is data
that is already on disk and "shares" parity with new data.
As example, here are the first 64K in several extents on 4 disk RAID5
with so far single data chunk
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff
15491 itemsize 176
chunk length 3221225472 owner 2 stripe_len 65536
type DATA|RAID5 num_stripes 4
stripe 0 devid 4 offset 9437184
dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be
stripe 1 devid 3 offset 9437184
dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f
stripe 2 devid 2 offset 1074790400
dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf
stripe 3 devid 1 offset 1094713344
dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
where devid 1 = sdb1, 2 = sdc1 etc.
Now let's write some data (I created several files) up to 64K in size:
mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1
mirror 2 logical 1103364096 physical 9502720 device /dev/sde1
mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1
mirror 2 logical 1103368192 physical 9506816 device /dev/sde1
mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1
mirror 2 logical 1103372288 physical 9510912 device /dev/sde1
mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1
mirror 2 logical 1103376384 physical 9515008 device /dev/sde1
mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1
mirror 2 logical 1103380480 physical 9519104 device /dev/sde1
Note that btrfs allocates 64K on the same device before switching to
the next one. What is a bit misleading here, sdc1 is data and sde1 is
parity (you can see it in checksum tree, where only items for sdc1
exist).
Now let's write next 64k and see what happens
nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1
mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1
mirror 2 logical 1103429632 physical 9502720 device /dev/sde1
See? btrfs now allocates new stripe on sdb1; this stripe is at the
same offset as previous one on sdc1 (64K) and so shares the same
parity stripe on sde1. If you compare 64K on sde1 at offset 9502720
before and after, you will see that it has changed. INPLACE. Without
CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1
is updated but sde1 is not, attempt to reconstruct data on sdc1 will
either fail (if we have checksums) or result in silent corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html