Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 3/20/19 11:40 PM, Zygo Blaxell wrote:
On Wed, Mar 20, 2019 at 10:40:07PM +0800, Qu Wenruo wrote:


On 2019/3/20 下午10:00, Anand Jain wrote:

   Also any idea why the generation number for the extent data is not
   incremented [2] when -o nodatacow and notrunc option is used, is it
   a bug? the dump-tree is taken with the script as below [1]
   (this corruption is seen with or without generation number is
   being incremented, but as another way to fix for the corruption we can
   verify the inode EXTENT_DATA generation from the same disk from which
   the data is read).

For the generation part, it's the generation when data is written to
disk.

Truncation/nocow overwrite shouldn't really change the generation of
existing file extents.

So I'm afraid you can't use that generation to do the check.

  Any idea why it shouldn't change? Albeit there isn't new allocation
  due to nodatacow and notrunc overwrite, but sure data is overwritten.

The references to the extent in the subvol trees hold a copy of the
extent's generation, so if the extent's generation is modified, all the
references to the extent in all the subvol trees have to be modified too,
or they fail transid verification later on.


 Extents with refs > 1 will cow , so gen # is incremented on the
 overwritten extents. But my scenario is when refs = 1. As one of
 the idea I am mulling is to track/record missed generation numbers
 on the disks and quarantine those reads on those disks accordingly.


Compared to pure datacow
(nodatasum, compress=none), this would be the same number of iops, only
fragmentation can be saved because of block overwrites (and even that
isn't saved all the time).

  If that's the case then I would guess there will be bug in send receive
  as well.

Send requires a read-only snapshot, > and the snapshot's reference to
the nodatacow extents automatically turns on datacow for those extents.
Thus, send behaves correctly because nodatacow is disabled.

The nodatacow flag is advisory.  It doesn't prevent btrfs from relocating
data when needed.

 Oh yes. I forgot, there is rdonly snapshot prerequisite for the sends.

Thanks, Anand

I'm not sure about the send part.

On the other hand, if btrfs is going to update the generation of
nodatacow file extent overwrite, it should cause pretty big performance
degradation.

The idea of nodatacow is to skip all the expensive csum, extent
allocation (maybe not that expensive) and the race of subvol tree.

nodatacow also skips RAID data integrity checks.  Generally, applications
and admins should plan for any data put in a nodatacow file to be silently
corrupted at any time, i.e. the same situation as an ext4-on-mdadm setup.
This is the price for minimizing the overhead for a write to a nodatacow
extent.

If we're going to update file extents for such case, we're re-introduce
performance impact to users who don't want that impact at all.
I don't believe it's worthy at all.

Thanks,
Qu


Thanks, Anand

Thanks,
Qu


[1]
   umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
   mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
   echo 1st write: && \
   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
conv=fsync,notrunc && sync && \
   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
   echo --- && \
   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
   echo 2nd write: && \
   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
conv=fsync,notrunc && sync && \
   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
   echo --- && \
   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"


1st write:
      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
          generation 6 transid 6 size 4096 nbytes 4096
          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
          sequence 1 flags 0x3(NODATASUM|NODATACOW)
          atime 1553058460.163985452 (2019-03-20 13:07:40)
          ctime 1553058460.163985452 (2019-03-20 13:07:40)
          mtime 1553058460.163985452 (2019-03-20 13:07:40)
          otime 1553058460.163985452 (2019-03-20 13:07:40)
---
      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
          generation 6 type 1 (regular)
          extent data disk byte 13631488 nr 4096
          extent data offset 0 nr 4096 ram 4096
          extent compression 0 (none)
2nd write:
      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
          generation 6 transid 7 size 4096 nbytes 4096
          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
          sequence 2 flags 0x3(NODATASUM|NODATACOW)
          atime 1553058460.163985452 (2019-03-20 13:07:40)
          ctime 1553058460.189985450 (2019-03-20 13:07:40)
          mtime 1553058460.189985450 (2019-03-20 13:07:40)
          otime 1553058460.163985452 (2019-03-20 13:07:40)
---
      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
          generation 6 type 1 (regular)   <----- [2]
          extent data disk byte 13631488 nr 4096
          extent data offset 0 nr 4096 ram 4096
          extent compression 0 (none)


Thanks, Anand








[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux