Re: uncorrectable errors after btrfs replace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hi Chris

thanks for your reply. I was unable to save the filesystem. Even after deleting all but 4Gb I still had too many errors so I just reformated the device.  I'm glad that it was my backups and not my data.

On 18/08/13 23:43, Chris Murphy wrote:
On Aug 18, 2013, at 1:12 PM, Stuart Pook <slp644161@xxxxxxx> wrote:

6  btrfs filesystem resize 580g .

You first shrank a 2TB btrfs file system on dmcrypt device to 590GB.
But then you didn't resize the dm device or the partition?

no, I had no need to resize the dm device or partition.  I just read that when doing a replace the new device must be no smaller than the old device.  So I shrunk the old device using "btrfs filesystem resize".  Once the resize worked I was able to do the replace but I didn't try to replace before resizing.

This is what btrfs(1) says on Debian: "The targetdev needs to be same size or larger than the srcdev."  I may be confused here.

9  time btrfs balance start -musage=1 -dusage=1 . && time btrfs filesystem resize 580g .

I was surprised that the resize to 580Gb didn't work so I tried a magical rebalance before doing the resize to 580 again.  It still didn't work (not enough space) but a resize to 590 Gb did.

10  time  btrfs filesystem resize 590g .

this worked

You followed the resize of the fs, but not the underlying devices,
with a balance, then resized it two more times?

The resize to 580 didn't work. So I did a balance.  The resize to 580 still didn't work so I resized to 590.

This is weird, but also makes the sequence difficult to follow.

13  time btrfs replace start  /dev/dm-11 /dev/dm-12 -B /disks/backups
14  time btrfs replace start  /dev/dm-11 /dev/dm-12-B /disks/backups

Why is this command repeated? What's with the numbering system that
skips numbers?

The command is repeated because I cancelled it my mistake by setting the filesystem to readonly.  I'm not sure if I restarted it by rerunning the replace or just by remounting the filesystem readwrite in another window.

I'll put all of the commands at the end of this list.

Aug 18 12:28:17 kooka kernel: [54139.448029] ata10: SATA link up1.5 Gbps (SStatus 113 SControl 310)
Bad connection so libata is dropping the link from 3 Gbps to1.5Gbps.
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age Always - 12080

This confirms that both ends of the cable are sensing communication
problems between drive and controller. The cable needs to be
replaced, likely it's the connector not the cable itself.

I think that I should stop using my SATA dock with the SATA ports on my motherboard which are probably not designed to be hot plugged.

I guess that /disks/backup is mostly dead and that I should just
reformat it.  What do you think?

Well I think I'd try to simplify this drastically and see if you've
got a reproducing bug.

I ran a badblocks scan on the raw device (not the luks device) and didn't get any errors.

The steps you've got I find mostly incoherent, so I can't try to do
what you did to see if it's reproducible.

yes, this was the first time I've tried this.  And just to make this more difficult some commands were typed in a different window.
Next time I'll watch /var/log/syslog but I would have preferred
that "btrfs replace" stop when getting errors.

The errors should be self correcting, but the mere fact they're
happening means that some errors could be occurring but aren't
detected. If the data is corrupting in-transit, but the drive or
controller didn't report a problem, then btrfs has no way of knowing
it was written incorrectly.

The data was written to the WD-Blue (640Gb) disk and then copied off it.  The only errors I saw concerned the WB-Blue.  If the errors were data corruption on writing or reading the WD-Blue then I would have thought that the checksums would have told me that there was something wrong.  btrfs didn't give me an IO error until I started to read the files when the data was on a final disk.

Does "btrfs replace" check the ckecksums as it reads the data from the disk that is being replaced?

Just to be clear. This is the series of btrfs replace I did:

backups : HD204UI -> WD-Blue
/mnt : WD-Black -> HD204UI
backups : WD-Blue -> WD-Black

I guess that my backups were corrupted was they were written to or read from the WD-Blue. Wouldn't the checksums have detected this problem before the data was written to the WD-Black?

There's only so much software can do to overcome blatant hardware problems.

I was hoping to be informed of them

But, it seems unlikely such a high percent of errors would go
undetected to result in so many uncorrectable errors, so there may be
user error here along with a bug.

I'm not sure how I could have done it better. Does "btrfs replace" check that the data is correctly written to the new disk before it is removed from the old disk?  Should I have used the 2 disks to make a RAID-1 and then done a scrub before removing the old disk?

Here is the complete list of commands I made in the main terminal

    1  cd /disks/backups/
    2  btrfs filesystem df
    3  btrfs filesystem df  ,
    4*
    5  btrfs filesystem df  .
    6  btrfs filesystem resize 580g .
    7  date
    8  btrfs filesystem df  .
    9  time btrfs  balance start -musage=1 -dusage=1 . && time  btrfs filesystem resize 580g .
   10  time  btrfs filesystem resize 590g .
   11  btrfs filesystem show
   12  cryptsetup luksOpen /dev/sdd2 640Gb
   13  time btrfs replace start  /dev/dm-11 /dev/dm-12 -B /disks/backups
   14  time btrfs replace start  /dev/dm-11 /dev/dm-12 -B /disks/backups
   15  cd /
   16  btrfs filesystem show
   17  btrfs filesystem show
   18  cryptsetup remove _dev_sdc2
   19  fdisk /dev/sdc
   20  fdisk /dev/sdc
   21  fdisk -c /dev/sdc
   22  fdisk -c=dos /dev/sdc
   23  fdisk /dev/sdc
   24  fdisk -c=dos /dev/sdc
   25  l /mnt
   26  mount /dev/sdb1 /mnt
   27  l /mnt
   28  btrfs subv list /mnt
   29  btrfs filesystem show
   30  #time btrfs replace start  /dev/dm-11 /dev/dm-12 -B /disks/backups
   31  fdisk -l /dev/sdc
   32  time btrfs replace start  /dev/sdb1  /dev/sdc2 -B /mnt
   33  btrfs filesystem show
   34  btrfs filesystem label  /dev/dm-12
   35   btrfs filesystem label /disks/backups
   36   btrfs filesystem label /disks/backups backups2Tb
   37  btrfs filesystem show
   38   btrfs filesystem label /disks/backups
   39  cryptsetup luksFormat /dev/sdb2
   40  cryptsetup luksAddKey /dev/sdb2
   41  cryptsetup open  /dev/sdb2 newbackups
   42  l /dev/mapper/newbackups
   43  time btrfs replace start  /dev/dm-12  /dev/dm-11 -B /disks/backups
   44  btrfs filesystem show
   45  cryptsetup status 640Gb
   46  cryptsetup remove 640Gb
   47  btrfs filesystem show
   48  btrfs filesystem df /disks/backups/
   49  btrfs filesystem resize max /disks/backups/
   50  btrfs filesystem df /disks/backups/
   51  btrfs filesystem show
   52  vi /etc/cron.daily/storebackup
   53  vi /etc/cron.daily/stuart
   54  /etc/local/backups
   55  mount
   56  mount -o remount,rw /disks/backups/
   57  time  btrfs  scrub start -Bd /disks/backups
   58  smartctl -a   /dev/sdb
   59  smartctl -a   /dev/sdc
   60  smartctl -a   /dev/sdd
   61  smartctl -t short   /dev/sdd
   62  sleep 2m;  smartctl -a   /dev/sdd
   63  history > /tmp/root.commands

Which disk is which?

WD-Black ata-WDC_WD2002FAEX-007BA0_WD-WCAY00589823 -> ../../sdb
HD204UI ata-ST2000DL004_HD204UI_S2H7J90C549571 -> ../../sdc
WD-Blue  ata-WDC_WD6400AAKS-00A7B2_WD-WMASY2546840 -> ../../sdd

please let me know if I can be any clearer, thanks
Stuart
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux