Duncan <1i5t5.duncan@xxxxxxx> wrote:
> covici posted on Fri, 25 Dec 2015 16:14:58 -0500 as excerpted:
>
> > Henk Slager <eye1tm@xxxxxxxxx> wrote:
> >
> >> On Fri, Dec 25, 2015 at 11:03 AM, <covici@xxxxxxxxxxxxxx> wrote:
> >> > Hi. I created a file system using 4.3.1 version of btrfsprogs and
> >> > have been using it for some three days. I have gotten the following
> >> > errors in the log this morning:
>
> >> > Dec 25 04:10:16 ccs.covici.com kernel: BTRFS (device dm-20): parent
> >> > transid verify failed on 51776421888 wanted 4983 found 4981
>
> [Several of these within a second, same block and transids, wanted 4983,
> found 4981.]
>
> >> > The file system was then made read only. I unmounted, did a check
> >> > without repair which said it was fine, and remounted successfully in
> >> > read/write mode, but am I in trouble? This was on a solid state
> >> > drive using lvm.
> >> What kernel version are you using?
> >> I think you might have some hardware error or glitch somewhere,
> >> otherwise I don't know why you have such errors. These kind of errors
> >> remind me of SATA/cable failures over quite a period of time (multipe
> >> days). Or something with lvm or trim of SSD.
> >> Any unusual with the SSD if you run smartctl?
> >> A btrfs check will indeed likely result in an OK for this case.
> >> What about running read-only scrub?
> >> Maybe running memtest86+ can rule-out the worst case.
> >
> > I am running 4.1.12-gentoo and btrfs progs 4.3.1. Same thing happened
> > on another filesystem, so I switched them over to ext4 and no troubles
> > since. As far as I know the ssd drives are fine, I have been using them
> > for months. Maybe btrfs needs some more work. I did do scrubs on the
> > filesystems after I went offline and remounted them, and they were
> > successful, and I got no errors from the lower layers at all. Maybe
> > I'll try this in a year or so.
>
> Well, as I seem to say every few posts, btrfs is "still stabilizing, not
> fully stable and mature", so it's a given that more work is needed, tho
> it's demonstrated to be "stable enough" for many in daily use, as long as
> they're generally aware of stability status and are following the admin's
> rule of backups[1] with the increased risk-factor of running "still
> stabilizing" filesystems in mind.
>
> The very close generation/transid numbers, only two commits apart, for
> the exact same block, within the same second, indicate a quite recent
> block-write update failure, possibly only a minute or two old. You could
> tell how recent by comparing the generation/transid in the superblock
> (using btrfs-show-super) at as close to the same time as possible, seeing
> how far ahead it is.
>
> I'd check smartctl -A for the device(s), then run scrub and check it
> again, to see if the raw number for ID5, Reallocated_Sector_Ct (or
> similar for your device) changed. (I have some experience with this.[2])
>
> If the raw reallocated sector count goes up, it's obviously the device.
> If it doesn't but scrub fixes an error, then it's likely elsewhere in the
> hardware (cabling, power, memory or storage bus errors, sata/scsi
> controller...). If scrub detects but can't fix the error the lack of fix
> is probably due to single mode, with the original error due possibly to a
> bad shutdown/umount or a btrfs bug. If scrub says it's fine, then
> whatever it was was temporary could be due to all sorts of things, from a
> cosmic ray induced memory error, to btrfs bug, to...
>
> In any case, if scrub fixes or doesn't detect an error, I'd not worry
> about it too much, as it doesn't seem to be affecting operation, you
> didn't get a lockup or backtrace, etc. In fact, I'd take that as
> indication of btrfs normal problem detection and self-healing, likely due
> to being able to pull a valid copy from elsewhere due to raidN or dup
> redundancy or parity.
>
> Tho there's no shame in simply deciding btrfs is still too "stabilizing,
> not fully stable and mature" for you, either. I know I'd still hesitate
> to use it in a full production environment, unless I had both good/tested
> backups and failover in place. "Good enough for daily use, provided
> there's backups if you don't consider the data throwaway", is just that;
> it's not really yet good enough for "I just need it to work, reliably,
> because it's big money and people's jobs if it doesn't."
>
> ---
> [1] Admin's rule of backups: For any given level of backup, you either
> have it, or by your actions are defining the data to be of less value
> than the hassle and resources taken to do the backup, multiplied by the
> risk factor of actually needing that backup. As a consequence, after the
> fact protests to the contrary are simply lies, as actions spoke louder
> than words and they defined the time and hassle saved as more valuable,
> so the valuable was saved in any case and in this case the user should be
> happy they saved the more valuable hassle and resources even if the data
> got lost.
>
> And of course with btrfs still stabilizing, that risk factor remains
> somewhat elevated, meaning more levels of backups need to be kept, for
> relatively lower value data.
>
> But AFAIK, you've stated elsewhere that you have backups, so this is more
> for completeness and for other readers than for you, thus its footnoting,
> here.
...
...
The show stopper for me was that the file system was put into read only
mode and even though scrub was fine, which would not run in read only
mode, I had to unmount the fs, and run the check, which maybe I didn't
really need to do and remount, which for me is not practical to do. So,
even though I had no actual data loss, I had to say it was not worth it
for the time being.
--
Your life is like a penny. You're going to lose it. The question is:
How do
you spend it?
John Covici
covici@xxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html