Re: Interrupted and resumed scrubs seem to have caused filesystem to go readonly (EFBIG error)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/01/2020 12:34, Qu Wenruo wrote:
> 
> 
> On 2020/1/2 下午8:07, Graham Cobb wrote:
>> On 02/01/2020 01:26, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/1/2 上午7:35, Graham Cobb wrote:
>>>> I have a problem on one BTRFS filesystem. It is not a critical
>>>> filesystem (it is used for backups) and I have not yet tried even
>>>> unmounting and remounting, let alone a "btrfs check".
>>>>
>>>> The problem seems to be that after several iterations of running 'btrfs
>>>> scrub' for 30 minutes, then pausing for a while, then resuming the
>>>> scrub, I got a transaction aborted with an EFBIG error and a warning in
>>>> the kernel log. The fs went readonly, and transid verify errors are now
>>>> reported. The original log extract is available at
>>>> http://www.cobb.uk.net/kern.log.bug-010120 but I have pasted the key
>>>> part below.
>>>
>>> EFBIG in btrfs is very rare, and can only be caused by too many system
>>> chunks.
>>>
>>> The most common reason is the chunk pre-alllocation for scrub, which
>>> also matches your situation.
>>>
>>> There is already a fix for it, and will land in v5.5 kernel.
>>> It looks like we should backport it.
>>
>> Thanks Qu. I will wait for that kernel, and maybe stop my monthly scrubs
>> (although my several other btrfs filesystems did not have a problem this
>> month fortunately).
> 
> And the problem will normally not impact the fs, as newly created empty
> system chunks will be soon cleaned up.
> 
>>
>> I am getting transid errors:
> 
> This is not a good news. And in fact it's normally a deadly problem.

In fact, this was not a real problem: the errors were because the
filesystem was still mounted from the original error and had gone ro so
I guess the in-memory state was different from the on-disk state.  Doh!

A simple umount and mount worked fine, although I then did a btrfs check
which also worked fine:

black:~# btrfs check --readonly -p /dev/sdc3
Opening filesystem to check...
Checking filesystem on /dev/sdc3
UUID: 4d1ba5af-8b89-4cb5-96c6-55d1f028a202
[1/7] checking root items                      (0:06:27 elapsed,
25179174 items checked)
[2/7] checking extents                         (6:34:26 elapsed, 2419791
items checked)
cache and super generation don't match, space cache will be invalidated
[3/7] checking free space tree                 (0:00:00 elapsed)
[4/7] checking fs roots                        (25:44:17 elapsed,
1497725 items checked)
[5/7] checking csums (without verifying data)  (0:54:36 elapsed, 4812627
items checked)
[6/7] checking root refs                       (0:00:00 elapsed, 1067
items checked)
[7/7] checking quota groups skipped (not enabled on this FS)
found 11946687545430 bytes used, no error found
total csum bytes: 11626743024
total tree bytes: 39628275712
total fs tree bytes: 24636817408
total extent tree bytes: 2363850752
btree space waste bytes: 5422658757
file data blocks allocated: 29159815589888
 referenced 16100593688576

Thanks again for the help, and for the design which prevented fs
corruption in this case.

I would encourage you to consider backporting the fix for the original
EFBIG problem, as you suggested above.

Graham

> 
>>
>>>> Jan  1 06:51:56 black kernel: [1931271.801468] BTRFS error (device
>>>> sdc3): parent transid verify failed on 16216583520256 wanted 301800
>>>> found 301756
>>
>> I presume 301800 is the transaction which failed and caused the fs to go
>> readonly. I don't suppose it is likely I could revert the whole fs to
>> the state of the last successful transaction is there?
> 
> This means some tree blocks doesn't reach disk.
> It can be deadly, or just a side effect caused by the transaction abort.
> 
>>
>> It is not a big problem: the fs only contains backup snapshots (not my
>> only backups!) although it would be nice to recover the historical
>> snapshots if I could (I used them to research a bug I reported to debian
>> just the other day!).
> 
> I'm afraid this depends on where the corruption is.
> 
> If it's just caused by that EFBIG error, and btrfs check reports no
> error, then it's just temporary problem caused by transaction abort.
> 
> 
> If it's in extent tree, it only affects mount or certain write
> operations, but if you can mount the fs, it should be OK to read out the
> whole fs.
> 
> If it's in csum tree, it will affect certain data read, other than
> mostly OK.
> 
> If it's in subvolume trees, some directories/files can't be accessed.
> 
> So, please run a btrfs check on the unmounted fs to verify what's the case.
> 
> Thanks,
> Qu
> 
>>
>> Regards
>> Graham
>>
> 




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux