On Tue, Aug 09, 2016 at 07:41:42PM +0000, Hugo Mills wrote:
> On Tue, Aug 09, 2016 at 03:22:03PM -0400, Chris Mason wrote:
> >
> >
> > On 08/09/2016 03:11 PM, Hugo Mills wrote:
> > >On Tue, Aug 09, 2016 at 06:27:33PM +0000, Hugo Mills wrote:
> > >>On Tue, Aug 09, 2016 at 02:26:14PM -0400, Chris Mason wrote:
> > >>>On 08/09/2016 02:23 PM, Hugo Mills wrote:
> > >>>> Hi, Chris,
> > >>>>
> > >>>>On Tue, Aug 09, 2016 at 02:02:20PM -0400, Chris Mason wrote:
> > >>>>>On 08/09/2016 01:27 PM, Hugo Mills wrote:
> > >>>>>> Over the weekend, I started doing some maintenance on my server: I
> > >>>>>>upgraded to 4.7.0, and I started deleting a device from my array,
> > >>>>>>preparatory to putting in a larger one. About halfway through the
> > >>>>>>operation, several kernel threads hung up for a while (resulting in
> > >>>>>>"blocked for 120s" messages), and then the delete process seems to
> > >>>>>>have stopped entirely, although several kernel threads are at maximum
> > >>>>>>usage.
> > >>>>>>
> > >>>>>> After a few hours, I rebooted the machine, and left it for a day or
> > >>>>>>so. I tried the delete again this afternoon, and it's done the same
> > >>>>>>thing again. The full log is included below. I have a kworker and a
> > >>>>>>btrfs-transaction pegged at close to 100% of a core each, and a
> > >>>>>>btrfs-cleaner (and the btrfs dev del process) in D state.
> > >>>>>>
> > >>>>>> The FS was not under load at the time of the failure, and it passes
> > >>>>>>scrub. I haven't tried a btrfs check yet.
> > >>>>>
> > >>>>>Thanks Hugo, can you nail down which line of code belongs to:
> > >>>>>
> > >>>>>btrfs_async_run_delayed_refs+0xc6
> > >>>>
> > >>>> I'm having a spot of trouble with this. The btrfs on this kernel is
> > >>>>built-in, and I've lost the contents of the build directory (it's done
> > >>>>by an overnight build script, and it's already built a 4.8-rc1 for one
> > >>>>of my other machines).
> > >>>>
> > >>>>(gdb) file /boot/vmlinuz-4.7.0-dirty
> > >>>>BFD: /boot/vmlinuz-4.7.0-dirty: Warning: Ignoring section flag IMAGE_SCN_MEM_NOT_PAGED in section .bss
> > >>>>Reading symbols from /boot/vmlinuz-4.7.0-dirty...(no debugging symbols found)...done.
> > >>>>(gdb) list *btrfs_async_run_delayed_refs+0xc6
> > >>>>No symbol table is loaded. Use the "file" command.
> > >>>>
> > >>>> There must be a way of getting this info from here, but I'm not
> > >>>>sure I know what it is. Build a new kernel from 4.7 with this
> > >>>>machine's config and run gdb on the btrfs.o file? Not a problem to do,
> > >>>>but it might take a little while.
> > >>>
> > >>>As long as you use the same gcc and config file, it'll almost always
> > >>>generate the same offsets/code. You can recompile with debug
> > >>>symbols on and it'll be accurate.
> > >>
> > >> OK. Back later.
> > >
> > >(gdb) file fs/btrfs/btrfs.o
> > >Reading symbols from fs/btrfs/btrfs.o...done.
> > >(gdb) list *btrfs_async_run_delayed_refs+0xc6
> > >0x13dae is in btrfs_async_run_delayed_refs (fs/btrfs/extent-tree.c:2915).
> > >2910
> > >2911 btrfs_queue_work(root->fs_info->extent_workers, &async->work);
> > >2912
> > >2913 if (wait) {
> > >2914 wait_for_completion(&async->wait);
> > >2915 ret = async->error;
> > >2916 kfree(async);
> > >2917 return ret;
> > >2918 }
> > >2919 return 0;
> >
> > So its waiting on the actual delayed ref work but we don't see them
> > in the stack trace.
> >
> > Can you please sysrq-w and sysrq-t?
>
> Not right now -- I wanted to watch a film, and rebooted the machine
> to get NFS working again. It's now refusing to boot. I think that's
> unrelated to this issue (different filesystems are involved for a
> start), but it's stopping me from doing anything else. :(
>
> I'll reproduce the failure, and get back to you with the sysrq
> dumps tomorrow.
I restarted the device delete this evening, and it's not thrown the
error again. I'll be doing some more device reorganisation over the
next week or two, so it may happen again, but for now, I think we'll
have to write this report off as an undiagnosed transient error.
Hugo.
--
Hugo Mills | I thought I'd discovered a new colour, but it was
hugo@... carfax.org.uk | just a pigment of my imagination.
http://carfax.org.uk/ |
PGP: E2AB1DE4 |
Attachment:
signature.asc
Description: Digital signature
