Re: OSD doesn't start
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
- Subject: Re: OSD doesn't start
- From: Székelyi Szabolcs <szekelyi@xxxxxxx>
- Date: Fri, 06 Jul 2012 01:33:13 +0200
- In-reply-to: <95834053.QbLuzMQ4OG@mranderson>
- Organization: NIIFI
- References: <1563053.ttVafs9Pph@mranderson> <F1FB8F95B3FA4FF19D53AE9F060D88F5@inktank.com> <95834053.QbLuzMQ4OG@mranderson>
- User-agent: KMail/4.8.4 (Linux/3.2.0-26-generic; KDE/4.8.4; x86_64; ; )
On 2012. July 5. 16:12:42 Székelyi Szabolcs wrote:
> On 2012. July 4. 09:34:04 Gregory Farnum wrote:
> > Hrm, it looks like the OSD data directory got a little busted somehow. How
> > did you perform your upgrade? (That is, how did you kill your daemons, in
> > what order, and when did you bring them back up.)
>
> Since it would be hard and long to describe in text, I've collected the
> relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The
> short story is that after seeing that the OSDs won't start, I tried to bring
> down the whole cluster and start it up from scratch. It didn't change
> anything, so I rebooted the two machines (running all three daemons), to
> see if it changes anything. It didn't and I gave up.
>
> My ceph config is available at http://pastebin.com/KKNjmiWM .
>
> Since this is my test cluster, I'm not very concerned about the data on it.
> But the other one, with the same config, is dying I think. ceph-fuse is
> eating around 75% CPU on the sole monitor ("cc") node. The monitor about
> 15%. On the other two nodes, the OSD eats around 50%, the MDS 15%, the
> monitor another 10%. No Ceph filesystem activity is going on at the moment.
> Blktrace reports about 1kB/s disk traffic on the partition hosting the OSD
> data dir. The data seems to be accessible at the moment, but I'm afraid
> that my production cluster will end up in a similar situation after
> upgrade, so I don't dare to touch it.
>
> Do you have any suggestion what I should check?
Yes, it definitely looks like dying. Besides the above symptoms all clients'
ceph-fuse burn the CPU, there are unreadable files on the fs (tar blocks on
them infinitely), the FUSE clients emit messages like
ceph-fuse: 2012-07-05 23:21:41.583692 7f444dfd5700 0 -- client_ip:0/1181
send_message dropped message ping v1 because of no pipe on con 0x1034000
every 5 seconds. I tried to backup the data on it, but it got blocked in the
middle. Since then I'm unable to get any data out of it, not even by killing
ceph-fuse and remounting the fs.
--
cc
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
[CEPH Users]
[Information on CEPH]
[Linux USB Devel]
[Video for Linux]
[Linux Audio Users]
[Photo]
[Yosemite News]
[Yosemite Photos]
[Free Online Dating]
[Linux Kernel]
[Linux SCSI]
[XFree86]