[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ogfs-dev]memexp deadlock



This sure has been an interesting exercise.  I've had a few more
panics on the 3 node OSDL setup, and now I'm getting what appears
to be a memexp deadlock.

The nodes were spewing the following output:

node 0:
memexp:  asking for lock (4, 208) action 3 from 2
memexp:  asking for lock (4, 208) action 3 from 2
memexp:  asking for lock (4, 208) action 3 from 2
memexp:  asking for lock (4, 208) action 3 from 2
memexp:  asking for lock (4, 208) action 3 from 2


node 1:
memexp:  asking for lock (4, 184) action 3 from 0
memexp:  asking for lock (4, 184) action 3 from 0
memexp:  asking for lock (4, 184) action 3 from 0
memexp:  asking for lock (4, 184) action 3 from 0
memexp:  asking for lock (4, 184) action 3 from 0


node 2:
memexp:  asking for lock (4, 184) action 3 from 0
memexp:  asking for lock (4, 184) action 3 from 0
memexp:  asking for lock (4, 184) action 3 from 0
memexp:  asking for lock (4, 184) action 3 from 0
memexp:  asking for lock (4, 184) action 3 from 0


These messages come from memexp/lockops.c::kick_holders(), one
message for every 500 failed attempts.  They persisted for a long,
long, time.  (At least it didn't panic. :-)


I took down node 2 hoping to break the deadlock, but it didn't
get better.  Then I also took down node 1, but still node 0 kept
generating errors.  So I had to take it down also.  And it gave
an infinite number of:

dmep_tcp: 0: Failed to connect to server (-22)
dmep_tcp: 1:Unable to reconnect to server. Expect hell to break loose.
memexp:  slow heartbeat time (7364269, 7364775, 506)
dmep_tcp: write - sock=c834ef00 at buf=cb5b3c18, size=19 returned -32.
dmep_tcp: 0: Failed to connect to server (-22)
dmep_tcp: 1:Unable to reconnect to server. Expect hell to break loose.
OGFS:  error -22 from locking module on lock (208, 4), retrying...
OGFS:  error -22 from locking module on lock (208, 4), retrying...
dmep_tcp: write - sock=c834ef00 at buf=c8523ec4, size=19 returned -32.


So I had to cut power from node 0 also.


This brings up a curious question.  If memexpd dies (don't know if
that happened above, but seems likely), or the node currently running
memexpd dies, what is the protocol for getting lock traffic going
again?  I did a similar experiment where I killed node 0 (with memexpd)
and could never recover the node 1 and 2 mounts.

-- Joe D. <joe@osdl.org>




-------------------------------------------------------
This SF.net email is sponsored by:  Etnus, makers of TotalView, The best
thread debugger on the planet. Designed with thread debugging features
you've never dreamed of, try TotalView 6 free at www.etnus.com.
_______________________________________________
Opengfs-devel mailing list
Opengfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opengfs-devel

[Kernel]     [Security]     [Bugtraq]     [Photo]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Clusters]     [Linux RAID]     [Yosemite Hiking]     [Linux Resources]

Powered by Linux