Re: Failover after partial failure because of SAN?
|[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]|
On Fri, Nov 4, 2011 at 11:04 AM, Fajar A. Nugraha <list@xxxxxxxxx> wrote: > > On Fri, Nov 4, 2011 at 4:03 PM, Jochen Schneider > <jochen.schneider@xxxxxxxxx> wrote: > > Hi, > > > > We are setting up a cluster for a storage application with SAN disks managed > > through HA-LVM and connected through multipath. There are actually two > > applications which have to run on the same node, > > HAVE to run on the same node? Why? Can't they communicate via TCP/IP? They are already communicating via TCP/IP, so they could be running on different nodes, you are right. But they are working in pairs so they shouldn't be like randomly distributed over the nodes. Also, we would have to see what the performance impact would be to have them on different nodes. > > but only one of them needs > > the disk. Both of them have clients. > > > > The question I have is what should happen when the SAN fails: Should both > > applications failover to another machine (possibly after a retry) or should > > the application which doesn't need the disk keep running while the other is > > shut down? > > You're not giving yourself much option. Since you say both application > HAVE to run on the same node, I assume both are related (e.g. one > needs the other). In that case, the only viable option is to failover. The one application not needing disk access can run without the other so in case of SAN failure there could be a degraded mode where only the first is serving its clients and the other is down. > Having said that, I'm curious what do you mean by "SAN fails". It's > rare for a cluster node to be suddenly unable to access a node while > the other can access it just fine. Usually it's either the SAN > inaccessible completely (e.g. broken SAN or switches) or a server node > fails. I'm am not sure, actually. I don't have any practical data points of a "real" SAN failure, only one due to misconfiguration. That's why I find it hard to decide on our configuration, I'm not sure about possible failures, dependencies between them and (even rough) probability estimates. (Has anybody ever come across a document addressing that, maybe as failure assumptions behind a clustering package and its configuration?) > > I'm not sure how much recovery can come out of a failover in case > > of a SAN failure, if it's not both network cards of the node which are > > defective or whatever. > > Exactly :) > > If no node can access the SAN, then it can't failover anywhere. If it is more likely that SAN access fails on the SAN side than on the node side, I guess that would mean it would be better to keep the application not needing the SAN running, i.e., not failing over. Or maybe failover should be tried once and then my service should go in the degraded mode described above? I'm not sure whether that is possible. > -- > Fajar Thanks! Jochen -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster