Google
  Web www.spinics.net

Re: [389-users] 1.2.7.5 process disappearing, replication failing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On 02/02/2011 08:48 PM, Andrew Kerr wrote:
> The one replica still running 1.2.7.5 hasn't crashed since it stopped getting traffic, but it is still getting replicated to.  Too early to determine if that has any relevance.
>
> My single master is still on 1.2.7.5, and has been stable.  It gets the same portion and type of end-user traffic, plus some.  So something still makes me think this has something to do with the master sending bad/mishandled data to the replicas, or something along those lines.  Not based on anything other and educated guesswork though.
>
> The lines in the error log don't have anything unusual in them.  Same thing each time it dies.  The referrals error seems to be ongoing, haven't looked in to that yet, but assume it isn't related.
>
> [02/Feb/2011:09:52:26 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=simplewire,dc=com is coming online; enabling replication
> [02/Feb/2011:09:52:26 -0500] NSMMReplicationPlugin - repl_set_mtn_referrals: could not set referrals for replica dc=simplewire,dc=com: 32
> [02/Feb/2011:09:52:26 -0500] NSMMReplicationPlugin - repl_set_mtn_referrals: could not set referrals for replica dc=simplewire,dc=com: 32
> [02/Feb/2011:10:05:38 -0500] NSMMReplicationPlugin - repl_set_mtn_referrals: could not set referrals for replica dc=simplewire,dc=com: 32
> 	389-Directory/1.2.7.5 B2010.350.198
> 	vdc-prd-ldap-001.simplewire.com:389 (/etc/dirsrv/slapd-vdc-prd-ldap-001)
>
> [02/Feb/2011:11:50:36 -0500] - 389-Directory/1.2.7.5 B2010.350.198 starting up
> [02/Feb/2011:11:50:36 -0500] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
> [02/Feb/2011:11:50:36 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests

This sure looks like a crash.  If you are able, I would appreciate it if 
you could follow the steps to enable core files at 
http://directory.fedoraproject.org/wiki/FAQ#Debugging_Crashes
> -----Original Message-----
> From: Rich Megginson [mailto:rmeggins@xxxxxxxxxx]
> Sent: Wednesday, February 02, 2011 1:14 PM
> To: Andrew Kerr
> Cc: General discussion list for the 389 Directory server project.
> Subject: Re: [389-users] 1.2.7.5 process disappearing, replication failing
>
> On 02/02/2011 10:37 AM, Andrew Kerr wrote:
>> I reinstalled the two replicas that were saying "No such object" and now they work - same exact cut-and-paste process that didn't work before.
>>
>> The good news is that I am back up and running (phew, what a morning!).
>>
>> I left one replica on 1.2.7.5, disabled behind our load balancer, so it is getting replicated to but no production traffic - with the intent of helping figure out what the problem is before others find it.  I'll get a bug report filed since this seems like something new.
>>
>> FYI, these are all virtual machines (on a mix of vmware, kvm, and xen depending on the datacenter) and have very minimal installs, running no other apps, with no selinux or anything either.
> Is the 1.2.7.5 server still crashing?  If so, please post the last few
> lines of the errors log before the crash.
>
> See also here:
> http://directory.fedoraproject.org/wiki/FAQ#Debugging_Crashes
>> -----Original Message-----
>> From: 389-users-bounces@xxxxxxxxxxxxxxxxxxxxxxx [mailto:389-users-bounces@xxxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Andrew Kerr
>> Sent: Wednesday, February 02, 2011 11:44 AM
>> To: Rich Megginson; General discussion list for the 389 Directory server project.
>> Subject: Re: [389-users] 1.2.7.5 process disappearing, replication failing
>>
>> The process is completely gone.  Doesn't show up in ps, and the pid referenced in the pid file doesn't exist.
>>
>> I do have a lot of lines like this in my access log:
>> [02/Feb/2011:10:05:06 -0500] conn=4479 op=-1 fd=161 closed - B1
>>
>> On the positive side, I was able to get some of the replicas downgraded to 1.2.4.  I had been deleting the server from the site under netscaproot and re-registering, but I hadn't re-created the replication agreement, I was just re-initializing the existing one.  Deleting it and creating a new one got rid of the error: "Unable to parse the response to the startReplication extended operation.  Replication is aborting".
>>
>> Four of the six systems I put back to 1.2.4 (by removing the RPMs and blowing away all dirsrv relics left behind, reinstalling, and re-configuring).  Two of them I initialize and can see the directory, but when I do an ldapsearch remotely I get "result: 32 No such object".  More random/unpredictable behavior...
>>
>>
>> -----Original Message-----
>> From: Rich Megginson [mailto:rmeggins@xxxxxxxxxx]
>> Sent: Wednesday, February 02, 2011 11:10 AM
>> To: General discussion list for the 389 Directory server project.
>> Cc: Andrew Kerr
>> Subject: Re: [389-users] 1.2.7.5 process disappearing, replication failing
>>
>> On 02/02/2011 09:06 AM, Andrew Kerr wrote:
>>> I'm running a single master with 13 replicas, all CentOS 5.5.  The master, and a few of the slaves, are running 1.2.7.5.  We were previously on 1.2.4, with most replicas still on that version.
>> You might be running into https://bugzilla.redhat.com/show_bug.cgi?id=668619
>> The symptom of that bug is your server will just stop responding to
>> requests, including server-to-server requests like replication.  Your
>> server will still be running.
>>
>> Does ps -ef|grep slapd show your server process is running?
>> Do you see the messages like "op=-1 fd=66 closed - T2" in your access log?
>>> All of a sudden, the 1.2.7.5 replicas slapd process had just started to disappear.  Nothing in the error log with level at 8192.  Its just gone.  I can start it up and it'll last about 5 minutes.  Replication is what seems to be breaking - it seems to go away right after an update.
>>>
>>> I've tried rolling the replicas back to 1.2.4, but when I initialize the consumers I get "Unable to parse the response to the startReplication extended operation.  Replication is aborting".
>>>
>>> Any suggestions on where to go from this point?  It seems 1.2.7.5 is HIGHLY unstable.  But it seems it can't initialize 1.2.4 replicas (??), or maybe it just doesn't work at all.
>>>
>>> I'm not sure what the safe way is to roll back the master from 1.2.7.5, can I use "yum downgrade" safely?  At least now my  master and the replicas on 1.2.4 are working, I don't want to risk completely taking down ldap.
>>>
>>> Is there a good stable version I ought to be at?  I upgraded from 1.2.4 because of a number of other bugs, although none of them as bad as 1.2.7.5 seems to be.
>>>
>>> Thanks - any help is greatly appreciated.
>>>
>>> This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
>>> you may review at http://www.amdocs.com/email_disclaimer.asp
>>> --
>>> 389 users mailing list
>>> 389-users@xxxxxxxxxxxxxxxxxxxxxxx
>>> https://admin.fedoraproject.org/mailman/listinfo/389-users
>> --
>> 389 users mailing list
>> 389-users@xxxxxxxxxxxxxxxxxxxxxxx
>> https://admin.fedoraproject.org/mailman/listinfo/389-users

--
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users


[Fedora Directory Devel]     [Fedora Announce]     [Fedora Legacy Announce]     [Home]     [Fedora Tools]     [Kernel]     [Fedora Legacy]     [Share Photos]     [Fedora Desktop]     [PAM]     [Red Hat Watch]     [Red Hat Development]     [Red Hat 9 Bible]     [Red Hat 9]     [Big List of Linux Books]     [Gimp]     [Yosemite News]

Add to Google