LDAP Stops after full backup
My master Zimbra ldap-only server has stopped working each weekend for the last two weeks. I believe it is related to the full backup that is run each weekend:
It seems that the weekend's backup starts more than one zmslapcat which makes LDAP unresponsive to one of my store servers so that LMTP fails and users on that server cannot login (I am using multiple store servers). LDAP then won't shut down until the slapcats are each manually killed. While it is down I have no problem running a full backup with zmbackupldap which produces the same ldiff files that I find on my secondary ldap server which is syncrepl'd from the master.
0 1 * * 6 /opt/zimbra/bin/zmbackup
0 0 * * * /opt/zimbra/bin/zmbackup -del 1m
I've read some of the code to trace down what I think is happending: zmbackupldap calls zmslapcat which seems to call OpenLDAP's standard slapcat command whose man page says: "slapd(8) should not be running (at least, not in read-write mode) when you do this [slapcat] to ensure consistency of the database". I don't think I see any commands to shut down ldap in zmbackupldap. I could have overlooked something.
/var/log/zimbra.log at the time of incident shows nothing out of the ordinary until I initiated a shutdown which produced many ZimbraLdapContext ServiceUnavailableException until I killed the slapcats and slapd was restarted.
1. Does zmbackup stop Zimbra's LDAP for backups and is this relevant?
2. Is there anywhere else that I should look?
LDAP failure might be triggered by store server full backup
The logging level of LDAP in zldap0 was increased to 16640 as per what the Zimbra wiki described as "good for debug":
LDAP - Zimbra :: Wiki
LDAP was restarted, which was the only service affecting event. I then ran the backup script as it is run by cron on weekends. Aside from finishing quickly and producing the expected ldif file, there was no service interuption and the logs dropped out for the amount of time it takes to run the backup. It seems that it stopped logging when I ran the backup because a log level which generated 93 results in the 6 seconds before:
[zimbra@zldap0 ~]$ zmlocalconfig | grep ldap_log_level
ldap_log_level = 49152
[zimbra@zldap0 ~]$ zmlocalconfig -e ldap_log_level=16640
[zimbra@zldap0 ~]$ zmlocalconfig | grep ldap_log_level
ldap_log_level = 16640
went dark for the amount of time it took to do the backup:
[zimbra@zldap0 ~]$ for x in 0 1 2 3 4 5 6; \
do grep 05:28:0$x /var/log/zimbra.log | wc -l; done
Running the slapcat that was revealed by ps during the original system failure directly, i.e. running the following:
Apr 1 05:28:06 zldap0 slapd: conn=332 fd=26 closed (connection lost)
Apr 1 05:28:12 zldap0 slapd: conn=333 fd=26 ACCEPT from IP=184.108.40.206
produced nothing extra in /var/log/zimbra.log and a set of hex ids which nearly to map one-to-one with with DNs in the ldif file:
-v -d 16640
I'm not sure how to get more data from LDAP to debug aside from something extremely verbose like strace. I doubt it would be revelaing since I can't seem to break the service by running a slapcat even when the server is up.
[zimbra@zldap0 fultonj_test_4_1_01]$ grep dn ldap.bak | wc -l
[zimbra@zldap0 fultonj_test_4_1_01]$ wc -l log.1
My new conjecture is that the full backup from the store server, which breaks until the slapcats are killed and ldap is restarted on the ldap server, is what is causing the problem. I will run its full backup over the weekend and keep an eye on it and the ldap server. I will also remove the full backup from its crontab and run it by hand since I'd rather choose when to bring the system down during a scheduled maintenance window. I'll keep this log level on LDAP since I have enough disk space to hold it, though queries will be a little slower.
I'll share my results on this page in hopes that they help someone else. Please post suggestions if you think I'm missing anything.
Multiple slapcat contention
Thanks for the suggestion. I have not been able to get the database to lock up. During a running full backup on the store server I have run several full backups of the LDAP server and have not seen any slapcats running, even with a 'watch "ps axu | grep slap"', though I see an LIDF file is produced. I see no change in the values returned from db_stat -c during this time (output below). If it locks up again I will paste the output of the ps and db_stat commands here.
Originally Posted by quanah
Is there any reason that more than one slapcat would be running at the same time?
[zimbra@zldap0 sessions]$ /opt/zimbra/sleepycat/bin/db_stat -c -h /opt/zimbra/openldap-data
886 Last allocated locker ID.
2147M Current maximum unused locker ID.
9 Number of lock modes.
3000 Maximum number of locks possible.
1500 Maximum number of lockers possible.
1500 Maximum number of lock objects possible.
19 Number of current locks.
380 Maximum number of locks at any one time.
91 Number of current lockers.
106 Maximum number of lockers at any one time.
19 Number of current lock objects.
202 Maximum number of lock objects at any one time.
159M Total number of locks requested.
159M Total number of locks released.
0 Total number of lock requests failing because DB_LOCK_NOWAIT was set.
2437 Total number of locks not immediately available due to conflicts.
0 Number of deadlocks.
0 Lock timeout value.
0 Number of locks that have timed out.
0 Transaction timeout value.
0 Number of transactions that have timed out.
1MB 368KB The size of the lock region..
7055 The number of region locks granted after waiting.
273M The number of region locks granted without waiting.