The logging level of LDAP in zldap0 was increased to 16640 as per what the Zimbra wiki described as "good for debug":
LDAP - Zimbra :: Wiki Code:
[zimbra@zldap0 ~]$ zmlocalconfig | grep ldap_log_level
ldap_log_level = 49152
[zimbra@zldap0 ~]$ zmlocalconfig -e ldap_log_level=16640
[zimbra@zldap0 ~]$ zmlocalconfig | grep ldap_log_level
ldap_log_level = 16640
[zimbra@zldap0 ~]$
LDAP was restarted, which was the only service affecting event. I then ran the backup script as it is run by cron on weekends. Aside from finishing quickly and producing the expected ldif file, there was no service interuption and the logs dropped out for the amount of time it takes to run the backup. It seems that it stopped logging when I ran the backup because a log level which generated 93 results in the 6 seconds before:
Code:
[zimbra@zldap0 ~]$ for x in 0 1 2 3 4 5 6; \
do grep 05:28:0$x /var/log/zimbra.log | wc -l; done
0
3
21
17
15
7
30
[zimbra@zldap0 ~]$
went dark for the amount of time it took to do the backup:
Code:
Apr 1 05:28:06 zldap0 slapd[27922]: conn=332 fd=26 closed (connection lost)
Apr 1 05:28:12 zldap0 slapd[27922]: conn=333 fd=26 ACCEPT from IP=139.147.11.13
1:55635 (IP=139.147.11.133:389)
Running the slapcat that was revealed by ps during the original system failure directly, i.e. running the following:
Code:
/opt/zimbra/openldap/sbin/slapcat
-v -d 16640
-f /opt/zimbra/conf/slapd.conf
-l /opt/zimbra/backup/sessions/fultonj_test_4_1_01/ldap.bak.1
> /opt/zimbra/backup/sessions/fultonj_test_4_1_01/log.1
produced nothing extra in /var/log/zimbra.log and a set of hex ids which nearly to map one-to-one with with DNs in the ldif file:
Code:
[zimbra@zldap0 fultonj_test_4_1_01]$ grep dn ldap.bak | wc -l
7733
[zimbra@zldap0 fultonj_test_4_1_01]$ wc -l log.1
7728 log.1
[zimbra@zldap0 fultonj_test_4_1_01]$
I'm not sure how to get more data from LDAP to debug aside from something extremely verbose like strace. I doubt it would be revelaing since I can't seem to break the service by running a slapcat even when the server is up.
My new conjecture is that the full backup from the store server, which breaks until the slapcats are killed and ldap is restarted on the ldap server, is what is causing the problem. I will run its full backup over the weekend and keep an eye on it and the ldap server. I will also remove the full backup from its crontab and run it by hand since I'd rather choose when to bring the system down during a scheduled maintenance window. I'll keep this log level on LDAP since I have enough disk space to hold it, though queries will be a little slower.
I'll share my results on this page in hopes that they help someone else. Please post suggestions if you think I'm missing anything.