Hi all,
We have recently migrated to a larger single-server configuration to support increasing load. The new server is a dual-proc Xeon 2.53Ghz device with 12GB RAM running Ubuntu 8.04 Server. While we were at it, we also bumped up from 6.0.1 to 6.0.3. Generally the server runs very well and supports several dozen simultaneous users with aplomb, but a couple of times per day the load average climbs to 10+ and the server becomes temporarily unresponsive. It typically recovers before underlying protocols timeout, but it is quite noticeable when it occurs. In investigating the issue we have followed the guidelines for tuning, increasing the innodb buffer space to 40% of total RAM and reducing the JVM heap to 20% of total RAM. Unfortunately while these measures seem to improve the "sunny day performance", the problem is still occurring with the same severity. When the problem occurs it is primarily Java processes monopolizing CPU time with little or no iowait. Sometimes the stats process is running, other times it is not. I haven't yet found anything going on in zimbra.log or mailbox.log which seems to correlate but I have found that we frequently see errors like this:
Code:
zmmtaconfig: Skipping getAllReverseProxyURLs ERROR: service.FAILURE (system failure: ZimbraLdapContext) (cause: javax.naming.CommunicationException zimbra.mydomain.com:389)
zmmtaconfig: gacf ERROR: service.FAILURE (system failure: ZimbraLdapContext) (cause: javax.naming.CommunicationException zimbra.mydomain.com:389)
zmmtaconfig: Skipping getAllMtaAuthURLs ERROR: service.FAILURE (system failure: ZimbraLdapContext) (cause: javax.naming.CommunicationException zimbra.mydomain.com:389)
zmmtaconfig: Sleeping...Key lookup failed.
These occur several times per minute day in and day out. In researching these messages it would appear they are frequently associated with services failing to start, but in my case everything starts happily and the condition of things looks good:
Code:
$ zmcontrol status
Host zimbra.mydomain.com
antispam Running
antivirus Running
convertd Running
ldap Running
logger Running
mailbox Running
mta Running
snmp Running
spell Running
stats Running
$ ldap status
slapd running pid: 18976
Full disclosure:
* My /etc/hosts file may have been malformed at the times the dummy install and upgrade were run. Presently it looks like this:
Code:
127.0.0.1 localhost.localdomain localhost
216.110.208.158 zimbra.mydomain.com zimbra
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
* DNS and IP assignments remained static through the transition.
* Architecture was the same on both machines (hence dummy install).
Obviously I'm just looking for advice on how to isolate and resolve. Thanks in advance for any pointers.