So this morning around 8:45am the main Zimbra mailbox server became unresponsive. Normally when I can't ssh into a *nix system I pop up a terminal on the machine if its local and switch over to a second terminal (ALT+F2) if the first one is frozen for some reason. I was on my way into work at the time but the employees were already causing a fuss, as most employees do when email is down.
Anyways since I wasn't there another admin did the only thing he could think of and restarted the server. When I came in the server was still booting, more on that later, and in took some troubleshooting to get the system back to normal, so here's where I'm at now:
This next part is long......
1) Bravo Gents that was the first outage (read: more than just a few minutes) we've had since we migrated from Exchange last May.
2) Since the system had not been rebooted in such a long time it when fsck when booting, on most systems thats fine, but when checking through 1TB of HD space it takes quite a long time to finish. Is there anyway to disable forced filesystem checks in Linux? (I can probably Google this one)
3) Maybe I never noticed this before but using Ubuntu 6.06 server with a clean install postfix has this setup in rc.d:
This caused the MTA service to not startup as something was already using port 25 and was not owned by zimbra so it could not stop it. Killing the process as root and restart the MTA service brought it back up fine and dandy.Code:/etc/rc0.d/K20postfix /etc/rc1.d/K20postfix /etc/rc2.d/S20postfix /etc/rc3.d/S20postfix /etc/rc4.d/S20postfix /etc/rc5.d/S20postfix /etc/rc6.d/K20postfix
Does anyone else with Ubuntu 6.06 and no LAMP installation also have this setup? If so is it safe to remove? I figure it will have the same problem again the next time it has to reboot.
4) So I'm currently diagnosing what caused the problem and here's where it gets tricky. Since the system was powered off while still unresponsive the log files in /var/log all have create dates of when the server came back online (10:07am). "kern.log" However seems to have maintain its data from previous day but doesn't report anything in terms of a hardware failure between yesterday and this morning. The log files before that end at 6:29am, which is the time they were rotated. This is what I have from the logs stored in /opt/zimbra/log:
From audit.log:
What does this error mean "error=system failure: getDirectContext;" ?Code:2008-04-03 08:50:05,693 INFO [ImapServer-3848] [ip=xxx;] security - cmd=Auth; account=user1@domain.com; protocol=imap; 2008-04-03 08:50:09,081 INFO [ImapServer-3850] [ip=xxx;] security - cmd=Auth; account=user1@domain.com; protocol=imap; 2008-04-03 08:55:25,188 WARN [ImapServer-3876] [ip=xxx;] security - cmd=Auth; account=user2@domain.com; protocol=imap; error=system failure: getDirectContext; 2008-04-03 08:55:51,565 WARN [ImapServer-3879] [ip=xxx;] security - cmd=Auth; account=user3@domain.com; protocol=imap; error=system failure: getDirectContext; 2008-04-03 08:56:29,224 WARN [ImapServer-3883] [ip=xxx;] security - cmd=Auth; account=user4@domain.com; protocol=imap; error=system failure: getDirectContext; 2008-04-03 08:56:29,224 WARN [ImapServer-3884] [ip=xxx;] security - cmd=Auth; account=user5@domain.com; protocol=imap; error=system failure: getDirectContext; 2008-04-03 08:57:08,013 WARN [http-443-Processor18] [ip=xxx;ua=ZimbraWebClient - IE7 (Win);] security - cmd=Auth; account=user5@domain.com; protocol=soap; error=system failure: getDirectContext; 2008-04-03 08:57:17,094 WARN [http-443-Processor21] [ip=xxx;ua=ZimbraWebClient - IE7 (Win);] security - cmd=Auth; account=user5@domain.com; protocol=soap; error=system failure: getDirectContext; 2008-04-03 08:57:38,389 WARN [http-443-Processor29] [ip=xxx;ua=ZimbraWebClient - IE7 (Win);] security - cmd=Auth; account=user5@domain.com; protocol=soap; error=system failure: getDirectContext; 2008-04-03 09:22:27,695 WARN [http-7071-Processor48] [ip=xxx;ua=ZimbraWebClient - FF2.0 (Win)/undefined;] security - cmd=AdminAuth; account=admin; error=system failure: getDirectContext; After system restart 2008-04-03 10:08:06,856 INFO [ImapServer-1] [ip=xxx;] security - cmd=Auth; account=user3@domain.com; protocol=imap; 2008-04-03 10:08:06,862 INFO [ImapServer-2] [ip=xxx;] security - cmd=Auth; account=user6@domain.com; protocol=imap
From mailbox.log:
There wasn't much in any of the other logs. I'm lost on that one so please help me out.Code:2008-04-03 08:53:36,735 INFO [ImapServer-3870] [] imap - [xxx] connected 2008-04-03 08:53:42,116 INFO [ImapServer-3871] [] imap - [xxx] connected 2008-04-03 08:53:50,647 INFO [IndexWritersSweeperThread] [] MailboxIndex - open index writers sweep: before=21, closed=5, after=16 (241ms) 2008-04-03 08:53:50,799 WARN [http-443-Processor43] [] misc - unable to lookup account for log, id: c885be3e-927c-42bc-8a4b-960855dfc60a com.zimbra.common.service.ServiceException: system failure: getDirectContext at com.zimbra.common.service.ServiceException.FAILURE(ServiceException.java:165) at com.zimbra.cs.account.ldap.LdapUtil.getDirContext(LdapUtil.java:200) at com.zimbra.cs.account.ldap.LdapUtil.getDirContext(LdapUtil.java:184) at com.zimbra.cs.account.ldap.LdapProvisioning.getAccountByQuery(LdapProvisioning.java:441) at com.zimbra.cs.account.ldap.LdapProvisioning.getAccountById(LdapProvisioning.java:469) at com.zimbra.cs.account.ldap.LdapProvisioning.getAccountById(LdapProvisioning.java:496) at com.zimbra.cs.account.ldap.LdapProvisioning.get(LdapProvisioning.java:485) at com.zimbra.cs.account.Account.addAccountToLogContext(Account.java:119) at com.zimbra.soap.SoapEngine.dispatch(SoapEngine.java:123) at com.zimbra.soap.SoapEngine.dispatch(SoapEngine.java:85) at com.zimbra.soap.SoapServlet.doPost(SoapServlet.java:220) at javax.servlet.http.HttpServlet.service(HttpServlet.java:709) at com.zimbra.cs.servlet.ZimbraServlet.service(ZimbraServlet.java:152) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:667) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:619) Caused by: javax.naming.CommunicationException: Timeout exceeded while waiting for a connection: 30000ms at com.sun.jndi.ldap.pool.Connections.get(Connections.java:128) at com.sun.jndi.ldap.pool.Pool.getPooledConnection(Pool.java:129) at com.sun.jndi.ldap.LdapPoolManager.getLdapClient(LdapPoolManager.java:310) at com.sun.jndi.ldap.LdapClient.getInstance(LdapClient.java:1572) at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2616) at com.sun.jndi.ldap.LdapCtx.<init>(LdapCtx.java:287) at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:175) at com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:193) at com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:136) at com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:66) at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:667) at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:288) at javax.naming.InitialContext.init(InitialContext.java:223) at javax.naming.ldap.InitialLdapContext.<init>(InitialLdapContext.java:134) at com.zimbra.cs.account.ldap.LdapUtil.getDirContext(LdapUtil.java:195) ... 27 more
5) After the main server was back up and running, MTA and all I noticed in the admin gui that only 1 server out of 3 was being reported. Knowing this problem usually stems from "sysklogd" needing a restart I did that on all the servers to no avail. Checking the local syslogs on the 2 other servers showed they were just not reporting anything since around 8:44am:
So to start fresh I ran 'zmcontrol stop' then 'ps auxww | grep zimbra' and got this:Code:Apr 3 08:44:04 relay zimbramon[23889]: 23889:info: 2008-04-03 08:44:01, STATUS: relay.domain.com: snmp: Running Apr 3 08:44:04 relay zimbramon[23889]: 23889:info: 2008-04-03 08:44:01, STATUS: relay.domain.com: stats: Running Apr 3 11:10:04 relay2 zimbramon[31305]: 31305:info: 2008-04-03 11:10:01, STATUS: relay.domain.com: antispam: Running Apr 3 11:10:04 relay zimbramon[31305]: 31305:info: 2008-04-03 11:10:01, STATUS: relay.domain.com: antivirus: Running
Mind you nothing should have been running, killing all of those and restarting Zimbra gave me this:Code:zimbra 12628 99.7 2.6 62520 54344 ? R 2007 220693:12 amavisd (ch12-12628-12) zimbra 24133 0.0 0.0 0 0 ? Zs 08:46 0:00 [zmstatuslog] <defunct> zimbra 24141 0.0 0.2 6872 4204 ? S 08:46 0:00 /usr/bin/perl /opt/zimbra/bin/zmcontrol status zimbra 24155 0.0 0.0 4208 1644 ? S 08:46 0:00 sh -c /opt/zimbra/bin/zmprov -l gs relay.domain.com 2>/dev/null zimbra 24156 0.0 1.0 413968 22004 ? Sl 08:46 0:00 /opt/zimbra/java/bin/java -client -Xmx256m -Dzimbra.home=/opt/zimbra -Djava.library.path=/opt/zimbra/lib -Djava.ext.dirs=/opt/zimbra/java/jre/lib/ext:/opt/zimbra/lib/jars:/opt/zimbra/lib/ext com.zimbra.cs.account.ProvUtil -l gs relay.domain.com zimbra 24188 0.0 0.0 2956 1368 ? S 08:46 0:00 sh -c /opt/zimbra/bin/zmprov -l gacf | grep -v InheritedAttr zimbra 24189 0.0 1.0 414300 22116 ? Sl 08:46 0:00 /opt/zimbra/java/bin/java -client -Xmx256m -Dzimbra.home=/opt/zimbra -Djava.library.path=/opt/zimbra/lib -Djava.ext.dirs=/opt/zimbra/java/jre/lib/ext:/opt/zimbra/lib/jars:/opt/zimbra/lib/ext com.zimbra.cs.account.ProvUtil -l gacf zimbra 24192 0.0 0.0 1636 504 ? S 08:46 0:00 grep -v InheritedAttr
So my question is what are those java processes? It looks like they hung because the main server was querying them and hung itself.Code:zimbra 30882 0.0 0.2 6980 4648 pts/0 S 11:09 0:00 /usr/bin/perl /opt/zimbra/libexec/zmmtaconfig zimbra 30932 0.1 2.4 57440 49836 ? Ss 11:09 0:00 amavisd (master) zimbra 30948 0.0 2.4 58732 50600 ? S 11:09 0:00 amavisd (ch2-avail) zimbra 30949 0.0 2.4 58816 50684 ? S 11:09 0:00 amavisd (ch4-avail) zimbra 30950 0.0 2.4 58716 50672 ? S 11:09 0:00 amavisd (ch2-avail) zimbra 30952 0.0 2.4 59708 51680 ? S 11:09 0:00 amavisd (ch3-avail) zimbra 30953 0.0 2.4 59640 51532 ? S 11:09 0:00 amavisd (ch2-avail) zimbra 30954 0.0 2.3 58168 48892 ? S 11:09 0:00 amavisd (virgin child) zimbra 30955 0.0 2.3 58168 48888 ? S 11:09 0:00 amavisd (virgin child) zimbra 30956 0.0 2.3 58168 48896 ? S 11:09 0:00 amavisd (virgin child) zimbra 30957 0.0 2.3 58168 48892 ? S 11:09 0:00 amavisd (virgin child) zimbra 30958 0.0 2.3 58168 48896 ? S 11:09 0:00 amavisd (virgin child) zimbra 30993 1.5 5.2 119756 108644 ? Ss 11:09 0:07 /opt/zimbra/clamav/sbin/clamd --config-file /opt/zimbra/conf/clamd.conf zimbra 31002 0.0 0.1 5688 3948 pts/0 S 11:09 0:00 /usr/bin/perl /opt/zimbra/libexec/swatch --config-file=/opt/zimbra/conf/swatchrc --use-cpan-file-tail --script-dir=/tmp -t /var/log/zimbra.log zimbra 31014 0.0 0.3 8380 6544 pts/0 S 11:09 0:00 /usr/bin/perl /tmp/.swatch_script.31002 root 31094 0.0 0.0 6384 1688 ? Ss 11:09 0:00 /opt/zimbra/postfix-2.2.9/libexec/master zimbra 31113 0.0 0.0 5584 872 ? Ss 11:09 0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra zimbra 31114 0.0 0.0 5584 468 ? S 11:09 0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra zimbra 31115 0.0 0.0 5584 524 ? S 11:09 0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra zimbra 31117 0.0 0.0 5584 468 ? S 11:09 0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra zimbra 31118 0.0 0.0 5584 468 ? S 11:09 0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra zimbra 31166 0.0 0.1 4924 3296 pts/0 S 11:09 0:00 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-proc zimbra 31182 0.0 0.1 4792 3136 pts/0 S 11:09 0:00 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-cpu zimbra 31184 0.0 0.1 4788 3124 pts/0 S 11:09 0:00 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-vm zimbra 31219 0.0 0.1 4792 3080 pts/0 S 11:09 0:00 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-mtaqueue zimbra 31275 0.0 0.0 1692 576 pts/0 S 11:10 0:00 /usr/bin/vmstat -n -S K 30
6) Lastly I've noticed this happening more frequently ever since a failed backup triggered multiple zmlogger processes that don't die when you run "zmcontrol stop"
As you can see I'm swapping big time, and sometimes the CPU will spike to over 5.00 unless I kill zmlogger, and this is just from it running today. I've checked and repaired the logger tables, and even rebuilt the db recently but it keeps coming back. The server in question has (2) Dual Core Intel Xeon CPU 2.66GHz Cpu's with 4GB of RAM, we only have 75 users though that also includes accounts as HAM, SPAM, Wiki. I'm am also at a loss on this one, as it wasn't happening in early February, but now I deal with it everyday.Code:top - 17:20:15 up 7:55, 2 users, load average: 1.26, 0.79, 0.74 Tasks: 136 total, 2 running, 134 sleeping, 0 stopped, 0 zombie Cpu(s): 12.3% us, 8.6% sy, 0.0% ni, 40.9% id, 37.9% wa, 0.0% hi, 0.2% si Mem: 4147624k total, 3997224k used, 150400k free, 1540k buffers Swap: 1935824k total, 1011920k used, 923904k free, 78724k cached
All in all still must better than using and troubleshooting Exchange.


LinkBack URL
About LinkBacks






