| Welcome to the Zimbra :: Forums! | |
Welcome, if you would like to post a comment please register.
We also encourage you to explore all things Zimbra with our team and members of the community.
|  | 
05-11-2010, 06:29 AM
| | | ZCS cluster went down - /opt/zimbra-cluster/bin/zmcluctl failed (returned 1) Hi All,
I'm running Zimbra 5.0.20 NE on a 2-node cluster of CentOS 4.8 (active/standby). The other day, the cluster decided to fail over to the standby, and I'm trying to determine why. In the logs, I see:
/var/log/messages on node 1 (originally the standby, became the active): Code: May 7 06:40:16 wsl-mx1 clurgmgrd[5374]: <notice> Recovering failed service mx.mydomain.com
May 7 06:40:17 wsl-mx1 kernel: kjournald starting. Commit interval 5 seconds
May 7 06:40:17 wsl-mx1 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
May 7 06:40:17 wsl-mx1 kernel: EXT3 FS on emcpowera1, internal journal
May 7 06:40:17 wsl-mx1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
May 7 06:42:59 wsl-mx1 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
May 7 06:42:59 wsl-mx1 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
May 7 06:43:03 wsl-mx1 clurgmgrd: [5374]: <err> script:zimbra: start of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1)
May 7 06:43:03 wsl-mx1 clurgmgrd[5374]: <notice> start on script "zimbra" returned 1 (generic error)
May 7 06:43:03 wsl-mx1 clurgmgrd[5374]: <warning> #68: Failed to start service:mx.mydomain.com; return value: 1
May 7 06:43:03 wsl-mx1 clurgmgrd[5374]: <notice> Stopping service mx.mydomain.com
May 7 06:43:14 wsl-mx1 clurgmgrd: [5374]: <notice> Forcefully unmounting /opt/zimbra-cluster/mountpoints/mx.mydomain.com
May 7 06:43:14 wsl-mx1 clurgmgrd: [5374]: <warning> killing process 7666 (zimbra amavisd /opt/zimbra-cluster/mountpoints/mx.mydomain.com)
...(more killing process messages)
May 7 06:43:20 wsl-mx1 clurgmgrd[5374]: <notice> Service mx.mydomain.com is recovering
May 7 07:46:16 wsl-mx1 clurgmgrd[5374]: <notice> Starting stopped service mx.mydomain.com
May 7 07:46:16 wsl-mx1 kernel: kjournald starting. Commit interval 5 seconds
May 7 07:46:16 wsl-mx1 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
May 7 07:46:16 wsl-mx1 kernel: EXT3 FS on emcpowera1, internal journal
May 7 07:46:16 wsl-mx1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
May 7 07:48:09 wsl-mx1 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
May 7 07:48:09 wsl-mx1 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
May 7 07:48:13 wsl-mx1 clurgmgrd[5374]: <notice> Service mx.mydomain.com started /var/log/messages on node 2 (originally the active, became the standby): Code: May 7 06:36:20 wsl-mx2 clurgmgrd: [5376]: <err> script:zimbra: status of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1)
May 7 06:36:20 wsl-mx2 clurgmgrd[5376]: <notice> status on script "zimbra" returned 1 (generic error)
May 7 06:36:20 wsl-mx2 clurgmgrd[5376]: <notice> Stopping service mx.mydomain.com
May 7 06:37:08 wsl-mx2 clurgmgrd[5376]: <notice> Service mx.mydomain.com is recovering
May 7 06:37:08 wsl-mx2 clurgmgrd[5376]: <notice> Recovering failed service mx.mydomain.com
May 7 06:37:08 wsl-mx2 kernel: kjournald starting. Commit interval 5 seconds
May 7 06:37:08 wsl-mx2 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
May 7 06:37:08 wsl-mx2 kernel: EXT3 FS on emcpowera1, internal journal
May 7 06:37:08 wsl-mx2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
May 7 06:39:55 wsl-mx2 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
May 7 06:39:55 wsl-mx2 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
May 7 06:39:59 wsl-mx2 clurgmgrd: [5376]: <err> script:zimbra: start of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1)
May 7 06:39:59 wsl-mx2 clurgmgrd[5376]: <notice> start on script "zimbra" returned 1 (generic error)
May 7 06:39:59 wsl-mx2 clurgmgrd[5376]: <warning> #68: Failed to start service:mx.mydomain.com; return value: 1
May 7 06:39:59 wsl-mx2 clurgmgrd[5376]: <notice> Stopping service mx.mydomain.com
May 7 06:40:10 wsl-mx2 clurgmgrd: [5376]: <notice> Forcefully unmounting /opt/zimbra-cluster/mountpoints/mx.mydomain.com
May 7 06:40:10 wsl-mx2 clurgmgrd: [5376]: <warning> killing process 6870 (zimbra amavisd /opt/zimbra-cluster/mountpoints/mx.mydomain.com)
...(more killing process messages)
May 7 06:40:16 wsl-mx2 clurgmgrd[5376]: <notice> Service mx.mydomain.com is recovering
May 7 06:40:16 wsl-mx2 clurgmgrd[5376]: <warning> #71: Relocating failed service mx.mydomain.com I didn't see anything particularly interesting in the zimbra logs, and they were too big to post in this message, so I'll reply back with them.
I found two threads that might be related to this: [SOLVED] Clustering problem. RHCS transfer even the ZCS service is well
The former suggests deleting the log directory, the latter suggest increasing the zmcluctl timeout. However, neither indicates if the possible solution actually solved the problem.
Any suggestions? Thanks! | 
05-11-2010, 06:33 AM
| | | zimbra.log Attached are my zimbra.log snippets from both servers. Thanks! | 
05-11-2010, 07:26 AM
| | | (Sorry to spam my own post)
I was looking through the zimbra.log on node 2 (originally the active), and noticed this: Code: May 7 06:36:20 wsl-mx2 zimbra-cluster[2821]: status - rc=1 from zmcontrol: output=[Host mx.mydomain.com <EOL>, antispam Running <EOL>, antivirus Stopped <EOL>, zmclamdctl is not running <EOL>, imapproxy Running <EOL>, ldap Running <EOL>, logger Running <EOL>, mailbox Running <EOL>, mta Running <EOL>, snmp Running <EOL>, spell Running <EOL>, stats Running ]
May 7 06:36:21 wsl-mx2 zimbra-cluster[3300]: stop - Zimbra stop initiated via zmcluctl Could this indicate that the cluster failed over because antivirus wasn't running? I don't see anything worthwhile in clamd.log from that time, other than the daemon starting up after the failover: Code: Fri May 7 06:20:02 2010 -> SelfCheck: Database status OK.
Fri May 7 06:30:02 2010 -> SelfCheck: Database status OK.
Fri May 7 06:35:56 2010 -> Reading databases from /opt/zimbra/data/clamav/db
Fri May 7 06:38:36 2010 -> +++ Started at Fri May 7 06:38:36 2010
Fri May 7 06:38:36 2010 -> clamd daemon 0.95.1-broken-compiler (OS: linux-gnu, ARCH: i386, CPU: i686)
Fri May 7 06:38:36 2010 -> Log file size limited to 20971520 bytes.
Fri May 7 06:38:36 2010 -> Reading databases from /opt/zimbra/data/clamav/db
Fri May 7 06:38:36 2010 -> Not loading PUA signatures.
Fri May 7 06:41:43 2010 -> +++ Started at Fri May 7 06:41:43 2010 | 
05-11-2010, 08:05 AM
| | | This looks like the the way the cluster scripts are working : they tests all the component each X minutes (5 or 10, can't remember). If one of the component is not working, it switches!
It's supposed to fence the node on which the component failed and and switch to the spare node.
If fencing is not working properly, you can get both node up at the same time. Very very very bad. | 
05-11-2010, 08:13 AM
| | | Ah ok, so you agree that it failed over because AV wasn't running? How would I figure out why AV wasn't running, as I don't see anything in the clamd log file?
And yes, the failover did not work properly. I have it set to fence using drac, so it should just power off the failed node and start up the services on the standby. However, that didn't happen. Both servers stayed online, the zimbra service was taken down, and did not come back up automatically. It had to be brought up manually. | 
05-11-2010, 08:33 AM
| | | Fencing not working as it should and scripts switching "too quickly" (waiting for operator to restart antivirus should have been OK) are the reason some of our customers left RHCS and are now working with "hand clustering".
If a module goes down, it is manually restarted.
If active server goes down, ZCS is manually switched to spare server.
In ZCS6, there's a new way of configuring RHCS in order to have it switch only in case of "hardware" failure. | 
05-13-2010, 08:45 AM
| | | Got it, RHCS failover is not very reliable. Probably I'll just write a wrapper script for it to use instead of zmcluctl, which only returns 1 in case of mailbox being down or telnet port 25 not working, and alerts me otherwise.
Any idea why the clamd service might have died, or where I could check to glean more info? Thanks! | | Thread Tools | Search this Thread | | | | | Display Modes | Linear Mode | | Why Join? Registering let's you ask questions, makes it easier to search, displays any files attached to posts, and notifies you about replies.  |