So my company has been using our new ZCS Network Edition cluster for a few weeks now and I've noticed some disappointing issues with running a clustered environment and I'm looking for feedback.
-Single-Node cluster setup (two physical servers, 1 shared storage array, active/standby).
-RedHat ES v4 as base OS with RedHat Cluster Suite 4 as the clustering architecture.
The main problem I'm having with our cluster that we didn't have when I was testing a trial version on a single server is stability while doing configuration/maintenance. Specifically, when I run certain commands or make configuration changes in the admin web console the result is often that the cluster wants to fail over to the other node, which takes time to do, so our server is actually unavailable for a few minutes. This is a pain because users get disconnected and start thinking the mail server is down.
I think the problem is that whenever a config change takes place that requires one of the specific services/processes of the entire suite of services/process to be restarted or HUP'd, the cluster thinks something is down and decides to start failing over to the other box before the service/process gets a chance to restart.
For example, when I log into the administration web console and configure AntiVirus settings by un-checking "Block encrypted archives" and then clicking save, the result is that the cluster fails over to the other node and the entire server is unavailable for about 2 minutes. This wasn't happening when I was testing Zimbra with a trial version using a single server. I also get the same behavior when I run some of the CLI tools.
What is especially annoying about this is that I have no idea what types of changes will result in this downtime unless I've already tried it before. This means that whenever I have to make a trivial configuration change that doesn't require restarting any services for the change, I have to plan for an outage and notify all of our users. This is becoming a real pain.
Is anyone else having these kinds of problems? It almost seems like going with a clustered environment we have ended up with a LESS stable system and more downtime than if we just would have went with a single server.
Any rules of thumb on knowing beforehand what kinds of changes/edits will result in the server being unavailable for a few minutes and which ones are safe?
Any tips/feedback for a newbie Zimbra sysadmin is appreciated.