| Welcome to the Zimbra :: Forums! | |
Welcome, if you would like to post a comment please register.
We also encourage you to explore all things Zimbra with our team and members of the community.
|  | 
07-17-2008, 11:53 PM
| | Intermediate Member | |
Posts: 18
| | Cluster Stability Issues So my company has been using our new ZCS Network Edition cluster for a few weeks now and I've noticed some disappointing issues with running a clustered environment and I'm looking for feedback.
Setup Info:
-Single-Node cluster setup (two physical servers, 1 shared storage array, active/standby).
-RedHat ES v4 as base OS with RedHat Cluster Suite 4 as the clustering architecture.
The main problem I'm having with our cluster that we didn't have when I was testing a trial version on a single server is stability while doing configuration/maintenance. Specifically, when I run certain commands or make configuration changes in the admin web console the result is often that the cluster wants to fail over to the other node, which takes time to do, so our server is actually unavailable for a few minutes. This is a pain because users get disconnected and start thinking the mail server is down.
I think the problem is that whenever a config change takes place that requires one of the specific services/processes of the entire suite of services/process to be restarted or HUP'd, the cluster thinks something is down and decides to start failing over to the other box before the service/process gets a chance to restart.
For example, when I log into the administration web console and configure AntiVirus settings by un-checking "Block encrypted archives" and then clicking save, the result is that the cluster fails over to the other node and the entire server is unavailable for about 2 minutes. This wasn't happening when I was testing Zimbra with a trial version using a single server. I also get the same behavior when I run some of the CLI tools.
What is especially annoying about this is that I have no idea what types of changes will result in this downtime unless I've already tried it before. This means that whenever I have to make a trivial configuration change that doesn't require restarting any services for the change, I have to plan for an outage and notify all of our users. This is becoming a real pain.
Is anyone else having these kinds of problems? It almost seems like going with a clustered environment we have ended up with a LESS stable system and more downtime than if we just would have went with a single server.
Any rules of thumb on knowing beforehand what kinds of changes/edits will result in the server being unavailable for a few minutes and which ones are safe?
Any tips/feedback for a newbie Zimbra sysadmin is appreciated. | 
12-11-2008, 12:29 PM
| | | Wow - that's not good - we're planning to implement same thing on our environment. | 
12-11-2008, 02:15 PM
| | Trained Alumni | |
Posts: 342
| | We have stopped using clustering for many of the same reasons. It's faster and easier to manually move a service to another node than to deal with the headache and problems Redhat Clustering gave us.
I have no experience with Veritas Clustering and Zimbra so I can't say if it would be any better.
Matt | 
12-11-2008, 05:59 PM
| | | FWIW we have pretty much stopped using clustering like that as well too; maintaining a cluster just took a lot more admin time and we too found the cluster to be "fragile" way too often.
An HA environment for us now is a Zimbra farm on identical server hardware connected to a good SAN (Clarion CX4 or similar) with a spare server chassis. If a server dies, we just pop the on-board disks from the dead box into the spare chassis, boot it up, reconfigure all the NICs (MAC addresses changed of course) and we are done. Not too much slower than waiting for the RHEL cluster to fail over.
If very serious HA is required by the client, we'll add a second SAN in a different location and do SAN replication over the (secured) WAN.
Hope that helps,
Mark
__________________
___________________________________ L. Mark Stone, CIO "Uptime. All the time."
477 Congress Street | Portland, ME 04101-3431 | (207) 772-5678
proactive maintenance and monitoring | technology consulting
Zimbra groupware | EMR implementations | private cloud hosting
| 
12-12-2008, 12:44 AM
| | | Same "problem" here with one customer, we'll leave the RHCS beginning of next year.
However, the cluster was much quieter and less prone to issues since 5.0.6.
Today, with all the features you get in VM management, I think it's the way to go... | 
12-13-2008, 04:21 PM
| | Intermediate Member | |
Posts: 15
| | Hi all, i might be missing something here, but isn't the whole idea of clustering is to "silently" bring down a server for maintenance and bring it back up when all works done?
so my questions is that will it be feasible to do something like the followings, assuming there are two clustered-nodes A & B:
- use firewall to block access to node B
- all traffic will be routed to A thinking B is down
- use CLI or some private tunnel to configure node B, save changes
- restart B if need to, remove firewall blocking
- repeat node A
Without getting into the details, will this logic work? It seems to me the biggest issue is that the dispatcher thinks a node one second is alive, another second it's dead, so why not just virtuall-kill it and bring it up "officially" when ready?
I personally haven't implemented the clustering setup yet but definitely interested to do it so I appreciate your sharing of these sort of issues.
thx | 
12-16-2008, 02:56 PM
| | Advanced Member | |
Posts: 193
| | The clustering seemed a bit of a pain to us as well, and we are using stand by machines. We configured our machines to have dual IP's on the same interface and have access to the same bunch of NetApps. The L4 switch only looks at one set of addresses, and if a mailbox server fails, I can have the hot spare take over in about 2min. My process is make sure the failed machine is really failed (power off if necessary), turn on the fail machine IP on the standby machine, mount correct network partitions, start zimbra on backup machine. 24/7 monitoring means that we will notice the failure pretty quickly. | 
12-16-2008, 03:13 PM
| | | Quote:
Originally Posted by Vladimir The clustering seemed a bit of a pain to us as well, and we are using stand by machines. We configured our machines to have dual IP's on the same interface and have access to the same bunch of NetApps. The L4 switch only looks at one set of addresses, and if a mailbox server fails, I can have the hot spare take over in about 2min. My process is make sure the failed machine is really failed (power off if necessary), turn on the fail machine IP on the standby machine, mount correct network partitions, start zimbra on backup machine. 24/7 monitoring means that we will notice the failure pretty quickly. | Sounds like most of us replying to this thread are all on the same page that the costs (cash, maintenance and time) of a RH cluster exceed the benefits from some other (quasi) HA solutions for hosting Zimbra in most cases.
All the best,
Mark
__________________
___________________________________ L. Mark Stone, CIO "Uptime. All the time."
477 Congress Street | Portland, ME 04101-3431 | (207) 772-5678
proactive maintenance and monitoring | technology consulting
Zimbra groupware | EMR implementations | private cloud hosting
| | Thread Tools | Search this Thread | | | | | Display Modes | Linear Mode | | Why Join? Registering let's you ask questions, makes it easier to search, displays any files attached to posts, and notifies you about replies.  |