Results 1 to 8 of 8

Thread: Cluster Stability Issues

  1. #1
    node_runner is offline Intermediate Member
    Join Date
    Sep 2007
    Posts
    18
    Rep Power
    7

    Angry Cluster Stability Issues

    So my company has been using our new ZCS Network Edition cluster for a few weeks now and I've noticed some disappointing issues with running a clustered environment and I'm looking for feedback.

    Setup Info:
    -Single-Node cluster setup (two physical servers, 1 shared storage array, active/standby).
    -RedHat ES v4 as base OS with RedHat Cluster Suite 4 as the clustering architecture.


    The main problem I'm having with our cluster that we didn't have when I was testing a trial version on a single server is stability while doing configuration/maintenance. Specifically, when I run certain commands or make configuration changes in the admin web console the result is often that the cluster wants to fail over to the other node, which takes time to do, so our server is actually unavailable for a few minutes. This is a pain because users get disconnected and start thinking the mail server is down.

    I think the problem is that whenever a config change takes place that requires one of the specific services/processes of the entire suite of services/process to be restarted or HUP'd, the cluster thinks something is down and decides to start failing over to the other box before the service/process gets a chance to restart.

    For example, when I log into the administration web console and configure AntiVirus settings by un-checking "Block encrypted archives" and then clicking save, the result is that the cluster fails over to the other node and the entire server is unavailable for about 2 minutes. This wasn't happening when I was testing Zimbra with a trial version using a single server. I also get the same behavior when I run some of the CLI tools.

    What is especially annoying about this is that I have no idea what types of changes will result in this downtime unless I've already tried it before. This means that whenever I have to make a trivial configuration change that doesn't require restarting any services for the change, I have to plan for an outage and notify all of our users. This is becoming a real pain.

    Is anyone else having these kinds of problems? It almost seems like going with a clustered environment we have ended up with a LESS stable system and more downtime than if we just would have went with a single server.

    Any rules of thumb on knowing beforehand what kinds of changes/edits will result in the server being unavailable for a few minutes and which ones are safe?

    Any tips/feedback for a newbie Zimbra sysadmin is appreciated.

  2. #2
    whatisee1 is offline Junior Member
    Join Date
    Mar 2008
    Posts
    8
    Rep Power
    7

    Default

    Wow - that's not good - we're planning to implement same thing on our environment.

  3. #3
    Chewie71 is offline Trained Alumni
    Join Date
    Sep 2006
    Location
    Illinois
    Posts
    371
    Rep Power
    8

    Default

    We have stopped using clustering for many of the same reasons. It's faster and easier to manually move a service to another node than to deal with the headache and problems Redhat Clustering gave us.

    I have no experience with Veritas Clustering and Zimbra so I can't say if it would be any better.

    Matt

  4. #4
    LMStone's Avatar
    LMStone is offline Moderator
    Join Date
    Sep 2006
    Location
    477 Congress Street | Portland, ME 04101
    Posts
    1,366
    Rep Power
    10

    Default

    FWIW we have pretty much stopped using clustering like that as well too; maintaining a cluster just took a lot more admin time and we too found the cluster to be "fragile" way too often.

    An HA environment for us now is a Zimbra farm on identical server hardware connected to a good SAN (Clarion CX4 or similar) with a spare server chassis. If a server dies, we just pop the on-board disks from the dead box into the spare chassis, boot it up, reconfigure all the NICs (MAC addresses changed of course) and we are done. Not too much slower than waiting for the RHEL cluster to fail over.

    If very serious HA is required by the client, we'll add a second SAN in a different location and do SAN replication over the (secured) WAN.

    Hope that helps,
    Mark

  5. #5
    Klug's Avatar
    Klug is offline Moderator
    Join Date
    Mar 2006
    Location
    Beaucaire, France
    Posts
    2,292
    Rep Power
    13

    Default

    Same "problem" here with one customer, we'll leave the RHCS beginning of next year.

    However, the cluster was much quieter and less prone to issues since 5.0.6.

    Today, with all the features you get in VM management, I think it's the way to go...

  6. #6
    sam_gennux is offline Intermediate Member
    Join Date
    Jun 2008
    Posts
    15
    Rep Power
    6

    Default

    Hi all, i might be missing something here, but isn't the whole idea of clustering is to "silently" bring down a server for maintenance and bring it back up when all works done?

    so my questions is that will it be feasible to do something like the followings, assuming there are two clustered-nodes A & B:

    - use firewall to block access to node B
    - all traffic will be routed to A thinking B is down
    - use CLI or some private tunnel to configure node B, save changes
    - restart B if need to, remove firewall blocking
    - repeat node A

    Without getting into the details, will this logic work? It seems to me the biggest issue is that the dispatcher thinks a node one second is alive, another second it's dead, so why not just virtuall-kill it and bring it up "officially" when ready?

    I personally haven't implemented the clustering setup yet but definitely interested to do it so I appreciate your sharing of these sort of issues.

    thx

  7. #7
    Vladimir is offline Advanced Member
    Join Date
    Aug 2007
    Posts
    220
    Rep Power
    7

    Default

    The clustering seemed a bit of a pain to us as well, and we are using stand by machines. We configured our machines to have dual IP's on the same interface and have access to the same bunch of NetApps. The L4 switch only looks at one set of addresses, and if a mailbox server fails, I can have the hot spare take over in about 2min. My process is make sure the failed machine is really failed (power off if necessary), turn on the fail machine IP on the standby machine, mount correct network partitions, start zimbra on backup machine. 24/7 monitoring means that we will notice the failure pretty quickly.

  8. #8
    LMStone's Avatar
    LMStone is offline Moderator
    Join Date
    Sep 2006
    Location
    477 Congress Street | Portland, ME 04101
    Posts
    1,366
    Rep Power
    10

    Default

    Quote Originally Posted by Vladimir View Post
    The clustering seemed a bit of a pain to us as well, and we are using stand by machines. We configured our machines to have dual IP's on the same interface and have access to the same bunch of NetApps. The L4 switch only looks at one set of addresses, and if a mailbox server fails, I can have the hot spare take over in about 2min. My process is make sure the failed machine is really failed (power off if necessary), turn on the fail machine IP on the standby machine, mount correct network partitions, start zimbra on backup machine. 24/7 monitoring means that we will notice the failure pretty quickly.
    Sounds like most of us replying to this thread are all on the same page that the costs (cash, maintenance and time) of a RH cluster exceed the benefits from some other (quasi) HA solutions for hosting Zimbra in most cases.

    All the best,
    Mark

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Replies: 6
    Last Post: 07-18-2010, 10:31 PM
  2. zcs Red Cat cluster (4) installation problem
    By alessio in forum Installation
    Replies: 3
    Last Post: 02-21-2008, 08:18 AM
  3. Zimbra cluster and postfix smtp_bind_address
    By Lebha in forum Administrators
    Replies: 8
    Last Post: 02-05-2008, 02:35 AM
  4. Exchange 2003 Migration Issues
    By JordanPWilliams in forum Migration
    Replies: 10
    Last Post: 07-27-2007, 10:51 AM
  5. Zimbra on OpenVZ (stability issues)
    By haensse in forum Installation
    Replies: 14
    Last Post: 04-09-2007, 09:40 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •