Results 1 to 9 of 9

Thread: Mailbox server clustering questions

  1. #1
    andrew_l is offline Active Member
    Join Date
    Apr 2008
    Posts
    26
    Rep Power
    7

    Default Mailbox server clustering questions

    I have set up a clustered pair of mailbox servers running Zimbra 5.02. My configuration largely follows the Zimbra docs on RHEL 4 cluster configuration. The services basically work, but I've noticed 2 oddities that I am trying to understand.

    When the Zimbra services are running on the normally-active node, the periodic status logs for the cluster services look like this:

    Apr 15 13:22:03 servername1 zimbramon[28465]: 28465:info: 2008-04-15 13:22:01, STATUS: clustername.example.com: logger: Running
    Apr 15 13:22:03 servername1 zimbramon[28465]: 28465:info: 2008-04-15 13:22:01, STATUS: clustername.example.com: mailbox: Running
    Apr 15 13:22:03 clustername zimbramon[28465]: 28465:info: 2008-04-15 13:22:01, STATUS: clustername.example.com: logger: Running
    Apr 15 13:22:03 servername1 zimbramon[28465]: 28465:info: 2008-04-15 13:22:01, STATUS: clustername.example.com: spell: Running
    Apr 15 13:22:03 clustername zimbramon[28465]: 28465:info: 2008-04-15 13:22:01, STATUS: clustername.example.com: mailbox: Running
    Apr 15 13:22:03 servername1 zimbramon[28465]: 28465:info: 2008-04-15 13:22:01, STATUS: clustername.example.com: stats: Running
    Apr 15 13:22:03 clustername zimbramon[28465]: 28465:info: 2008-04-15 13:22:01, STATUS: clustername.example.com: spell: Running
    Apr 15 13:22:03 clustername zimbramon[28465]: 28465:info: 2008-04-15 13:22:01, STATUS: clustername.example.com: stats: Running

    When the services have failed over to the standby node, they seem to work properly but the status log looks like this:

    Apr 15 13:32:03 servername2 zimbramon[22532]: 22532:info: 2008-04-15 13:32:01, STATUS: clustername.example.com: logger: Stopped
    Apr 15 13:32:03 servername2 zimbramon[22532]: 22532:info: 2008-04-15 13:32:01, STATUS: clustername.example.com: mailbox: Running
    Apr 15 13:32:03 servername2 zimbramon[22532]: 22532:info: 2008-04-15 13:32:01, STATUS: clustername.example.com: spell: Running
    Apr 15 13:32:03 servername2 zimbramon[22532]: 22532:info: 2008-04-15 13:32:01, STATUS: clustername.example.com: stats: Running

    There are two obvious differences here:

    1) In the first case, there are apparently 2 instances of each service, one under the cluster name and one under the server name. In the second case, each service only has once instance, under the server name.

    2) In the second case, the logger shows as "stopped", although it appeared to start when the services initially failed over.

    I'm wondering if anyone can explain these, or give me a hint on how to troubleshoot the logger problem. Thanks,

    -Andrew L

  2. #2
    andrew_l is offline Active Member
    Join Date
    Apr 2008
    Posts
    26
    Rep Power
    7

    Default

    I think I have gotten past these problems. The problem with the multiple instances of services running seems to have resolved itself after some failover testing. I'm not sure what caused it or what corrected it, though. The logger problem on the standby node seems to have occurred because I didn't run the zmsyslogsetup script on the standby node. I ran it and the logger is now running on the standby node.

    -Andrew L

  3. #3
    Klug's Avatar
    Klug is offline Moderator
    Join Date
    Mar 2006
    Location
    Beaucaire, France
    Posts
    2,322
    Rep Power
    13

    Default

    Welcome to the forum.

    Do you mean you're using your own failover scripts ?

  4. #4
    andrew_l is offline Active Member
    Join Date
    Apr 2008
    Posts
    26
    Rep Power
    7

    Default

    Quote Originally Posted by Klug View Post
    Welcome to the forum.

    Do you mean you're using your own failover scripts ?
    No, we're using Red Hat Cluster Suite. Although I've solved the two problems mentioned in the OP, I'm having some trouble coming up with a fencing configuration that works properly. This is really a Red Hat Cluster question rather than a Zimbra question, but if anyone is a RHCS expert, I would be interested to know if there is a way to prevent a node that can't reach the router from trying to fence the other node.

    -Andrew L

  5. #5
    Klug's Avatar
    Klug is offline Moderator
    Join Date
    Mar 2006
    Location
    Beaucaire, France
    Posts
    2,322
    Rep Power
    13

    Default

    Quote Originally Posted by andrew_l View Post
    No, we're using Red Hat Cluster Suite.
    Reading your post, I thought you were using RHCS but with your own scripts (not the one provided by ZCS's cluster install tool).

    Quote Originally Posted by andrew_l View Post
    I would be interested to know if there is a way to prevent a node that can't reach the router from trying to fence the other node.
    I'm not sure I understand.

    Nodes are supposed to be in the same subnet (because of the virtual IP), so there should not be any "router" (aka "gateway") between the nodes.

    How do you get one node not reaching the router ? Network issues ? Gateway going down ? Why does the gateway going down makes your node think it can not reach the other one ?

  6. #6
    andrew_l is offline Active Member
    Join Date
    Apr 2008
    Posts
    26
    Rep Power
    7

    Default

    Quote Originally Posted by Klug View Post
    Nodes are supposed to be in the same subnet (because of the virtual IP), so there should not be any "router" (aka "gateway") between the nodes.

    How do you get one node not reaching the router ? Network issues ? Gateway going down ? Why does the gateway going down makes your node think it can not reach the other one ?
    The two nodes are in the same subnet. They are both connected to the same switch. The network path for cluster communication, as well as general network communication, is the one through that switch. We are using HP iLO for fencing, so the fencing traffic goes through that switch as well. The situation is that if the switch goes down, removing both nodes from the network, each node thinks the other has gone down. They both start trying to fence the other. This is unsuccessful as long as the switch is down (because the iLO traffic needs to go over the network). When the switch comes back up, the nodes both fence each other and shut down everything.

    What I would like is to be able to add a check that basically says, "if the other node seems down, but you can't ping the switch, then don't do anything -- just sit there and wait for the switch to come back online." Or, if there is a general flaw in my design that could be corrected some other way, I'm open to that too. Thanks,

    -Andrew L

  7. #7
    Klug's Avatar
    Klug is offline Moderator
    Join Date
    Mar 2006
    Location
    Beaucaire, France
    Posts
    2,322
    Rep Power
    13

    Default

    Quote Originally Posted by andrew_l View Post
    The situation is that if the switch goes down, removing both nodes from the network, each node thinks the other has gone down.
    I think I get it.

    In order to avoid this, you need two switches (obviously on different power supplies) connected to each other and two NICs in each node then to setup SpanningTree (or equivalent technology handled by the switches).
    If one switch goes down, the other is still there, you've removed one of the SPOF of the design and you're safe.

    However, there's still a problem with most fence devices (iLO, DRAC, APC, etc) because they have only one single NIC...
    If the switch the fence device is connected to goes down, the related node can not be fenced off.

  8. #8
    andrew_l is offline Active Member
    Join Date
    Apr 2008
    Posts
    26
    Rep Power
    7

    Default

    Quote Originally Posted by Klug View Post
    I think I get it.
    In order to avoid this, you need two switches (obviously on different power supplies) connected to each other and two NICs in each node then to setup SpanningTree (or equivalent technology handled by the switches).
    If one switch goes down, the other is still there, you've removed one of the SPOF of the design and you're safe.
    Yes, I had considered a design like that and we may just have to do it. Our network engineer was already planning to set up 2 switches for the subnet the cluster is on, plus a third switch on its own subnet for iLO. However, all these switches would connect back to our core router, which would become the SPOF. To avoid that, we would need to set up spanning tree among the switches as you suggested, but our network guy was reluctant to try that.

    I was hoping there was a way to use RHCS to do network checks and have the cluster nodes make decisions based on that. I know that sort of thing can be done with other cluster software such as Linux-HA.

    -Andrew L

  9. #9
    Klug's Avatar
    Klug is offline Moderator
    Join Date
    Mar 2006
    Location
    Beaucaire, France
    Posts
    2,322
    Rep Power
    13

    Default

    Quote Originally Posted by andrew_l View Post
    I was hoping there was a way to use RHCS to do network checks and have the cluster nodes make decisions based on that.
    This might exist, I've always gone the "hard network way" 8)
    You should subscribe to RHCS-Cluster mailing list and ask, the people know what they're talking about.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Replies: 11
    Last Post: 06-22-2007, 01:32 PM
  2. [Network Edition Trial] OS X Installation
    By dmg in forum Installation
    Replies: 4
    Last Post: 02-07-2007, 05:25 PM
  3. Error 256 on Installation
    By RuinExplorer in forum Installation
    Replies: 5
    Last Post: 10-19-2006, 09:19 AM
  4. more fedora install problems
    By jlynch3 in forum Installation
    Replies: 13
    Last Post: 09-14-2005, 09:37 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •