mcgreen --
Here at Brandeis University, we are in the process of rolling out a multi-server install with 7000 accounts behind the
Cisco ACE load balancer module. Today we're in the process of wrapping up an opt-in period and getting underway with migrating the rest of the university. The road to here was relatively painless, but not without a bruise or two along the way.
Our infrastructure is comprised of 12 servers -- 4 mail stores (with the potential of adding 2 more), 2 ldap (master and replica), 2 IMAP/POP3 proxies, 2 MXs, and 2 dedicated name servers (for running a
split domain). All servers are Xen 3.1.3 virtual machines, running RHEL 5.1 x86. For more details about server infrastructure, please see this post by another member of our team:
Zimbra and Xen.
Our load balance configuration consists of the following:
- 3 server farms -- zimbra-web, zimbra-proxy, and zimbra-mx
- The zimbra-web farm consists of the 4 mail stores, with probes on ports 80 and 443
- The zimbra-proxy farm consists of the 2 proxies with probes on 993 and 995
- The zimbra-mx farm consists of the 2 MXs with probes on 25 and 465
Our probes are simple tcp connections to the specified ports, which are executed every 5 seconds. If the probe fails, a server is immediately 'out of service' and a pass detect interval of 15 is set before a server can be active in the farm (a probe must succeed 3 times before it can join the farm again). Down the road we will configure better probes, such as HTTP probes on the mail stores (we'd expect a 302 on 80 and a 200 on port 443), SMTP probes for the MXs (we can make an client request to the port and expect a HELO back), etc. There is some danger in doing SMTP probes since the MXs are configured to rate limit connections, however, that can easily be changed.
SSL termination was a consideration of ours, however, we chose not to proceed. For starters, there are a handful of known issues with terminating SSL on the Cisco ACE -- both security and stability related. Secondly, this was before the web proxy was introduced, which would have made this LB scenario extremely complex. For example, you need 1 priv key / cert pair for your load balanced VIP, say mail.example.com. Additionally, you'll need a priv key / cert for each mail store you had, since Zimbra would redirect you if the user did not land on their appropriate mail sotre (in our case you had a 1 in 4 chance). Web proxying would fix all of this, though this was just introduced in v 5.0.5 and documented in v 5.0.6. It is currently delivered with a 'BETA' disclaimer so I'm reluctant to put it into production. I've given it a shot with minimal success, so we're sticking with a certificate with SANs.
Speaking of SANs (Subject Alternate Names), it was a little bit of a to do in order to get certificates working properly. SANs basically allow you to specify multiple CNs in 1 certificate. Our current CA Thawte does not sign certificates with SANs -- nor do many of the big CAs. We did have success with using
Digi Cert and was rather impressed with their responsiveness. Verifying our domain was easy (almost too easy) and using their web interface was a snap. One thing to be sure of is to include the CN of the certificate as 1 of the SANs else you'll still get a browser warning.
Installing the certificates was somewhat of a pain, though that was a known bug. Our work around was to install the certificates manually to all our servers:
Code:
for i in $ZIMBRASERVERS; do scp * root@${i}:/opt/zimbra/ssl/zimbra/commercial/ && ssh root@${i} "chmod 700 /opt/zimbra/ssl/zimbra/commercial/commercial_ca.crt"; done
for i in $ZIMBRASERVERS; do ssh root@${i} "/opt/zimbra/bin/zmcertmgr deploycrt comm /opt/zimbra/ssl/zimbra/commercial/commercial.crt /opt/zimbra/ssl/zimbra/commercial/commercial_ca.crt"; done
for i in $ZIMBRASERVERS; do ssh root@${i} "su - zimbra -c 'zmcontrol stop && zmcontrol start'"; done This bug I believe has been fixed, though I can't say for sure:
Bug 24153
One of the more difficult aspects of running a multi-server install is trying to follow the documentation. Even the multi-server documentation itself gets confusing -- most of it is written without specifying which server in your multi-server environment things should be executed from! Additionally, you won't find everything you need to know in the multi-server docs, so you'll have to touch base with the single-node docs and make your own judgment calls. Not a show stopper, just a few extra steps involved.
Infrastructure aside, migrating existing users has been our primary focus of this whole process. Our original plan was to do the cutover over in a weekend, with the hope that we could update DNS to point old IPs (old imap server, outbound smtp, etc.) to the new load balanced VIP so users would not have to update their existing mail clients. Ha! Zimbra recommends
imapsync are the recommended methodology of migrating users, which is basically a perl script which serially copies e-mails one by one between 2 different mail stores. To help automate our Zimbra account creation, we have created a custom python script which performs the following:
- Takes in a comma delimited file of uids
- One by one, verifies they have an existing mail account
- Creates a zimbra account and sets their COS
- Sends the user a kick off e-mail letting them know the process is about to begin
- Syncs their e-mail via imapsync
- After the first pass of imapsync, start forwarding mail to their Zimbra account, makes their old INBOX and folders read-only, then runs through a second imapsync pass (this is to prevent user's from making any updates, moving e-mails to different folders, etc.) and make sure we have a perfect replica of their old mail
- Once successful, we create a new INBOX on the user's old account which contains instructions on how to confirm their account (via another web-app system we developed) as well as instructions on how to update their existing mail clients.
To date, we have not had a single complaint about this process. We did this in an incremental fashion -- starting with a subset of about 30 people to move first. Once we ironed out the kinks, we opened it up to our technology departments which was about 140 people. We then expanded our account confirmation system to allow users an opportunity to opt-in. Since we made the opt-in available (2008/05/14), about 300 people have signed up and migrated their account. We've received a lot of praise from our users -- both from a user experience during the migration / confirmation process, as well as excitement to use a high quality product like Zimbra.
All in all we're extremely happy with our decision to go to Zimbra. Once mail has been completely migrated, we intend to migrate all of our calendar data from Oracle Calendar into Zimbra (see
Oracle Calendar to Zimbra pain) which if you read the post can see we're dealing with a culture shock more than any technical challenges.
I'll be happy to keep you updated once we start the campus migration (scheduled to start 2008/07/07) and let you know the end result.