The general recommendation is to keep within one NUMA zone. Fine. But given current Nehalem CPUs and Intel chipsets, you can build a server both faster and cheaper with two sockets than with one socket, because then you can populate 6 sockets with 1.3GHz RAM versus 3. The non-local RAM access penalty for Nehalem, as measured with numademo, is only about 16-30%, which is about the gain you get by going from 1GHz to 1.3GHz RAM. And you only pay that penalty for non-local memory access, so on the whole, you win.
Let's say I have a Zimbra server with 24GB RAM and two NUMA nodes. Substantially all memory is used by Jetty and MySQL, which as far as I know never share memory. Is there any real cost to putting them in separate NUMA zones? Do any other processes share memory, including disk buffer cache?
I'm inclined to put LDAP and MySQL in zone 1, and everything else in zone 0. Reasonable?
Relevant tools: numactl --hardware, numactl --preferred=0 -N0, numademo, cat /sys/devices/system/node/node*/numastat
So you are running Zimbra on bare metal instead of on a virtual machine?
Regardless of the hypervisor, the ability to snapshot a VM right before an operating system and/or Zimbra upgrade is such a wonderful time saver in case a rollback is needed that we no longer build Zimbra servers on bare metal.
The biggest source of memory pressure we see is Amavis. We compound that pressure by mounting Amavis' tmp directory on a RAM disk to speed up mail processing, but even when the server is maxed out (e.g. small, single Zimbra server with an email flood processing ~30K emails per hour), the UI in general and even Searches continued to perform OK.
The biggest performance bottleneck in our experience is storage I/O (that's why we put Amavis' tmp directory on a RAM disk). %hi and %si in top rarely blip much, so even if you configure a more efficient bare metal memory map by splitting jetty and MySQL onto separate NUMA nodes, I'm not sure you'll actually obtain any noticeable performance improvement.
Lastly, don't forget that memory bus speeds are wicked fast now, so the performance gain from NUMA optimization has been reduced -- even in the virtual world, where vSphere 4.1 and XenServer 5.6 SP2 contain a lot of performance enhancements over previous versions.
Don't get me wrong -- I'm really very interested in this subject and keen to learn more, but since we on our own Private Cloud and most of our clients are running virtual server farms with N+1 compute heads for HA and fault tolerance, pinning VMs to certain NUMA nodes causes more administrative headaches than performance benefits (for us).
All the best,
My server is bare metal (Dell R710) booting from Compellent SAN.
I still get snapshots at the storage layer, and I can still use VMs for test and DR. I always test upgrades by mounting consistent snapshots of production /opt/zimbra/* as Xen VMs (old version included in RHEL 5.7). When I do this, I/O performance is very noticeably worse than on bare metal, even when the Xen server is, on paper, faster and better connected.
With the VMWare acquisition of Zimbra, I had been intending to virtualize there -- VMWare I/O is rumored better than Xen, certainly than RedHat's build -- but my mail store is 5 terabytes of raw LUNs, which makes our VMWare guy uncomfortable. We would have needed to buy a new server anyway, so I ended up with dedicated hardware again. And modern dedicated hardware means NUMA.
I ran some simple tests (for i in `seq 1 10`; do for j in `zmprov -l gaa`; do echo sm $j;echo 'search -t message "in:trash -has:attachment"'; done|zmmailbox -z;done) that showed no NUMA cross-talk (according to numactl --hardware). So there don't seem to be any funky zero-copy handoffs between java and mysqld. There was a modest (7%) improvement for NUMA split versus NUMA interleaved. I didn't expect there to be much difference, but wanted to confirm that it didn't make things worse.
While the big story is I/O, we have seen memory-related performance issues, especially in the early 5.0 series, stemming from Java garbage collection. If I can help keep Java heap local to the CPU, I expect at least a small win. Also, I put amavis tmp on RAM disk. While the orders-of-magnitude boost from keeping it off spinning platters is the big win, that RAM disk should be kept NUMA-local, too.
My bad; I assumed bare metal server = DASD and no SAN. Snapshot availability either on a SAN or in the hypervisor = normal blood pressure during upgrades for sure! :-)
Originally Posted by Rich Graves
FWIW, one of our clients who runs SAS on optimized bare metal 2-year old Dells after having poor virtualized performance just lit up a test server on our new Private Cloud infrastructure. Performance improvements over bare metal ranged from a 50% improvement to an order of magnitude improvement. Test job completion times ranged from 15 minutes to four hours (they run BIG jobs...) Our cloud is built on commercial XenServer 5.6 SP2.
We also saw big performance improvements when we upgraded our own and our clients' vSphere servers from 4.0 to 4.1. Both vSphere 4.1 and XenServer 5.6 SP2 were reported to have noticeable performance improvements over previous versions, which our real-world experiences confirmed.
I don't know why your VMware guy is nervous about your 5TB LUNs in principle although the migration will take double the current storage... VMware can only "see" LUNs up to 2TB, but can string multiple LUNs together to create Datastores much larger than 2TB (you create Virtual Disks for use by virtual machines only inside Datastores).
Thanks for the follow up!
All the best,