Zimbra offers Open Source email server software and shared calendar for Linux and the Mac
Go Back   Zimbra :: Forums > Zimbra Collaboration Suite > Administrators

Welcome to the Zimbra :: Forums!
Welcome, if you would like to post a comment please register. We also encourage you to explore all things Zimbra with our team and members of the community.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 11-29-2007, 09:46 AM
Intermediate Member
 
Posts: 21
Default High CPU Load every couple of days

Hi all,

I'm using ZCS Community 4.5.10 and every couple of days (2-4 days) the CPU load will skyrocket causing the system to become unresponsive. When this happens, I ran top and noticed that the %wa often stay at 99% or a little bit lower. I'm thinking maybe this is because of Java (seeing that it takes the highest amount of resource) or something in the crontab job. As a reference, I've attached a screenshot of top and catalina.out with this post.

The system serves 15-20 mailboxes, it runs on Ubuntu6, and it is virtualized with Microsoft Virtual Server 2005 R2. Other guest systems (Windows) experience no similar symptoms and they are very stable. Because of that, I think we can safely rule out the possibility of a faulty hard disk. Any idea why this is happening?
Attached Images
File Type: jpg Untitled-1.jpg (50.5 KB, 398 views)
Attached Files
File Type: zip catalina.zip (35.5 KB, 3 views)

Last edited by pornsakb; 11-29-2007 at 09:51 AM..
Reply With Quote
  #2 (permalink)  
Old 11-29-2007, 10:57 AM
Moderator
 
Posts: 1,027
Default

My first question would be about resources--specifically RAM--and those "other guest systems" to which you refer. Is the 1.5 GB of RAM that I see listed on your top screen allocated exclusively to the Ubuntu/Zimbra virtual machine, or is that everything you have on your box?

Knowing that Windows swaps like crazy on less than 512 MB (for XP, and to a lesser extent for 2000 as well), if your whole machine has a gig and a half of RAM and is running both Ubuntu/Zim and Windows in separate virtual machines, you could well be running out of RAM when Windows and Ubuntu try to do some RAM-intensive work at the same time. Best practices here generally come down to 2GB exclusively for Linux and Zimbra as being the sweet spot for even smaller installations than yours. That 2 GB can then support a fair amount of growth; it's what I'm using on a 32 mailbox system and it is extremely responsive, but below that 2 GB floor point systems run a whole lot less efficiently.
Reply With Quote
  #3 (permalink)  
Old 11-29-2007, 11:01 AM
Intermediate Member
 
Posts: 21
Default

Hi,

The 1.5GB of ram that you see is dedicated exclusively to the virtualized Ubuntu6 machine. I also tried shutting down all other guest systems but the load average does not get reduced.
Reply With Quote
  #4 (permalink)  
Old 11-29-2007, 11:10 AM
Moderator
 
Posts: 1,027
Default

Quote:
Originally Posted by pornsakb View Post
Hi,

The 1.5GB of ram that you see is dedicated exclusively to the virtualized Ubuntu6 machine. I also tried shutting down all other guest systems but the load average does not get reduced.
OK, that makes hardware an unlikely candidate for the problem . . . I presume you have at least a moderate-horsepower cpu (doesn't take a killer; I run mine on a single PIII 1.4GHz).

You might, next time you have such a high utilization, do a tail of your zimbra.log and mail.log (both are in /var/log) files. zimbra.log in particular keeps a record of the activities--with timestamp--that the zimbra software is doing and can give you some pretty useful insights.

By way of comparison, I ran top on my system and just watched it for the last five minutes. I see java hit the top of the list whenever I log into or do any action in a web client (most of the time my users connect by IMAP, not web client), but the rest of the time it doesn't even show on the list. It does take a fair chunk of RAM when in action, but not a particularly high amount of CPU, and that usage appears to be transient.

I presume, if java is heavily utilized by your system, that most of your users are using the web client? This is not a bad thing; many systems with hundreds or thousands of users use it heavily; just trying to sort out possibilities.
Reply With Quote
  #5 (permalink)  
Old 11-29-2007, 11:12 AM
Moderator
 
Posts: 1,027
Default

Quote:
Originally Posted by pornsakb View Post
I'm thinking maybe this is because of Java (seeing that it takes the highest amount of resource) or something in the crontab job.
You can rule out crontab jobs real easily. Just su - zimbra and then run crontab -l, and see if any processes are time-correlated with your problem. Very few of the jobs in crontab take more than a few seconds to run.
Reply With Quote
  #6 (permalink)  
Old 11-29-2007, 07:32 PM
Intermediate Member
 
Posts: 21
Default

I think the key here is to try to find out what is waiting for IO (%wa) and take it from there. Any idea how I can do that?
Reply With Quote
  #7 (permalink)  
Old 11-30-2007, 08:23 AM
Active Member
 
Posts: 45
Default

I had a problem like this once, but Linux and not windows. Oddly enough, it happened because the server had a lot of RAM.

Linux, by default, dedicate 10% of memory for the disk IO cache. The server had 16 gigabytes of memory, so file writes would buffer up to 1.6G and then everything would wait for the disk to catch up. It was a fairly speedy disk array, but it could not keep up under heavy loads.

Anyway, in linux, you can tune how much memory is reserved for file buffers, and how frequently it will write the buffer to the disk. I ended up doing something like this in /etc/sysctl.conf:

Code:
vm.dirty_background_ratio = 1
vm.dirty_ratio = 1
vm.dirty_expire_centisecs = 50
vm.dirty_writeback_centisecs = 50
followed by sysctl -p to use the new values.

which means that it will reserve 1% of memory for file buffers, and it will write the buffers to disk every half a second. Obviously, you have to play with these numbers a bit to figure out what works best for you.

I have never used Microsoft Virtual Server, but given the description I have to wonder if file writes are taking too long. Have you benchmarked the disk performance to see how well it can perform? The iostat command can help you see how much io the system is doing.

Good luck!
Reply With Quote
  #8 (permalink)  
Old 11-30-2007, 08:39 AM
Intermediate Member
 
Posts: 21
Default

Attached is the results produced by hdparm -tT before the CPU spikes. Are the figures normal?
Attached Images
File Type: jpg hdparm.jpg (111.1 KB, 403 views)
Reply With Quote
  #9 (permalink)  
Old 11-30-2007, 09:19 AM
Moderator
 
Posts: 1,027
Default

I don't know if you have noticed but there are a number of threads on high CPU load that might be worth reviewing in case they give you any insight:

Here's one suggesting that NFS may (or may not, depending on the poster) be part of the problem.

Another traced it to zmmtaconfig syncing to LDAP.

Still another (Bugzilla 15598)was a bug (fixed since 4.5) related to malformed MIME messages.

It also occurs to me just now to ask-is your Ubuntu fully patched? Could also be something with an odd module. . .

Of course your issue may well be none of the above. Unfortunately finding these things can be somewhat of a needle-in-a-haystack search. Have you checked your zimbra.log file yet?

And if it seems that I'm scattergunning here, I am. Until more information is discovered this could take you any of a zillion ways. . .
Reply With Quote
  #10 (permalink)  
Old 11-30-2007, 11:01 AM
Active Member
 
Posts: 45
Default

This is from our zimbra server. It has a pair of mirrored 750GB SATA drives which are moderately fast.

Code:
[root@email ~]# hdparm -Tt /dev/sda

/dev/sda:
 Timing cached reads:   2296 MB in  2.00 seconds = 1145.88 MB/sec
 Timing buffered disk reads:  184 MB in  3.00 seconds =  61.30 MB/sec
[root@email ~]# cat /proc/scsi/scsi
Attached devices:
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3750640AS      Rev: 3.AA
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi1 Channel: 00 Id: 01 Lun: 00
  Vendor: ATA      Model: ST3750640AS      Rev: 3.AA
  Type:   Direct-Access                    ANSI SCSI revision: 05
I suppose it all depends on what kind of disk you have. Your numbers are not particularly good, but they are also not particularly bad.

I don't have much experience with Ubuntu on servers, but do you have anything in /var/log/sa ? The files in there may provide some useful system information over the "wait" periods.

Mark

Quote:
Originally Posted by pornsakb View Post
Attached is the results produced by hdparm -tT before the CPU spikes. Are the figures normal?
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes


Similar Threads

Why Join?

Registering let's you ask questions, makes it easier to search, displays any files attached to posts, and notifies you about replies.

blog.zimbra.com




 

SEO by vBSEO ©2011, Crawlability, Inc.