| Welcome to the Zimbra :: Forums! | |
Welcome, if you would like to post a comment please register.
We also encourage you to explore all things Zimbra with our team and members of the community.
|  | | 
06-05-2007, 09:09 AM
| | | Quote:
Originally Posted by padraig This looks good but there is a danger this could mask an underlying problem
if processes dies regularly  | the posted config doesnt work... for a bunch of reasons.
I'm working on a much better version right now... should have it finished today, i'll test for a week and then post back here if it works | 
02-28-2008, 07:41 PM
| | | Anyone get this working successfully?
This thread is a little old but im hoping somebody got this to work. Quote:
Originally Posted by Leesbian the posted config doesnt work... for a bunch of reasons.
I'm working on a much better version right now... should have it finished today, i'll test for a week and then post back here if it works | | 
02-28-2008, 08:11 PM
| | Former Zimbran | |
Posts: 5,606
| | There is a very fundamental issue with this work flow that needs to be considered:
If a service stops, it stops for a reason. This work flow does nothing to address that problem.
This means that if there is larger issue, such as an unhanded exception...well it's only a matter of time before it goes down again. Since this idea would automatically restart the service, you may never know if you hit an unhanded exception. It also might make it worse....
Zimbra has great handlers. We have our own watchdog proc for things like mta, clam, and java. If those die, it tries to restart them. If there is a condition preventing the restart, it won't restart them.
The moral of the story is that if the server goes down, you really should figure out why, as opposed to just restarting the service.
I do think this is a good idea, which is why I'm saying it's a problem with the work flow itself.
There's a high availability/fail over script floating around. You might want to look at that. | 
02-29-2008, 09:52 AM
| | | does the watchdog process send an email to the admin if a process dies and it has to restart it or cant restart it? is there an option to set something like that up? i realize that if a service does die that there could be a bigger underlying issue, but i would like an alert telling me its died and could/couldnt be restarted rather than just finding out by all my customers calling and complaining ;-)
i was just trying to be proactive in being alerted to the issue first if something were to happen.
thanks for the input. | 
02-29-2008, 10:05 AM
| | Former Zimbran | |
Posts: 5,606
| | Well, it wouldn't be able to send an e-mail because the server is down, thus smtp is down. If e-mail's down, you probably won't get the message anyway.
What I would do is to have a script that monitors the services. If a condition is raised where the services go down, you could have it sent an http post to your "support server" or something. If you're using windows nt, you would whip up a script where if that post is received, it uses windows messaging service (not MSN messenger, but the messenger protocol built into windows nt machines) to send your machine an alert.
Just some thoughts.
Definitely possible. | 
02-29-2008, 10:07 AM
| | Former Zimbran | |
Posts: 5,606
| | Correction:
SMTP may not be down, but another service could be down. In any case, since this is a disaster-related script, you should plan for the event that smtp is unavailable. | 
03-03-2008, 03:39 AM
| | | multistore is worthwhile to be monitored using monit all what u say, john, is right..but:
i have a multistore architecture with store servers wan-connected to a central hub;
i have a store that die when wan connection with master goes away; at this moment i dunno any way to resort it without using monit; if u would suggest me something different u are welcome!
any advice will be glad | 
10-10-2008, 08:23 AM
| | | A working monitor... Hey there... no one's done anything with this in a while, but I figured I would post my working monitor script. The one thing to note is that the purpose of the script is NOT to restart a failed process, simply to give the administrator a heads up that something is about to go bad (Eg. process hung, running out of resources, process died... etc). Code: check system myhost.local
if loadavg (1min) > 4 then alert
if loadavg (5min) > 2 then alert
if memory usage > 85% then alert
if cpu usage (user) > 70% then alert
if cpu usage (system) > 50% then alert
if cpu usage (wait) > 20% then alert
check process Zimbra.Apache
with pidfile "/opt/zimbra/log/httpd.pid"
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
if failed port 80 protocol http then alert
group zimbra
check process Zimbra.Logwatch
with pidfile "/opt/zimbra/log/logswatch.pid"
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
group zimbra
check process Zimbra.MySQL
with pidfile "/opt/zimbra/db/mysql.pid"
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
if failed port 7306 protocol mysql then alert
group zimbra
check process Zimbra.MySQL_Logger
with pidfile "/opt/zimbra/logger/db/mysql.pid"
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
depends on Zimbra.MySQL
group zimbra
check process Zimbra.MTA_Config
with pidfile "/opt/zimbra/log/zmmtaconfig.pid"
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
group zimbra
check process Zimbra.Mailbox_Java
with pidfile "/opt/zimbra/log/zmmailboxd_java.pid"
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
if failed port 143 protocol imap then alert
group zimbra
check process Zimbra.Mailbox_Control
with pidfile "/opt/zimbra/log/zmmailboxd_manager.pid"
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
group zimbra
check process Zimbra.ClamAV
with pidfile /opt/zimbra/log/clamd.pid
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
group zimbra
check process Zimbra.Cyrus_SASL
with pidfile /opt/zimbra/cyrus-sasl/state/saslauthd.pid
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
group zimbra
check process Zimbra.Postfix
with pidfile /opt/zimbra/data/postfix/spool/pid/master.pid
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
if failed port 25 protocol smtp then alert
group zimbra
check process Zimbra.LDAP
with pidfile /opt/zimbra/openldap/var/run/slapd.pid
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
if failed host myhost.local port 389 protocol ldap3 then alert
group zimbra
check process Zimrba.Amavis
with pidfile /opt/zimbra/log/amavisd.pid
if children > 255 for 5 cycles then alert
if cpu usage > 95% for 3 cycles then alert
group zimbra So, think of this as an early warning system. Monit can easily be set to use a different SMTP server than your Zimbra server, so it gets around that problem as well. | 
10-10-2008, 09:00 AM
| | | To each his/her own.. Quote:
Originally Posted by jholder There is a very fundamental issue with this work flow that needs to be considered:
If a service stops, it stops for a reason. This work flow does nothing to address that problem.
This means that if there is larger issue, such as an unhanded exception...well it's only a matter of time before it goes down again. Since this idea would automatically restart the service, you may never know if you hit an unhanded exception. It also might make it worse....
Zimbra has great handlers. We have our own watchdog proc for things like mta, clam, and java. If those die, it tries to restart them. If there is a condition preventing the restart, it won't restart them.
The moral of the story is that if the server goes down, you really should figure out why, as opposed to just restarting the service.
I do think this is a good idea, which is why I'm saying it's a problem with the work flow itself.
There's a high availability/fail over script floating around. You might want to look at that. | Everyone's requirements are different, so your mileage will vary. I've had processes die, and they could die for many reasons, sometimes even under load from a spam attack.
Depending on your environment, you may not want the service down, if say it happened at 4am and you get a wakeup call at 8am from irate users. Your investigation time would be limited, you would have to restart the service.
So the real moral of the story, know what you need before you implement. Just leaving a service down is great in theory, as we take our time to exchange pleasantries with Zimbra tech support to get the issue resolved. But that's not always a quick thing.
As someone mentioned later, monit can be configured to send alerts via another smtp server, so based on your alerts config, you will be notified of a down situation.
You can also comment out the start/stop lines and just have the alerts sent out, pretty flexible. | 
10-10-2008, 09:46 AM
| | | Absolutely Oh, I completely agree... That's the whole point of the monitrc posting that I put up... all it does is let the admin know that either (A) a service has gone down, or (b) the server appears to be struggling with something... either way, they should look into it. The monit script I posted doesn't even have start/stop lines, and that's completely intentional.
The idea behind having the alerts for children processes/memory utilization/load etc. is that the administrator can get in, and worst case scenario, alert the users that the system is going down. In my experience, I've seen that generally the anger level of a client is inversely proportional to the amount of warning they had. eg. "You're getting a lot of spam, it looks like it's about to hang the system" is often appreciated more than "The reason you haven't received email in the last 4 hours is because spam clogged the system".
... god I hate spam. | | Thread Tools | Search this Thread | | | | | Display Modes | Linear Mode | | Why Join? Registering let's you ask questions, makes it easier to search, displays any files attached to posts, and notifies you about replies.  |