Zimbra offers Open Source email server software and shared calendar for Linux and the Mac
 
Go Back   Zimbra - Forums > Zimbra Collaboration Suite > Administrators

Welcome to the Zimbra - Forums!
Welcome, if you would like to post a comment please register. We also encourage you to explore all things Zimbra with our team and members of the community.

Reply
 
LinkBack Thread Tools Display Modes
  #11 (permalink)  
Old 05-09-2009, 08:27 PM
Senior Member
 
Posts: 58
Default Appears to be a database issue

Hello, all. We are making some progress toward understanding what has happened but still have not fixed the problem. It does not appear that this problem is related to our manually running the cron jobs but is rather a database issue. We have been able to reproduce the issue.

The first time was after we noticed we did not have a clean shutdown of the system. The second, we crashed the system (we are still pre-production and doing some experimentation). After the crash, we saw the same symptoms: It was as if mail had rolled back to a previous point, newly created domains and users were missing, existing users had deleted mail in their inboxes and missing new mails.

When we check the message store, it appears the messages are there but they are not showing up in the database. We also are seeing error messages in the database, e.g., this is when we had our first problems:

Code:
090507 22:27:02  mysqld started
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
090507 22:27:03  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
090507 22:27:05  InnoDB: Started; log sequence number 0 3354455
090507 22:27:05 [Note] /opt/zimbra/mysql/libexec/mysqld: ready for connections.
Followed shortly thereafter with:
Code:
090507 22:52:06 [Note] /opt/zimbra/mysql/libexec/mysqld: ready for connections.
Version: '5.0.67-log'  socket: '/opt/zimbra/db/mysql.sock'  port: 7306  Source distribution
090507 22:59:50090507 22:59:50 [ERROR] Cannot find table mboxgroup16/appointment from the internal data dictionary
of InnoDB though the .frm file for the table exists. Maybe you
have deleted and recreated InnoDB data files but have forgotten
to delete the corresponding .frm files of InnoDB tables, or you
have moved .frm files to another database?
See http://dev.mysql.com/doc/refman/5.0/en/innodb-troubleshooting.html
how you can resolve the problem.

090507 22:59:50090507 22:59:50 [ERROR] Cannot find table mboxgroup16/appointment from the internal data dictionary
of InnoDB though the .frm file for the table exists. Maybe you
have deleted and recreated InnoDB data files but have forgotten
to delete the corresponding .frm files of InnoDB tables, or you
have moved .frm files to another database?
See http://dev.mysql.com/doc/refman/5.0/en/innodb-troubleshooting.html
how you can resolve the problem.
We see similar messages tonight. Running zmdbintegrityreport -v gave similar messages and -r did not repair them.

We next followed the database repair procedures described in Mysql Crash Recovery - Zimbra :: Wiki. Now, zmdbintegrityreport -v comes up clean but we still have the same problem.

How does one get the messages from the message stored back into the database and recovery the other information? I'm suspecting but hoping that I will not hear the only way is restoration from backup and losing anything since the last backup. Thanks - John
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote
  #12 (permalink)  
Old 05-10-2009, 12:51 AM
Zimbra Consultant & Moderator
 
Posts: 11,512
Default

Quote:
Originally Posted by gracedman View Post
How does one get the messages from the message stored back into the database and recovery the other information? I'm suspecting but hoping that I will not hear the only way is restoration from backup and losing anything since the last backup. Thanks - John
You can use zmlmtpinject. How much RAM is allocated this Zimbra server?

Please update your forum profile with the output of the following:

Code:
zmcontrol -v
__________________
Regards


Bill
Reply With Quote
  #13 (permalink)  
Old 05-10-2009, 02:35 PM
Senior Member
 
Posts: 58
Default

The server has up to 16GB available to it. It is using a little less than 3 right now.

zmcontrol -v yields:

Release 5.0.16_GA_2921.RHEL5_64_20090429021719 CentOS5_64 NETWORK edition
I'll study zmlmtpinject a little more to make sure I don't misuse it.

Thanks - John
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote
  #14 (permalink)  
Old 05-10-2009, 02:36 PM
Senior Member
 
Posts: 58
Default

Any thoughts about why a crash or unclean shutdown would cause this "data reversion"? I can't imagine it is the case but it's almost as if the transaction for the last several days changes was not committed to the database. Thanks - John
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote
  #15 (permalink)  
Old 05-11-2009, 10:39 AM
Senior Member
 
Posts: 58
Default

The reinjection seemed to work well. We put together a little script to facilitate it in case it helps anyone else:

Code:
#!/bin/bash
if [ $# -ne 1 ];then
        echo "Usage: $(basename $0) <Message store ID number>"
        exit 10
fi
if [ ! -d /opt/zimbra/store/0/${1}/msg/0 ];then
        echo "Could not find message store for ${1}"
        exit 11
fi

ZIMDIR="/opt/zimbra"
cd ${ZIMDIR}/store/0/${1}/msg/0/
for MSG in $(ls)
do
        TO=$(sed -n -e "1,/Original Message/ s/^To: \(.*\)/\1/p" ${MSG})
        FROM=$(sed -n -e "1,/Original Message/ s/^From: \(.*\)/\1/p" ${MSG})
        if [ -n "$(echo ${TO} | grep '<')" ];then
                TO=${TO##*<}
                TO=${TO%%>*}
        fi
        if [ -n "$(echo ${FROM} | grep '<')" ];then
                FROM=${FROM##*<}
                FROM=${FROM%%>*}
        fi
        ${ZIMDIR}/bin/zmlmtpinject -r ${TO} -s ${FROM} ${MSG}
done
We're still perplexed about why we had the problem in the first place. Any suggestions? Thanks - John
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote
  #16 (permalink)  
Old 05-13-2009, 05:12 AM
Senior Member
 
Posts: 58
Default

We're slowly working our way through this issue. It occurs when we shut down the underlying vserver host. Despite our creating a /etc/vservers/zimbra1/apps/vshelper/sync-timeout file and setting it to 900 (seconds) to give the system plenty of time to shutdown gracefully, it is not. If we watch the shutdown sequence, it is considering the zimbra server stopped long before it possibly could be. Thus, in effect, we are crashing the zimbra server.

We have escalated to the vserver list to determine how to solve the problem from the vserver side. However, that doesn't mean there will be other reasons why the zimbra server might crash in the future. I'm a bit surprised by how much damage is done and how consistently. Are there settings which can be tweaked to make zimbra a little more resistant to damage during a crash? Thanks - John
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote
  #17 (permalink)  
Old 05-13-2009, 05:49 AM
Senior Member
 
Posts: 58
Default

I wonder if the severity of our crashes are compounded by the fact we have oodles of memory right now and very little activity. Perhaps we are being bitten by either this (from Performance Tuning Guidelines for Large Deployments - Zimbra :: Wiki):

"Innodb writes out pages in its cache after a certain percent of pages are dirty. The default is 90%. This default setting will minimize the total number of writes, but it would cause a major bottleneck in system performance when 90% is reached and database becomes unresponsive because the disk system is writing out all those changes in one shot. We recommend you set the dirty flush ratio to 10%, which does cause a lot more net total IO, but will avoids spiky write load. "

or this:

"MySQL is configured to store its data in files, and the Linux kernel buffers file IO. The buffering provided by the kernel is not useful to innodb at all because innodb is making its own paging decisions - the kernel gets in the way. Bypass the kernel with"

As I am not eager to keep mangling my system, I am hesitant to test these hypotheses! Can anyone with more experience furnish some guidance?

As someone new to this forum, I do hope it is not bad etiquette to keep such a running post. I am hoping it might help some other poor slob who faces the same issues in the future. Thanks - John
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote
  #18 (permalink)  
Old 06-02-2009, 11:39 AM
Senior Member
 
Posts: 58
Default

We are still struggling with this issue but have narrowed it down quite a bit. We noticed that our data loss is always the data between 4:02AM the morning after zimbra was last restarted and the time of the next zimbra service restart. This happens whether we restart the server, restart the zimbra service, or simply do a zmstorectl restart.

We also noticed that all logging stops at this same 4:02AM point in time. No entries in zimbra.log, mailbox.log, logger_myslow.log, etc.

This is also the time the logrotate daily cron job kicks in. We did notice a problem in /etc/logrotate.d/zimbra. The postrotate routine for zimbra.log tries to restart syslogd but CentOS 5.3 (our platform) uses rsyslogd. Once we edited that call, zimbra.log was able to continually log past 4:02AM but it still did not fix the problem.

We removed the zimbra logrotate file from /etc/logrotate.d/ last night but were able to reproduce the data loss this morning so it does not appear to be something within the zimbra file. We are also told by zimbra tech support that data missing from the logs would not provoke a data rollback / loss / corruption upon service restart so we are assuming this log failure is a parallel symptom rather than a cause.

We will begin removing files from logrotate.d in an attempt to find what is the mystic coincidence between this failure and the cron.daily jobs. Any suggestions about what might be happening? Thanks - John
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote
  #19 (permalink)  
Old 06-05-2009, 05:56 AM
Senior Member
 
Posts: 58
Default data loss and log stop correlation

I believe we have found the correlation between the data loss and the log problems but have not yet found the root cause. I believe data is being lost from the database because the logs are not being written after 4:02AM, specifically the innodb log file.

I am guessing that when mysql restarts and finds the data in the database does not match the data in the innodb log file (because we have ceased writing to it), it rolls back the database. I am deducing this from this mysql_error.log message:

090605 06:32:19 mysqld started
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
090605 6:32:20 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
090605 6:32:23 InnoDB: Started; log sequence number 0 19382319
090605 6:32:23 [Note] /opt/zimbra/mysql/libexec/mysqld: ready for connections.
Version: '5.0.67-log' socket: '/opt/zimbra/db/mysql.sock' port: 7306 Source distribution
090605 6:53:05 [Note] /opt/zimbra/mysql/libexec/mysqld: Normal shutdown

This is after recovering the database and waiting for another 4:02 to pass. Our challenge is still to find out why logging stops and stops only for zimbra. All other logging is working fine past 4:02 AM.

Would it be the case that mysql would roll back to 4:02AM if the innodb log contained no data later than 4:02?

We have not verified that innodb is not updated after 4:02 as we just put these issues together. We'll verify that in the morning. 4:02 is the exact time the cron.daily jobs start. Any clues as to why our logging stops at 4:02AM and stays disabled until Zimbra is restarted? Thanks - John
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote
  #20 (permalink)  
Old 06-09-2009, 05:45 AM
Senior Member
 
Posts: 58
Default Offending process isolated

We have finally narrowed down the problem to the offending process. There is some kind of conflict between mysql/innodb and the vserver hashify function. The problem described previously occurs every time hashify is run either from a cron job (4:00AM every day) or manually.

Since this thread contains a lot of accumulated extraneous information, I'll open a new thread to discuss this specific problem.
__________________
www.spiritualoutreach.com
Making Christianity intelligible to secular society
Reply With Quote

Why Join?

Registering let's you ask questions, makes it easier to search, displays any files attached to posts, and notifies you about replies.

Zimbrablog.com




 

Search Engine Optimization by vBSEO 3.1.0