Dovecot Failed state

Joe · June 1, 2020, 8:29pm

Same thing we’ve discussed above. Improper shutdown, failed start after shutdown.

The master.pid is the PID file for the main dovecot process (the one that starts all the others). When Dovecot tried to shutdown, it did not find a master.pid file to remove. This could mean one of two things: You ran stop or --force stop when the dovecot master process has already shut down, or for some reason dovecot removed the master.pid during a restart and failed to write a new one for some reason. Probably the former. But, basically it looks like dovecot trying to shutdown when it’s not running.

Opposite situation. Dovecot tried to start and found it was already running.

You could look to see if pid 14483 actually exists (assuming you haven’t killed it since this error). Whatever the current PID is, you’d look at, e.g.:

ps -p 14483

If you have a master.pid but the PID inside is not actually existent, that means Dovecot was improperly shutdown, somehow (kill -9, maybe --force stop, OOM killer) and was unable to cleanup on exit.

Speaking of which, have we already discussed the OOM killer? How much memory do you have? Plenty of free?

adamjedgar · June 1, 2020, 8:34pm

:~# ps -p 14483
PID TTY TIME CMD
14483 ? 00:00:00 dovecot

:~# free -m
total used free shared buff/cache available
Mem: 3955 1359 425 373 2170 1982
Swap: 511 267 244

Joe · June 1, 2020, 8:41pm

OK, can rule out OOM killer, you probably have plenty.

adamjedgar · June 1, 2020, 8:43pm

I am thinking of changing to Debian 10.

Is an upgrade of this system to Debian 10 a wise move under the circumstances, or do you think i need to build a new one and move everything across? (this system has a static ip address and does not host dns)

Joe · June 1, 2020, 8:48pm

Upgrading from 9->10 is reasonably painless, I believe. Both are systemd. You’d want to do it when you can afford some downtime, though. If this is a VM (rather than a physical server), it’s probably better to migrate rather than upgrade, though, for a variety of reasons. Migration means you get to start over fresh, can migrate one domain at a time with testing in between, etc. Takes longer, but probably a cleaner resulting system.

We probably have an upgrade guide for 9->10. I’ll check…er, nope. Looks like Eric hasn’t had time to tackle that (he usually writes up the upgrade guides).

adamjedgar · June 1, 2020, 9:31pm

what i am thinking i will do then, is migrate most of the domains to a new system. I have access to dns registries on all but three of the domains. So ill move the ones where i have dns registry access to change A and MX records, and leave the other 3 in place and move them later.

adamjedgar · June 2, 2020, 10:15am

hi guys, an update on this issue…

i know i was planning to stop at this point and simply setup a new debian 10 system, however, i have received an idea from another forum that i think might be worth trying.

its interesting that this guy mentioned this, because the thought did cross my mind this morning after i last posted here…is it possible that its trying to start up a second instance of Dovecot because it thinks dovecot is not running when in fact dovecot is still running?

Here is his advice…

Login through SSH, disable dovecot from running as a service then stop it.
After you have done this see if dovecot is running, if it is kill it.
Once you have confirmed that dovecot is not running by running ps aux | grep -i dov
Enable it to run as a service again, but do not start it.
Then login to webmin and start the service from there if possible.

More than likely there are multiple instances of dovecot running and the system is potentially trying to kill one instance, but another immediately starts back up.

i had already got the system monitor working again this morning by doing a similar thing to the above. I will wait and see if and when Dovecot crashes again. If that happens, then i will try the above advice to see what happens.

What do you think Joe, is it possible that the reason why my system still delivers email even though the system monitors are saying its failed, is because in fact the entire issue exists because Webmin (or debian?) is trying to start up a second instance of Dovecot without actually realizing its already running?

Is there any way of simulating this? Why would Webmin (or debian) be attempting to start up another instance of Dovecot if its still running? Is there somewhere in the configuration in Webmin or Virtualmin where i could have inadvertantly created a configuration whereby this could happen?

I note from this url https://wiki1.dovecot.org/RunningDovecot, its possible to run more than one instance of Dovecot at the same time on a server!

Joe · June 3, 2020, 3:10pm

It wouldn’t, unless you told it to by clicking the “start” button, or if you have a Status monitor that restarts it.

adamjedgar · June 10, 2020, 4:04am

ok so i just thought i would post back here re dovecot failing status. My server appears to have been not showing this error now for almost a week and i have done a few other minor updates and even rebooted it a couple of times…no further problems.

Its seems as if it may have just been that Dovecot process was still running and both debian and virtualmin thought it was not. I assume that because there was nothing allowing a second instance of dovecot to start because it was actually already running, the virtualmin and debian monitors kept right on think dovecot wasnt working and trying to start up a new instance of it.

Now that i have been able to essentially reset the entire system in the right way by forcing the stray Dovecot process to shutdown, once i restarted it again, there dont appear to be any further problems at this time.

it will be interesting to see what happens when the next webmin update comes out as that seems to have been when my system has crapped itself in some form or another.

adamjedgar · June 11, 2020, 11:39am

Ah FFS!!!
according to system monitor dovecot has entered a failed state again.
I notice that one of the virtual server had an SSL certificate update this afternoon. 6 minutes later, dovecot monitor says dovecot was no longer running!

I am sure this has something to do with SSL certificates, i just dont know why or what one is causing it!

emails are still being delivered to my email client apps…it doesnt make any sense how this could be randomly happening like this. its been over a week now without any problems and suddenly this again!

the error logs once again say it cannot restart because its already running.

ok, what is interesting is that in webmin/system/bootup and shutdown, dovecot.service is shown as not running.

i attempt to restart, it says failed!

I run the following in putty (note the output)

~# for i in ps aux | grep dovecot | awk '{print $2}' ; do kill -9 $i ; done
-bash: kill: (803) - No such process
-bash: kill: (25862) - No such process
-bash: kill: (26029) - No such process
-bash: kill: (28571) - No such process
-bash: kill: (28573) - No such process
-bash: kill: (28574) - No such process
-bash: kill: (28575) - No such process
-bash: kill: (28632) - No such process
-bash: kill: (28781) - No such process
-bash: kill: (28782) - No such process
-bash: kill: (31709) - No such process

i go back to webmin/system/bootup and shutdown and “Start Dovecot.service” and it starts up again.

I am not understanding this at all?

Cannot i just add the start command to the system monitor in the event dovecot.service goes down again? Is there anything i can do for this?

Is there any reason why my choosing Restart does not work and Start does? (is that what i am doing wrong here?) How does that have anything to do with the system monitor thinking dovecot has crashed?

Joe · June 11, 2020, 5:22pm

Just to test a theory, can you look at dovecot.conf, and see if you have any ssl_ca directives set, and if so, remove them and restart Dovecot? (You may have to kill dovecot services with kill -9.)

Joe · June 11, 2020, 5:26pm

Note, this might make some clients (probably just mobile clients and old operating system versions) fail to validate the SSL for your domains. We can fix that if this solves the start/stop/status issues. I don’t know if it will, but it’s the only thing I know we’re currently doing wrong in dovecot configs, maybe exacerbated by improved support for domain-based certs in mail configs.

Joe · June 11, 2020, 5:27pm

Oh, also make a note of where your chain cert(s) are when deleting those ssl_ca lines, as we’ll need to bundle them into the ssl_cert if this proves to be the problem.

adamjedgar · June 12, 2020, 3:11am

by ssl_ca directives, do you mean the locations of those certs? See below

!include_try local.conf
local_name domain1.com {
ssl_cert = </home/domain1.com/ssl.cert
ssl_key = </home/domain1.com/ssl.key
ssl_ca = </home/domain1.com/ssl.ca
}
local_name www.domain1.com {
ssl_cert = </home/domain1.com/ssl.cert
ssl_key = </home/domain1.com/ssl.key
  ssl_ca = </home/domain1.com/ssl.ca
}

i have the above for all domains/virtual servers on my system for all my clients.

Joe · June 12, 2020, 4:23am

Yes, get rid of every ssl_ca line. Just comment it out with a #, or make a note of them, so you can grab it later (we need to bundle it into the ssl_cert instead, though that only matters for quite old clients and some mobile clients).

adamjedgar · June 13, 2020, 10:46pm

My monitoring service has gone down again overnight…and this time, i thought i would just attempt to start dovecot.service through virtualmin gui inteface (webmin>system>bootup and shutdown)

It wouldnt start.

I am wondering why i shouldnt just insert the following commands into the system monitor…
for i in ps aux | grep dovecot | awk ‘{print $2}’ ; do kill -9 $i ; done
then,
systemctl start dovecot.service

That is all that i need to do in putty to get the system monitor seeing that dovecot is online again (yeah i know, its not solving the problem and emails are still being delivered anyway).

Having said that, whilst webmin/virtualmin status monitors are all saying dovecot is running, if i check in command shell, it still says dovecot has enterred a failed state! This is crazy.

what is comforting, after i went into webmin>system>bootup and shutdown, and stopped both the dovecot.service and dovecot.socket processes (which webmin thought were not running anyway), from hotmail i sent an email to an admin account on this server, and after restarting dovecot about 5 minutes later the email got delivered…so i guess this at least means my backup mxrecord to the other virtualmin system might actually be doing its job?(is there anyway to check that from the backup email server?)

OK…so Joe onto what you suggested in your last post…

you have me a little worried…my clients already had lost access to their emails a few weeks ago when this first started. I am worried that if i do this and they loose access again…well poo might hit the fan. I suppose I am just need to be able to visualise in my head the pathway here…if i can see where i am going and why…i wont panic so much if something goes wrong.

could you elaborate on exactly why i need to do this?
is this something that servers that are running correct have setup by default in virtualmin?
3.if all of the above is bundled into the ssl certificate, what happens when a renewal is issued for ssl at some point in the near future?
Is this only for testing/debugging and would be reverted back again after the testing/debugging process is complete?
I note that out of all the domains on my system, there is one (that also has the most email usage) where its SSL cert has not been copied to dovecot. Could that be doing this if user email clients are connecting via SSL?

Joe · June 14, 2020, 7:42am

Because we misunderstood this option. It doesn’t do what we all thought it did (and what examples on the web indicated it did and what intuitively it seems like it would). The next version of Virtualmin will not use it and will instead bundle the CA chain certs into the ssl_cert file.

I do not know if this is the cause of the problems you’ve seen. I have never seen it happen on our servers (which use the same type of configuration including the ssl_ca being set). I simply don’t know, that’s why I said please test and let me know what happens (though if Virtualmin puts the ssl_ca back in place when renewing Let’s Encrypt certs, which it very likely will, we probably won’t actually know).

Again, the next version of Virtualmin gets this right. We thought we were getting it right in the past…and it’s never caused a problem in the past (many years worth). But, now that we know it’s not actually right (despite acting right in many cases) we’ve fixed it.

I really don’t know what else to tell you. This config option was used wrong, we know that for sure. Whether it causes the problems you see, I don’t know and won’t know for sure until it’s fixed.

ssl_ca will never be used this way again after the next version of Virtualmin goes out. It is wrong. It isn’t for the CA chain bundle, even though it seems to work like it is sometimes.

I don’t understand this question.

adamjedgar · June 14, 2020, 7:56am

well i have a few domains where their SSL cert is being used in dovecot. however, in one domain (virtual server), which is by far the most used email on my system, Virtualmin does not say that it has assigned the domain SSL cert to dovecot (only usermin)

What cert does dovecot default to in virtualmin when email is enabled, but where the domain (virtual server) letsencrypt ssl cert has not been copied to dovecot?

Its got me confused, because imap and pop3 clients are connecting to this domain (virtual server) via SSL no problems at all!

I had no idea this domains SSL cert wasnt assigned to dovecot in virtualmin until this morning!

philmck · June 14, 2020, 11:31am

I’ve been lurking with interest, because I’ve been suffering repeated but intermittent problems with Dovecot failing on more than one server (all running Virtualmin on Ubuntu). In my case it nearly always seems to happily restart from the front dashboard.

I just thought I’d mention that the OOM killer might be worth more investigation, because for me the problem seems to correlate with weekly backups that run early on Sundays and temporarily take a lot of CPU and bandwidth. But not always.

Joe · June 14, 2020, 7:08pm

So…clicking “Copy to Dovecot” does not do what many folks think it does (that’s a UX failure on our part that documentation doesn’t seem to be able to solve). That button goes back to before per-domain certs were possible, and it was intended to set the SSL certificate for Dovecot (and it still sets the SSL certificate for Postfix). Now it only sets the default certificate…any virtual servers that have a certificate will get that certificate configured and used when users connect to the given domain.

We still recommend everyone use one domain for mail connections (which is the one domain you’d use the “Copy to…” buttons for), until Postfix SNI support is complete and widespread. But, Dovecot gets certs setup for domains that have certs.

In short: The “Copy to…” buttons are meant to be pressed once for your “main” domain, the one that everyone will connect to for mail. For now, anyway. Postfix SNI comes in the next release, but will probably be quirky until we work out all the kinks and until we’ve tested across all the various Postfix versions we have to support. It’s an area in flux in the upstream software, not just in Virtualmin, so sometimes when we think we know something (from prior testing) it turns out we don’t or it changed out from under us (like the ssl_ca option).

If you’re using one main domain for mail, you won’t experience those pains and the adventurous folks who don’t mind a little quirkiness will find all the bugs. A couple months after release of the next Virtualmin release (probably around about the time of Virtualmin 7) it’ll be a good/safe time to switch.

So, on this one, it’s automatic for Dovecot. You don’t need to “assign” anything if everyone is connecting using a virtual domain name that has a certificate managed by Virtualmin. The “Copy to…” button is for the default cert in Dovecot (and the only cert in Postfix for now). You don’t press it for every domain…it’s nonsensical to do so. It just replaces the default with the new domain every time you press it.