Dovecot Failed state

Joe,

After you sending me over to this thread from my “Dovecot left in Mixed State” thread. I reviewed my dovecot.conf file on one of the problem servers. 2 of the domains which had their certs updated, did have CA lines. Perhaps those did leave Dovecot in this mixed state. Lots of folks were logged in and receiving email but Webmim thought Dovecot was not running and I suspect that new logins were failings.

I know this isn’t much help, but after removing those entries and then updating a couple of certs this morning, there were no issues. Of course that doesn’t really mean a thing at this point.

Again, of note, I run a paid EV cert on my systems. I do have that set up in the 10-ssl.conf for Dovecot. It does have the ca line. I had this set to /etc/pki/tls/certs/STAR_domain_com.ca-bundle but I see now that it was set to what may have been the last of those two certs mentioned above. I have now corrected that issue but failed to check the last write date before I did so I’m not sure when this happened. Either way, there is a 50/50 chance that the ca listed there was the last one I updated which had the ca line in dovecot.conf.

John

Are you sure that dovecot is working properly?
Mine is down again (twice) without reboot. Few weeks ago there was been an update, I guess

On my system, Dovecot still delivers emails…see if yours is delivering even though it says its enterred a failed state
The reason mine enters a failed state is because the system is trying to start dovecot when its actually already running. However the system doesnt know dovecot is running, hence its trying to restart it.
What i dont know is why…it can run happily for days or weeks, then boom, out of the blue this happens? Its SSL. related but not exactly sure how that sets it off. But it keeps delivering emails anyway.
On my system its a problem with the system monitoring i think

HI guys
I’m also hitting problems with this after running upgrades on a centos8 server last night. Its now 4:30 am, and email’s been down for a few hours.

it looks like I have several of these entries per domain (of which there are many). without counting it could be hundreds of them:
local_name www.example.com {
ssl_cert = </home/example.com/ssl.cert
ssl_key = </home/example.com/ssl.key
#ssl_ca = </home/example.com/ssl.ca
}
I have commneted out all the ssl_ca lines

But in any case shouldnt all this per domain cert stuff be done under conf.d/ and not directly in dovecot.conf?

I’m also seeing this error:
Jun 18 03:30:54 centos8 dovecot[7488]: config: Warning: /etc/dovecot/dovecot.conf line 682: Global setting ssl_cert won't change the setting inside an earlier filter at /etc/dovecot/dovecot.conf line 166 (if this is intentional, avoid this warning by moving the global setting before /etc/dovecot/dovecot.conf line 166)

That seems to be due to virtualmin having duplicated domain cert stuff into dovecot.conf (at a quick check this could be many times in some cases)

and whether its separate or connected, I’m also getting this error since earlier upgrade:
dovecot[160359]: imap-login: Error: Failed to initialize SSL server context: Can't load DH parameters: error:1408518A:SSL routines:ssl3_ctx_ctrl:dh key too small: user=<>

I’ve gone to sleep for a bit.
later

l.

I thought the duplicates things had been fixed. There should only be one per domain. Are you sure you’re running the latest Virtualmin version? (6.09-3)

Hi Joe
thanks for reply.
Sorry, should have posted version info which is:
CentOS 8.2.2004
Dovecot 2.3.8
Virtualmin 6.09
Webmin 1.942

The duplicates thing: You could well be correct. I might be using a backup copy of an older dovecot.conf.

This error:
“Global setting ssl_cert won’t change the setting inside an earlier filter”

seems to be caused by multiple domain certs being copied in that arent in a ‘local’ wrapper. I guess that might just have been caused by me hitting the “Copy to Dovecot” button many times. I’d often wondered why it didnt disable after use! :slight_smile:

The “Can’t load DH parameters” error was caused by upgrading to 2.3.8. I fixed by generating dh.pem
In fact dovecot spits out this solution in maillog:
dd if=/var/lib/dovecot/ssl-parameters.dat bs=1 skip=88 | openssl dhparam -inform der > /etc/dovecot/dh.pem
but it didnt work

the dovecot wiki had a method that worked:
https://doc.dovecot.org/configuration_manual/dovecot_ssl_configuration/#dovecot-ssl-configuration

openssl dhparam 4096 > dh.pem

it warns it will take a long time and it did!

I seem to be receiving and sending mail again ok.

Thanks as always

l.

So Joe can you just make this as simple an answer as possible so i am absolutely clear…

Question 1

in my webmin server /etc/dovecot.conf,

1a. I do not need any of the entries below for any domain at all? i can delete them all…because Virtualmin automatically figures all this out for me when a new ssl cert is requested (either self signed or a lets encrypt one etc)?

ssl_cert = <home/client1/ssl.cert
ssl_key = <home/client1/ssl.key
ssl_ca = <home/client1/ssl.ca

ssl_cert = <home/client2/ssl.cert
ssl_key = <home/client2/ssl.key
ssl_ca = <home/client2/ssl.ca

ssl_cert = <home/client3/ssl.cert
ssl_key = <home/client3/ssl.key
ssl_ca = <home/client3/ssl.ca

1b. i have to ask, if i delete the above references, how does virtualmin know that the SSL certs are in /home/client1/…? (because this is where the certs are currently stored…i can see them in file manager for each client)

Question 2

In terms of the copy to buttons for each virtual server/domain…I only do this for my server’s primary domain?

Question 3

I “DO NOT” click any of these copy to buttons for any client domain on a shared server with only a single ipaddress?

Are these actually just like this in your config file and not in local_name sections? If so, how did they get that way?

Virtualmin creates sections like this:

local_name virtualmin.com {
ssl_cert = </home/virtualmin/ssl.cert
ssl_key = </home/virtualmin/ssl.key
}
local_name www.virtualmin.com {
ssl_cert = </home/virtualmin/ssl.cert
ssl_key = </home/virtualmin/ssl.key
}

It also currently includes ssl_ca but we know that’s wrong, as we’ve discussed at great length above.

If you have these hanging out like this outside of local_name sections, it’s definitely 100% , absolutely cannot work the way you think it is, wrong. There can only ever be one set of “default” cert/key. You can’t have more than one. It doesn’t make sense. The local_name sections are for setting up SNI. There should only ever be one pair of cert/key for the “default” in Dovecot.

From the local_name sections that Virtualmin automatically sets up for every domain that has a cert and mail enabled.

Yes. That is true of all of the “Copy to…” buttons in the SSL page. Only one can ever be used. If you click Copy to… for multiple domains, it will (or should, though it seems like maybe there’s a bug, if you have a whole bunch of these entries, unless you created them manually) copy over the previous one. Those buttons are going away or getting completely refactored in next version, as they confused the hell out of people even after a doc and label update. This button for Dovecot should result in one (and only 1) default ssl_key/ssl_cert to be configured in Dovecot. It should be the “main” domain for your server, whatever that is.

I currently recommend, and have always recommended, that you use one domain for all mail-related services, and those buttons (for Dovecot and Postfix) would be how you set it. Postfix SNI is coming in the next release, for systems that can use it, but I still think most people should use one domain for mail until things have stabilized a bit and we’ve gotten some widespread testing by experienced users. We’ve never recommended you use different domains for mail, but we do support it for POP/IMAP (but since it is not supported in Postfix, it is wildly confusing to setup because your SMTP server won’t match your POP/IMAP server).

NO. YOU DO NOT. I’ve said many times in this thread (and every other thread about mail and certs), you click it once for the domain that will be the default for mail. Again, we are changing the UI to make it harder to misunderstand what this is about. It was a confusing UI, we know we screwed up the UX here, but you really can believe me when I tell you what it is for and what it does. I would not lie to you about this, and I’m running out of ways to explain it. :wink:

yes that is what mine looks like with the exception that my system has also inserted

ssl_ca = </home/virtualmin/ssl.ca

so i will go and rename/delete all of the ssl_ca lines.

1 Like

Hello,
my dovecot entered failed state again today - with commented ssl_ca lines.

I found out that the same time as dovecot was killed (12:55) one virtual server had letsencrypt cerificate renewed:

/var/log/syslog

/var/log/mail.err

virtualmin
image

systemctl status dovecot

Master.pid does not exist:
image

Email delivery is still working at the moment.

First time when I noticed this issue (client called me), situation was similar but worse, dovecot was not running at all and email delivery was dead.

My guess would be there is a problem in post-tasks (how dovecot is killed/restarted) after letsencrypt certificate is automatically renewed. I didn’t notice problems with manual renewal.

That is exactly the same symptoms as mine and your logs and error messages are also identical to mine…first time no email was being delivered…i had to roll back to a server backup from the day before to get it resolved …mostly now even though it enters a failed state exactly like yours, mail is still being delivered.
I have not been able to reliably get it working again without quite often a hard reset of the system. Even then usually dovecot wont automatically restart on reboot…im having to manually start dovecot after reboot.

I have even tried killing the processes for dovecot…the command shell simply says there are none running even though the same error messages popup when i do a “systemctl status dovecot”!

I am starting to think there is a synchronisation issue between dovecot and debian, and maybe even virtualmin itself…because sometimes debian says dovecot is not running even though virtualmin says it is…and vice versa.

Hi,
today I found out that starting dovecot through control panel does not work.

  • First I manually killed dovecot process with kill -9
  • Then I tried to start dovecot from control panel and this happened:
    Processes started ps axu |grep dovecot and email was delivering fine, but status was failed:
  • then i stopped dovecot from control panel
  • and started dovecot from console #systemctl start dovecot
    now status is active

edit:
I can confirm, next letsencrypt certificate was just renewed and auto renewal killed dovecot and failed to start it again.

As mentioned in a different topic, we suffer from the same issues: Dovecot Left in mixed failed state after LetsEncrypt gets a new cert

In order to somewhat mitigate the downtime I’ve created a script that runs every 5 minutes as a cronjob:

#!/bin/bash
# This script tries to recover dovecot when it was unable to restart itself
# https://forum.virtualmin.com/t/dovecot-left-in-mixed-failed-state-after-letsencrypt-gets-a-new-cert/106125/6
# https://forum.virtualmin.com/t/dovecot-failed-state/105718/73

pgrep -fx dovecot/imap > /dev/null

if [ "$?" -ne 0 ]; then
    echo "Dovecot is not running, starting dovecot..."
    systemctl start dovecot
fi

We’ll be using this to keep our uptime while a proper fix is being worked on.

Use script at own risk.

Again, why reinvent the wheel?
Monit ftw.

Because this temporarily serves its purpose and is not meant to last. It was also quicker to write.

Today I found this in the logwatch report of a server with Dovecot apparently down according to the Webmin dashboard. The file /var/run/dovecot/config seems to exist but is zero length, symlinked from /run/dovecot/config, owned by root, chmod 0600.

--------------------- Dovecot Begin ------------------------

Dovecot was killed, and not restarted afterwards.

Dovecot disconnects: 6 Total

Unmatched Entries
dovecot: anvil: Fatal: Error reading configuration: read(/var/run/dovecot/config) failed: read(size=8192) failed: Connection reset by peer: 1 Time(s)
dovecot: master: Dovecot v2.2.33.2 (d6601f4ec) starting up for imap, pop3, pop3 (core dumps disabled): 1 Time(s)
dovecot: master: Error: unlink(/var/run/dovecot/master.pid) failed: No such file or directory (in main.c:518): 1 Time(s)
dovecot: ssl-params: Fatal: Error reading configuration: read(/var/run/dovecot/config) failed: read(size=8192) failed: Connection reset by peer: 1 Time(s)

---------------------- Dovecot End -------------------------

Hi,
I just wanted to comment that I am having the same problem with dovecot that Orao has. I have the same log errors and dovecot always dies when a Let’s Encrypt certificate is renewed. Although sometimes it happens without a Let’s Encrypt renewal.
I tried without ssl_ca lines and it doesn’t make any difference.
If I’m lucky enough, the /usr/bin/dovecot process survives and the email continues to work.

I’m not sure, but I think it started when I upgraded CentOS from version 7.7 to 7.8

You know about the only time i ever see a master pid error is when one of my clients email accounts is full.
So what i am saying is not about the email account, its about the cause…full.
I wonder if that file you mention being “zero length” is related to the problem?

Howdy,
my server has been having this issue for over a year and a half. Never thought to report it…my bad.
Every LetsEncrypt Cert update kills Dovecot, but not all processes running under Dovecot. This is the reason that it says running but the Vmin console properly reports that Dovecot is down, because it is.
Running a script that kills all, ALL Dovecot processes and then restarting Dovecot allows things to go back to normal.
My opinion(only that) is the script that Vmin uses after updating certs is broken or incorrect.
FYI, CentOS 7, fully updated, used Vmin script to install from bare OS about 3 years ago

Bug could be reproduced if we could manually run the same script as the “automatic letsencrypt renewal” does in the background. When certificates are manually renewed from control panel it doesn’t affect dovecot.

If we could run this script manually It would be easier to do more debugging. I was already looking into /usr/share/webmin/dovecot/dovecot-lib.pl and some extra logging could help to find at which point things go wrong.

Ubuntu 18.04.4
webmin: 1.942
virtualmin: 6.09
dovecot: 2.2.33.2 (d6601f4ec)