Dovecot Failed state

adamjedgar · May 15, 2020, 11:38pm

Guys this is insane…as soon as one changes anything to do with a virtual server SSL dovecot fails almost immediately.

I keep getting the same error over and over again…

dovecot: master: Fatal: Dovecot is already running? Socket already exists: /var/run/dovecot/login/login

I am finding it almost impossible to kill the dovecot process such as dovenull and dovecot anvil. They restart in seconds after typing in systemctl kill dovecot…

the really stupid part is, webmin and virtualmin dont seem to be able to even agree with each other what in fact is the status of dovecot?

Right now Dovecot i think is actually running…

~]# systemctl status dovecot
● dovecot.service - Dovecot IMAP/POP3 email server
Loaded: loaded (/lib/systemd/system/dovecot.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2020-05-16 09:16:06 AEST; 11min ago
Docs: man:dovecot(1)

There is something wrong between virtualmin, webmin, virtual server domains, SSL, and Dovecot. Whatever it is, it is taking out the Dovecot process whenever i play with a domain in any way that involves SSL.

for example, all i did this morning was resolve a “too many redirects” issue on the website for my primary domain this morning and then dovecot stopped working again.
I went to Virtualmin>Server Configuration>website redirects…and removed the virtualmin redirect to https that was causing the error (i already have this redirect in htaccess file in public_html.
i then went to a second virtual server where the redirect from .com.au to .com domain simply refuses to work with SSL and did the same.

After doing the above and nothing more, dovecot went down within exactly the same time frame!

Something is definately not right!

also on another brand new Virtualmin GPL system (debian 10), I cannot get it to auto generate an SSL certificate for the host.tesla.com…even if i create a virtual server called clientdomain.com and use that to obtain SSL for clientdomain.com and host.tesla.com it fails. If i create a virtual server called host.tesla.com and attempt to obtain SSL from letsencrypt it also fails.

To say these issues are unrelated simply cannot be. There is something that has been done recently in webmin 6.09 update that has caused this and clearly the webmin 6.09-3 update has not resolved any of the above problems. My system has been functioning perfectly fine for almost 2 years with only minor issues, however this issue has hovered in the background for a number of months and now its turned into a full blown pandemic seemingly exacerbated by webmin updates!

I have largely gone through all of the updates during that time trouble free…but this, this has got me to the point of being forced to move across to ISPconfig (which i do not want to do…i like Virtualmin). My clients are screaming at me citing unreliability in email over the last week!

My hope is that its something i have done wrong and Im sure that i must be at fault…but if that error turns out to be running webmin updates???

calport · May 16, 2020, 2:15am

This is not to hijack the thread but rather to point out a similar though unrelated issue of SSL certs on a fresh install of Ubuntu 18 LTS on the latest Webmin and Virtualmin GPL - Virtualmin’s Let’s Encrypt fails every time to generate SSL certs for hostname or any new virtual servers that I create. There is certainly something amiss relatd to SSL management, I agree with @adamjedgar and one symptom of this is the issue that he has described here, the other is the one that I have described.

@Joe please help.

Joe · May 16, 2020, 3:31am

@adamjedgar I’m logged into your server and will have a look around. I’ll follow up on your private ticket when I know more.

Joe · May 16, 2020, 3:56am

I don’t see anything wrong with Dovecot status on your system at the moment. Everything (Webmin’s System and Server Status module, Virtualmin’s dashboard, and systemctl status dovecot) agrees it is up and running.

You have an initscript, probably from an OS upgrade, which is down, but you can ignore that. And, there is a dovecot.socket service…that’s for if you’re running dovecot in an on-demand sort of way, and should not be used in a dedicated server where Dovecot is running all the time. (this is systemd’s answer to old school inetd socket activation.)

adamjedgar · May 16, 2020, 11:22pm

deleted as more problems found

adamjedgar · May 16, 2020, 11:32pm

Joe, I note that dovecot failed again last night at 8pm.
I checked in /var/log sys.log (saved/dated 17th april at 06.25 am) and found that following entries for the time of failure 8pm the night before. You can access the log file yourself on my server if you need to look more closely at it to see actual domains etc.

in particular note the entries in bold

May 16 20:00:18 server1 named[576]: network unreachable resolving ‘tesla.com/A/IN’: 2400:b800:1:1::11#53

May 16 20:01:05 server1 dovecot: imap-login: Disconnected (no auth attempts in 0 secs): user=<>, rip=1.144.105.116, lip=104.156.233.188, TLS handshaking: SSL_accept() failed: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown: SSL alert number 46, session=

and

Also, that ipaddress 45.142.195.15 and .13 etc… (internethosting-ltd@yandex.ru) its hitting my server over and over again. Is it malicious? should i block it?

extract from the relevant log file is in next post…

adamjedgar · May 17, 2020, 12:23am

Hey i think these are the log entries at the time of dovecot failure…(i have substituded my server host.domain.com for tesla.com, client recipient and sender email addresses are changed too)

May 16 20:00:02 server1 postfix/pickup[32559]: B27C07D0AB: uid=0 from=webmin@server1.tesla.com
May 16 20:00:02 server1 postfix/cleanup[3326]: B27C07D0AB: message-id=1589623202.3323.1@server1.tesla.com
May 16 20:00:02 server1 postfix/qmgr[1277]: B27C07D0AB: from=webmin@server1.tesla.com, size=531, nrcpt=1 (queue active)
May 16 20:00:03 server1 postfix/smtp[3377]: warning: live.com.au: dane configured with dnssec lookups disabled
May 16 20:00:05 server1 postfix/smtp[3377]: B27C07D0AB: to=user@live.com.au, relay=apc.olc.protection.outlook.com[104.47.124.33]:25, delay=3.1, delays=0.01/0.01/1.7/1.4, dsn=2.6.0, status=sent (250 2.6.0 1589623202.3323.1@server1.tesla.com [InternalId=84451941941286, Hostname=HK2APC01HT103.eop-APC01.prod.protection.outlook.com] 7547 bytes in 0.391, 18.836 KB/sec Queued mail for delivery -> 250 2.1.5)
May 16 20:00:05 server1 postfix/qmgr[1277]: B27C07D0AB: removed
May 16 20:00:09 server1 postfix/smtpd[6953]: connect from unknown[45.142.195.14]
May 16 20:00:14 server1 postfix/smtpd[3687]: connect from unknown[45.142.195.13]
May 16 20:00:18 server1 named[576]: network unreachable resolving ‘tesla.com/A/IN’: 2a01:8840:be::1#53
May 16 20:00:18 server1 named[576]: network unreachable resolving ‘tesla.com/A/IN’: 2400:b800:1:1::11#53
May 16 20:00:18 server1 named[576]: network unreachable resolving ‘tesla.com/A/IN’: 2400:b800:0:1::11#53
May 16 20:00:18 server1 named[576]: network unreachable resolving ‘tesla.com/DS/IN’: 2a01:8840:bf::1#53
May 16 20:00:18 server1 named[576]: network unreachable resolving ‘tesla.com/DS/IN’: 2a01:8840:c0::1#53
May 16 20:00:19 server1 postfix/smtpd[6953]: warning: unknown[45.142.195.14]: SASL LOGIN authentication failed: authentication failure
May 16 20:00:19 server1 postfix/smtpd[6953]: too many errors after AUTH from unknown[45.142.195.14]
May 16 20:00:19 server1 postfix/smtpd[6953]: disconnect from unknown[45.142.195.14] ehlo=1 auth=0/1 rset=1 commands=2/3
May 16 20:00:22 server1 postfix/smtpd[3738]: connect from unknown[45.142.195.15]
May 16 20:00:28 server1 postfix/smtpd[3687]: warning: unknown[45.142.195.13]: SASL LOGIN authentication failed: authentication failure
May 16 20:00:28 server1 postfix/smtpd[3687]: too many errors after AUTH from unknown[45.142.195.13]
May 16 20:00:28 server1 postfix/smtpd[3687]: disconnect from unknown[45.142.195.13] ehlo=1 auth=0/1 rset=1 commands=2/3
May 16 20:00:29 server1 postfix/smtpd[3738]: warning: unknown[45.142.195.15]: SASL LOGIN authentication failed: authentication failure
May 16 20:00:29 server1 postfix/smtpd[3738]: too many errors after AUTH from unknown[45.142.195.15]
May 16 20:00:29 server1 postfix/smtpd[3738]: disconnect from unknown[45.142.195.15] ehlo=1 auth=0/1 rset=1 commands=2/3
May 16 20:00:29 server1 postfix/smtpd[811]: connect from unknown[45.142.195.8]
May 16 20:00:35 server1 postfix/smtpd[12512]: connect from unknown[45.142.195.14]
May 16 20:00:51 server1 postfix/smtpd[12512]: warning: unknown[45.142.195.14]: SASL LOGIN authentication failed: authentication failure
May 16 20:00:51 server1 postfix/smtpd[12512]: too many errors after AUTH from unknown[45.142.195.14]
May 16 20:00:51 server1 postfix/smtpd[12512]: disconnect from unknown[45.142.195.14] ehlo=1 auth=0/1 rset=1 commands=2/3
May 16 20:01:02 server1 postfix/smtpd[6953]: connect from unknown[45.142.195.14]
May 16 20:01:05 server1 dovecot: imap-login: Login: user=contact@clientdomain.com, method=PLAIN, rip=1.144.105.116, lip=104.156.233.188, mpid=3408, TLS, session=
May 16 20:01:05 server1 dovecot: imap-login: Disconnected (no auth attempts in 0 secs): user=<>, rip=1.144.105.116, lip=104.156.233.188, TLS handshaking: SSL_accept() failed: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown: SSL alert number 46, session=<s+YTAsGlxTUBkGl0>
May 16 20:01:05 server1 dovecot: imap-login: Disconnected (no auth attempts in 0 secs): user=<>, rip=1.144.105.116, lip=104.156.233.188, TLS handshaking: SSL_accept() failed: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown: SSL alert number 46, session=
May 16 20:01:06 server1 dovecot: imap-login: Disconnected (no auth attempts in 0 secs): user=<>, rip=1.144.105.116, lip=104.156.233.188, TLS handshaking: SSL_accept() failed: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown: SSL alert number 46, session=<BjcaAsGl/g4BkGl0>
May 16 20:01:06 server1 dovecot: imap-login: Disconnected (no auth attempts in 0 secs): user=<>, rip=1.144.105.116, lip=104.156.233.188, TLS handshaking: SSL_accept() failed: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown: SSL alert number 46, session=
May 16 20:01:06 server1 dovecot: imap-login: Disconnected (no auth attempts in 0 secs): user=<>, rip=1.144.105.116, lip=104.156.233.188, TLS handshaking: SSL_accept() failed: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown: SSL alert number 46, session=
May 16 20:01:12 server1 postfix/smtpd[3687]: connect from unknown[45.142.195.15]

what do the entries in bold above mean? I think this is where the problem lies.

Joe · May 17, 2020, 1:56am

This means something (presumably the mail server) tried to look up tesla.com and the local name server (named=bind) failed to reach the authoritative name server for that name. Dovecot does not generally do name lookups, that I can think of, so it is unrelated to any Dovecot problems. Name service failures can be involved in problems either sending or receiving mail, but that is SMTP, which is handled by Postfix. Dovecot does not send or receive mail (it provides POP/IMAP, which are used to retrieve mail). You can safely and completely ignore named errors when troubleshooting Dovecot.

This is a failed SSL negotiation. It indicates the client (whatever mail clien was being used to make this IMAP connection) could not negotiate a functional SSL connection with the server (Dovecot). This actually looks like a sort of fall-through of a direct OpenSSL error, not even really an error from Dovecot.

I believe you are seeing a mail client trying to use a very old/insecure SSL protocol to connect and Dovecot doesn’t want to do so. Maybe it doesn’t even support it anymore, I’m not sure. We have some discussions on one of the github issue trackers about supporting old and crappy mobile clients, so it may be something that can be worked around.

It should not, however, cause Dovecot to stop. So, you’re saying Dovecot actually shut down? Or is it just one specific client that cannot connect ever? Very different problems to solve. I can see how this would lead to a client being unable to login, but I can’t see any indication that Dovecot shutdown or any clue why it would, if it did.

Joe · May 17, 2020, 1:58am

Here is a very good discussion of the SSL error and suggested solutions: https://serverfault.com/questions/656488/imap-tls-connection-to-dovecot-fails

adamjedgar · May 17, 2020, 5:40am

yes dovecot is actually completely shutting down “exited with error code” and “enterred a failed state”.

the only way i can get dovecot working again is to perform a hard reset of the server using my vps service providers interrface. None of the systemctl kill “name of process im trying to kill” does anything that will allow a restart of dovecot services.

I am having to do this virtually every morning when i wake up. I cannot even predict when its going to fail.

I think i really need you to log into the server directly, check out the log file that was saved at 06.30 17th April. Once in that file, go back to 7.30 - 8pm the night before… the status monitor detected dovecot was offline at the 8pm check.

Here is the webmin>System Logs> Mail error log leading up to the dovecot failure last night
note i have substituted my real domain for tesla.com

i have also substituted any other client informaiton to other names as well. (You cant simply enter the domain information below and get dns resolution because i have changed the names obviously)

i am particularly concerned about the “host1 dovecot: master: Fatal: Dovecot is already running? Socket already exists: /var/run/dovecot/login/login” error. That seems to always be front and center in this mess!

May 16 09:13:54 host1 dovecot: imap-login: Fatal: master: service(imap-login): child 1497 killed with signal 9
May 16 09:13:54 host1 dovecot: imap(secretary@client2.com): Fatal: master: service(imap): child 1508 killed with signal 9
May 16 09:13:54 host1 dovecot: imap-login: Fatal: master: service(imap-login): child 1704 killed with signal 9
May 16 09:13:54 host1 dovecot: imap(secretary@client2.com): Fatal: master: service(imap): child 1719 killed with signal 9
May 16 09:13:54 host1 dovecot: imap-login: Fatal: master: service(imap-login): child 2592 killed with signal 9 [last ip=1.144.105.177]
May 16 09:13:54 host1 dovecot: auth: Fatal: master: service(auth): child 2593 killed with signal 9
May 16 09:13:54 host1 dovecot: imap-login: Fatal: master: service(imap-login): child 2899 killed with signal 9
May 16 09:13:54 host1 dovecot: ssl-params: Fatal: master: service(ssl-params): child 2900 killed with signal 9
May 16 09:13:54 host1 dovecot: imap-login: Fatal: master: service(imap-login): child 2901 killed with signal 9
May 16 09:13:54 host1 dovecot: imap-login: Fatal: master: service(imap-login): child 2902 killed with signal 9
May 16 09:13:54 host1 dovecot: auth-worker: Fatal: master: service(auth-worker): child 2903 killed with signal 9
May 16 09:13:54 host1 dovecot: auth-worker: Fatal: master: service(auth-worker): child 2906 killed with signal 9
May 16 09:13:54 host1 dovecot: imap(tesla@client.com): Fatal: master: service(imap): child 2907 killed with signal 9
May 16 09:13:54 host1 dovecot: imap(s2members@client2.com): Fatal: master: service(imap): child 2908 killed with signal 9
May 16 09:13:54 host1 dovecot: imap(name@tesla.com): Fatal: master: service(imap): child 2909 killed with signal 9
May 16 09:14:04 host1 dovecot: master: Fatal: Dovecot is already running? Socket already exists: /var/run/dovecot/login/login
May 16 19:34:32 host1 dovecot: master: Fatal: Dovecot is already running? Socket already exists: /var/run/dovecot/login/login
May 16 19:34:34 host1 dovecot: master: Fatal: Dovecot is already running? Socket already exists: /var/run/dovecot/login/login
May 16 19:34:36 host1 dovecot: master: Fatal: Dovecot is already running? Socket already exists: /var/run/dovecot/login/login
May 16 19:34:38 host1 dovecot: master: Fatal: Dovecot is already running? Socket already exists: /var/run/dovecot/login/login

Joe · May 17, 2020, 6:12am

The socket already exists message just means there is already a process hanging on to that socket (i.e. Dovecot is already running, perhaps in some unknown/broken state). It isn’t a cause of anything…it’s a symptom that the service is trying to start when it’s already running in some form.

I just went through the logs and those errors immediately follow a kill signal from root, I guess you stopped the service and then immediately restared it? That would lead to what you’re seeing…a restart should handle it correctly and wait for a full shutdown before trying to restart (I’m kinda surprised the start doesn’t recognize it’s running and tries to start the service over an existing one.

It takes a little while to stop and start. To stop because it tries to let existing sessions finish before shutting down, and to start because it has to load all the domain certificates. It looks like it was still closing old sessions when it tried to start again.

That doesn’t tell us what was wrong before it was restarted, though. What behavior were you seeing that made you believe it was down?

adamjedgar · May 17, 2020, 6:14am

the system monitor and server status in virtualmin both said dovecot server was down.

sorry Joe i just need to backtrack for a bit… i also found the following in another forum with what could be a similar problem…

ignoring the SELinux part of the thread below, and just looking at the permissions part of Lukas statement…could this be my problem?** (ie dovecot is run by root:root however login/login is dovenull:root)**

https://bugzilla.redhat.com/show_bug.cgi?id=1560704

Lukas Vrabec 2018-05-03 18:11:21 UTC

Hi,

This is more issue on dovecot side than SELinux, problem here is that process dovecot is owned by root:root. Socket /var/run/dovecot/login/ipc-proxy is owned by dovenull:root and SELinux requires proper permissions on filesystem even if root could bypass all permissions. Kernel use ‘group’ for check permissions and mode is 600, and process trying access this socket so it should be changed.

Joe · May 17, 2020, 6:15am

You can’t ignore SELinux with regard to that ticket because it can only be a problem with SELinux running.

adamjedgar · May 17, 2020, 6:20am

but look at the folder permissions on my server in var/run/dovecot/login. Is this how its supposed to be?

Joe · May 17, 2020, 6:23am

Those permissions/ownership are fine. The only reason it didn’t work in the ticket you linked is because SELinux can prevent root from accessing things owned by other users. If you don’t have SELinux, that’s not possible. That ticket has nothing to do with your problem.

And, if the only problem is that the status is wrong, we should be looking at that instead of at Dovecot.

I know that’s what this thread was originally about so maybe we need to go back to that. Was this system upgraded from a non-systemd version of Debian (i.e. 8 or earlier)? It’s possible Webmin is getting confused about that old initscript and thinks that’s the responsible service, when it should be the systemd unit.

Joe · May 17, 2020, 6:26am

So, next time you see the state go weird, don’t panic. Don’t try to start Dovecot or restart it. Just try to fetch mail via POP/IMAP and see if it works. If it does, Dovecot is still running and we have a bogus status for some reason (and I would like to see it). If it does not, then we have a problem with Dovecot, rather than status.

adamjedgar · May 17, 2020, 6:37am

ok will do joe.
its interesting that you suggest double checking whether or not mail is being recived. I note that there are log entries for mail where emails were delivered after that status said dovecot had failed.
having said that, when i go into command shell, and enter systemctl status dovecot, i do see a failed state error so i am fairly certain that dovecot is actually shutting down at least some of the time.
I will do some poking around and next time it fails (which i am sure if its going to happen will do so in the next few hours …or 1 day at the most).

Joe · May 17, 2020, 9:08am

That damned initscript keeps bugging me. And, I think I might know what’s going on.

Move /etc/init.d/dovecot out of /etc/init.d and also make sure Webmin is defaulting to using systemd as the init system, rather than initscripts (in Bootup and Shutdown module, look at the top of the page under the title, it’ll say “Boot system: systemd” or something similar).

I suspect there is some edge case where the existence of two services for dovecot is causing a fight. They can’t both be running…they’d be using all the same resources (ports, sockets), and if something were trying to use the wrong one to start/stop/restart dovecot, it would cause exactly the errors you saw, I’m pretty sure.

adamjedgar · May 17, 2020, 10:59pm

Hi Joe,
I have renamed /etc/init.d/dovecot to /backup-dovecot

a quick check of webmin>system>bootup and shutdown shows Boot system: Systemd

I also made a change in my email client desktop pc app. This may have nothing to do with anything but it is something i changed around the time we started having problems with dovecot…so i changed it back.

in my desktop pc email client app, for the domain on the virtual server where this problem seems to originate, i recently changed my incoming mail server to mail.mydomain.com (this was about a week ago).

Prior to this, i had always set my incoming mail to server1.mydomain.com

I have changed it back to the way it was before and asked one of my clients to do the same with theirs.

i dont know if that dns entry has even remotely anything to do with the issue, however i dont use any dns records that resolve mail.domain.com to server1.mydomain.com for any of my clients virtual servers. So i figure, maybe dovecot might also be grumbling about dns resolution for mail.domain.com when that dns record doesnt exist? (got to try to simplify anything that might not be quite right in dns i figure)*

I will post back either when the system fails again, or in a few days if it doesnt.

Two things i am dreading however,

what happens when a new webmin update comes out…that almost always seems to cause my system to go haywire
the next time an SSL certificate for a domain on my system updates

If i am going to have further problems, then the above will be the test. Lets wait and see.

thanks for your help so far…i wish we had more concrete evidence of the solution.

Joe · May 17, 2020, 11:15pm

Dovecot is not grumbling about name service. Only Postfix would perform lookups like that. That’d be a different issue. Failing DNS lookups can’t possibly cause Dovecot to shutdown (or anything else, really…Dovecot doesn’t really interact with the name server). A name service issue could only impact deliverability, it could not cause stability issues under any circumstances. Forget about name service errors in this conversation; they are not relevant to this problem.

Webmin updates would not have any effect. Virtualmin updates could, and recently did. Webmin and Virtualmin are provided in separate packages (Virtualmin is a Webmin module, but it is maintained in a bunch of separate packages).

We’ve already discussed the cert issues that were present in the Pro version 6.09-2 and 6.09-2 packages on several threads (though I think those issues only impaced Postfix…this is probably not directly related to that, but I suspect the panic that ensued when things broke led to Dovecot getting some unneeded restarts and service changes; perhaps you enabled the Dovecot initscript in the midst of all this). Hopefully, that kind of thing won’t happen again, though maybe bugs will happen again.

That is, again, the bug that was present in 6.09-1 and 6.09-2 and fixed in 6.09-3. It should not recur, though it’s possible there will be other bugs that show up (the issues were related to a major overhaul of cert handling for Postfix). Unless you plan to install 6.09-1 or 6.09-2 again, you won’t see that specific problem again.

Just don’t poke the bear. I don’t think problems will recur. Everything looked fine when I looked at your server (but that initscript could cause trouble if it was trying to start in addition to the systemd unit).