systemctl status postfix
● postfix.service - Postfix Mail Transport Agent
Loaded: loaded (/lib/systemd/system/postfix.service; enabled; vendor preset: enabled)
Active: active (exited) since Sat 2024-12-21 10:55:09 PST; 3h 28min ago
Process: 3698507 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 3698507 (code=exited, status=0/SUCCESS)
CPU: 4ms
Dec 21 10:55:09 {server_name} systemd[1]: Starting Postfix Mail Transport Agent...
Dec 21 10:55:09 {server_name} systemd[1]: Finished Postfix Mail Transport Agent.
That’s how long ago it restarted, so maybe it actually did stop. I don’t think that time is likely to coincide with automatic updates. Do you have out of memory errors in the kernel log? Could be the OOM killer.
Automatic updates on Debian are handled by the unattended-upgrades package. There could be other things, but that’s the most likely.
Also, what’s your system uptime? It’s possible for the service monitors to fire during a reboot at a time when a service would be down as a normal part of the reboot (it either goes down before the monitor or comes up after).
Oh, another event that can trigger a restart is TLS certificate changes in Postfix, including automatic Let’s Encrypt certificate renewals that include the one that is being used for Postfix (or any edit in Virtualmin that effects the Postfix configuration, though not virtual map updates).
@Joe - Yes, at least I cannot see anything that is not working.
This server has been running in it’s current configuration and on this VM for years. This message just started in the past month and since we’ve never seen this before I wanted to ask why.
I hear you on the TLS certificate question and It is possible that we see the error “around” when certificates are updated because there are 38 VMs on this server so certificate updates are pretty regular occurrences.
Next time, if there is a next time, that we see this notification I will pay closer attention if a certificate was also updated.
Yesterday one of the virtual servers certificate renewed at 01:04 PM and at 01:05 PM we received another postfix server stopped message. So maybe there is a relation after all?
Yeah, that’s entirely possible. Postfix definitely restarts when certificates change, and that restart takes a moment (and having a bunch of TLS certs to load slows it down quite a lot). It’s harmless, though. Mail is a resilient protocol, if the server doesn’t respond for a few seconds, mail is just delayed for a few minutes and retried.
I’m not coming up with a perfect solution to make it not notify of a down server, if you’ve configured notifications for it, since it seems like it really is down at the time it’s checked in these events. But, I guess you probably want to make it only alert after two failures instead of just one.
Probably more certs making it slower to restart. Or busier mail server. Lots of things can make Postfix take a little longer to restart. You can restart it yourself manually to see how long it takes (though the queue is dynamic, and could be different every time you restart, and rapidly restarting one after the other will be faster as the queue will be mostly empty on the second restart). It’s normally pretty fast, though, so maybe you’re seeing clues of something wrong (like a lot of spam coming in or going out), and that may be a thing worth looking into.
A peak at the mail log or the journal for the postfix unit is never a bad idea.
No. It’s not catching it every time a cert is renewed, I’m sure. We’re just talking about a race condition here. The monitor happens to run when Postfix is in the middle of restarting sometimes. I can’t imagine that would happen two times in a row (the checks run every five minutes by default, Postfix will certainly be finished restarting in five minutes).
To be clear: If Webmin is running, it will run its status checks on schedule. It doesn’t know anything about why a service is down, it just sees it’s down and reports that. If it comes back up by the time of the next check five minutes later, and it only notifies on two failures, it won’t notify. If it doesn’t come back up in five minutes, or is somehow down again at exactly the time of the next check, it’ll notify. Whenever it is seen back up, the count restarts.
There was an update to email notification Options where the webmin default email would become an option. Perhaps this is now selected whereas before it was not?
@Joe - This server had 38 virtual servers running on it and this number has not changed for about 1 year so the number of certificates does not seem likely since this only started a month or so ago.
It seems that spam is a roller coaster in general. To me it looks like we go through spells of heavy spam and then maybe those spammers get filtered or close shop and then sometime later we see another rise as spam.
Your race condition explanation seems good to me and that the certificate renewal is coinciding with other activity.
I have noticed that our nightly backups have doubled in the time they take starting about a month or two ago. The full backup of 38 virtual servers used to take around 40 minutes and now they take 1 hour and 20 minutes. I thought that our hosting provider may have moved us to a slower or more congested server. If you think this may be a symptom of something else and / or related let me know if there is something I can check.
@shoulders - If a new email notification option was added, which one was it, and was it default enabled? Was this new email notification the one @Joe already mentioned?