Service monitor : Postfix Server down

SYSTEM INFORMATION
OS type and version Debian Linux 11
Virtualmin version 7.30.2

We are getting email notifications daily at nearly the same time that the postfix server is down:

Monitor on {server_name} for 'Postfix Server' has detected that the service has gone down at 12/21/2024 10:55 AM

When I check the postfix server is is running so I presume it had restarted itself.

This started around the last virtualmin/webmin update.

Thanks for any suggestions.

I would assume it never went down.

But, you can check. Look at systemctl status postfix and check how long it’s been up.

It’s possible it did restart if you have automatic updates enabled (every update for a service will cause it to restart).

@Joe - Thank you for your reply.

Here is what I get:

systemctl status postfix
● postfix.service - Postfix Mail Transport Agent
     Loaded: loaded (/lib/systemd/system/postfix.service; enabled; vendor preset: enabled)
     Active: active (exited) since Sat 2024-12-21 10:55:09 PST; 3h 28min ago
    Process: 3698507 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 3698507 (code=exited, status=0/SUCCESS)
        CPU: 4ms

Dec 21 10:55:09 {server_name} systemd[1]: Starting Postfix Mail Transport Agent...
Dec 21 10:55:09 {server_name} systemd[1]: Finished Postfix Mail Transport Agent.

How can I see if automatic updates are enabled?

That’s how long ago it restarted, so maybe it actually did stop. I don’t think that time is likely to coincide with automatic updates. Do you have out of memory errors in the kernel log? Could be the OOM killer.

Automatic updates on Debian are handled by the unattended-upgrades package. There could be other things, but that’s the most likely.

Also, what’s your system uptime? It’s possible for the service monitors to fire during a reboot at a time when a service would be down as a normal part of the reboot (it either goes down before the monitor or comes up after).

I started to experience the same here…

@Joe - I don’t think it is an OOM matter. Here is our dash info:

Real memory	8.59 GiB used / 1.61 GiB cached / 62.79 GiB total

Here is the uptime info:

System uptime	18 days, 2 hours, 29 minutes

This does not seem related to any recent reboot(s).

Oh, another event that can trigger a restart is TLS certificate changes in Postfix, including automatic Let’s Encrypt certificate renewals that include the one that is being used for Postfix (or any edit in Virtualmin that effects the Postfix configuration, though not virtual map updates).

And, just to be clear: Nothing is actually wrong right? Postfix is working, correct? You’re just trying to figure out why it stopped briefly?

@Joe - Yes, at least I cannot see anything that is not working.

This server has been running in it’s current configuration and on this VM for years. This message just started in the past month and since we’ve never seen this before I wanted to ask why.

I hear you on the TLS certificate question and It is possible that we see the error “around” when certificates are updated because there are 38 VMs on this server so certificate updates are pretty regular occurrences.
Next time, if there is a next time, that we see this notification I will pay closer attention if a certificate was also updated.

@Joe - We had some SSL certificates automatically update and the postfix server stopping notification did not happen.

We have not received the notification since my previous message.

@Joe - Well, I was wrong in my last post.

Yesterday one of the virtual servers certificate renewed at 01:04 PM and at 01:05 PM we received another postfix server stopped message. So maybe there is a relation after all?

Yeah, that’s entirely possible. Postfix definitely restarts when certificates change, and that restart takes a moment (and having a bunch of TLS certs to load slows it down quite a lot). It’s harmless, though. Mail is a resilient protocol, if the server doesn’t respond for a few seconds, mail is just delayed for a few minutes and retried.

I’m not coming up with a perfect solution to make it not notify of a down server, if you’ve configured notifications for it, since it seems like it really is down at the time it’s checked in these events. But, I guess you probably want to make it only alert after two failures instead of just one.

@Joe - If the server is not stopped / down and is just restarting I think it’s no big deal.

Why do you think we only started to receive these messages in the past month or two?

If I set the “Failures Before Reporting” to 2 instead of 1 does that just mean that every 2nd certificate renewal I will receive a notification?

Probably more certs making it slower to restart. Or busier mail server. Lots of things can make Postfix take a little longer to restart. You can restart it yourself manually to see how long it takes (though the queue is dynamic, and could be different every time you restart, and rapidly restarting one after the other will be faster as the queue will be mostly empty on the second restart). It’s normally pretty fast, though, so maybe you’re seeing clues of something wrong (like a lot of spam coming in or going out), and that may be a thing worth looking into.

A peak at the mail log or the journal for the postfix unit is never a bad idea.

No. It’s not catching it every time a cert is renewed, I’m sure. We’re just talking about a race condition here. The monitor happens to run when Postfix is in the middle of restarting sometimes. I can’t imagine that would happen two times in a row (the checks run every five minutes by default, Postfix will certainly be finished restarting in five minutes).

To be clear: If Webmin is running, it will run its status checks on schedule. It doesn’t know anything about why a service is down, it just sees it’s down and reports that. If it comes back up by the time of the next check five minutes later, and it only notifies on two failures, it won’t notify. If it doesn’t come back up in five minutes, or is somehow down again at exactly the time of the next check, it’ll notify. Whenever it is seen back up, the count restarts.

There was an update to email notification Options where the webmin default email would become an option. Perhaps this is now selected whereas before it was not?

@Joe - This server had 38 virtual servers running on it and this number has not changed for about 1 year so the number of certificates does not seem likely since this only started a month or so ago.

It seems that spam is a roller coaster in general. To me it looks like we go through spells of heavy spam and then maybe those spammers get filtered or close shop and then sometime later we see another rise as spam.

Your race condition explanation seems good to me and that the certificate renewal is coinciding with other activity.

I have noticed that our nightly backups have doubled in the time they take starting about a month or two ago. The full backup of 38 virtual servers used to take around 40 minutes and now they take 1 hour and 20 minutes. I thought that our hosting provider may have moved us to a slower or more congested server. If you think this may be a symptom of something else and / or related let me know if there is something I can check.

@shoulders - If a new email notification option was added, which one was it, and was it default enabled? Was this new email notification the one @Joe already mentioned?

here you go

Don’t know