I have been seeing random BIND shut down and restart on Virtualmin GPL install. The shut down is always preceded by some error but random domain or IP entry. these IP addresses and domains are not on my server. The code message is often “error (unexpected RCODE REFUSED) resolving to some domain/IP”
If anyone can point me in the correct direction for finding the cause. I am assuming there is a bad setting somewhere. I am running CentOS6 with Virtualmin GPL
A sample log of a shutdown
Sep 25 06:01:06 ip-68-178-130-21 named[20231]: error (unexpected RCODE REFUSED) resolving '49.142.218.222.in-addr.arpa/PTR/IN': 202.103.225.70#53
Sep 25 06:01:07 ip-68-178-130-21 named[20231]: error (unexpected RCODE REFUSED) resolving '49.142.218.222.in-addr.arpa/PTR/IN': 202.103.224.69#53
Sep 25 06:01:09 ip-68-178-130-21 named[20231]: error (unexpected RCODE REFUSED) resolving '49.142.218.222.in-addr.arpa/PTR/IN': 202.103.224.69#53
Sep 25 06:01:09 ip-68-178-130-21 named[20231]: error (unexpected RCODE REFUSED) resolving '49.142.218.222.in-addr.arpa/PTR/IN': 202.103.225.70#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: shutting down
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: stopping command channel on 127.0.0.1#953
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: stopping command channel on ::1#953
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on ::#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on 127.0.0.1#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on 68.178.130.21#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on 68.178.131.108#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on 50.62.34.152#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: exiting
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: starting BIND 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6.3 -u named
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: built with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var' '--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static' '--disable-openssl-version-check' '--with-dlz-ldap=yes' '--with-dlz-postgres=yes' '--with-dlz-mysql=yes' '--with-dlz-filesystem=yes' '--with-gssapi=yes' '--disable-isc-spnego' '--with-docbook-xsl=/usr/share/sgml/docbook/xsl-stylesheets' '--enable-fixed-rrset' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS= -DDIG_SIGCHASE'
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: ----------------------------------------------------
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: BIND 9 is maintained by Internet Systems Consortium,
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: Inc. (ISC), a non-profit 501(c)(3) public-benefit
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: corporation. Support and training for BIND 9 are
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: available at https://www.isc.org/support
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: ----------------------------------------------------
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: adjusted limit on open files from 1024 to 1048576
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: found 16 CPUs, using 16 worker threads
Another BIND Shutdown was this one - these IP addresses and domains are not on my server:
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: client 122.136.196.116#63897: query (cache) 'aa3247.com/A/IN' denied
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: client 122.136.196.116#3209: query (cache) 'aa3247.com/A/IN' denied
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: shutting down
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: stopping command channel on 127.0.0.1#953
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: stopping command channel on ::1#953
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on ::#53
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on 127.0.0.1#53
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on 68.178.130.21#53
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on 68.178.131.108#53
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on 50.62.34.152#53
Sep 25 05:46:27 ip-68-178-130-21 named[22235]: exiting
I don’t think the error message and the exiting is directly related. In your first example, there’s over a minute between these events. Can you check how much time passed between each of these incidences?
I’d also think that BIND is hitting some kind of resource limit. You might check the syslog for further errors, maybe OOM messages. Short of a programming bug, it should not simply exit like that. You might check with “tcpdump udp port 53” what the load of your BIND is.
Looks like about 20 items per second - not sure what is normal.
This happens about 4 times a day, but at no particular time. I’ve looked thru the logs, but the only consistency I see is right before BIND closes, there is always an error message before such as:
Looks like about 20 items per second - not sure what is normal.
This happens about 4 times a day, but at no particular time. I’ve looked thru the logs, but the only consistency I see is right before BIND closes, there is always an error message before such as:
Looks like about 20 items per second - not sure what is normal.
This happens about 4 times a day, but at no particular time. I’ve looked thru the logs, but the only consistency I see is right before BIND closes, there is always an error message before such as:
The resolver errors you see are reverse lookups, probably triggered by some service on your machine that wants to know the hostname for an IP address that is trying to connect (e.g. SSHD and Postfix can do that). You might want to check your syslog, auth-log or mail-log for an unusually high amount of connection attempts / login failures, which indicate dictionary attacks.
Still, BIND should not simply exit from something like this. It’s meant to process a considerable amount of requests per second.
Did you check the syslog or kernel log for additional errors like OOMs? Otherwise, I’m fairly certain that it’s a problem with the virtualization software in use for your server. What’s your used and free memory? You might want to use the tool “atop” to record and playback performance data, that should give you an idea if something has a memory spike or similar.
However I did find spamassasin set to lookup DNS test mode (not sure that would matter) and did fine an email account that is trying to login every 3 seconds with an authentication failure in saslauthd
The only thing I have seen pulling a lot of CPU is clamav.
It’s not a busy server - only 100 virtual servers.
It’s not uncommon that we see unusual issues and service that die unexpectedly on OpenVZ-based systems… but typically when that occurs, the “failcnt” field of the user_beancounters file shows some non-0 values in it.
On your system, they all appear to be “0”.
How much RAM is in your server, and what is the current usage?
I can only repeat what I wrote last time, and what Eric said too now: We need to know what your RAM situation is. free and atop are recommended to provide that information, and to check historical memory data when BIND crashes.
This is really odd. Memory should be plenty, and no beancounter violation.
Two things I can think of right now: Intermittent memory problems. To find out about those, install “atop”. Or the BIND version you’re using (being an RC) might be buggy. Ubuntu officially uses 9.8.1, yours is 9.8.2-RC1.
You could try turning on BIND’s own logging, including debug logging. Or you might want to check if there’s an earlier, non-RC version of BIND available. Are you using third-party repositories for your CentOS? Those often contain versions of software that make trouble when used with Virtualmin.
The Centos iss a clean standard install and Virtualmin is the standard. It is a real problem as the server closes down about 6 times a day. Stopped twice again this morning. I searched the logs, and the same RCODE error then a BIND exit. I noticed that there is a bunch of messages all relating to the same PID. Don’t know if it means anything.
So what about my suggestions, like atop, BIND debug log and downgrading to a stable non-RC version? I’m afraid I can’t help you if you don’t reply to my suggestions.
The BIND debug is already on, but it does not seem to be writing to the file. I will be willing to downgrade but the current stable version is 9.8.6 and I have installed 9.8.2-0.17.rc1.el6.3 which is not the latest. I do not have atop installed, but checked it out this morning and will install.
You’d definitely want to limit recursive queries to your local server, and not allow it for any user on the internet. That’s done with the “allow-recursion” directive, or there’s also an edit box in Webmin’s BIND module for it. So you should set “Do fully recursive lookups” to “Default”, and in “Addresses and Topology”, set “Allow recursive queries from” to “Listed…” and enter in the box
localnets
localhost
As for the RAM filling up, did that happen when your BIND crashed? You really need ATOP, to have historical data about processes. It could be a script going haywire and spawning many processes, or BIND itself using too much RAM. But you won’t find that out without ATOP.
When you have it installed, go to /etc/init.d/atop and modify the line “DARGS=”, the number at the end is the interval in seconds that atop uses to store resource data. You might want to lower the default of 10 minutes to e.g. one minute, for your debugging purposes.