Bind Cluster name servers setup doesn't create slave zones anymore but 0 byte files are present

SYSTEM INFORMATION
OS type and version Ubuntu 20.04.4 LTS
Webmin version 1.994

Please help me isolate this problem:

Setup:

ns1.example.com
ns2.example.com
ns3.example.com
ns4.example.com

Then we have have this:

client server #1
client server #2
client server #3

When we create a zone on any of the client servers, it automatically gets replicated using Webmin’s cluster name servers. But it stopped working…

What works:

When I create a zone on any of the client servers, it doesn’t give any errors.

Then all of the slaves I can see the following 0 byte files:

/var/lib/bind]> ls -lah | grep testdomain
-rw-r--r--  1 bind bind    0 Jun 23 06:43 testdomain1.com.hosts
-rw-r--r--  1 bind bind    0 Jun 23 06:47 testdomain2.com.hosts
-rw-r--r--  1 bind bind    0 Jun 23 07:00 testdomain3.com.hosts

rndc reload for any of the test domains give this:

/var/lib/bind]> rndc reload testdomain1.com
WARNING: key file (/etc/bind/rndc.key) exists, but using default configuration file (/etc/bind/rndc.conf)
rndc: 'reload' failed: not found
no matching zone 'testdomain1.com' in any view

The workaround is to do service named reload on all four name servers, at which point replication starts happening and zero byte files are populated with the correct zone data.

Deleting a zone also works, I can see the .hosts files are removed from all four name servers.

Here is post from 2009 that discusses similar problems:

Troubleshooting tips broadly covered in that post are:

  • Check firewall
  • Check allow-transfer
  • Check permissions
  • Check listen-on.

I checked firewall with netstat -an | grep :53
It says bind is listening on both UDP and TCP port 53.

FirewallD is stock so TCP (and UDP) 53 is open.
Also since reload works, at this point I really don’t think it’s the firewall. And it used to work :slight_smile:

listen-on is interesting because the only listen-on* is in /etc/bind/named.conf.options and that is:
listen-on-v6 { any; };

I don’t know where to look next :pensive:

If I had to guest the behind the scenes of “Re-start DNS server” when you add a domain is not succeeding…

Please give me some tips for taking this further?

Kind of bumping here. This appears to be the clue. When I create a new DNS zone on any of the masters, the primary name server gives this message:

Jul 20 19:29:03 ns1 named[442496]: client @0x7fe280013670 192.168.1.1#52459: received notify for zone 'example.com': not authoritative

Indeed, during migration the zones are not authorative, and my registar insists on having consistent name servers before registration. How do I convince Bind that it should still accept the domain in spite of it not being authorative for it?

I am having this same issue, on a bunch of servers. I thought perhaps it was the latest release, but I also have an older server on virtualmin 6 and it is having the same issue. These are all running Ubuntu server (various versions 16.04 esm and 18.04), so there is a commonality there.

The primary server the site is created on doesn’t seem to reload the new bind configuration. Even testing with dig on the server the site was just created on returns no DNS results. I have to manually reload bind (IE service bind9 reload), then the domain comes alive.

Of course the slave servers aren’t able to sync the domains either until a reload is done.

I am going to do further digging to try to figure out why bind isn’t being reloaded.

Ok as an update, I see no attempts to restart the slave DNS servers, which is why their configuration doesn’t go live.

I tested with the latest Virtualmin 7.1 and Webmin 1.994

Here are the debug logs on a DNS slave:

23836 [25/Jul/2022 16:04:03.429761] - MASTER_IP_ADDR - START "script=fastrpc.cgi"
23836 [25/Jul/2022 16:04:03.430581] - MASTER_IP_ADDR - READ "/etc/webmin/miniserv.conf"
23836 [25/Jul/2022 16:04:03.432216] admin_user MASTER_IP_ADDR - READ "/usr/share/webmin/authentic-theme/config"
23836 [25/Jul/2022 16:04:03.438348] admin_user MASTER_IP_ADDR - READ "/etc/webmin/custom-lang"
23836 [25/Jul/2022 16:04:03.443717] admin_user MASTER_IP_ADDR - READ "/usr/share/webmin//defaultacl"
23836 [25/Jul/2022 16:04:03.444644] admin_user MASTER_IP_ADDR - READ "/etc/webmin/root.acl"
23836 [25/Jul/2022 16:04:03.445407] admin_user MASTER_IP_ADDR - READ "/etc/webmin//root.acl"
23836 [25/Jul/2022 16:04:03.447884] admin_user MASTER_IP_ADDR - READ "/usr/share/webmin/version"
23836 [25/Jul/2022 16:04:03.458022] admin_user MASTER_IP_ADDR - STOP "runtime=0"
23838 [25/Jul/2022 16:04:03.558033] admin_user MASTER_IP_ADDR - READ "/etc/webmin/bind8/config"
23838 [25/Jul/2022 16:04:03.559465] admin_user MASTER_IP_ADDR - READ "/usr/share/webmin/bind8/prefs.info"
23838 [25/Jul/2022 16:04:03.568180] admin_user MASTER_IP_ADDR - READ "/etc/webmin/bind8/custom-lang"
23838 [25/Jul/2022 16:04:03.571178] admin_user MASTER_IP_ADDR - READ "/var/webmin/module.infos.cache"
23838 [25/Jul/2022 16:04:03.592956] admin_user MASTER_IP_ADDR - READ "/etc/webmin//admin_user.acl"
23838 [25/Jul/2022 16:04:03.593942] admin_user MASTER_IP_ADDR - READ "/etc/webmin/installed.cache"
23838 [25/Jul/2022 16:04:03.619460] admin_user MASTER_IP_ADDR bind8 READ "/usr/share/webmin/bind8/defaultacl"
23838 [25/Jul/2022 16:04:03.620907] admin_user MASTER_IP_ADDR bind8 READ "/etc/webmin/bind8/root.acl"
23838 [25/Jul/2022 16:04:03.621715] admin_user MASTER_IP_ADDR bind8 READ "/etc/webmin/bind8/admin_user.acl"
23838 [25/Jul/2022 16:04:03.622530] admin_user MASTER_IP_ADDR bind8 READ "/usr/share/webmin/bind8/prefs.info"
23838 [25/Jul/2022 16:04:03.627214] admin_user MASTER_IP_ADDR bind8 READ "/etc/webmin/bind8/zone-names"
23838 [25/Jul/2022 16:04:04.082076] admin_user MASTER_IP_ADDR bind8 WRITE "/var/lib/bind/adminusertest3.com.hosts"
23838 [25/Jul/2022 16:04:04.083565] admin_user MASTER_IP_ADDR bind8 PERMS "file=/var/lib/bind/adminusertest3.com.hosts user=116 group=123 perms="
23838 [25/Jul/2022 16:04:04.128271] admin_user MASTER_IP_ADDR bind8 READ "/etc/bind/named.conf.local"
23838 [25/Jul/2022 16:04:04.167896] admin_user MASTER_IP_ADDR bind8 WRITE "/etc/bind/named.conf.local"
23838 [25/Jul/2022 16:04:04.175352] admin_user MASTER_IP_ADDR bind8 CLOSE "/etc/bind/named.conf.local"
23838 [25/Jul/2022 16:04:04.176233] admin_user MASTER_IP_ADDR bind8 READ "/etc/selinux/config"
23838 [25/Jul/2022 16:04:04.180462] admin_user MASTER_IP_ADDR bind8 CMD "cmd=diff \/etc\/bind\/named\.conf\.local\.webminorig \/etc\/bind\/named\.conf\.local 2>/dev/null"

I had a really old virtualmin server I made the same test to the same slave, and it successfully restarted the bind server.

24565 [25/Jul/2022 16:20:48.450624] - OLD_MASTER_IP_ADDR - START "script=rpc.cgi"
24565 [25/Jul/2022 16:20:48.451545] - OLD_MASTER_IP_ADDR - READ "/etc/webmin/miniserv.conf"
24565 [25/Jul/2022 16:20:48.453453] admin_user OLD_MASTER_IP_ADDR - READ "/usr/share/webmin/authentic-theme/config"
24565 [25/Jul/2022 16:20:48.459859] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin/custom-lang"
24565 [25/Jul/2022 16:20:48.465452] admin_user OLD_MASTER_IP_ADDR - READ "/usr/share/webmin//defaultacl"
24565 [25/Jul/2022 16:20:48.466532] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin/root.acl"
24565 [25/Jul/2022 16:20:48.467561] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin//root.acl"
24567 [25/Jul/2022 16:20:48.573304] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin/bind8/config"
24567 [25/Jul/2022 16:20:48.574732] admin_user OLD_MASTER_IP_ADDR - READ "/usr/share/webmin/bind8/prefs.info"
24567 [25/Jul/2022 16:20:48.584359] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin/bind8/custom-lang"
24567 [25/Jul/2022 16:20:48.587178] admin_user OLD_MASTER_IP_ADDR - READ "/var/webmin/module.infos.cache"
24567 [25/Jul/2022 16:20:48.608972] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin//admin_user.acl"
24567 [25/Jul/2022 16:20:48.609785] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin/installed.cache"
24567 [25/Jul/2022 16:20:48.573304] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin/bind8/config"
24567 [25/Jul/2022 16:20:48.574732] admin_user OLD_MASTER_IP_ADDR - READ "/usr/share/webmin/bind8/prefs.info"
24567 [25/Jul/2022 16:20:48.584359] admin_user OLD_MASTER_IP_ADDR - READ "/etc/webmin/bind8/custom-lang"
24567 [25/Jul/2022 16:20:48.635075] admin_user OLD_MASTER_IP_ADDR bind8 READ "/usr/share/webmin/bind8/defaultacl"
24567 [25/Jul/2022 16:20:48.636348] admin_user OLD_MASTER_IP_ADDR bind8 READ "/etc/webmin/bind8/root.acl"
24567 [25/Jul/2022 16:20:48.636983] admin_user OLD_MASTER_IP_ADDR bind8 READ "/etc/webmin/bind8/admin_user.acl"
24567 [25/Jul/2022 16:20:48.637634] admin_user OLD_MASTER_IP_ADDR bind8 READ "/usr/share/webmin/bind8/prefs.info"
24567 [25/Jul/2022 16:20:49.253993] admin_user OLD_MASTER_IP_ADDR bind8 WRITE "/var/lib/bind/adminusertesttest4.com.hosts"
24567 [25/Jul/2022 16:20:49.255441] admin_user OLD_MASTER_IP_ADDR bind8 PERMS "file=/var/lib/bind/adminusertesttest4.com.hosts user=116 group=123 perms="
24567 [25/Jul/2022 16:20:49.299730] admin_user OLD_MASTER_IP_ADDR bind8 READ "/etc/bind/named.conf.local"
24567 [25/Jul/2022 16:20:49.338900] admin_user OLD_MASTER_IP_ADDR bind8 WRITE "/etc/bind/named.conf.local"
24567 [25/Jul/2022 16:20:49.346145] admin_user OLD_MASTER_IP_ADDR bind8 CLOSE "/etc/bind/named.conf.local"
24567 [25/Jul/2022 16:20:49.346955] admin_user OLD_MASTER_IP_ADDR bind8 READ "/etc/selinux/config"
24567 [25/Jul/2022 16:20:49.351045] admin_user OLD_MASTER_IP_ADDR bind8 CMD "cmd=diff \/etc\/bind\/named\.conf\.local\.webminorig \/etc\/bind\/named\.conf\.local 2>/dev/null"
24567 [25/Jul/2022 16:20:52.312370] admin_user OLD_MASTER_IP_ADDR bind8 READ "/usr/share/webmin/version"
24567 [25/Jul/2022 16:20:52.932768] admin_user OLD_MASTER_IP_ADDR bind8 KILL "signal=HUP pids=30413"

So there is some difference. I think I will attempt a search the latest versions and figure out where the problem begins.

Thank you so much @bjb ! This is incredible work and so useful.

…I see no attempts to restart the slave DNS servers, which is why their configuration doesn’t go live.

Agreed. A big clue is on my system, when it stays restarting slaves, there is no delay anymore. In the past, there was a split second delay.

I tested with the latest Virtualmin 7.1 and Webmin 1.994

My stuff is always at latest. Ironically I’m pretty sure this worked well for over a year, but stopped in the last three or so months.

Here are the debug logs on a DNS slave:

I’m blown away! Now I have quite a few questions here. Then perhaps I can produce my own logs too :slight_smile:

  • Where are those logs? I get so confused with the log files. I have webmin.log, miniserv.log, miniserv.error.
  • Did you have to turn on debugging? What is the flag?
  • Here I thought it would be good enough to just tail /var/log/system | grep named and get to the bottom of it, but after examining your log I realise we’re probably dealing with with Webmin’s inter RPC process communications here…

I’m hoping that a clue wouldn’t be RNDC, but from your log difference I can also see permissions are at play and that could explain on my system creation of zones but 0 bytes files…

Here is a diff of your logs after a bit of cleaning:

➜  Bind Sync Error diff not_working working
1c1
< - START "script=fastrpc.cgi"
---
> - START "script=rpc.cgi"
8,9d7
< - READ "/usr/share/webmin/version"
< - STOP "runtime=0"
15a14,16
> - READ "/etc/webmin/bind8/config"
> - READ "/usr/share/webmin/bind8/prefs.info"
> - READ "/etc/webmin/bind8/custom-lang"
20,22c21,22
< bind8 READ "/etc/webmin/bind8/zone-names"
< bind8 WRITE "/var/lib/bind/adminusertest3.com.hosts"
< bind8 PERMS "file=/var/lib/bind/adminusertest3.com.hosts user=116 group=123 perms="
---
> bind8 WRITE "/var/lib/bind/adminusertesttest4.com.hosts"
> bind8 PERMS "file=/var/lib/bind/adminusertesttest4.com.hosts user=116 group=123 perms="
27a28,29
> bind8 READ "/usr/share/webmin/version"
> bind8 KILL "signal=HUP pids=30413"

I see some differences, for example, on not working (<) you seem to be using fastrpc.cgi whereas on not working (>) you seem to be using rpc.cgi. Some other differences too but not enough to alarm met yet.

Please give a hint which log you are looking at, and how you turned on debugging? Then I can do the same… :slight_smile:

You can turn the webmin debugging on from the GUI.

Webmin tab / Webmin / Webmin Configuration / Debugging Log file

I turned on all logging and increased the log file size to 500MB

Can anyone who’s seeing this issue post the output from Virtualmin when creating a new domain? There should be a message like “Re-starting slave DNS servers …”

Here is what I get when I add a new domain, DNS zone only with Webmin login

Creating administration group testing123test.com ..
.. done

Creating administration user testing123test.com ..
.. done

Creating aliases for administration user ..
.. done

Adding administration user to groups ..
.. done

Creating home directory ..
.. done

Creating mailbox for administration user ..
.. done

Adding new DNS zone ..
.. done

Adding slave zone on server1.x.com server2.x.com server3.x.com ..
.. done

Creating Webmin user ..
.. done

Saving server details ..
.. done

Re-starting DNS server ..
.. done

Re-loading Webmin ..
.. done

Ok I see the bug that causes this - it will be fixed in the next Virtualmin release. You can see the fix here : Use correct field for DNS slave servers https://forum.virtualmin.com/… · virtualmin/virtualmin-gpl@d5c5ac0 · GitHub

I’m super keen for the official fix. I tried manually implementing @Jamie 's fix but things broke down so I reverted.

We’ll be releasing Virtualmin 7.2 with this fix really soon!