We have about 2000 virtual servers on our physical Virtualmin server, each of which is a sub server of a single parent virtual server. Each time I try to create a new sub-server of the parent it stalls at the “Updating Webmin user” stage and stays there for 10 mins. During this time I can see (using htop from cli) that the domain_setup.cgi process is using 100% of one, sometimes both of the CPU cores. It does eventually complete successfully and everything works as expected.
If I create a new parent server or a sub server using a different parent everything works quickly in the expected amount of time (20-30 secs). I get the same result if trying to use the API’s create-domain.pl script.
My hunch is that it has something to do with the amount of sub-servers that are using the same parent server’s account. Ideally I would really like to resolve this problem and continue using the same parent server though worse case I can create a new dedicated parent and start adding them to the new one.
Any assistance/brain storming ideas would be much appreciated.
I explained your problem to Jamie… he asked if it might be possible for him to log into your system, and troubleshoot the issue to determine what the bottleneck is.
Also, do you have an example of a Sub-Server that you’re trying to add? If you could include the details of some new domains you’d like added, that would help in the troubleshooting process.
This is a production server that is very important to us. Unfortunately I am unable to give out root login details to unauthorised parties. Sorry, company policy.
I am however fairly technically competent so I can carry out any troubleshooting steps that might provide more information about the problem…?
Well, he was hoping to do some code profiling on your system, which is a more involved process than we can really describe.
Without a close look, there’s not likely an easy fix to the problem you’re seeing – it’s likely a code problem, where something isn’t working as efficiently as it should.
What Jamie needs to do in order to fix it is to replicate a setup such as yours, cause the problem, and then determine what in that is running slowly. You have an above-average number of domains there, systems with that many don’t get too much testing.
I’ve explained the issue you’re seeing to him though, and we’ll see if there’s anything we can figure out.
I ended up just creating a new Parent Virtual Server and changed my script to add new Sub-Servers to that. Everything is working now as expected again except that the services occasionally crash due to too many files being open. This is due to the each virtual server having two log files open.
I am in the process of writing a script to change the log file output for each sub server to write to a single file for the parent.
Apache is currently writing directly to the logfiles in /var/virtualmin with each virtual server having its own log files.
Most days around the same time (8.20am) BIND, Telnet and Webmin/Virtualmin fall over. During this time Apache keeps running. It seems that is it a cron job which is causing this problem.
This is what I see in the /var/log/syslog file just prior to the crash,
May 8 08:17:01 CRON[7339]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 8 08:20:01 CRON[7377]: (root) CMD (/home/bp/bin/ps.sh)
May 8 08:20:01 CRON[7379]: (root) CMD (/etc/webmin/status/monitor.pl)
May 8 08:20:01 CRON[7380]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
I got an alert this morning at exactly 8.20am telling me that BIND, Telnet and Webmin/Virtualmin services were down. I restarted these services and everything has been stable since.