So the plot thickens…
It runs (nearly) fine from the GUI… I have run it manually 3 times over the past few days. It was successful twice, and did fail once, due to an empty HTTP response while uploading to rackspace. However it continued as configured with the rest of the domains, and finished cleanly. All 3 manual runs have created a log entry.
Each night the backup has been running as scheduled but not creating a log file, email report or deleting old backups.
The logs do seem to maybe end short. Two end with:
[…snip…]
… done
Uploading archive to Rackspace's Cloud Files service ..
.. done
… completed in 17 seconds
706 servers backed up successfully, 0 had errors.
Which looks ok, but the failed one ends the same but without the "706 servers backed up successfully, 0 had errors. " line at all. I would expect it to be present with an error count of 1.
Also none of the logs show the deleting old backups information. When I ran manually and watched in the GUI, this info was shown. And I made a few interesting observations during it. I’ve also captured a copy to a file.
-
I wasn’t actually expecting it to delete old backups! When manually running the backup the “Delete old files” and “Schedule and reporting” sections are not available on the form. I inferred from this that these features were not going to be executed on a manual run. I did not receive the report email, however I did witness the backup deletion functionality running.
-
Backup deletion seemed very slow. Output seemed to have stopped, I don’t know if it stalled on just one domain or if they were just overall slow. It looked like it had stalled but when I came back later it had continued to what it seems to have considered completion. While it was “stalled” I took a look at lsof for the process. It had over 600 https connections to the rackspace server at CLOSE_WAIT. I’m not sure if this is expected or what it means? Sounds a bit high to me.
-
It doesn’t seem to have attempted to delete all of the backups that it should be deleting. This is from the manually captured log file from the GUI. It does not appear in the actual log file saved on the server!
[…snip…]
Uploading archive to Rackspace’s Cloud Files service …
… done
… completed in 9 seconds
706 servers backed up successfully, 0 had errors.
Backup is complete. Final size was 42.63 MB.
Deleting backups from %Y-%m-%d-inc in Rackspace container jh2_virtualmin_backup older than 30 days …
Deleting file 2016-06-19-inc/raremaps.co.uk.tar.gz, which is 77 days old ..
.. deleted 1.96 kB.
Deleting file 2016-06-19-inc/rareprints.com.tar.gz, which is 77 days old ..
.. deleted 2.06 kB.
[…snip…]
Deleting file 2016-06-20-inc/3dprototypes.net.tar.gz, which is 76 days old ..
.. deleted 2.07 kB.
Deleting file 2016-06-20-inc/agrobot.co.uk.tar.gz, which is 76 days old ..
.. deleted 2.08 kB.
… deleted 170 old backups, and skipped 0 that were not old enough
As you can see it is configured to delete backups older than 30 days. It has only deleted 170 and it should be doing about 700 per day. Its working on the oldest backups at 76 days old, so its got a couple of months behind. The 170 were a few from 2016-06-19 and a few from 2016-06-20. Also there are many current backups that should have been skipped, I’m not sure why 0 is shown for that count. The identical backup on the other server shows these counts correctly.
Looking at what is present on the rackspace container it seems to have stopped way before it should have:
root@jh1:/mnt/cloudfuse/jh2_virtualmin_backup# for f in * ; do echo $f `ls $f|grep gz$|wc` ; done
2016-06-05-full 705 705 16047
2016-06-15-inc 0 0 0
2016-06-20-inc 527 527 12105
2016-06-21-inc 706 706 16067
2016-06-22-inc 706 706 16067
2016-06-23-inc 706 706 16067
2016-06-24-inc 705 705 16045
2016-06-25-inc 706 706 16067
2016-06-26-inc 706 706 16067
2016-06-27-inc 706 706 16067
2016-06-28-inc 706 706 16067
2016-06-29-inc 706 706 16067
2016-06-30-inc 706 706 16067
2016-07-01-full 706 706 16067
2016-07-02-inc 706 706 16067
2016-07-03-inc 706 706 16067
[...snip...]
Most backup days have 706 .gz files in them. You can see its deleted some from the 20th but not all. Also the 15th is interesting as it shows no gz files… looking into the folder I can see a .dom and .info files there for a domain but no .gz file.
So, in conclusion I am beginning to suspect network connectivity issue while deleting the backups is causing a crash or timeout that makes the backup process uncleanly exit? Maybe the full directory listing isn’t downloaded? That could be why it doesn’t delete everything and doesn’t consider any to be “skipped”. I’m not sure that explains why the .info and .dom file exists without an .gz!?
I am going to clean up the server a bit, about 400 of the domains can be removed so that will make the backup process much quicker, and I am thinking it may prevent this problem. I am also going to manually remove some of the old backups so virtualmin doesn’t have so many to keep track of and delete. This wont be done for a few more weeks yet, so If you would like me to run any further tests and gather more info so you can get to the bottom of this I would be willing to do so. Thanks to all involved for an awesome system
I will let the backup run on schedule tonight and see how many files it has managed to delete in the morning.