Deleting old backups fail: Too many open files (CLOSE_WAIT to racksapce container)

bobemoe · September 2, 2016, 11:01am

UPDATE: I’ve found the cause of this now: Comment #6

One of my virtualmin backups seem to be completing successfully (all domains.tar.gz present), but not saving the log or sending the results email or deleting the old backups.

I have 2 schedules, both to rackspace files. One FULL on the 1st of month, and one INCREMENTAL on 2nd-31st of month. The full backup works as expected, its the incremental one that is showing this issue. I have got past logs of the incremental one as it was working for a couple of months, it seems to have stopped working around 6 July 2016.

I have another server with the same backup schedules which is logging/emailing/removing old backups for both schedules as expected.

Both servers are fully updated. The main difference between servers is there are 700 domains (mainly aliases) on the server in question as opposed to about 20 on the one that works successfully. Maybe this is too many? and something is timing out? I’m not running any process monitors/killers and cant see anything in dmesg of the kernel killing any tasks.

I haven’t set any pre/post backup scripts so its not them interfering.

Obviously I have no backup logs to check since the issue started! There doesn’t seem to be anything in any other logs. Any ideas what I could check next?

Thanks

Eric · September 2, 2016, 3:47pm

Howdy,

Hmm, that’s certainly odd.

So the backup appears to be working – you just don’t see any logs for it?

And just to verify, you’re looking in Backup and Restore -> Backup Logs?

I’d be concerned that anything not listed in there might not be working properly – but maybe there’s a bug of some sort that’s interfering with the logging process.

One thing you might try, is to go into the Scheduled Backup that isn’t working properly, and have it run from there from the GUI in the foreground, so that you can see all the output. That might aid in discovering what’s going on.

-Eric

bobemoe · September 2, 2016, 5:22pm

Hi

Yeah, that is correct, backup folder and files for all domains get created but no trace of the seemingly successful backup can be found in “Backup and Restore -> Backup Logs” or with the actual log files in /etc/webmin/virtual-server/backuplogs.

Additionally the completion email is not being sent (it is configured to) and the backups older than 30 days as configured are not being deleted.

Thanks for the tip… I am running the backup now in the web interface so I can monitor the output of the process. It’ll take a few hours so I will report back.

Cheers

bobemoe · September 5, 2016, 8:39am

So the plot thickens…

It runs (nearly) fine from the GUI… I have run it manually 3 times over the past few days. It was successful twice, and did fail once, due to an empty HTTP response while uploading to rackspace. However it continued as configured with the rest of the domains, and finished cleanly. All 3 manual runs have created a log entry.

Each night the backup has been running as scheduled but not creating a log file, email report or deleting old backups.

The logs do seem to maybe end short. Two end with:
[…snip…] … done

Uploading archive to Rackspace's Cloud Files service ..
.. done

… completed in 17 seconds

706 servers backed up successfully, 0 had errors.

Which looks ok, but the failed one ends the same but without the "706 servers backed up successfully, 0 had errors. " line at all. I would expect it to be present with an error count of 1.

Also none of the logs show the deleting old backups information. When I ran manually and watched in the GUI, this info was shown. And I made a few interesting observations during it. I’ve also captured a copy to a file.

I wasn’t actually expecting it to delete old backups! When manually running the backup the “Delete old files” and “Schedule and reporting” sections are not available on the form. I inferred from this that these features were not going to be executed on a manual run. I did not receive the report email, however I did witness the backup deletion functionality running.
Backup deletion seemed very slow. Output seemed to have stopped, I don’t know if it stalled on just one domain or if they were just overall slow. It looked like it had stalled but when I came back later it had continued to what it seems to have considered completion. While it was “stalled” I took a look at lsof for the process. It had over 600 https connections to the rackspace server at CLOSE_WAIT. I’m not sure if this is expected or what it means? Sounds a bit high to me.
It doesn’t seem to have attempted to delete all of the backups that it should be deleting. This is from the manually captured log file from the GUI. It does not appear in the actual log file saved on the server!
[…snip…]

Uploading archive to Rackspace’s Cloud Files service …
… done

… completed in 9 seconds

706 servers backed up successfully, 0 had errors.

Backup is complete. Final size was 42.63 MB.

Deleting backups from %Y-%m-%d-inc in Rackspace container jh2_virtualmin_backup older than 30 days …

Deleting file 2016-06-19-inc/raremaps.co.uk.tar.gz, which is 77 days old ..
.. deleted 1.96 kB.

Deleting file 2016-06-19-inc/rareprints.com.tar.gz, which is 77 days old ..
.. deleted 2.06 kB.

[…snip…]

Deleting file 2016-06-20-inc/3dprototypes.net.tar.gz, which is 76 days old ..
.. deleted 2.07 kB.

Deleting file 2016-06-20-inc/agrobot.co.uk.tar.gz, which is 76 days old ..
.. deleted 2.08 kB.

… deleted 170 old backups, and skipped 0 that were not old enough

As you can see it is configured to delete backups older than 30 days. It has only deleted 170 and it should be doing about 700 per day. Its working on the oldest backups at 76 days old, so its got a couple of months behind. The 170 were a few from 2016-06-19 and a few from 2016-06-20. Also there are many current backups that should have been skipped, I’m not sure why 0 is shown for that count. The identical backup on the other server shows these counts correctly.

Looking at what is present on the rackspace container it seems to have stopped way before it should have:


root@jh1:/mnt/cloudfuse/jh2_virtualmin_backup# for f in * ; do echo $f `ls $f|grep gz$|wc` ; done
2016-06-05-full 705 705 16047
2016-06-15-inc 0 0 0
2016-06-20-inc 527 527 12105
2016-06-21-inc 706 706 16067
2016-06-22-inc 706 706 16067
2016-06-23-inc 706 706 16067
2016-06-24-inc 705 705 16045
2016-06-25-inc 706 706 16067
2016-06-26-inc 706 706 16067
2016-06-27-inc 706 706 16067
2016-06-28-inc 706 706 16067
2016-06-29-inc 706 706 16067
2016-06-30-inc 706 706 16067
2016-07-01-full 706 706 16067
2016-07-02-inc 706 706 16067
2016-07-03-inc 706 706 16067
[...snip...]

Most backup days have 706 .gz files in them. You can see its deleted some from the 20th but not all. Also the 15th is interesting as it shows no gz files… looking into the folder I can see a .dom and .info files there for a domain but no .gz file.

So, in conclusion I am beginning to suspect network connectivity issue while deleting the backups is causing a crash or timeout that makes the backup process uncleanly exit? Maybe the full directory listing isn’t downloaded? That could be why it doesn’t delete everything and doesn’t consider any to be “skipped”. I’m not sure that explains why the .info and .dom file exists without an .gz!?

I am going to clean up the server a bit, about 400 of the domains can be removed so that will make the backup process much quicker, and I am thinking it may prevent this problem. I am also going to manually remove some of the old backups so virtualmin doesn’t have so many to keep track of and delete. This wont be done for a few more weeks yet, so If you would like me to run any further tests and gather more info so you can get to the bottom of this I would be willing to do so. Thanks to all involved for an awesome system

I will let the backup run on schedule tonight and see how many files it has managed to delete in the morning.

bobemoe · September 5, 2016, 3:05pm

I couldn’t wait till tonight so I have run the backup again manually. This time I attached strace to the backup process, and kept an eye on lsof. During the backup there was only a couple of network connections open at one time. When the deleting started is when the CLOSE_WAIT connections started showing up, and after a while I started to see “Too many open files” all over the place in strace. Soon after this the backup finished, only deleting 170 domains again. Does this confirm my suspicions of too many network connections? I have a copy of the strace but it is big and I haven’t check it for sensitive info so I wont post here unless specific sections are requested.

bobemoe · September 8, 2016, 9:54pm

OK, so I have tracked the cause of this down to a combination of 2 bugs:

First: in rs_http_single_call function in /usr/share/webmin/virtual-server/rs-lib.pl an HTTP connection is opened to rackspace server and an API call is issued. Upon success the connection is closed and the response is returned. However, upon error the function returns an error code but the connection is not closed and remains in CLOSE_WAIT state. There are 5 places this function could return without closing the connection.

Second: in purge_domain_backups function in /usr/share/webmin/virtual-server/backups-lib.pl which loops through all files in the rackspace container. For each file rs_stat_object is called. If the file is a .gz file it is deleted, along with the .dom and .info files. The next loop iteration is the .dom file (which was just deleted) so when rs_stat_object is called a 404 is correctly returned, but this is considered an error by rs_http_single_call which then leaves the connection open as described above. Could this function be re-factored to put the rs_stat_object call in the next conditional block which is not executed for .dom and .info files?

Eric · September 9, 2016, 12:26am

Howdy,

Hmm, if you think you’re seeing a bug, we may want to get Jamie involved to get his thoughts on that.

To do that – could you open up a bug report here in the support tracker:

https://www.virtualmin.com/project/issues#overlay=node/add/project-issue/virtual_server

There, Jamie will be able to comment on that… thanks!

-Eric

bobemoe · September 9, 2016, 8:27am

Done: https://www.virtualmin.com/node/41967
Thanks.