All vhosts are backed up under the same S3 account but randomly one of them fails

eddieb · July 16, 2014, 4:38pm

Eric, i imagined you guys already looked into all the updates that occur near the date the issue started occurring?

i dont have a rackspace acct but im willing to help. LMK how.

thank you!

amityweb · July 16, 2014, 4:50pm

Could a firewall be preventing Amazon connection? We have CSF on all our servers, and we block all non-public ports, e.g. SSH, FTP etc. We only open port 80 for incoming traffic, and email ports. Would Amazon S3 be connecting on specific ports, or kicking off some LFD block like too many connections too quickly? I did search for “amazon” in the LFD log but nothing there.

I have approx. 10 servers all backing up to Amazon S3 at the same time each night, midnight, so also wondered if connecting to the buckets in my one S3 account at the same time caused some issue at Amazon? Although if it did then I am sure S3 is not very robust for larger organisations, so probably not this.

Eric · July 16, 2014, 7:06pm

I’ll review the git check-ins to double-check that nothing there changed that might be causing the issue.

Did this start occurring at the same time for everyone, roughly mid-june?

-Eric

eddieb · July 16, 2014, 7:49pm

13-16 of june at least for two ppl here…

Eric · July 16, 2014, 8:40pm

Howdy,

Thanks for the info.

Reviewing the Virtualmin releases, it looks like the most recent Virtualmin version, which is where the S3 code is implemented, was released on May 14th.

The most recent Webmin version was released on May 22nd.

I suspect since so many folks began seeing this issue in mid-June, that something else may have changed around that time (assuming it’s client-side, and not related to S3 itself… which it very well may be).

I tried to answer this next question by reviewing the posts in this thread quickly, but I just wanted to confirm – is everyone who’s having this issue using CentOS? It sounds like there’s a mix of CentOS 5 and CentOS 6 systems, but I didn’t notice any that were running Ubuntu or Debian.

I don’t imagine any of you still have yum logs from roughly the time when the problem began occurring, where you could check to see what packages were installed/updated around that time?

-Eric

martinmanyhats · July 16, 2014, 9:49pm

yum.log for June 2014:


Jun 03 09:32:53 Updated: tzdata-2014d-1.el6.noarch
Jun 03 09:33:00 Updated: tzdata-java-2014d-1.el6.noarch
Jun 04 17:12:17 Updated: gnutls-2.8.5-14.el6_5.x86_64
Jun 04 17:12:25 Updated: libtasn1-2.3-6.el6_5.x86_64
Jun 06 00:01:02 Updated: openssl-1.0.1e-16.el6_5.14.x86_64
Jun 06 00:01:05 Updated: openssl-devel-1.0.1e-16.el6_5.14.x86_64
Jun 11 22:54:45 Updated: goaccess-0.8-1.el6.x86_64
Jun 21 09:50:37 Updated: kernel-firmware-2.6.32-431.20.3.el6.noarch
Jun 21 09:51:03 Installed: kernel-2.6.32-431.20.3.el6.x86_64
Jun 21 09:52:58 Updated: kernel-headers-2.6.32-431.20.3.el6.x86_64
Jun 21 09:53:12 Updated: libxml2-2.7.6-14.el6_5.2.x86_64
Jun 21 09:53:13 Updated: libxml2-python-2.7.6-14.el6_5.2.x86_64
Jun 21 09:53:21 Updated: tzdata-2014e-1.el6.noarch
Jun 21 09:53:28 Updated: tzdata-java-2014e-1.el6.noarch
Jun 23 10:54:03 Updated: nodejs-packaging-7-1.el6.noarch
Jun 23 23:03:26 Updated: avahi-libs-0.6.25-12.el6_5.1.x86_64
Jun 23 23:03:33 Updated: kpartx-0.4.9-72.el6_5.3.x86_64
Jun 23 23:03:39 Updated: ql2400-firmware-7.03.00-1.el6_5.noarch
Jun 23 23:03:44 Updated: ql2500-firmware-7.03.00-1.el6_5.noarch
Jun 26 10:44:13 Updated: 1:dovecot-2.0.9-7.el6_5.1.x86_64
Jun 26 10:53:59 Updated: coreutils-libs-8.4-31.el6_5.2.x86_64
Jun 26 10:54:10 Updated: coreutils-8.4-31.el6_5.2.x86_64
Jun 28 08:10:25 Updated: clamav-db-0.98.4-1.el6.x86_64
Jun 28 08:10:35 Updated: clamav-0.98.4-1.el6.x86_64
Jun 28 08:10:37 Updated: clamd-0.98.4-1.el6.x86_64

martin

eddieb · July 17, 2014, 7:24am

CentOS 6.

Updates on the week of 10-17 of June:

Updated mod_security-1:2.8.0-20.el6.art.x86_64 @asl-4.0
Updated libxml2-2.7.6-14.el6_5.1.x86_64
Updated libxml2-python-2.7.6-14.el6_5.1.x86_64

amityweb · July 17, 2014, 9:25am

I attach my yum.log for June, for one of my servers. Let me know if you want more of my servers (all 10 of them have the issue). They are all more or less similar in setup, and all would have been updated the same time, I try to run yum update once a month on all servers. They are all CentOS 6.

T2thec · July 17, 2014, 1:30pm

Just two put my +1 to this. Sadly I have lost a clients database now as things weren’t being backed up fully.

I had this issue some time ago and I found that the server time was a little off which seemed to cause the failing.

Another note for some, Amazon, not so long ago changed their authentication method (IAM). Not sure if it is related or not, but I have changed the credentials now.

Not sure how either of these suggestions have anything to do with Empty response to HTTP request. Just clutching at straws really.

It still doesn’t get everything across, but I got more servers over than I have done in the past. I’m going to tweak reties now and see if that helps.

siwuch · July 17, 2014, 2:49pm

Hi,

We use Ubuntu 12.04 LTS and we have the same observations:

I contacted AWS support and they are asking for request/response headers of the failed requests. Is there any way for me to find some debugging info in logs that I could pass to AWS team?

Eric · July 17, 2014, 3:14pm

The biggest thing I see in common that’s been updated is libxml2.

I’m reluctant to think that’s the source of issue, but I also don’t want to rule it out. I mean, I suppose if one of you guys wanted to try rolling libxml2 back to the previous version, I’d be curious if that makes a difference.

I’m a bit more curious about this though –

I’ve been reviewing the code used to push the backups to S3, and I’m wondering if anyone would be so kind to try making a change in /usr/libexec/webmin/virtual-server/s3-lib.pl to enable some additional debugging.

On line 208 of that file is the following:

$err = "Empty response to HTTP request";

Could you change that line to read as follows:

$err = "Empty response to HTTP request: [line: $line], [out: $out]";

That will show a bit more info about what’s really being returned by Amazon when this error occurs.

After making that change, restart Webmin (/etc/init.d/webmin restart).

Then, next time that error is thrown, could you paste in the full error output here? It’s possible those variables will be completely empty. But it’s also possible they’ll contain exactly what we need to determine what’s going on

Thanks!

-Eric

siwuch · July 17, 2014, 3:40pm

Thanks Eric, I’ve made the change. We run backups every night Europe time. I will respond with an updatede if anything comes up unless someone else is faster than me.

Leszek

amityweb · July 17, 2014, 4:36pm

@T2thec about the server time… Amazon does indeed check server time. If its 15 minutes or more difference it wont connect. I discovered this some time ago when using a plugin for Expression Engine. You can see my report here http://expressionengine.stackexchange.com/questions/10170/assets-stopped-connecting-to-amazon-s3-access-denied-by-target-host, and Amazons FAQ on the matter http://aws.amazon.com/articles/1109#04. BUT, my server times are now correct following this issue, so that may not be the issue here.

martinmanyhats · July 17, 2014, 9:46pm

Two servers updated, will report back tomorrow.

martin

martinmanyhats · July 18, 2014, 8:07am

One domain failed, the report contained:

Uploading archive to Amazon’s S3 service …
… upload failed! Empty response to HTTP request: [line: ], [out: ]

martin

T2thec · July 18, 2014, 8:57am

Hey. Reporting back.

• Bumped retires up to 10
• Changed to new AWS IAM passkey details
• Checked server time wasn’t out

72 vps fully backed up three times without a hitch.

I am a happy man again… For now.

eddieb · July 18, 2014, 9:54am

T2thec,

Whats the new AWS IAM passkey details?

and everyone, how would server time be an issue OCCASIONALLY?

thanks

siwuch · July 21, 2014, 9:58am

Dear Eric,

Similarly to martinmanyhats I see the same output in my backup logs:

Creating incremental TAR file of home directory ..
.. done

Uploading archive to Amazon's S3 service ..
.. upload failed! Empty response to HTTP request: [line: ], [out: ]

… completed in 1 minutes, 24 seconds

Thanks!

jfquestiaux · July 21, 2014, 10:55am

Hi. I had the same problem on a Debian 7 server, starting on May 27th: random backup failures of virtual server.

I tried to set up a new scheduled backup and it failed every time until I increased the retries from 3 to 10 in the Virtualmin settings.

I’ll do the same on the server with the random failures and see what it does next time.

Eric · July 21, 2014, 2:53pm

Thank you for sharing the additional output in the backup logs. That does look like it’s truly empty.

It sounds like Amazon is suggesting we review (or pass along to them) the headers being sent to them, and received from them, during the backup process.

I’ll work with Jamie to get a patch that retrieves that information, and we’ll get back with you shortly!

-Eric