Backuping again

David.Strejc · September 9, 2008, 5:02am

Thanks Joe for super response to Backup thread some day before, but I’ve started this new one due to little differencies.

My scenario is:

I got 3 servers.

First one is main server serving all of our clients websites (about 100 Joomla installations).
Second one is secondary DNS and secondary MX server for first main server.
Third is configured exactly the same as first one, but it is on different line (hosting provider) and is "sleeping" (serving only testing projects for now).

I am doing backups every day onto 2 different places from main server to third box and to our fileserver in office (with thin upload line) - paranoic … I mean virtualmin virtualservers backups.

In the case of main server failure … (atomic bomb or something like that).

I hope (and have tested) that secondary MX is holding emails and waiting for first one server to come up and than deliver them.

Than - while I am in panic and calling for new hardware - I will restore backups on third server.

Our clienst are using our DNS servers for their domains. So we will only chagne GLUE for our primary nameserver. I will restore first domain (our main domain with nameservers) - but I don’t know how will second server behave. Should I delete all virtualhosts secondary DNS (slave) files on second server? - Than I can start restoring backups and everyghing should work (if the will of computer god only can pass this ;o)?

I hope that you understand my problem and my scenario and someone could find this case as helpful in future. If you don’t understand me or have some questions - please help me to specifie this problem more deeply - so we all can find solution for this - I know that this can save some poor administrator a**.

We aren’t rich company so we can’t afford highly available cluster systems for this moment (but I am doing some little research myself for establishing something like this based on web && virtualmin).

Thanks for any advices. Many thanks.

Joe · September 9, 2008, 5:22pm

Good questions. Though there is no “best” solution…it’s all a delicate balancing act between “fast enough recovery” and “doesn’t cost too much” and “acceptable amount of data loss”.

That said, Jamie and I have been discussing reliability and scalability quite a bit lately, and I think modern Linux and Solaris have some pretty good solutions that are pretty cost effective and can provide “Big Company” levels of reliability and scalability at “Bootstrapping Small Company” cost, as long as you’re willing to put in the effort (and a little money).

One option, but probably not the right option for you, is the “watch that basket” model (as in, “Put all your eggs in one basket, and then watch that basket!”), which can be summed up as “Put all of your important data onto a reliable, fast, scalable file server.” This assumes you have the ability to install arbitrary hardware on the network, and all of your servers will be on the same network segment with fast pipes (gigabit) to the file server. You’d then run Solaris with ZFS or Linux with LVM/RAID in a highly reliable configuration (raidz or RAID 5, respectively) across several disks. ZFS supports live snapshots, which is nice…Linux doesn’t have any live snapshot mechanism that I’m aware of. Regardless, even with a good strong basket, you’ll want a periodic backup of the data.

Anyway, moving parts are the most likely failure point in a server, so spreading across many disks is a win. There’s still the concern of failure in other hardware components in your file server…this is an area where it’s possible to spend varying levels of money to get increasing levels of reliability. Sun makes some pretty impressive systems in this space–they can deal with the loss of pretty much any single component (like a CPU or disk controller) and continue operation. Anyway, when you go down this path, your reliability concerns become known and well-understood quantities, rather than “I have a bunch of random boxes with disparate data and I need to make them reliable”. Making one box reliable is much easier than making many boxes reliable. Obviously, once you have all of your data on one system, you can then easily make a basic disk image, or kickstart configuration, that you run on all of the other systems that allows you to quickly spin it up in the event of a failure. Since no data needs to be restored from backups (it’s all on the reliable file server) bringing up a replacement system could take seconds (if you have a pre-configured spare) or a few minutes (if you have to copy the image and copy in the configuration files for the lost server). This scales relatively nicely, and when performance of disk access becomes an issue, it just means it’s time to add another file server.

If I were building a hosting infrastructure this is probably the way I would do it. Joyent is setup this way (roughly, with the caveat that they have a lot of different configurations because of various hosting options and types they’ve offered over the years), and they have historically excellent reliability…and they do it with less expenditure than most of their competitors, I’d bet. (They are not publicly traded and I don’t have any inside information on their finances, so I don’t know their actual margins, but I get the feeling they’re pretty good.)

Another option is simple backups, which is what we do for Virtualmin.com, and what you’re describing and probably the right choice for your current infrastructure. It’d take us about an hour to bring Virtualmin back online if we lost our server, and it would involve some DNS changes (and for some folks, Virtualmin would be offline for longer than an hour, due to propagation time and DNS caching misconfiguration at some ISPs). But, if you can avoid DNS changes, it’s a win–if a server is dead, why not bring a new one back up on the same IP? We probably don’t have that privilege, since we only have two spare servers and they are in different data centers, and it sounds like you don’t have that privilege either. Of course, if you’re talking atomic bomb scenarios (or flood, fire, earthquake, etc.), that takes down the whole data center of your primary server, then you’re going to have to change DNS and accept some downtime no matter what.

But, you shouldn’t need to delete anything on the secondary DNS server–assuming your new primary is configured the same as the old primary, when you bring in the backups it ought to update the secondary server with the new IP, I think (if not, it’s probably bug-like and ought to be fixed). Updating glue records should be viewed as an absolute last resort…note that DNS is a reliable protocol by design. If one server goes down, it’s OK. The secondary will get all the requests while the primary is out to lunch. It might be a little slower, but it won’t stop service. So don’t touch those glue records if you’ll be bringing the old IP back into service in the future.

It sounds like you’ve already followed the docs on a hold and forward backup MX server, and the DNS Slave auto-configuration, so you should only need to worry about getting the primary back online (either new hardware of recovering the old server) as quickly as possible…so just worry about getting the data in place on the new server and the secondary DNS server pointing to the new IP(s).

The only unavoidable negative in this kind of deployment is that you almost certainly will lose (some) data. Assuming daily backups, the time since your last backup and the failure could mean as much as 23+ hours worth of data lost to the void. Averages tell us that it’ll be less than this (the mean I guess would be about 12 hours), of course, but it is an unavoidable bit of data loss…unless the disk isn’t what failed, or the data can otherwise be recovered.

In short, don’t be overly concerned about DNS–it’s the most reliable part of your infrastructure, by default. User data is the hard part.

David.Strejc · September 10, 2008, 3:15am

Thanks for absolutely great description and advices.

Now I will test it ;o))) If I will found something that I don’t understand I will ask you.

Today my lovely datacenter will have maintanance power shutdown for 5 minutes so I hope that will be fun.

For now as you said our solution is adequate for us. But in future I am planning that Open Source hosting (I think that I have more secure hosting for our few clients now than have many commercial hosting companies in our country which I’ve tried - thanks to your advices and great software - thanks to webmin I’ve realized how many things works in UNIX and as I am getting more and more into deep of unix I understand now that your tool is everything but not only point and click solution - it is set of predefined rutines with readable - pretty readable - code and great support on your forum). Maybe one day I’ll be able to say that I’ve helped community and you two with virtualmin a little bit.

Did you ever thing about writing book not about virtualmin and webmin (as you allready did that) but about techniques and ways of being good administrator of open source based systems?

This forum is full of great resources.

Maybe I am talking too much - now hands on keyboard and let’s go not to theory but to praxis ;o)

Thanks a lot. I’ll let you know about my progress with backuping theory.

Joe · September 10, 2008, 4:39am

Practice is the only way you’ll ever be comfortable with this stuff. Backups, in particular, require dry runs so that you can be sure your processes are reliable and repeatable. If they aren’t backups are useless…so, everybody ought to be testing their backup procedures to know that they work.

Did you ever thing about writing book not about virtualmin and webmin (as you allready did that) but about techniques and ways of being good administrator of open source based systems?

I considered it. But, I’m not privileged enough to be independently wealthy at this point in time…and writing technical books does not pay very well. My Webmin book was an accident–it sprung from the documentation of my previous companies products, and then turned into a real book. Maybe someday I’ll do something similar with our Webmin and Virtualmin wikis–add some best practices sections. But it’s not on my todo list right now. I’ve been feeling a bit overwhelmed with Virtualmin, Inc. lately, so I’m not looking for new projects.