Good questions. Though there is no “best” solution…it’s all a delicate balancing act between “fast enough recovery” and “doesn’t cost too much” and “acceptable amount of data loss”.
That said, Jamie and I have been discussing reliability and scalability quite a bit lately, and I think modern Linux and Solaris have some pretty good solutions that are pretty cost effective and can provide “Big Company” levels of reliability and scalability at “Bootstrapping Small Company” cost, as long as you’re willing to put in the effort (and a little money).
One option, but probably not the right option for you, is the “watch that basket” model (as in, “Put all your eggs in one basket, and then watch that basket!”), which can be summed up as “Put all of your important data onto a reliable, fast, scalable file server.” This assumes you have the ability to install arbitrary hardware on the network, and all of your servers will be on the same network segment with fast pipes (gigabit) to the file server. You’d then run Solaris with ZFS or Linux with LVM/RAID in a highly reliable configuration (raidz or RAID 5, respectively) across several disks. ZFS supports live snapshots, which is nice…Linux doesn’t have any live snapshot mechanism that I’m aware of. Regardless, even with a good strong basket, you’ll want a periodic backup of the data.
Anyway, moving parts are the most likely failure point in a server, so spreading across many disks is a win. There’s still the concern of failure in other hardware components in your file server…this is an area where it’s possible to spend varying levels of money to get increasing levels of reliability. Sun makes some pretty impressive systems in this space–they can deal with the loss of pretty much any single component (like a CPU or disk controller) and continue operation. Anyway, when you go down this path, your reliability concerns become known and well-understood quantities, rather than “I have a bunch of random boxes with disparate data and I need to make them reliable”. Making one box reliable is much easier than making many boxes reliable. Obviously, once you have all of your data on one system, you can then easily make a basic disk image, or kickstart configuration, that you run on all of the other systems that allows you to quickly spin it up in the event of a failure. Since no data needs to be restored from backups (it’s all on the reliable file server) bringing up a replacement system could take seconds (if you have a pre-configured spare) or a few minutes (if you have to copy the image and copy in the configuration files for the lost server). This scales relatively nicely, and when performance of disk access becomes an issue, it just means it’s time to add another file server.
If I were building a hosting infrastructure this is probably the way I would do it. Joyent is setup this way (roughly, with the caveat that they have a lot of different configurations because of various hosting options and types they’ve offered over the years), and they have historically excellent reliability…and they do it with less expenditure than most of their competitors, I’d bet. (They are not publicly traded and I don’t have any inside information on their finances, so I don’t know their actual margins, but I get the feeling they’re pretty good.)
Another option is simple backups, which is what we do for Virtualmin.com, and what you’re describing and probably the right choice for your current infrastructure. It’d take us about an hour to bring Virtualmin back online if we lost our server, and it would involve some DNS changes (and for some folks, Virtualmin would be offline for longer than an hour, due to propagation time and DNS caching misconfiguration at some ISPs). But, if you can avoid DNS changes, it’s a win–if a server is dead, why not bring a new one back up on the same IP? We probably don’t have that privilege, since we only have two spare servers and they are in different data centers, and it sounds like you don’t have that privilege either. Of course, if you’re talking atomic bomb scenarios (or flood, fire, earthquake, etc.), that takes down the whole data center of your primary server, then you’re going to have to change DNS and accept some downtime no matter what.
But, you shouldn’t need to delete anything on the secondary DNS server–assuming your new primary is configured the same as the old primary, when you bring in the backups it ought to update the secondary server with the new IP, I think (if not, it’s probably bug-like and ought to be fixed). Updating glue records should be viewed as an absolute last resort…note that DNS is a reliable protocol by design. If one server goes down, it’s OK. The secondary will get all the requests while the primary is out to lunch. It might be a little slower, but it won’t stop service. So don’t touch those glue records if you’ll be bringing the old IP back into service in the future.
It sounds like you’ve already followed the docs on a hold and forward backup MX server, and the DNS Slave auto-configuration, so you should only need to worry about getting the primary back online (either new hardware of recovering the old server) as quickly as possible…so just worry about getting the data in place on the new server and the secondary DNS server pointing to the new IP(s).
The only unavoidable negative in this kind of deployment is that you almost certainly will lose (some) data. Assuming daily backups, the time since your last backup and the failure could mean as much as 23+ hours worth of data lost to the void. Averages tell us that it’ll be less than this (the mean I guess would be about 12 hours), of course, but it is an unavoidable bit of data loss…unless the disk isn’t what failed, or the data can otherwise be recovered.
In short, don’t be overly concerned about DNS–it’s the most reliable part of your infrastructure, by default. User data is the hard part.