|When the cloud went down! IT Fundamentals 101.|
|Written by John Virgolino|
|Thursday, 28 April 2011|
Quite a few notable web sites and services went down late last week because of Amazon’s Web Services outage. There is plenty of news and blog coverage of that. The question that has predictably surfaced is, “How does this affect the credibility of the cloud?” It doesn’t, and in all likelihood, it won’t even affect Amazon in the long run. The right question would be, “How prepared are cloud customers for when it happens again?”
The fact is that outages and down time are a reality in IT. Even the invincible Google has been down for extended periods. That’s why following proven IT standards should be the strategy of choice, not blaming the cloud vendors. The focus should be on IT fundamentals, those time proven policies that when followed, will protect your business regardless of where your data lives, be it in your server closet, at Amazon in Virginia or on the moon.
What business owners and IT need to plan for is the level of availability their business needs. It doesn’t matter how small or large the network is. If the main source of applications and data fails, how long can the business sustain without realizing a loss in money, reputation, and good will with customers? In the financial services industry, the answer is pretty much about 1 hour, if that long. For most businesses, it hovers around 1 day.
Once that question is answered, a plan can be put together for avoiding down time. Availability of applications and data is mitigated by having redundant services, usually spread out geographically, running either simultaneously or on standby. For instance, a server running in Amazon’s Virginia site last week would have a similar server running at their West coast site and could have been switched over within minutes. According to Amazon, their regions are independent of each other, so this provides geographical isolation. But then again, they also said that the 4 availability zones in Virginia were isolated from each other too, and that clearly didn’t work. The moral of that story is to have cautious faith in any service provider. To sleep better at night, it would be prudent to have the main servers and data with one vendor and a redundant system with a completely different vendor at an alternate geographic location. This spreads the risk across multiple domains of possible failure.
Redundancy mitigates risk, but the data still lives with third parties. Once that data is up in the cloud, there is a loss of control. It really doesn’t matter how many servers, or how much redundancy Amazon or anyone else has. Google supposedly has over 200,000 servers and a sophisticated redundancy practice in place. But, none of that matters if the office connection to the Internet goes down. Internet provider downtime is a lot more common then Amazon or Google going down, which is why having dual Internet connections with multiple providers is critical, especially when there is a cloud dependency.
Now, what if the cloud data becomes inaccessible for an extended period? Backup, another fundamental IT practice, is the appropriate solution. It’s your data, so make sure you have a copy of it somewhere that you have 100% control over it. How often a backup is made will depend on how long the business can live without it. Most businesses are satisfied with one day of loss and do a daily backup. Storage is cheap these days, so getting a local hard drive that gets a copy of the data from the cloud will help management sleep better at night. Cloud-to-cloud backups are also a consideration since most of them backup at least once an hour, they would have the most current data and it would be the easiest to access. But, on-premise backups should never be abandoned.
Having an active disaster recovery plan is prudent, but like a data backup, if periodic restore tests aren’t performed, the backup could be useless. Same thing with server availability, a planned simulation of down time to test failovers and the restoration of data also assures management that the plan will actually work. How often a test is performed is up to the business. At least once a year, but quarterly is probably better.
Of course, there are costs associated with having redundancy. The budget cap is dependent on how much the business will lose if it didn’t have a plan in place. If it costs $10,000 a day to be without systems and the business can sustain a $5,000 (half a day) loss, then the budget caps out at $5,000 to make sure systems aren’t down longer than that. Wall Street spends millions a year to make sure their systems stay up because they can lose millions in an hour. Don’t spend $20,000 on a disaster recovery solution if the net effect is avoiding a $10,000 loss.This Amazon incident in particular seems to resonate as a wake up call to many. We
have crossed the line into cloud computing as being a stated reality with many benefits, but like any dependency, we must follow the fundamentals and ask the right questions and prepare for the worst. The lesson from last week is that preparation is the antidote, regardless of where that failure originates.
Cloud Computing for Small Business Explained
Amazon's Explanation of the outage