SLAs, availability, and expectations vs. guarantees
I recently wrote a Service Level Agreement-type of document to help set expectations as to what kind of service one can expect from the Eclipse Webmasters, and from the Eclipse servers. You can read it here. It wasn’t really a pleasant experience, and the document is not really groundbreaking. It’s actually quite boring.
I was then introduced to this blog posting, which, as I understand it, essentially states that SLAs are useless when they use a blank metric we call ‘availability’. The author makes an interesting point, and as I prepared a pretty lengthy reply, the blog software decided my comment was too long, so I figured I’d post it here.
Interesting article. I’d appreciate seeing an example of an effective SLA that you have authored, and that you are being held accountable for.
I mean, let’s face it — I’d love to guarantee that every time you load a web page, it will come in under 20 seconds, and that all your email will be in your Inbox within 15 minutes. But there are those things that, as you say, are not easy to measure. When will I get hit by the next DoS attack? When will the most important server decide to crash? When will an IT guy make a critical human error and take some key system down? If your systems are connected to the Internet, then you’re open to a whole world of unknowns.
So, how can I guarantee, with absolute certainty, that email will not go down for a full day on the last day of a quarter (which, I agree, could be extremely damaging to a business)? Perhaps we spend millions of dollars on redundant high-end hardware, redundant points-of-presence, more process for staff, and more staff (to help maintain all this hardware). Easy enough. But now I must increase my prices to afford this wonderful SLA, driving away customers to the cheaper competitors, equally damaging my business.
Then again, sometimes spending massive amounts of money in infrastructure isn’t even enough.
So the alternative is to write an effective SLA that has service expectation metrics so forgiving that they don’t make much more sense compared to using the availability metric; ones that certainly don’t match the expectations of users any more accurately.
Of course, the odds of a catastrophic failure in a properly executed IT infrastructure are very low, making it easy to set reasonable expectations. You *should* expect your email within 15 minutes. But in between expectation and guarantee is this thing those out-of-alignment IT people call ‘budget’.
Posted December 23rd, 2008 by Denis Roy in category: Uncategorized
You can skip to the end and leave a response. Pinging is currently not allowed.
2 Responses to “SLAs, availability, and expectations vs. guarantees”
Leave a Reply
You must be logged in using your Eclipse Bugzilla account to post a comment.


Denis Roy Says:
December 23rd, 2008 at 3:06 pm
Ironically, I cannot seem to post any comments on the blogs.progress.com site where the article is hosted, and I can’t find a single Contact link to tell them that their site is not working as I expect it to
Jeff McAffer Says:
December 29th, 2008 at 5:58 pm
Hey Denis, as I have mentioned personally, I quite like the SLA you guys put together and am sure it was not easy. IMO the blog post you mention does not say that services have to be expensive (ie., always up or never fail at critical times) or guaranteed in a hard to attain way. It says that “availability” is only one raw operational metric for judging the suitability of a particular set of infrastructure for a particular intended use. This is the key, “intended use”.
This is where expectations come in. The SLA should help shape and set the user community’s expectations for the infrastructure. For example, saying that only released versions of software will be used in the production systems clearly sets a functional expectation (rightly IMHO). Users quite likely will agree with you there and for the vast majority of the other functional expectations set out in the SLA.
On the operational side the base “5-in-3″ availabilty metric likely meets user expectations for non-workflow infrastructure (e.g., http://www.eclipse.org). For committer workflow infrastructure (e.g., CVS/SVN and Bugzilla) I’m not sure simple availability is enough to set expectations. For example, is Bugzilla “available” (for its intended use) if each query takes 2 minutes all day long? Does 5-in-3 imply a 36sec expectation for responses? Should users have any expectations? Similarly for CVS response time.
There are similar questions around data integrity. If “critical data” is lost, how long should it be until it is recovered?
None of this is to say that it has to cost more. Its a matter of setting expectations.
In the end the SLA is a set of guidelines that facilitates understanding between the infrastructure team and the users. The cited blog implies that availability by itself is insufficient for that purpose in the operational context.