Pingdom is an organization focused on system reliability and uptime, so it's not surprising that they dug into the Google Apps service level agreement and found a shocker:

Gmail could be unavailable for more than 21 hours in a day, and Google could still tell you that according to their SLA, the service has had 100% uptime.
How? Well, Google says,
"Downtime Period" means, for a domain, a period of ten consecutive minutes of Downtime. Intermittent Downtime for a period of less than ten minutes will not be counted towards any Downtime Periods."
Can you imagine?  I remember when I was in IT many years ago....it's not like users waited ten minutes to see if a system would come back up before calling the help desk.  In this always-on era, downtime is even less-tolerated.  Yet Google's SLA is written to say that it's no big deal if there's a nine-minute outage, anytime.

That sure makes the claim of a 99.9% uptime SLA more than a bit specious.  What other corners are cut to get to that magical $50/user/year cost?

Link: Royal Pingdom: Google Apps SLA loophole allows for major downtime without consequences > with more discussion at TechCrunch >  (Thanks, Rob)

Post a Comment

  1. 1  Jim Casale http://www.jimcasale.net |

    Ed, You'd be surprised at which companies are actually piloting Gmail ;-) One can only hope they figure out it's not a Holy Grail. I just hope Google doesn't pull a MS move and promise them the moon even though they can't deliver it, and the higher ups believe it.

  1. 2  David Hoff http://www.cloudsherpas.com |

    The best way to gauge what's going on is to monitor the service and see for yourself. We monitor Google (and several production Notes servers) from multiple locations around the globe every minute, and Google's uptime has been very great.

    Yes, they had a some down-time, but it was no worse than issues I ran into with a MIME conversion error in 8.0.1, or any other server bugs that have hit over the years.

  1. 3  Ed Brill http://www.edbrill.com |

    @2 well, you are certainly the expert here, given your company profile. But as a consumer-only user of Google mail, of late I have seen issues like blank screens when I tab over to the gmail tab, "oops, a system error occurred" often, and at least two flat-out downtime periods I can think of in the last month.

    I can't find anything in your business profile about Notes server monitoring and not sure what MIME error you refer to in 8.0.1, but appreciate you sharing about Google.

  1. 4  Volker Weber http://vowe.net/about |

    What has been the uptime of your Westford Domino servers in 2008 so far?

  1. 5  Ed Brill http://www.edbrill.com |

    @4 you mean the "early deployment center"?

    Westford Server Uptime for November

    Median Uptime: 99.95 Clustered Mail: 99.9676

  1. 6  Volker Weber http://vowe.net/about |

    Not for November. For 2008.

  1. 7  Ben Poole http://benpoole.com |

    I don't get the pissing contest re uptime. Sure, when our old OS/2 Notes 3 server went down (which hosted all our apps and email at the time), that was annoying (2 - 3 times a day). But that was 1995.

    Pretty much anything nowadays is "good enough," and besides, *it's just email*, it isn't (shouldn't be?) *that* important in this day and age.

    I've used Google for email in one capacity or other for a few years now, and it's just fine -- to the point that I use Google Apps for Domains now, which again, is perfect for my needs, dodgy uptime stats notwithstanding.

  1. 8  Ed Brill http://www.edbrill.com |

    @6 the question is irrelevant. We don't offer our beta testbed servers to customers as a production environment. Some months the beta code hasn't run at 100%.

    @7 "Pretty much anything nowadays is 'good enough'"? Really? There's a difference between *wanting* e-mail to not be mission-critical and it actually continuing to *be* mission-critical. I think if you look at the worst case (on the Pingdom blog), it's a pretty scary potential 21-hours-a-day downtime, and I don't think that's "good enough" at all.

  1. 9  Ben Poole http://benpoole.com |

    21 hours a day is crazy. Which is why it's a notional number plucked from reading between the lines in a spotty SLA, nothing more.

    If it actually happens then I'll think again; for now such quotes are barely more than FUD. I'm still not going to go to the expense and hassle of hosting my own infrastructure -- Domino or otherwise -- just for email, and there are thousands of other companies out there which take the same view.

    I appreciate that for the traditional enterprise Google and its ilk aren't valid options, but that doesn't mean they're inappropriate for business *full-stop*: this nit-picking doesn't really achieve anything, and meanwhile thousands of smaller outfits are walking away from the original big players, "21 hours" notwithstanding.

  1. 10  Volker Weber http://vowe.net/about |

    Irrelevant. Ah, OK, I did not expect them to bis THIS bad.

    Anyway, November looks like three nines to me. That is what Google Apps promises. Four nines is pretty hard to achieve, five nines is the holy grail.

    Judging from the number of times Westford personnel has been unable to access their mail, I would be surprised if it made three nines YTD, let alone every month of the year. Actually there have been months where it would not be two nines.

    Does it matter? You tell. Three nines for $50 a year is actually pretty good. Is there anybody else out there with this service level?

    vowe.net does get over three nines every month of the year, edbrill.com definitely not.

  1. 11  Volker Weber http://vowe.net/about |

    That should read "vowe.net does NOT get".

  1. 12  Ed Brill http://www.edbrill.com |

    I get it, one red herring didn't work so you throw another distraction in. Vowe.net and edbrill.com are not "mission-critical" mail servers, and in my case I certainly wouldn't ask PSC to run it on a UPS, cluster the server to a remote site, etc. The interjection of their uptime is irrelevant.

    Going back to the Westford example, again, this is a data center where IBM tests early releases, not where IBM charges external customers for a service at $50/user/year. The SLA we get internally from our sometimes-beta real-world test environment is completely irrelevant to the discussion here, which is what Google claims is a 99.9% SLA but has this giant <10 minute loophole that doesn't count towards SLA. Uptime should be uptime, no?

  1. 13  Ben Langhinrichs http://www.GeniiSoft.com/showcase.nsf/GeniiBlog |

    This is one of those issues that are only easy to quantify by giving an extreme example, but then people quibble with the extreme example rather than the core problem. The problem with the SLA is not a potential 23 hours downtime - nobody expects that and it is incredibly unlikely. The problem is spotty service, which can plague systems. Spotty service would mean the mail was on and off all day, seldom for more than a couple of minutes, but often enough that it would drive users crazy. I have experienced such issues with different services (not ever my business mail) over the years, such as my cable internet service one year, and it was maddening. My daughter had this issue with her school e-mail for a semester, and she stopped using it entirely. It probably didn't go out more than three or four times a day, and probably not more than ten minutes at a time, but it seemed to always happen when she needed e-mail.

    In other words, the extreme example cited is not really a concern, but Ed is absolutely correct (in my opinion) in citing this as a serious potential issue with the Google SLA. Whether or not it is a likely issue is different, but I wouldn't run a company of more than about five people on an SLA like that.

  1. 14  Ben Poole http://benpoole.com |

    All I know is, I've struggled through multi-page SLAs from all sorts of companies that ultimately have meant naff-all, and make the reader's eyes bleed in to the bargain. (And don't get me started on support agreements...)

    If you're a larger company looking to trial Google Apps (and bet your bottom dollar that these beasts *do exist*) chances are you're negotiating a separate SLA in any case. For the rest of us, we have this wee one pager, and if Google renege on it, we walk.

    There's yada yada about how catastrophic it is that *theoretically* there could be a 21 hours of outage in one day. Well that's not in anyone's interest -- least of all Google's, so why worry about it? There are many *actual* threats to modern business that deserve more attention.

  1. 15  Volker Weber http://vowe.net |

    Ed, we know Westford goes offline for DAYS, not minutes or hours. Mail for IBM engineers probably is not "mission critical".

    vowe.net and edbrill.com are not red herrings. They are just proof points that it's not about a certain product, but the whole architecture. It's pretty hard to achieve three nines month over month on end. 30 days are 720 hours. Divide that by 1000 and you get to 43 minutes. If anything goes wrong, in a network switch, the power, the server, whatever, you are past that margin. With four nines you are at four and a half minutes, at five nines it's a mere 26 seconds.

    Three nines is pretty good for email and PIM. And not necessary for your site or mine.

  1. 16  Ed Brill http://www.edbrill.com |

    @15 I've never had a multi-day outage on my mail. Again, these are not designed to be the fully redundant hardened data center that the rest of IBM's other 300,000 mailboxes run on -- those mail servers are not down hours or days, period. Westford is a red herring because it is a development lab, "Early deployment center", not staffed nor built to be the best-in-class uptime, but rather to give us a place to test our products in real world conditions. So it still is not relevant to discuss Westford in the context of looking at the Google SLA. And Google can only achieve three nines by fudging their SLA terms. That's my point.

  1. 17  Ed Brill http://www.edbrill.com |

    Maybe the real point should be this -- I can name several Domino customers who deliver 100% availability to their users, month in and month out. Designing for availability can absolutely be done with Domino...but apparently not with Google.

  1. 18  Volker Weber http://vowe.net/about |

    Ed, I think you can cluster newer versions of Domino with older ones, if you want reliability. In this case you don't want to.

    There will be plenty of people served well with three nines of availability. It's a choice. If IBM can provide that for $50 a year, they are a good competition. If they can deliver better than three nines for that money, they have an edge.

    You can only deliver 100% availability in hindsight. You cannot guarantee it. Ask anybody in HA. I have had 100% availability YTD so far. But that's not HA.

  1. 19  Ben Poole http://benpoole.com |

    "Designing for availability can absolutely be done with Domino... but apparently not with Google."

    And no-one's disputing that. But clustered Domino environments cost significantly more than $50 per user, per year. You pays yer money, you takes yer choice. People whingeing about the odd outage in Google Apps are entirely missing this point and are NOT comparing like-with-like. Google Apps is *not* for everyone, but it works well enough for a lot of people.

    What makes it worse is that many of these individuals seem to be from an IT background: no doubt the same people who, like me, constantly moan about businesses and / or project managers paying no heed to the project triangle.

  1. 20  Ed Brill http://www.edbrill.com |

    @18 But Google can't provide three nines for $50 a year...they've given themselves a loophole to deliver far less. So the competitive situation you draw up is, yes, a red herring.

  1. 21  Volker Weber http://vowe.net/about |

    With infinite money you can build anything. Including your own mail server. Google Apps is $2500 for 50 users, or $12500 for 250 users.

    What would it cost to provide (only) three nines with Domino for that population? Without quota on the mailbox. Or a ridiculously high one.

  1. 22  Ben Langhinrichs http://www.GeniiSoft.com/showcase.nsf/GeniiBlog |

    Volker - I'm a bit confused by your point. Ed isn't arguing about whether Google Apps is a "good deal" or not, as that isn't his business. He is, rightly in my estimation, calling attention to part of what you are or are not getting for your money. We live in a world where a lot of attention is paid to the price of things, but not nearly enough to the real cost of those same things. In the US, that means people buying a VCR or flat screen TV from Walmart even if it winds up driving their local store with better service out of business. If I understand Ed's point, it is that it is easy to point at Google Apps and say, "Oh, look, $50 a year!", and not necessarily look too closely at how they manage that price. SLAs are often ignored, but service availability is one of those "premium" items you get that cost extra, along with security for your data and integration with a rich client platform, that make Notes/Domino a different value proposition. I don't expect to pay $50/year for my Domino service, but I do expect to get more value. It is reasonable to point out one of those "hide in plain sight" costs such as an SLA that sounds better than it is. You can refer to three nines or four nines or whatever you like and say it over and over, but it doesn't stop the fact that Google isn't really promising anything of the kind.

  1. 23  Tony Lee  |

    Vowe is arguing about the # of nines of various services.

    Ed is arguing about the Google SLA.

    They probably agree if they can argue about the same thing.

    @22 That's funny... Google as the Walmart of internet apps. Actually maybe that's not funny.

  1. 24  Dan Coprher  |

    If the SLA is 99.7 % the calculation formula is 30 d x 24 h = 720 hours / 100 = 13.8 hours. So the service can be down over 13 hours in one month. Please check your contract with G-mail!

    Have a nice day!

  1. 25  Nathan T. Freeman http://nathan.lotus911.com |

    "You can only deliver 100% availability in hindsight. You cannot guarantee it."

    Of course you can guarantee it. "100% availability or that month is free." See how easy that was?

    Talk about a red herring.

  1. 26  Nathan T. Freeman http://nathan.lotus911.com |

    "If the SLA is 99.7 % the calculation formula is 30 d x 24 h = 720 hours / 100 = 13.8 hours. So the service can be down over 13 hours in one month"

    Huh? (720 hours / 1000)*3 = 2.16 hours.

  1. 27  Flemming Riis  |

    whats the default SLA on IBM new 1000+ hosting ?

  1. 28  Ed Brill http://www.edbrill.com |

    @27 99.5% is the base offering and 99.9% is the high-availability offering.

  1. 29  Flemming Riis  |

    @27 99.5% is the base offering and 99.9% is the high-availability offering.

    Thanks.

  1. 30  Ed Brill http://www.edbrill.com |

    I'm sure this is unrelated, but I just got the following on my personal gmail account:

    Temporary Error (503)

    We’re sorry, but your Gmail account is temporarily unavailable. We apologize for the inconvenience and suggest trying again in a few minutes.

    I wonder how many minutes?!?!? :)

  1. 31  Ben Poole http://benpoole.com |

    Now you're just being silly! Chortle.

    (BTW I had three Lotus Notes "RBDs" in a row this morning -- but it's OK, 'cos I don't use Notes mail :-D )

  1. 32  Charles Robinson http://www.cubert.net |

    I'm sorry, I forgot what the question was.

  1. 33  Daniel Silva http://www.dansilva.org |

    @15, Westford does *NOT* go offline for DAYS, like you said. Ever. As far as I can remember at least. Mail is clustered.