It's quite ironic that I've been meaning to blog this topic for about a month.  But here goes...

I've been thinking a lot lately about the concept of continuity and reliance.  In IT we often talk about reliability, but why is it so important?  Why do we expend so much energy on "disaster recovery" and planning for contingencies?

My entire career has been influenced by this notion of contingency planning.  In my first job out of University, my manager was almost obsessed with the idea of documenting procedures, training backup staff, and having disaster recovery plans.  This always struck me as odd, considering we were a startup unit and had no operational customers to worry about.  I thought the time was better spent on getting our projects going than worrying about what would happen if they stopped.  But maybe that was just me.  His obsession resulted in the laughable 31-step written sequence of processing a software order, starting with

"1. Phone rings; pick up phone."
Thanks, Dan and Betty.

Fortunately or unfortunately, this early job experience has put its mark on me.  I find myself quite often going down very strange what-if paths:
* What if I leave my luggage at Changi airport, and for some reason, I can't make it back from my weekend in Indonesia?
* What if the man who does the voiceovers for VISA card commercials in the US retires? (His voice is clearly part of their "brand image")
* What if I mis-calculated that budget?
* What if my hard disk crashes, my USB memory key has been demagnetized, and my backups are still in boxes -- all while I'm travelling? (OK, it happened once)
So, the question is, can you really anticipate all of life's -- ok, let's focus on IT's -- unexpected happenings?  How much time and money planning for contingencies is worth it, and where's the point of diminishing returns?  

One of the smartest things that I think Lotus ever has done was put the clustering engine into the standard Domino server offering.  Domino is still unique in the market in offering a contingency model that has no single point of failure -- nothing to rely on as the breakpoint or bottleneck.  Everything can be made fully redundant -- even across physical sites, operating systems, and versions of the software.  

I've written about this a lot over the years, but maybe it's just my obsession.  Because you never know when things are going to completely change, unexpectedly.

Post a Comment