
Manoa Falls, Oahu
January 11th, 2009Content Advisory: This Post May Be Offensive To ITIL Advocates
December 18th, 2008
Some observations on ITIL:
The Good
ITIL establishes a common language. Common nomenclature and shared understanding is valuable in a complex environment.
ITIL is comprehensive — it covers everything. The courses force you to think about all the stuff that an effective IT organization needs to do.
The Bad
Looking at ITIL from the perspective of effective organizational leadership, ITIL is overweight.

No, it’s worse than overweight: ITIL is obese — as in can’t-get-off-the-bed-without-heavy-equipment obese.
When teams discuss implementation of ITIL processes, I hear an incessant sucking sound in the back of my mind — the sound of company resources and energy being consumed by the ITIL machine, until nothing is left but a huge ITIL Jabba beast.
I can’t help but feel that ITIL will mark the beginning of the death spiral for more than one company. ITIL will help companies reach the point where more effort is spent holding the ship together than is spent powering the ship to its destination.
Don’t get me wrong — the procedures and practices described by ITIL range from important to critical. But implementation attempts will overburden companies in cost and bureaucracy.
ITIL will become known as a model of organizational bloat.
Just my two cents…
-Donny
Automatic Retries Considered Harmful
November 12th, 2008With a nod to Edsger Dijkstra and aplogies to Eric A Meyer, auto retries and restarts are shiny rings that lead to pain and suffering!
Auto retries are frequently implemented as work-arounds for processes or requests that fail intermittently. Auto retries address the symptom rather than the cause. They chew up system capacity and hide problems. With every automatic retry you implement, you are adding invisible anchors to your system’s performance.
Resist the temptation to implement auto retries and auto restarts. Identify and address the root cause.
If you feel you *must* implement an auto retry, establish a performance counter that automatically alerts the right team when the auto retry count spikes or trends above normal levels.
-Donny
Bus Factor > 1
October 16th, 2008Your org’s bus factor is the minimum number of people in your org a bus would have to hit to create a critical expertise gap.
If you have critical areas of expertise owned by one person (bus factor == 1), you’re carrying an irresponsible level of risk for your company.
Don’t fool yourself — this is a management problem. Fix it.
-Donny
note: Your prima donnas will resist efforts to address your org’s bus factor issues. They’re under the mistaken impression that their value to the organization is measured solely by their individual contribution. We need to chat about prima donnas in a separate post.
note 2: Not sure who coined the bus factor phrase. A co-worker heard the phrase at a Java development conference where it was used in a session on project success factors.
3C: Characterize, Cause, Corrective Action
September 9th, 2008On The List Of Things That Don’t Bother Me Too Much
Someone on our team makes a mistake
On The List Of Things That Really, Really Bother Me
We didn’t learn from the mistake
The team made a few mistakes this week. We don’t sweat it too much — mistakes are going to happen. Especially as we work through some big responsibility transitions on our team.
It turns out that mistakes and system problems offer rich opportunities for your team to learn about your systems and how to improve reliability and performance.
Don’t beat your people up over mistakes and don’t let them play the blame game — use 3C (Characterize, Cause, Corrective Action) to facilitate the open discussion, analysis, and action required to prevent a repeat incident.
The 3C process, report outline, and key definitions are described in separate posts.
-Donny
3C Process and Definitions
September 9th, 2008The Objectives of 3C
1 - Eliminate failure modes
2 - Become smarter as an organization
Put another way, 3C helps our team become smarter with each major incident. In the smartest case, we completely eliminate the failure mode and all risk of a repeat incident. At the very least (but still smarter) — everyone on the team knows how to identify and correct the problem quickly when it happens again. The 3C report is designed to facilitate the conversations and analysis required to capture the critical lessons.
How the 3C Process Works
1) For each incident that leads to an impact of a specified threshold, the 3C owner drives the analysis required to complete the 3C report. In some cases, the 3C owner analyzes the incident and completes much of the report independently. For most major incidents, the 3C owner works with a group of people from different teams to establish cause and corrective action completely.
2) Within 24 hours of the incident, the 3C owner publishes the initial report to our engineering team and the 3C team members. In many cases, the initial report leads to an email thread with additional information, questions, and recommended actions.
3) Once the 3C owner is satisfied that the report is complete (with guidance from the team leads), the 3C owner posts the final report to the 3C report library.
Some Important Definitions
3C team - The cross-functional team of engineers and users involved in the incident and 3C analysis.
Complicating factors - Factors that led to a greater impact or extended incident duration. This may include things like incorrect detection or actions taken while attempting to contain the incident.
Containment instructions (reactive) - Next time this happens, what steps do we take to ensure a quick and safe recovery?
Early detection (predictive) - Next time this happens, how do we detect the problem as early as possible? What are the unambiguous indicators?
Elimination (preventive) - How do we eliminate this failure mode and ensure that this incident never occurs again? Consider manual processes and system changes we can make to eliminate all risk of a repeat incident.
3C owner - Typically assigned by a team lead, the 3C owner works with the 3C team to effectively characterize, establish cause, and determine corrective actions. The 3C owner can be someone close to the incident and containment, but this is not required. The 3C owner must be able to:
- Identify the right 3C team members
- Lead the 3C team to analyse causes thoroughly, differentiating between symptoms, root causes, and complicating factors
- Lead the 3C team to identify effective containment instructions, early detection method, and elimination actions
- Drive the 3C analysis to close and document the findings in the 3C report
Root causes - The underlying causes that led to the incident. For major incidents, there are often multiple causes. Dig past symptoms with the 5 Whys approach.
Some acks:
- 3C is a modified version of a root cause report used at Dell and other companies – with improvements to address report gaps and the lack of an effective team learning process.
- Thanks to Sean Trainor, a former co-worker from my years at Dell, who shared with me his effective approach to system reliability (reactive, predictive, preventive).
3C Report Outline
September 9th, 2008Characterize
- Incident data
- Problem description
- User impact (outage duration, etc)
- 3C manager
- 3C team
Cause
- Root cause
- Complicating factors
Corrective action
- Containment instructions (reactive)
- Early detection (predictive)
- Elimination (preventive)
Spotted This Pretty Creature Sunning On My Pool Equipment…
June 19th, 2008
Four Principles of IT Operations
June 7th, 2008Effective IT operations are:
- Measurable: Consistent and reliable methods of measurement for the key processes
- Repeatable: Key processes are performed the same way each time: no ambiguity, no uncertainty
- Predictable: There is a high degree of confidence in the expected results
- Transparent: Visibility to where work is being performed; where work is stacking up; where problems are occuring; what changes are being made
-Donny
