Archive for September, 2008

3C: Characterize, Cause, Corrective Action

Tuesday, September 9th, 2008

On The List Of Things That Don’t Bother Me Too Much

Someone on our team makes a mistake

On The List Of Things That Really, Really Bother Me

We didn’t learn from the mistake

 

The team made a few mistakes this week. We don’t sweat it too much — mistakes are going to happen. Especially as we work through some big responsibility transitions on our team.

It turns out that mistakes and system problems offer rich opportunities for your team to learn about your systems and how to improve reliability and performance.

Don’t beat your people up over mistakes and don’t let them play the blame game — use 3C (Characterize, Cause, Corrective Action) to facilitate the open discussion, analysis, and action required to prevent a repeat incident.

The 3C process, report outline, and key definitions are described in separate posts.

-Donny

3C Process and Definitions

Tuesday, September 9th, 2008

The Objectives of 3C

1 - Eliminate failure modes

2 - Become smarter as an organization

Put another way, 3C helps our team become smarter with each major incident. In the smartest case, we completely eliminate the failure mode and all risk of a repeat incident. At the very least (but still smarter) — everyone on the team knows how to identify and correct the problem quickly when it happens again. The 3C report is designed to facilitate the conversations and analysis required to capture the critical lessons.

 

How the 3C Process Works

1) For each incident that leads to an impact of a specified threshold, the 3C owner drives the analysis required to complete the 3C report. In some cases, the 3C owner analyzes the incident and completes much of the report independently. For most major incidents, the 3C owner works with a group of people from different teams to establish cause and corrective action completely.

2) Within 24 hours of the incident, the 3C owner publishes the initial report to our engineering team and the 3C team members. In many cases, the initial report leads to an email thread with additional information, questions, and recommended actions.

3) Once the 3C owner is satisfied that the report is complete (with guidance from the team leads), the 3C owner posts the final report to the 3C report library.

 

Some Important Definitions

3C team - The cross-functional team of engineers and users involved in the incident and 3C analysis.

Complicating factors - Factors that led to a greater impact or extended incident duration. This may include things like incorrect detection or actions taken while attempting to contain the incident.

Containment instructions (reactive) - Next time this happens, what steps do we take to ensure a quick and safe recovery?

Early detection (predictive) - Next time this happens, how do we detect the problem as early as possible? What are the unambiguous indicators?

Elimination (preventive) - How do we eliminate this failure mode and ensure that this incident never occurs again? Consider manual processes and system changes we can make to eliminate all risk of a repeat incident.

3C owner - Typically assigned by a team lead, the 3C owner works with the 3C team to effectively characterize, establish cause, and determine corrective actions. The 3C owner can be someone close to the incident and containment, but this is not required. The 3C owner must be able to:

- Identify the right 3C team members
- Lead the 3C team to analyse causes thoroughly, differentiating between symptoms, root causes, and complicating factors
- Lead the 3C team to identify effective containment instructions, early detection method, and elimination actions
- Drive the 3C analysis to close and document the findings in the 3C report

Root causes - The underlying causes that led to the incident. For major incidents, there are often multiple causes. Dig past symptoms with the 5 Whys approach.

Some acks:
- 3C is a modified version of a root cause report used at Dell and other companies – with improvements to address report gaps and the lack of an effective team learning process.
- Thanks to Sean Trainor, a former co-worker from my years at Dell, who shared with me his effective approach to system reliability (reactive, predictive, preventive).

3C Report Outline

Tuesday, September 9th, 2008

Characterize
  - Incident data
  - Problem description
  - User impact (outage duration, etc)
  - 3C manager
  - 3C team

Cause
  - Root cause
  - Complicating factors 

Corrective action
  - Containment instructions (reactive)
  - Early detection (predictive)
  - Elimination (preventive)