The Objectives of 3C
1 - Eliminate failure modes
2 - Become smarter as an organization
Put another way, 3C helps our team become smarter with each major incident. In the smartest case, we completely eliminate the failure mode and all risk of a repeat incident. At the very least (but still smarter) — everyone on the team knows how to identify and correct the problem quickly when it happens again. The 3C report is designed to facilitate the conversations and analysis required to capture the critical lessons.
How the 3C Process Works
1) For each incident that leads to an impact of a specified threshold, the 3C owner drives the analysis required to complete the 3C report. In some cases, the 3C owner analyzes the incident and completes much of the report independently. For most major incidents, the 3C owner works with a group of people from different teams to establish cause and corrective action completely.
2) Within 24 hours of the incident, the 3C owner publishes the initial report to our engineering team and the 3C team members. In many cases, the initial report leads to an email thread with additional information, questions, and recommended actions.
3) Once the 3C owner is satisfied that the report is complete (with guidance from the team leads), the 3C owner posts the final report to the 3C report library.
Some Important Definitions
3C team - The cross-functional team of engineers and users involved in the incident and 3C analysis.
Complicating factors - Factors that led to a greater impact or extended incident duration. This may include things like incorrect detection or actions taken while attempting to contain the incident.
Containment instructions (reactive) - Next time this happens, what steps do we take to ensure a quick and safe recovery?
Early detection (predictive) - Next time this happens, how do we detect the problem as early as possible? What are the unambiguous indicators?
Elimination (preventive) - How do we eliminate this failure mode and ensure that this incident never occurs again? Consider manual processes and system changes we can make to eliminate all risk of a repeat incident.
3C owner - Typically assigned by a team lead, the 3C owner works with the 3C team to effectively characterize, establish cause, and determine corrective actions. The 3C owner can be someone close to the incident and containment, but this is not required. The 3C owner must be able to:
- Identify the right 3C team members
- Lead the 3C team to analyse causes thoroughly, differentiating between symptoms, root causes, and complicating factors
- Lead the 3C team to identify effective containment instructions, early detection method, and elimination actions
- Drive the 3C analysis to close and document the findings in the 3C report
Root causes - The underlying causes that led to the incident. For major incidents, there are often multiple causes. Dig past symptoms with the 5 Whys approach.
Some acks:
- 3C is a modified version of a root cause report used at Dell and other companies – with improvements to address report gaps and the lack of an effective team learning process.
- Thanks to Sean Trainor, a former co-worker from my years at Dell, who shared with me his effective approach to system reliability (reactive, predictive, preventive).