
It is a Tuesday afternoon. Your customer-facing chatbot, the one that handles 60 percent of your tier-one support tickets, starts generating responses that contain fragments of other customers' personal data. Your support team notices within an hour. Your engineering team confirms the issue within two. But it takes your leadership team six hours to decide whether to shut the system down, because nobody knows who has the authority to pull the plug on a revenue-critical AI system.
By the time the chatbot goes offline, screenshots of the leaked data are circulating on social media. You have a full-blown crisis, and the root cause was not the model failure itself. It was the absence of a plan for when the model failed.
Every mature engineering organization has an incident response plan for server outages, security breaches, and data leaks. Almost none have an equivalent plan for AI-specific failures. This is a critical gap, because AI failures are fundamentally different from traditional software failures.
A crashed server produces a clear error state. An AI failure can be subtle and ongoing: a recommendation engine slowly drifting toward biased outputs, a fraud detection model gradually increasing its false positive rate, a content moderation system quietly suppressing legitimate speech. These failures do not trigger alerts in your existing monitoring stack because the system is technically "working." It is just working badly.
Your incident response plan needs to address three distinct failure modes. Sudden failures are the easiest to detect: a model crashes, returns errors, or produces obviously nonsensical outputs. These are handled by existing reliability engineering practices.
Drift failures are more dangerous. The model's performance degrades slowly over weeks or months as the input data distribution shifts away from the training data. By the time anyone notices, the accumulated damage to customer trust or decision quality can be substantial.
Adversarial failures are the most complex. A bad actor deliberately manipulates inputs to force the model into producing harmful outputs. This requires detection capabilities that most organizations have not built, and response protocols that cross the boundary between engineering and security teams.
A functional AI incident response plan requires four components. First, a classification matrix that maps failure types to severity levels with pre-defined escalation paths. Not every anomaly is a crisis. Your team needs clear criteria for when to investigate, when to escalate, and when to shut down.
Second, authority protocols. Decide now, not during the incident, who has the power to take an AI system offline. If your revenue-critical model starts misbehaving on a Saturday night, the on-call engineer needs to know they have explicit authorization to shut it down without waiting for three levels of management approval.
Third, a communication template. Under the EU AI Act, providers of high-risk AI systems have reporting obligations for serious incidents. Having pre-drafted notification templates for regulators, affected users, and internal stakeholders saves critical hours during a crisis.
Fourth, a post-mortem process specifically designed for AI failures. Traditional post-mortems focus on "what code broke." AI post-mortems need to ask harder questions: what data caused the drift, what monitoring gap allowed it to reach production, and what organizational incentive prevented someone from flagging the issue earlier.
Building and maintaining an AI incident response capability has real costs. It requires dedicated training for on-call staff, regular simulation exercises, and investment in AI-specific monitoring tools that go beyond traditional application performance management. Your engineering team will resist the additional on-call burden, and leadership will question the ROI of preparing for failures that have not happened yet.
The counter-argument is simple: the cost of an unmanaged AI incident in a regulated industry is measured in regulatory fines, litigation, and permanent reputational damage. The cost of preparation is measured in engineering hours and a few tabletop exercises per quarter.
Run a simple test this week. Walk up to the engineer responsible for your most critical AI system and ask: "If this model started producing harmful outputs right now, what is the step-by-step process to contain it?" If the answer takes more than thirty seconds or involves the phrase "I'd probably," you do not have a plan. You have a hope. Start building the playbook today.