Why recurring line stops rarely have one cause
Production incidents look simpler than they are. A single pressure drop involves signal drift, a quality hold from two shifts ago, a maintenance note nobody indexed, and a fix that worked once but was never written down. Here is how to separate those layers before you act.
Ask a production engineer what caused last Tuesday's line stop and you will usually get a single-sentence answer. 'The injection unit pressure dropped.' 'The conveyor drive faulted.' 'A sensor went out of tolerance.' These answers are accurate — they describe the alarm that fired, the tag that went red, the moment the line halted. They are also profoundly incomplete.
Almost every significant production incident is a multi-cause event. The visible symptom — the thing that tripped the alarm — is the top of an iceberg. Below the surface are three, five, sometimes eight contributing factors that had been accumulating across shifts, systems, and process steps. None of them, alone, would have stopped the line. Together, they made the stop inevitable.
Understanding why single-cause narratives dominate — and what it takes to get past them — matters directly for anyone trying to reduce unplanned downtime. Not because the real causes are exotic, but because standard investigation methods are structurally unable to find them.
The anatomy of a line stop
Consider a realistic scenario: an injection molding machine on a high-volume automotive trim line faults on mold cavity pressure deviation at 06:42 on a Tuesday morning. The process engineer arrives, resets the fault, adjusts the holding pressure setpoint, and the line restarts within 25 minutes. The CMMS ticket reads: 'Pressure deviation fault — adjusted hold phase setpoint. Line running.'
What the ticket does not capture: barrel temperature Zone 3 had been trending 4°C low since 22:15 the previous evening — well within alarm limits, but enough to affect melt viscosity over time. The mold cooling circuit on that cavity was flagged for a cleaning task six days earlier; the task was deferred because the machine was scheduled for a weekend run. A quality hold had been placed on output from that machine 11 hours earlier for flash defects — a different symptom of the same underlying pressure instability — but the hold was in the quality system and the process engineer was not notified. The material lot changed at the start of the night shift; the incoming lot had slightly higher MFI than spec center, requiring a nominal process adjustment that was documented in the shift log but not linked to the machine's active recipe.
Each of those factors is real and verifiable. Each is in a system somewhere — the historian, the quality system, the CMMS, the shift log. None of them triggered an alarm that required action at the time. Together, they created the exact conditions for a pressure deviation fault to occur the moment a cold morning startup coincided with a marginal material lot in a machine with deferred maintenance.
Key observation
The 'cause' of the stop was not the pressure deviation. The pressure deviation was the last domino. The cause was the accumulation of marginal states across four independent systems over 11 hours, none of which were visible in aggregate until the line stopped.
How evidence layers accumulate
Production incidents follow a consistent temporal structure. The conditions that make a stop likely develop slowly — hours to days before the event. The symptoms that are visible in the data are often present well before the final alarm fires, but they fall below notification thresholds or appear in systems that nobody correlates at that time.
Alarm storms make this worse. When a machine does eventually fault, it often fires 15 to 40 alarms in rapid succession — the process chain reacting to the primary fault, safety interlocks activating, downstream equipment sensing the loss of feed. The operator and engineer arriving at the machine are confronted with a wall of red. The instinct is to find the first alarm or the loudest alarm and treat it as the cause. This is almost always wrong.
The Abnormal Situation Management Consortium has documented that alarm floods — 10 or more alarms per minute — are associated with a significant percentage of major process incidents. The alarm flood is not the problem; it is the symptom of a process that was already in a degraded state before the first operator response. The useful information was generated in the hour before the flood, not during it.
- →Signal drift: process parameters moving slowly outside their historical operating range, within alarm limits, over multiple shifts
- →Deferred maintenance: open work orders that reduced equipment margin without triggering active alerts
- →Cross-system quality signals: holds or inspection flags that were not linked to the production machine generating them
- →Undocumented process changes: setpoint adjustments, material changes, tooling swaps captured in shift logs but not in the machine's active context
- →Environmental factors: temperature, humidity, utility pressure changes not in the machine's alarm schema
Each layer exists in a different system. The historian has the signal drift. The CMMS has the deferred work order. The quality system has the hold. The shift log has the material change. No single interface shows all four simultaneously, and no alarm fires when their combination becomes dangerous.
Why 5-why analysis fails without complete context
5-why analysis is the most common structured investigation method in manufacturing, and it is genuinely useful — when applied to problems with clear, single-path causal chains. A bearing failed because it was not lubricated because the PM schedule was missed because the work order was not generated because the CMMS trigger was misconfigured. That chain is clean and 5-why reconstructs it reliably.
Multi-cause incidents break 5-why for a structural reason: the method assumes you are following one causal path. When the incident had four contributing causes, you need four parallel paths, each traced separately, with the interaction between them documented. Most 5-why implementations in manufacturing don't support branching. The investigation form has five rows.
The deeper failure is that 5-why is a post-hoc interview method. You ask people who were present what they observed and why they think it happened. This is useful for capturing tacit knowledge, but it is systematically biased by what the people involved can recall, what they were monitoring at the time, and — importantly — what they are willing to attribute to decisions within their own control. Signal drift that started on a previous shift is easy to miss when you are reconstructing from memory 24 hours later. A deferred maintenance task is uncomfortable to surface when the person answering the questions is the one who deferred it.
The evidence problem
5-why is only as good as the evidence available to the person asking the questions. If the investigator does not have the historian trend for the 18 hours before the stop, the quality hold record, and the open work order in front of them before the first 'why,' they are building a causal chain on incomplete foundations. The analysis will be coherent, and it will be wrong.
What separating the layers looks like in practice
Effective multi-cause investigation requires changing when evidence collection happens. Currently, most facilities collect evidence after the investigation starts — someone pulls the historian trend when they need to check a hypothesis. The evidence collection is reactive and selective, shaped by the hypothesis the investigator already holds.
The alternative is to accumulate evidence continuously and surface it at incident time, before anyone picks up a radio. This means: when the alarm fires, the investigation workspace already contains the historian context for that machine over the past 24 hours, the open work orders, the active quality holds, the last material change, and — if available — the most recent similar incident on that machine or line.
The investigation then starts from a different baseline. Instead of 'what happened,' the question becomes 'which of these contributing factors was primary, and what was the interaction sequence.' This is not just faster — it changes what the investigation finds. Factors that would never have been recalled or surfaced in a post-hoc interview are visible in the pre-assembled evidence set.
Separating the layers also means being explicit about what each layer is: the immediate trigger (the alarm), the proximate cause (the process state that generated it), the contributing factors (the accumulated conditions that enabled it), and the systemic factors (the operational patterns that allowed those conditions to develop without intervention). These are different questions with different answers, and conflating them produces the single-cause narrative that sends teams back to the same machine for the same incident three months later.
The goal is not a more complicated investigation. It is a more complete one — carried out faster, because the evidence is already assembled when the investigation begins.
Every repeated incident is evidence that the first investigation stopped one layer too shallow.
More from Field Notes
See all field notes