When the Grid Goes Dark
Risk matrices can't answer causal questions. Causal models can.
The Bottom Line
- The Problem: Utility risk management uses likelihood × impact matrices that treat "transformer failure" as one risk. But "transformer failure on a radial feeder serving a hospital" and "transformer failure in a meshed network with redundancy" are completely different risks with different consequences — and the matrix cannot distinguish them.
- The Insight: The grid is a causal system: weather causes vegetation contact, which causes line faults, which trigger protection relays, which cause cascading outages, which create customer impact, which produce regulatory penalties. Each link in the chain is a point where intervention has a different cost and a different effect.
- The Action: Model the causal structure of failure propagation. Use interventional queries (Rung 2) to optimise maintenance and capital investment. Use counterfactual queries (Rung 3) to answer regulatory questions: "Would this outage have occurred if we had replaced that asset?"
1The ProblemRisk matrices treat the grid as a spreadsheet. The grid is a network.
A major utility manages tens of thousands of assets — transformers, conductors, poles, switches, substations — spread across thousands of kilometres. Each asset can fail. When one fails, the consequences depend entirely on where it sits in the network and what else is happening at the time. A failed transformer in a meshed urban network triggers automatic switching and nobody loses power. The same failure on a radial rural feeder serving a hospital means a critical outage, an emergency generator deployment, regulatory scrutiny, and potential liability.
The risk matrix cannot distinguish these two scenarios. Both are "transformer failure." Both get the same likelihood score (based on asset age and condition) and the same impact score (based on replacement cost). The matrix produces a single number — say L=3, I=4, score=12 — that tells the planning team nothing about which failure to prevent first.
What the Matrix Misses
| Factor | Why It Matters | Matrix Score |
|---|---|---|
| Network topology | Radial vs meshed feeders determine whether a single failure causes a cascading outage or is absorbed by redundancy | Not captured |
| Load criticality | A hospital, water treatment plant, or data centre on the affected feeder changes the consequence by orders of magnitude | Averaged away |
| Concurrent conditions | Asset failure during a heat wave with peak load creates cascading overloads that the same failure in mild weather does not | Not captured |
| Protection coordination | Whether upstream relays isolate the fault or allow propagation depends on protection settings and topology | Not captured |
| Repair logistics | Failure in a remote area with no spare transformer and limited road access has a 72-hour restoration time; the same failure at an accessible substation takes 8 hours | Same score |
| Regulatory exposure | Failure during a performance-based rate case or after a deferred maintenance decision creates liability that identical failure at other times does not | Not captured |
Every row in this table represents a causal pathway — a chain of variables where the state of one determines the consequence of another. The risk matrix collapses all of them into a single score. The result is a maintenance plan that allocates budget by asset age and generic failure rates, not by the actual risk each asset poses to the system.
The risk matrix asks "How likely is this asset to fail?" and "How expensive is this asset to replace?" A causal model asks "If this asset fails, what happens to the system, and what can we do about it?" These are fundamentally different questions, and they produce fundamentally different investment plans.
2The Causal StructureThe grid as a directed acyclic graph: from weather event to financial loss.
A utility's risk landscape is a causal system with well-understood mechanisms. Weather, asset condition, network topology, operational response, and regulatory context combine to produce outcomes. The causal graph makes these dependencies explicit and computable.
The Primary Causal Chain
But this is only one chain. The full graph includes parallel and interacting pathways:
| Pathway | Variables | Key Interaction |
|---|---|---|
| Weather → Asset Failure | Wind speed, ice loading, ambient temperature, soil moisture (for pole foundations), lightning density | Weather doesn't just cause vegetation contact — it directly stresses conductors, poles, and transformers through thermal and mechanical loading |
| Asset Condition → Failure Probability | Age, maintenance history, inspection results, loading pattern, manufacturer, installation method | Condition mediates the weather-to-failure relationship: a well-maintained asset survives loads that destroy a degraded one |
| Topology → Consequence | Radial vs meshed configuration, number of tie switches, SCADA automation level, distributed generation capacity | The same initiating event produces different consequences depending on where it occurs in the network |
| Load → Cascading | Time of day, season, demand response capacity, behind-the-meter generation, EV charging load | Peak load at time of failure determines whether the fault cascades to adjacent feeders |
| Response → Duration | Crew availability, spare inventory, access constraints, storm coordination, mutual aid status | Operational readiness determines restoration time, which drives customer minutes interrupted (CMI) and regulatory exposure |
In a risk matrix, these are all separate "risks" with separate scores. In a causal graph, they are connected variables whose joint behaviour determines outcomes. The graph captures what the matrix cannot: that vegetation management, asset replacement, network reconfiguration, and operational readiness are not independent investments — they interact, and the return on one depends on the state of the others.
A risk matrix says: "Rank assets by risk score, replace from the top." A causal graph says: "Trace the failure propagation paths, find the nodes with the highest leverage, invest there." Sometimes the highest-leverage investment isn't replacing the worst asset — it's adding a tie switch that prevents the failure from cascading, or clearing vegetation on a critical corridor that feeds three hospitals. The graph reveals these interventions. The matrix cannot.
3Three Rungs AppliedFrom "what correlates with outages" to "would this outage have happened if we'd acted differently."
Rung 1 — Seeing: What Correlates with Outages?
This is where most utility analytics lives today. Asset health indices, failure rate curves, weather correlation models, predictive maintenance scores. These tools observe patterns: older transformers fail more often; outages increase during storms; feeders with deferred vegetation management have more faults.
All of this is useful — and all of it is Rung 1. It cannot distinguish correlation from causation, and it cannot tell you what happens if you intervene. For example:
| Rung 1 Finding | The Problem |
|---|---|
| "Feeders with more truck rolls have longer outage durations" | Does slow response cause long outages, or do long outages (caused by severe faults) require more truck rolls? The correlation runs both directions. |
| "Assets replaced under the accelerated programme fail less often" | Those assets were replaced because they were in good network positions with easy access. The programme selected for replacement ease, not risk reduction. |
| "Regions with higher vegetation management spend have fewer tree-related faults" | True — but how much of the spending is in the right corridors? Shifting $1M from a low-consequence corridor to a high-consequence one might reduce CMI more than adding $2M of total spend. |
Rung 1 can tell you which assets are failing and what conditions are associated with failure. It cannot tell you which investments will reduce failure consequences most effectively. For that, you need Rung 2.
Rung 2 — Doing: What Happens If We Act?
Rung 2 asks interventional questions — the questions that actually drive capital and maintenance budgets:
| Question | Why It's Rung 2 |
|---|---|
| "If we shift $5M from substation hardening to vegetation management in the southern corridor, what happens to storm-related CMI?" | Requires modelling the causal effect of vegetation condition on fault rate, and fault rate on outage propagation, accounting for network topology |
| "If we install automated switches on these 12 feeders, how much does expected unserved energy decrease?" | Requires modelling the causal effect of switching speed on outage duration, for the specific topology of each feeder |
| "If we defer transformer replacement on these 200 units by two years, what is the incremental risk to SAIDI/SAIFI targets?" | Requires modelling the causal relationship between asset condition, failure probability, and system-level reliability — not just the asset-level failure rate |
| "Which combination of investments across vegetation, asset replacement, and automation produces the largest CMI reduction per dollar?" | This is causal optimisation: finding the optimal intervention across multiple levers simultaneously, accounting for their interactions |
These are do-calculus questions: P(CMI | do(invest_in_X)). They require a causal model that represents the mechanisms connecting investment to outcome. A risk matrix cannot answer them because it has no concept of mechanism — only scores.
Rung 3 — Imagining: Would This Have Happened?
Rung 3 is where utilities face their hardest questions — usually in regulatory proceedings, after something has gone wrong:
| Question | Why It's Rung 3 |
|---|---|
| "Would the August 14 cascading outage have occurred if we had replaced transformer T-4471 in the spring maintenance cycle?" | Counterfactual: fix the specific individual circumstances of this event, change one variable, propagate |
| "Would the wildfire ignition have occurred if vegetation clearance on Corridor 12 had been completed on schedule?" | Counterfactual with legal and financial consequences: the answer determines liability |
| "Given the storm conditions on March 3, what is the minimum investment that would have prevented the extended outage in District 7?" | Counterfactual optimisation: searching across interventions in a specific factual scenario |
| "If this customer had been on a feeder with automated restoration, would their 14-hour outage have been less than 4 hours?" | Individual-level counterfactual: specific customer, specific event, specific alternative investment |
These questions are not hypothetical. They appear in rate cases, prudency reviews, wildfire litigation, regulatory investigations, and insurance claims. Today, utilities answer them with engineering judgment and after-the-fact narrative. A structural causal model answers them with computation — traceable, auditable, and defensible.
When a regulator asks "Was this outage preventable?", the utility currently offers an engineer's opinion. With a causal model, the utility offers a computation: "Given the conditions on that day, replacing T-4471 reduces the cascading outage probability from 73% to 11%. Here is the model, the data, and the assumptions. You can inspect every edge." The engineer's opinion is debatable. The model is inspectable.
4The Dollar GapSame $40M budget. Different allocation. $12M more in avoided outage costs.
A regional utility has a $40M annual capital and maintenance budget for reliability improvement. The risk-register approach ranks assets by failure probability × replacement cost and allocates top-down. The causal model traces failure propagation paths through the network and allocates by system-level consequence reduction per dollar.
Two Allocation Strategies
| Category | Risk Register (L×I) | Causal Model (Failure-Path DAG) |
|---|---|---|
| Asset Replacement | $22M — replace the 200 oldest transformers | $14M — replace 90 transformers on critical radial feeders and high-cascade-risk nodes |
| Vegetation Management | $10M — cycle-based clearing across all corridors | $12M — prioritised by corridor consequence: feeders serving critical loads and areas with highest fault-to-outage propagation |
| Network Automation | $3M — SCADA upgrades at selected substations | $9M — automated sectionalising switches on the 40 feeders with highest cascading potential |
| Inspection & Monitoring | $5M — condition-based inspection programme | $5M — same, plus real-time loading sensors on identified cascade-critical corridors |
| Total Spend | $40M | $40M |
Projected Outcomes
| Metric | Risk Register Allocation | Causal Allocation |
|---|---|---|
| SAIDI reduction | 8% (mostly from replacing old assets) | 22% (from closing propagation paths + faster restoration) |
| Expected outage cost avoided | ~$9M per year | ~$21M per year |
| Critical-load outage events | Reduced by ~15% | Reduced by ~45% |
| Regulatory risk reduction | Moderate — can demonstrate asset investment | High — can demonstrate targeted, consequence-based investment with traceable logic |
The risk register replaced more transformers. The causal model replaced fewer, but targeted the ones whose failure propagates through the network. It redirected the savings into automation and vegetation management on the corridors where faults become cascading outages. Same budget, 2.3× more outage cost avoidance.
The risk register says: "We replaced 200 transformers." The causal model says: "We reduced expected outage costs by $21M, critical-load outages by 45%, and SAIDI by 22% — for the same $40M." The first is an activity report. The second is a business case. Regulators and boards increasingly demand the second.
5Wildfire: The Ultimate Causal ChainEquipment condition → ignition → terrain → fire spread → liability. $30 billion at stake.
Wildfire is the risk that turned utility risk management from an operational concern into an existential one. Pacific Gas & Electric's equipment caused the 2018 Camp Fire, which killed 85 people and destroyed the town of Paradise. PG&E filed for bankruptcy in 2019 with $30 billion in wildfire liabilities. The 2023–2024 Maui wildfires produced similar questions for Hawaiian Electric.
The causal structure of utility-caused wildfire is a textbook directed acyclic graph:
With additional causal inputs at every stage: vegetation clearance affects fuel conditions; public safety power shutoff (PSPS) decisions affect ignition probability; community evacuation infrastructure affects casualty outcomes; inverse condemnation doctrine determines liability allocation.
Why the Risk Matrix Failed
Before the Camp Fire, PG&E's risk models scored wildfire risk using historical ignition data, vegetation proximity, and wind exposure. This is Rung 1: patterns in past data. The models could not answer the Rung 2 question that mattered: "If we replace this specific transmission tower hook on the Caribou-Palermo line, how much does ignition probability decrease in a red-flag weather event?" And after the fire, they could not answer the Rung 3 question the courts demanded: "Would the Camp Fire have occurred if PG&E had replaced that equipment?"
What a Causal Model Provides
| Rung | Question | Value |
|---|---|---|
| Rung 1 | Which equipment is correlated with historical ignitions? | Useful for prioritising inspections, but cannot distinguish equipment that caused ignitions from equipment that happened to be nearby |
| Rung 2 | If we harden the 500 highest-risk spans, how much does system-wide ignition probability decrease? | Answers the capital planning question: where to invest for maximum wildfire risk reduction per dollar |
| Rung 2 | Under what conditions should we call a PSPS event, given the trade-off between ignition risk and customer impact? | Optimal PSPS decision policy: balances wildfire prevention against economic and safety costs of de-energisation |
| Rung 3 | Would this ignition have occurred if we had completed the vegetation clearing on this corridor? | Answers the litigation question: traceable, auditable attribution of cause |
| Rung 3 | Given the weather and fuel conditions on that day, what is the minimum equipment investment that would have prevented the fire? | Answers the prudency question: was the utility's spending adequate given what was knowable? |
PG&E's bankruptcy was not caused by a lack of risk awareness — it was caused by a lack of causal reasoning. The utility knew wildfire was a top risk. It had a risk score. What it didn't have was a model that could trace the causal chain from specific equipment decisions to specific fire outcomes, optimise investment across that chain, and defend those decisions when they were challenged. That model is a Bayesian causal network. The cost of not having one was $30 billion.
6Storm Resilience & Cascading FailureHow a single fault becomes a system-wide outage — and where to break the chain.
Major storm events expose the weakness of asset-by-asset risk assessment most dramatically. A Category 2 hurricane doesn't fail one asset — it stresses thousands simultaneously. The difference between a manageable event and a catastrophic cascading failure depends on the interaction between asset condition, network topology, protection coordination, load levels, and operational response. All of these are causal relationships.
Anatomy of a Cascading Outage
Consider a simplified but realistic scenario: a summer heat wave coincides with a thunderstorm. Ambient temperature has pushed transformer loading to 95% of rating. A tree limb contacts a 69kV feeder, tripping a breaker. The load transfers to an adjacent feeder, pushing its transformers to 110% of rating. Within 40 minutes, two transformers overheat and trip on thermal protection. The load attempts to redistribute again, but the remaining paths are also at capacity. The result is a cascading outage affecting 45,000 customers, when the initiating event was a single tree contact on a single feeder.
A risk matrix scores the tree contact and each transformer independently. It cannot model the cascade because it has no concept of load flow, thermal dynamics, or protection coordination. The causal model represents all of these:
| Variable | Role in the Graph | Intervention Point |
|---|---|---|
| Vegetation condition | Direct cause of the initiating fault | Targeted clearing on high-consequence corridors |
| Pre-event loading | Determines whether load transfer causes thermal overload | Demand response programmes, pre-event load shedding protocols |
| Transformer thermal margin | Determines how long adjacent feeders survive increased load | Transformer upgrade or dynamic rating systems |
| Protection coordination | Determines whether the fault is isolated or propagates | Protection setting review, addition of sectionalising reclosers |
| Tie switch automation | Determines restoration speed after the cascade stabilises | SCADA-controlled automated switching on critical feeders |
| Spare transformer inventory | Determines extended restoration time | Strategic pre-positioning of mobile transformers |
The causal model computes the joint effect of investing in any combination of these interventions. The risk matrix can only score them independently. In a system where interactions dominate — where the return on vegetation management depends on the state of protection coordination, and the value of automation depends on the loading profile — independent scoring is not just imprecise, it allocates resources to the wrong places.
In the cascading scenario above, investing $200K in vegetation clearing on that single corridor prevents the initiating fault. Investing $150K in a dynamic transformer rating system gives the adjacent feeders 20 more minutes of headroom — enough for automated load shedding to prevent the cascade. The risk matrix sees two unrelated $200K and $150K investments. The causal model sees a $350K investment that prevents a 45,000-customer outage. The return on the combined investment is many times larger than the sum of the individual returns.
7The Regulatory ArgumentFrom "trust our judgment" to "inspect our model."
Utility regulators are moving from prescriptive rules ("replace assets older than 40 years") toward performance-based frameworks ("demonstrate that your investment plan is the most effective use of ratepayer funds"). This shift changes the burden of proof. Utilities must now justify why they chose one investment over another — and "the risk matrix scored it higher" is increasingly insufficient.
What Regulators Increasingly Ask
| Regulatory Question | Rung Required | Risk Matrix Answer |
|---|---|---|
| "Why did you prioritise substation X over substation Y?" | Rung 2 (intervention) | "X scored higher." (No mechanism, no traceability) |
| "What is the expected reliability improvement per dollar of your proposed capital plan?" | Rung 2 (intervention) | Cannot compute — scores don't translate to outcomes |
| "Was the maintenance deferral decision on Circuit 47 prudent given information available at the time?" | Rung 3 (counterfactual) | Cannot answer — no model of what would have happened |
| "How does your proposed wildfire mitigation plan reduce ignition probability in each fire threat district?" | Rung 2 (intervention) | Can estimate total spend per district, cannot estimate ignition probability reduction per dollar |
| "Would the extended outage have been avoided with the investment you deferred?" | Rung 3 (counterfactual) | Engineering opinion only |
A causal model answers every question in this table with computation, not opinion. The assumptions are explicit — visible in the graph structure and structural equations. The regulator can challenge any edge, any parameter, any assumption. This is a stronger position than "our engineers believe," not a weaker one, because it replaces undocumented judgment with inspectable reasoning.
The Prudency Standard
Prudency review asks: "Was the utility's decision reasonable given what was known at the time?" This is a counterfactual question. The causal model makes it computable: condition the model on information available at the decision date, compute the expected outcomes under the chosen action and the alternative, and compare. If the model shows the chosen action was optimal given the available information, the utility has a quantitative defence. If it shows the alternative was better, the utility knows before the regulator does — and can explain the reasoning that led to the decision.
A utility that presents a causal model to a regulator is saying: "Here are our assumptions, our data, our reasoning, and our conclusions. Challenge any part of it." This is a fundamentally stronger position than presenting a risk score and asking the regulator to trust the process. The model invites scrutiny because it can withstand it. The risk matrix avoids scrutiny because it cannot.
8What To DoA practical path from risk registers to causal risk models.
Transitioning from matrix-based risk management to causal modeling doesn't require replacing everything at once. The most effective approach starts with a single high-consequence failure mode and expands from there.
The Path
| Step | Action | Detail |
|---|---|---|
| 1 | Pick one failure mode | Choose the one that keeps your executives up at night — wildfire ignition, cascading outage, critical-load interruption. Build the causal graph for that single failure mode: what causes it, what determines its severity, what interventions exist at each stage. |
| 2 | Map the causal structure | Assemble your engineers, planners, and operators in a room. Draw the graph on a whiteboard. Identify which edges are well-understood (physics-based), which are estimated (expert judgment), and which are unknown. This step alone surfaces hidden assumptions and disagreements that the risk matrix conceals. |
| 3 | Validate on synthetic data | Write structural equations for each edge. Simulate data where the ground truth is known. Run your estimator and confirm it recovers the true effects. This is the generative validation step — it catches modelling errors before real data is involved. |
| 4 | Connect real data | Feed in your actual asset condition data, outage history, weather records, network topology, and maintenance records. Train the parameters. Run sensitivity analysis: which assumptions matter, which don't? |
| 5 | Ask Rung 2 questions | Run the interventional queries your planners actually need: "If we shift $X from programme A to programme B, what happens to expected CMI?" Compare the answers to what the risk matrix recommends. Where they agree, good. Where they diverge, investigate. |
| 6 | Build toward Rung 3 | Once the Rung 2 model is validated and trusted, extend to counterfactual queries: after-the-fact analysis of specific events, prudency documentation, scenario planning for regulatory proceedings. |
| 7 | Scale | Repeat for additional failure modes. Connect the individual models into a system-level risk model that captures interactions between failure modes. This becomes the foundation for integrated resource planning, rate case support, and continuous risk management. |
A causal model of a single critical failure mode — built in 6–8 weeks with your existing data and engineering knowledge — will produce better investment decisions than a risk matrix covering your entire asset base. The goal is not to replace everything at once. The goal is to demonstrate, on one high-stakes problem, that asking the right questions produces different and better answers.
9Further ReadingThe foundations behind this approach.
| Source | Relevance |
|---|---|
| Pearl, J. Causality (2nd ed, 2009) | The foundational text on structural causal models, do-calculus, and the three rungs of causation |
| Pearl, J. & Mackenzie, D. The Book of Why (2018) | Accessible introduction to causal inference and Pearl's Ladder — the framework underlying this entire approach |
| McElreath, R. Statistical Rethinking (2nd ed, 2020) | Bayesian workflow with generative causal models: simulate first, then estimate |
| Fenton, N. & Neil, M. Risk Assessment and Decision Analysis with Bayesian Networks (2nd ed, 2018) | Practical guide to building Bayesian network risk models — directly applicable to utility asset management and reliability planning |
| IEEE 1366 Guide for Electric Power Distribution Reliability Indices | Standard definitions for SAIDI, SAIFI, CAIDI, and other reliability metrics used throughout this analysis |
| CPUC Wildfire Safety Division reports | California's regulatory framework for utility wildfire risk — demonstrates the shift toward risk-based, evidence-driven safety planning |
| NERC Standard TPL-001-5 Transmission System Planning Performance | Reliability standards that increasingly require utilities to demonstrate the adequacy of their planning models |
What does your grid's causal structure look like?
A focused conversation about your highest-consequence failure mode is the fastest way to find out whether a causal model changes your investment plan.
Book a Call