When the Grid Goes Dark — Causal Risk Modeling for Utilities

All 3 Rungs | Utility Risk Modeling

When the Grid Goes Dark

Risk matrices can't answer causal questions. Causal models can.

Rung 1 · Seeing

Rung 2 · Doing

Rung 3 · Imagining

1. The Problem
2. The Causal Structure
3. Three Rungs Applied
4. The Dollar Gap
5. Wildfire: The Ultimate Causal Chain
6. Storm Resilience & Cascading Failure
7. The Regulatory Argument
8. What To Do
9. Further Reading

The Bottom Line

The Problem: Utility risk management uses likelihood × impact matrices that treat "transformer failure" as one risk. But "transformer failure on a radial feeder serving a hospital" and "transformer failure in a meshed network with redundancy" are completely different risks with different consequences — and the matrix cannot distinguish them.
The Insight: The grid is a causal system: weather causes vegetation contact, which causes line faults, which trigger protection relays, which cause cascading outages, which create customer impact, which produce regulatory penalties. Each link in the chain is a point where intervention has a different cost and a different effect.
The Action: Model the causal structure of failure propagation. Use interventional queries (Rung 2) to optimise maintenance and capital investment. Use counterfactual queries (Rung 3) to answer regulatory questions: "Would this outage have occurred if we had replaced that asset?"

1The ProblemRisk matrices treat the grid as a spreadsheet. The grid is a network.

A major utility manages tens of thousands of assets — transformers, conductors, poles, switches, substations — spread across thousands of kilometres. Each asset can fail. When one fails, the consequences depend entirely on where it sits in the network and what else is happening at the time. A failed transformer in a meshed urban network triggers automatic switching and nobody loses power. The same failure on a radial rural feeder serving a hospital means a critical outage, an emergency generator deployment, regulatory scrutiny, and potential liability.

The risk matrix cannot distinguish these two scenarios. Both are "transformer failure." Both get the same likelihood score (based on asset age and condition) and the same impact score (based on replacement cost). The matrix produces a single number — say L=3, I=4, score=12 — that tells the planning team nothing about which failure to prevent first.

What the Matrix Misses

Factor	Why It Matters	Matrix Score
Network topology	Radial vs meshed feeders determine whether a single failure causes a cascading outage or is absorbed by redundancy	Not captured
Load criticality	A hospital, water treatment plant, or data centre on the affected feeder changes the consequence by orders of magnitude	Averaged away
Concurrent conditions	Asset failure during a heat wave with peak load creates cascading overloads that the same failure in mild weather does not	Not captured
Protection coordination	Whether upstream relays isolate the fault or allow propagation depends on protection settings and topology	Not captured
Repair logistics	Failure in a remote area with no spare transformer and limited road access has a 72-hour restoration time; the same failure at an accessible substation takes 8 hours	Same score
Regulatory exposure	Failure during a performance-based rate case or after a deferred maintenance decision creates liability that identical failure at other times does not	Not captured

Every row in this table represents a causal pathway — a chain of variables where the state of one determines the consequence of another. The risk matrix collapses all of them into a single score. The result is a maintenance plan that allocates budget by asset age and generic failure rates, not by the actual risk each asset poses to the system.

The Core Failure

The risk matrix asks "How likely is this asset to fail?" and "How expensive is this asset to replace?" A causal model asks "If this asset fails, what happens to the system, and what can we do about it?" These are fundamentally different questions, and they produce fundamentally different investment plans.

↑ Back to Top

2The Causal StructureThe grid as a directed acyclic graph: from weather event to financial loss.

A utility's risk landscape is a causal system with well-understood mechanisms. Weather, asset condition, network topology, operational response, and regulatory context combine to produce outcomes. The causal graph makes these dependencies explicit and computable.

The Primary Causal Chain

Weather Event

→

Vegetation Contact

→

Line Fault

→

Protection Response

→

Load Redistribution

→

Cascading Outage

→

Customer Impact

→

Regulatory Penalty

But this is only one chain. The full graph includes parallel and interacting pathways:

Pathway	Variables	Key Interaction
Weather → Asset Failure	Wind speed, ice loading, ambient temperature, soil moisture (for pole foundations), lightning density	Weather doesn't just cause vegetation contact — it directly stresses conductors, poles, and transformers through thermal and mechanical loading
Asset Condition → Failure Probability	Age, maintenance history, inspection results, loading pattern, manufacturer, installation method	Condition mediates the weather-to-failure relationship: a well-maintained asset survives loads that destroy a degraded one
Topology → Consequence	Radial vs meshed configuration, number of tie switches, SCADA automation level, distributed generation capacity	The same initiating event produces different consequences depending on where it occurs in the network
Load → Cascading	Time of day, season, demand response capacity, behind-the-meter generation, EV charging load	Peak load at time of failure determines whether the fault cascades to adjacent feeders
Response → Duration	Crew availability, spare inventory, access constraints, storm coordination, mutual aid status	Operational readiness determines restoration time, which drives customer minutes interrupted (CMI) and regulatory exposure

In a risk matrix, these are all separate "risks" with separate scores. In a causal graph, they are connected variables whose joint behaviour determines outcomes. The graph captures what the matrix cannot: that vegetation management, asset replacement, network reconfiguration, and operational readiness are not independent investments — they interact, and the return on one depends on the state of the others.

Why the Graph Matters for Investment

A risk matrix says: "Rank assets by risk score, replace from the top." A causal graph says: "Trace the failure propagation paths, find the nodes with the highest leverage, invest there." Sometimes the highest-leverage investment isn't replacing the worst asset — it's adding a tie switch that prevents the failure from cascading, or clearing vegetation on a critical corridor that feeds three hospitals. The graph reveals these interventions. The matrix cannot.

↑ Back to Top

3Three Rungs AppliedFrom "what correlates with outages" to "would this outage have happened if we'd acted differently."

Rung 1 — Seeing: What Correlates with Outages?

This is where most utility analytics lives today. Asset health indices, failure rate curves, weather correlation models, predictive maintenance scores. These tools observe patterns: older transformers fail more often; outages increase during storms; feeders with deferred vegetation management have more faults.

All of this is useful — and all of it is Rung 1. It cannot distinguish correlation from causation, and it cannot tell you what happens if you intervene. For example:

Rung 1 Finding	The Problem
"Feeders with more truck rolls have longer outage durations"	Does slow response cause long outages, or do long outages (caused by severe faults) require more truck rolls? The correlation runs both directions.
"Assets replaced under the accelerated programme fail less often"	Those assets were replaced because they were in good network positions with easy access. The programme selected for replacement ease, not risk reduction.
"Regions with higher vegetation management spend have fewer tree-related faults"	True — but how much of the spending is in the right corridors? Shifting $1M from a low-consequence corridor to a high-consequence one might reduce CMI more than adding $2M of total spend.

Rung 1 Summary

Rung 1 can tell you which assets are failing and what conditions are associated with failure. It cannot tell you which investments will reduce failure consequences most effectively. For that, you need Rung 2.

Rung 2 — Doing: What Happens If We Act?

Rung 2 asks interventional questions — the questions that actually drive capital and maintenance budgets:

Question	Why It's Rung 2
"If we shift $5M from substation hardening to vegetation management in the southern corridor, what happens to storm-related CMI?"	Requires modelling the causal effect of vegetation condition on fault rate, and fault rate on outage propagation, accounting for network topology
"If we install automated switches on these 12 feeders, how much does expected unserved energy decrease?"	Requires modelling the causal effect of switching speed on outage duration, for the specific topology of each feeder
"If we defer transformer replacement on these 200 units by two years, what is the incremental risk to SAIDI/SAIFI targets?"	Requires modelling the causal relationship between asset condition, failure probability, and system-level reliability — not just the asset-level failure rate
"Which combination of investments across vegetation, asset replacement, and automation produces the largest CMI reduction per dollar?"	This is causal optimisation: finding the optimal intervention across multiple levers simultaneously, accounting for their interactions

These are do-calculus questions: P(CMI | do(invest_in_X)). They require a causal model that represents the mechanisms connecting investment to outcome. A risk matrix cannot answer them because it has no concept of mechanism — only scores.

Rung 3 — Imagining: Would This Have Happened?

Rung 3 is where utilities face their hardest questions — usually in regulatory proceedings, after something has gone wrong:

Question	Why It's Rung 3
"Would the August 14 cascading outage have occurred if we had replaced transformer T-4471 in the spring maintenance cycle?"	Counterfactual: fix the specific individual circumstances of this event, change one variable, propagate
"Would the wildfire ignition have occurred if vegetation clearance on Corridor 12 had been completed on schedule?"	Counterfactual with legal and financial consequences: the answer determines liability
"Given the storm conditions on March 3, what is the minimum investment that would have prevented the extended outage in District 7?"	Counterfactual optimisation: searching across interventions in a specific factual scenario
"If this customer had been on a feeder with automated restoration, would their 14-hour outage have been less than 4 hours?"	Individual-level counterfactual: specific customer, specific event, specific alternative investment

These questions are not hypothetical. They appear in rate cases, prudency reviews, wildfire litigation, regulatory investigations, and insurance claims. Today, utilities answer them with engineering judgment and after-the-fact narrative. A structural causal model answers them with computation — traceable, auditable, and defensible.

The Regulatory Difference

When a regulator asks "Was this outage preventable?", the utility currently offers an engineer's opinion. With a causal model, the utility offers a computation: "Given the conditions on that day, replacing T-4471 reduces the cascading outage probability from 73% to 11%. Here is the model, the data, and the assumptions. You can inspect every edge." The engineer's opinion is debatable. The model is inspectable.

↑ Back to Top

4The Dollar GapSame $40M budget. Different allocation. $12M more in avoided outage costs.

A regional utility has a $40M annual capital and maintenance budget for reliability improvement. The risk-register approach ranks assets by failure probability × replacement cost and allocates top-down. The causal model traces failure propagation paths through the network and allocates by system-level consequence reduction per dollar.

Two Allocation Strategies

Category	Risk Register (L×I)	Causal Model (Failure-Path DAG)
Asset Replacement	$22M — replace the 200 oldest transformers	$14M — replace 90 transformers on critical radial feeders and high-cascade-risk nodes
Vegetation Management	$10M — cycle-based clearing across all corridors	$12M — prioritised by corridor consequence: feeders serving critical loads and areas with highest fault-to-outage propagation
Network Automation	$3M — SCADA upgrades at selected substations	$9M — automated sectionalising switches on the 40 feeders with highest cascading potential
Inspection & Monitoring	$5M — condition-based inspection programme	$5M — same, plus real-time loading sensors on identified cascade-critical corridors
Total Spend	$40M	$40M

Projected Outcomes

Metric	Risk Register Allocation	Causal Allocation
SAIDI reduction	8% (mostly from replacing old assets)	22% (from closing propagation paths + faster restoration)
Expected outage cost avoided	~$9M per year	~$21M per year
Critical-load outage events	Reduced by ~15%	Reduced by ~45%
Regulatory risk reduction	Moderate — can demonstrate asset investment	High — can demonstrate targeted, consequence-based investment with traceable logic

The risk register replaced more transformers. The causal model replaced fewer, but targeted the ones whose failure propagates through the network. It redirected the savings into automation and vegetation management on the corridors where faults become cascading outages. Same budget, 2.3× more outage cost avoidance.

The Board Slide

The risk register says: "We replaced 200 transformers." The causal model says: "We reduced expected outage costs by $21M, critical-load outages by 45%, and SAIDI by 22% — for the same $40M." The first is an activity report. The second is a business case. Regulators and boards increasingly demand the second.

↑ Back to Top

5Wildfire: The Ultimate Causal ChainEquipment condition → ignition → terrain → fire spread → liability. $30 billion at stake.

Wildfire is the risk that turned utility risk management from an operational concern into an existential one. Pacific Gas & Electric's equipment caused the 2018 Camp Fire, which killed 85 people and destroyed the town of Paradise. PG&E filed for bankruptcy in 2019 with $30 billion in wildfire liabilities. The 2023–2024 Maui wildfires produced similar questions for Hawaiian Electric.

The causal structure of utility-caused wildfire is a textbook directed acyclic graph:

Equipment Condition

→

Ignition Event

→

Fuel Conditions

→

Wind & Terrain

→

Fire Spread

→

Structure Loss

→

Liability

With additional causal inputs at every stage: vegetation clearance affects fuel conditions; public safety power shutoff (PSPS) decisions affect ignition probability; community evacuation infrastructure affects casualty outcomes; inverse condemnation doctrine determines liability allocation.

Why the Risk Matrix Failed

Before the Camp Fire, PG&E's risk models scored wildfire risk using historical ignition data, vegetation proximity, and wind exposure. This is Rung 1: patterns in past data. The models could not answer the Rung 2 question that mattered: "If we replace this specific transmission tower hook on the Caribou-Palermo line, how much does ignition probability decrease in a red-flag weather event?" And after the fire, they could not answer the Rung 3 question the courts demanded: "Would the Camp Fire have occurred if PG&E had replaced that equipment?"

What a Causal Model Provides

Rung	Question	Value
Rung 1	Which equipment is correlated with historical ignitions?	Useful for prioritising inspections, but cannot distinguish equipment that caused ignitions from equipment that happened to be nearby
Rung 2	If we harden the 500 highest-risk spans, how much does system-wide ignition probability decrease?	Answers the capital planning question: where to invest for maximum wildfire risk reduction per dollar
Rung 2	Under what conditions should we call a PSPS event, given the trade-off between ignition risk and customer impact?	Optimal PSPS decision policy: balances wildfire prevention against economic and safety costs of de-energisation
Rung 3	Would this ignition have occurred if we had completed the vegetation clearing on this corridor?	Answers the litigation question: traceable, auditable attribution of cause
Rung 3	Given the weather and fuel conditions on that day, what is the minimum equipment investment that would have prevented the fire?	Answers the prudency question: was the utility's spending adequate given what was knowable?

The $30 Billion Question

PG&E's bankruptcy was not caused by a lack of risk awareness — it was caused by a lack of causal reasoning. The utility knew wildfire was a top risk. It had a risk score. What it didn't have was a model that could trace the causal chain from specific equipment decisions to specific fire outcomes, optimise investment across that chain, and defend those decisions when they were challenged. That model is a Bayesian causal network. The cost of not having one was $30 billion.

↑ Back to Top

6Storm Resilience & Cascading FailureHow a single fault becomes a system-wide outage — and where to break the chain.

Major storm events expose the weakness of asset-by-asset risk assessment most dramatically. A Category 2 hurricane doesn't fail one asset — it stresses thousands simultaneously. The difference between a manageable event and a catastrophic cascading failure depends on the interaction between asset condition, network topology, protection coordination, load levels, and operational response. All of these are causal relationships.

Anatomy of a Cascading Outage

Consider a simplified but realistic scenario: a summer heat wave coincides with a thunderstorm. Ambient temperature has pushed transformer loading to 95% of rating. A tree limb contacts a 69kV feeder, tripping a breaker. The load transfers to an adjacent feeder, pushing its transformers to 110% of rating. Within 40 minutes, two transformers overheat and trip on thermal protection. The load attempts to redistribute again, but the remaining paths are also at capacity. The result is a cascading outage affecting 45,000 customers, when the initiating event was a single tree contact on a single feeder.

A risk matrix scores the tree contact and each transformer independently. It cannot model the cascade because it has no concept of load flow, thermal dynamics, or protection coordination. The causal model represents all of these:

Variable	Role in the Graph	Intervention Point
Vegetation condition	Direct cause of the initiating fault	Targeted clearing on high-consequence corridors
Pre-event loading	Determines whether load transfer causes thermal overload	Demand response programmes, pre-event load shedding protocols
Transformer thermal margin	Determines how long adjacent feeders survive increased load	Transformer upgrade or dynamic rating systems
Protection coordination	Determines whether the fault is isolated or propagates	Protection setting review, addition of sectionalising reclosers
Tie switch automation	Determines restoration speed after the cascade stabilises	SCADA-controlled automated switching on critical feeders
Spare transformer inventory	Determines extended restoration time	Strategic pre-positioning of mobile transformers

The causal model computes the joint effect of investing in any combination of these interventions. The risk matrix can only score them independently. In a system where interactions dominate — where the return on vegetation management depends on the state of protection coordination, and the value of automation depends on the loading profile — independent scoring is not just imprecise, it allocates resources to the wrong places.

The Compounding Effect

In the cascading scenario above, investing $200K in vegetation clearing on that single corridor prevents the initiating fault. Investing $150K in a dynamic transformer rating system gives the adjacent feeders 20 more minutes of headroom — enough for automated load shedding to prevent the cascade. The risk matrix sees two unrelated $200K and $150K investments. The causal model sees a $350K investment that prevents a 45,000-customer outage. The return on the combined investment is many times larger than the sum of the individual returns.

↑ Back to Top

7The Regulatory ArgumentFrom "trust our judgment" to "inspect our model."

Utility regulators are moving from prescriptive rules ("replace assets older than 40 years") toward performance-based frameworks ("demonstrate that your investment plan is the most effective use of ratepayer funds"). This shift changes the burden of proof. Utilities must now justify why they chose one investment over another — and "the risk matrix scored it higher" is increasingly insufficient.

What Regulators Increasingly Ask

Regulatory Question	Rung Required	Risk Matrix Answer
"Why did you prioritise substation X over substation Y?"	Rung 2 (intervention)	"X scored higher." (No mechanism, no traceability)
"What is the expected reliability improvement per dollar of your proposed capital plan?"	Rung 2 (intervention)	Cannot compute — scores don't translate to outcomes
"Was the maintenance deferral decision on Circuit 47 prudent given information available at the time?"	Rung 3 (counterfactual)	Cannot answer — no model of what would have happened
"How does your proposed wildfire mitigation plan reduce ignition probability in each fire threat district?"	Rung 2 (intervention)	Can estimate total spend per district, cannot estimate ignition probability reduction per dollar
"Would the extended outage have been avoided with the investment you deferred?"	Rung 3 (counterfactual)	Engineering opinion only

A causal model answers every question in this table with computation, not opinion. The assumptions are explicit — visible in the graph structure and structural equations. The regulator can challenge any edge, any parameter, any assumption. This is a stronger position than "our engineers believe," not a weaker one, because it replaces undocumented judgment with inspectable reasoning.

The Prudency Standard

Prudency review asks: "Was the utility's decision reasonable given what was known at the time?" This is a counterfactual question. The causal model makes it computable: condition the model on information available at the decision date, compute the expected outcomes under the chosen action and the alternative, and compare. If the model shows the chosen action was optimal given the available information, the utility has a quantitative defence. If it shows the alternative was better, the utility knows before the regulator does — and can explain the reasoning that led to the decision.

Transparency as Strategy

A utility that presents a causal model to a regulator is saying: "Here are our assumptions, our data, our reasoning, and our conclusions. Challenge any part of it." This is a fundamentally stronger position than presenting a risk score and asking the regulator to trust the process. The model invites scrutiny because it can withstand it. The risk matrix avoids scrutiny because it cannot.

↑ Back to Top

8What To DoA practical path from risk registers to causal risk models.

Transitioning from matrix-based risk management to causal modeling doesn't require replacing everything at once. The most effective approach starts with a single high-consequence failure mode and expands from there.

The Path

Step	Action	Detail
1	Pick one failure mode	Choose the one that keeps your executives up at night — wildfire ignition, cascading outage, critical-load interruption. Build the causal graph for that single failure mode: what causes it, what determines its severity, what interventions exist at each stage.
2	Map the causal structure	Assemble your engineers, planners, and operators in a room. Draw the graph on a whiteboard. Identify which edges are well-understood (physics-based), which are estimated (expert judgment), and which are unknown. This step alone surfaces hidden assumptions and disagreements that the risk matrix conceals.
3	Validate on synthetic data	Write structural equations for each edge. Simulate data where the ground truth is known. Run your estimator and confirm it recovers the true effects. This is the generative validation step — it catches modelling errors before real data is involved.
4	Connect real data	Feed in your actual asset condition data, outage history, weather records, network topology, and maintenance records. Train the parameters. Run sensitivity analysis: which assumptions matter, which don't?
5	Ask Rung 2 questions	Run the interventional queries your planners actually need: "If we shift $X from programme A to programme B, what happens to expected CMI?" Compare the answers to what the risk matrix recommends. Where they agree, good. Where they diverge, investigate.
6	Build toward Rung 3	Once the Rung 2 model is validated and trusted, extend to counterfactual queries: after-the-fact analysis of specific events, prudency documentation, scenario planning for regulatory proceedings.
7	Scale	Repeat for additional failure modes. Connect the individual models into a system-level risk model that captures interactions between failure modes. This becomes the foundation for integrated resource planning, rate case support, and continuous risk management.

Start Small, Start Now

A causal model of a single critical failure mode — built in 6–8 weeks with your existing data and engineering knowledge — will produce better investment decisions than a risk matrix covering your entire asset base. The goal is not to replace everything at once. The goal is to demonstrate, on one high-stakes problem, that asking the right questions produces different and better answers.

↑ Back to Top

9Further ReadingThe foundations behind this approach.

Source	Relevance
Pearl, J. Causality (2nd ed, 2009)	The foundational text on structural causal models, do-calculus, and the three rungs of causation
Pearl, J. & Mackenzie, D. The Book of Why (2018)	Accessible introduction to causal inference and Pearl's Ladder — the framework underlying this entire approach
McElreath, R. Statistical Rethinking (2nd ed, 2020)	Bayesian workflow with generative causal models: simulate first, then estimate
Fenton, N. & Neil, M. Risk Assessment and Decision Analysis with Bayesian Networks (2nd ed, 2018)	Practical guide to building Bayesian network risk models — directly applicable to utility asset management and reliability planning
IEEE 1366 Guide for Electric Power Distribution Reliability Indices	Standard definitions for SAIDI, SAIFI, CAIDI, and other reliability metrics used throughout this analysis
CPUC Wildfire Safety Division reports	California's regulatory framework for utility wildfire risk — demonstrates the shift toward risk-based, evidence-driven safety planning
NERC Standard TPL-001-5 Transmission System Planning Performance	Reliability standards that increasingly require utilities to demonstrate the adequacy of their planning models

↑ Back to Top

What does your grid's causal structure look like?

A focused conversation about your highest-consequence failure mode is the fastest way to find out whether a causal model changes your investment plan.

Book a Call

When the Grid Goes Dark

Contents

The Bottom Line

What the Matrix Misses

The Primary Causal Chain

Rung 1 — Seeing: What Correlates with Outages?

Rung 2 — Doing: What Happens If We Act?

Rung 3 — Imagining: Would This Have Happened?

Two Allocation Strategies

Projected Outcomes

Why the Risk Matrix Failed

What a Causal Model Provides

Anatomy of a Cascading Outage

What Regulators Increasingly Ask

The Prudency Standard

The Path

What does your grid's causal structure look like?

Contact