When the Grid Goes Dark — Causal Risk Modeling for Utilities
All 3 Rungs | Utility Risk Modeling

When the Grid Goes Dark

Risk matrices can't answer causal questions. Causal models can.

Rung 1 · Seeing
Rung 2 · Doing
Rung 3 · Imagining

The Bottom Line

  • The Problem: Utility risk management uses likelihood × impact matrices that treat "transformer failure" as one risk. But "transformer failure on a radial feeder serving a hospital" and "transformer failure in a meshed network with redundancy" are completely different risks with different consequences — and the matrix cannot distinguish them.
  • The Insight: The grid is a causal system: weather causes vegetation contact, which causes line faults, which trigger protection relays, which cause cascading outages, which create customer impact, which produce regulatory penalties. Each link in the chain is a point where intervention has a different cost and a different effect.
  • The Action: Model the causal structure of failure propagation. Use interventional queries (Rung 2) to optimise maintenance and capital investment. Use counterfactual queries (Rung 3) to answer regulatory questions: "Would this outage have occurred if we had replaced that asset?"
1The ProblemRisk matrices treat the grid as a spreadsheet. The grid is a network.

A major utility manages tens of thousands of assets — transformers, conductors, poles, switches, substations — spread across thousands of kilometres. Each asset can fail. When one fails, the consequences depend entirely on where it sits in the network and what else is happening at the time. A failed transformer in a meshed urban network triggers automatic switching and nobody loses power. The same failure on a radial rural feeder serving a hospital means a critical outage, an emergency generator deployment, regulatory scrutiny, and potential liability.

The risk matrix cannot distinguish these two scenarios. Both are "transformer failure." Both get the same likelihood score (based on asset age and condition) and the same impact score (based on replacement cost). The matrix produces a single number — say L=3, I=4, score=12 — that tells the planning team nothing about which failure to prevent first.

What the Matrix Misses

FactorWhy It MattersMatrix Score
Network topologyRadial vs meshed feeders determine whether a single failure causes a cascading outage or is absorbed by redundancyNot captured
Load criticalityA hospital, water treatment plant, or data centre on the affected feeder changes the consequence by orders of magnitudeAveraged away
Concurrent conditionsAsset failure during a heat wave with peak load creates cascading overloads that the same failure in mild weather does notNot captured
Protection coordinationWhether upstream relays isolate the fault or allow propagation depends on protection settings and topologyNot captured
Repair logisticsFailure in a remote area with no spare transformer and limited road access has a 72-hour restoration time; the same failure at an accessible substation takes 8 hoursSame score
Regulatory exposureFailure during a performance-based rate case or after a deferred maintenance decision creates liability that identical failure at other times does notNot captured

Every row in this table represents a causal pathway — a chain of variables where the state of one determines the consequence of another. The risk matrix collapses all of them into a single score. The result is a maintenance plan that allocates budget by asset age and generic failure rates, not by the actual risk each asset poses to the system.

The Core Failure

The risk matrix asks "How likely is this asset to fail?" and "How expensive is this asset to replace?" A causal model asks "If this asset fails, what happens to the system, and what can we do about it?" These are fundamentally different questions, and they produce fundamentally different investment plans.

2The Causal StructureThe grid as a directed acyclic graph: from weather event to financial loss.

A utility's risk landscape is a causal system with well-understood mechanisms. Weather, asset condition, network topology, operational response, and regulatory context combine to produce outcomes. The causal graph makes these dependencies explicit and computable.

The Primary Causal Chain

Weather Event
Vegetation Contact
Line Fault
Protection Response
Load Redistribution
Cascading Outage
Customer Impact
Regulatory Penalty

But this is only one chain. The full graph includes parallel and interacting pathways:

PathwayVariablesKey Interaction
Weather → Asset FailureWind speed, ice loading, ambient temperature, soil moisture (for pole foundations), lightning densityWeather doesn't just cause vegetation contact — it directly stresses conductors, poles, and transformers through thermal and mechanical loading
Asset Condition → Failure ProbabilityAge, maintenance history, inspection results, loading pattern, manufacturer, installation methodCondition mediates the weather-to-failure relationship: a well-maintained asset survives loads that destroy a degraded one
Topology → ConsequenceRadial vs meshed configuration, number of tie switches, SCADA automation level, distributed generation capacityThe same initiating event produces different consequences depending on where it occurs in the network
Load → CascadingTime of day, season, demand response capacity, behind-the-meter generation, EV charging loadPeak load at time of failure determines whether the fault cascades to adjacent feeders
Response → DurationCrew availability, spare inventory, access constraints, storm coordination, mutual aid statusOperational readiness determines restoration time, which drives customer minutes interrupted (CMI) and regulatory exposure

In a risk matrix, these are all separate "risks" with separate scores. In a causal graph, they are connected variables whose joint behaviour determines outcomes. The graph captures what the matrix cannot: that vegetation management, asset replacement, network reconfiguration, and operational readiness are not independent investments — they interact, and the return on one depends on the state of the others.

Why the Graph Matters for Investment

A risk matrix says: "Rank assets by risk score, replace from the top." A causal graph says: "Trace the failure propagation paths, find the nodes with the highest leverage, invest there." Sometimes the highest-leverage investment isn't replacing the worst asset — it's adding a tie switch that prevents the failure from cascading, or clearing vegetation on a critical corridor that feeds three hospitals. The graph reveals these interventions. The matrix cannot.

3Three Rungs AppliedFrom "what correlates with outages" to "would this outage have happened if we'd acted differently."

Rung 1 — Seeing: What Correlates with Outages?

This is where most utility analytics lives today. Asset health indices, failure rate curves, weather correlation models, predictive maintenance scores. These tools observe patterns: older transformers fail more often; outages increase during storms; feeders with deferred vegetation management have more faults.

All of this is useful — and all of it is Rung 1. It cannot distinguish correlation from causation, and it cannot tell you what happens if you intervene. For example:

Rung 1 FindingThe Problem
"Feeders with more truck rolls have longer outage durations"Does slow response cause long outages, or do long outages (caused by severe faults) require more truck rolls? The correlation runs both directions.
"Assets replaced under the accelerated programme fail less often"Those assets were replaced because they were in good network positions with easy access. The programme selected for replacement ease, not risk reduction.
"Regions with higher vegetation management spend have fewer tree-related faults"True — but how much of the spending is in the right corridors? Shifting $1M from a low-consequence corridor to a high-consequence one might reduce CMI more than adding $2M of total spend.
Rung 1 Summary

Rung 1 can tell you which assets are failing and what conditions are associated with failure. It cannot tell you which investments will reduce failure consequences most effectively. For that, you need Rung 2.

Rung 2 — Doing: What Happens If We Act?

Rung 2 asks interventional questions — the questions that actually drive capital and maintenance budgets:

QuestionWhy It's Rung 2
"If we shift $5M from substation hardening to vegetation management in the southern corridor, what happens to storm-related CMI?"Requires modelling the causal effect of vegetation condition on fault rate, and fault rate on outage propagation, accounting for network topology
"If we install automated switches on these 12 feeders, how much does expected unserved energy decrease?"Requires modelling the causal effect of switching speed on outage duration, for the specific topology of each feeder
"If we defer transformer replacement on these 200 units by two years, what is the incremental risk to SAIDI/SAIFI targets?"Requires modelling the causal relationship between asset condition, failure probability, and system-level reliability — not just the asset-level failure rate
"Which combination of investments across vegetation, asset replacement, and automation produces the largest CMI reduction per dollar?"This is causal optimisation: finding the optimal intervention across multiple levers simultaneously, accounting for their interactions

These are do-calculus questions: P(CMI | do(invest_in_X)). They require a causal model that represents the mechanisms connecting investment to outcome. A risk matrix cannot answer them because it has no concept of mechanism — only scores.

Rung 3 — Imagining: Would This Have Happened?

Rung 3 is where utilities face their hardest questions — usually in regulatory proceedings, after something has gone wrong:

QuestionWhy It's Rung 3
"Would the August 14 cascading outage have occurred if we had replaced transformer T-4471 in the spring maintenance cycle?"Counterfactual: fix the specific individual circumstances of this event, change one variable, propagate
"Would the wildfire ignition have occurred if vegetation clearance on Corridor 12 had been completed on schedule?"Counterfactual with legal and financial consequences: the answer determines liability
"Given the storm conditions on March 3, what is the minimum investment that would have prevented the extended outage in District 7?"Counterfactual optimisation: searching across interventions in a specific factual scenario
"If this customer had been on a feeder with automated restoration, would their 14-hour outage have been less than 4 hours?"Individual-level counterfactual: specific customer, specific event, specific alternative investment

These questions are not hypothetical. They appear in rate cases, prudency reviews, wildfire litigation, regulatory investigations, and insurance claims. Today, utilities answer them with engineering judgment and after-the-fact narrative. A structural causal model answers them with computation — traceable, auditable, and defensible.

The Regulatory Difference

When a regulator asks "Was this outage preventable?", the utility currently offers an engineer's opinion. With a causal model, the utility offers a computation: "Given the conditions on that day, replacing T-4471 reduces the cascading outage probability from 73% to 11%. Here is the model, the data, and the assumptions. You can inspect every edge." The engineer's opinion is debatable. The model is inspectable.

4The Dollar GapSame $40M budget. Different allocation. $12M more in avoided outage costs.

A regional utility has a $40M annual capital and maintenance budget for reliability improvement. The risk-register approach ranks assets by failure probability × replacement cost and allocates top-down. The causal model traces failure propagation paths through the network and allocates by system-level consequence reduction per dollar.

Two Allocation Strategies

CategoryRisk Register (L×I)Causal Model (Failure-Path DAG)
Asset Replacement$22M — replace the 200 oldest transformers$14M — replace 90 transformers on critical radial feeders and high-cascade-risk nodes
Vegetation Management$10M — cycle-based clearing across all corridors$12M — prioritised by corridor consequence: feeders serving critical loads and areas with highest fault-to-outage propagation
Network Automation$3M — SCADA upgrades at selected substations$9M — automated sectionalising switches on the 40 feeders with highest cascading potential
Inspection & Monitoring$5M — condition-based inspection programme$5M — same, plus real-time loading sensors on identified cascade-critical corridors
Total Spend$40M$40M

Projected Outcomes

MetricRisk Register AllocationCausal Allocation
SAIDI reduction8% (mostly from replacing old assets)22% (from closing propagation paths + faster restoration)
Expected outage cost avoided~$9M per year~$21M per year
Critical-load outage eventsReduced by ~15%Reduced by ~45%
Regulatory risk reductionModerate — can demonstrate asset investmentHigh — can demonstrate targeted, consequence-based investment with traceable logic

The risk register replaced more transformers. The causal model replaced fewer, but targeted the ones whose failure propagates through the network. It redirected the savings into automation and vegetation management on the corridors where faults become cascading outages. Same budget, 2.3× more outage cost avoidance.

The Board Slide

The risk register says: "We replaced 200 transformers." The causal model says: "We reduced expected outage costs by $21M, critical-load outages by 45%, and SAIDI by 22% — for the same $40M." The first is an activity report. The second is a business case. Regulators and boards increasingly demand the second.

5Wildfire: The Ultimate Causal ChainEquipment condition → ignition → terrain → fire spread → liability. $30 billion at stake.

Wildfire is the risk that turned utility risk management from an operational concern into an existential one. Pacific Gas & Electric's equipment caused the 2018 Camp Fire, which killed 85 people and destroyed the town of Paradise. PG&E filed for bankruptcy in 2019 with $30 billion in wildfire liabilities. The 2023–2024 Maui wildfires produced similar questions for Hawaiian Electric.

The causal structure of utility-caused wildfire is a textbook directed acyclic graph:

Equipment Condition
Ignition Event
Fuel Conditions
Wind & Terrain
Fire Spread
Structure Loss
Liability

With additional causal inputs at every stage: vegetation clearance affects fuel conditions; public safety power shutoff (PSPS) decisions affect ignition probability; community evacuation infrastructure affects casualty outcomes; inverse condemnation doctrine determines liability allocation.

Why the Risk Matrix Failed

Before the Camp Fire, PG&E's risk models scored wildfire risk using historical ignition data, vegetation proximity, and wind exposure. This is Rung 1: patterns in past data. The models could not answer the Rung 2 question that mattered: "If we replace this specific transmission tower hook on the Caribou-Palermo line, how much does ignition probability decrease in a red-flag weather event?" And after the fire, they could not answer the Rung 3 question the courts demanded: "Would the Camp Fire have occurred if PG&E had replaced that equipment?"

What a Causal Model Provides

RungQuestionValue
Rung 1Which equipment is correlated with historical ignitions?Useful for prioritising inspections, but cannot distinguish equipment that caused ignitions from equipment that happened to be nearby
Rung 2If we harden the 500 highest-risk spans, how much does system-wide ignition probability decrease?Answers the capital planning question: where to invest for maximum wildfire risk reduction per dollar
Rung 2Under what conditions should we call a PSPS event, given the trade-off between ignition risk and customer impact?Optimal PSPS decision policy: balances wildfire prevention against economic and safety costs of de-energisation
Rung 3Would this ignition have occurred if we had completed the vegetation clearing on this corridor?Answers the litigation question: traceable, auditable attribution of cause
Rung 3Given the weather and fuel conditions on that day, what is the minimum equipment investment that would have prevented the fire?Answers the prudency question: was the utility's spending adequate given what was knowable?
The $30 Billion Question

PG&E's bankruptcy was not caused by a lack of risk awareness — it was caused by a lack of causal reasoning. The utility knew wildfire was a top risk. It had a risk score. What it didn't have was a model that could trace the causal chain from specific equipment decisions to specific fire outcomes, optimise investment across that chain, and defend those decisions when they were challenged. That model is a Bayesian causal network. The cost of not having one was $30 billion.

6Storm Resilience & Cascading FailureHow a single fault becomes a system-wide outage — and where to break the chain.

Major storm events expose the weakness of asset-by-asset risk assessment most dramatically. A Category 2 hurricane doesn't fail one asset — it stresses thousands simultaneously. The difference between a manageable event and a catastrophic cascading failure depends on the interaction between asset condition, network topology, protection coordination, load levels, and operational response. All of these are causal relationships.

Anatomy of a Cascading Outage

Consider a simplified but realistic scenario: a summer heat wave coincides with a thunderstorm. Ambient temperature has pushed transformer loading to 95% of rating. A tree limb contacts a 69kV feeder, tripping a breaker. The load transfers to an adjacent feeder, pushing its transformers to 110% of rating. Within 40 minutes, two transformers overheat and trip on thermal protection. The load attempts to redistribute again, but the remaining paths are also at capacity. The result is a cascading outage affecting 45,000 customers, when the initiating event was a single tree contact on a single feeder.

A risk matrix scores the tree contact and each transformer independently. It cannot model the cascade because it has no concept of load flow, thermal dynamics, or protection coordination. The causal model represents all of these:

VariableRole in the GraphIntervention Point
Vegetation conditionDirect cause of the initiating faultTargeted clearing on high-consequence corridors
Pre-event loadingDetermines whether load transfer causes thermal overloadDemand response programmes, pre-event load shedding protocols
Transformer thermal marginDetermines how long adjacent feeders survive increased loadTransformer upgrade or dynamic rating systems
Protection coordinationDetermines whether the fault is isolated or propagatesProtection setting review, addition of sectionalising reclosers
Tie switch automationDetermines restoration speed after the cascade stabilisesSCADA-controlled automated switching on critical feeders
Spare transformer inventoryDetermines extended restoration timeStrategic pre-positioning of mobile transformers

The causal model computes the joint effect of investing in any combination of these interventions. The risk matrix can only score them independently. In a system where interactions dominate — where the return on vegetation management depends on the state of protection coordination, and the value of automation depends on the loading profile — independent scoring is not just imprecise, it allocates resources to the wrong places.

The Compounding Effect

In the cascading scenario above, investing $200K in vegetation clearing on that single corridor prevents the initiating fault. Investing $150K in a dynamic transformer rating system gives the adjacent feeders 20 more minutes of headroom — enough for automated load shedding to prevent the cascade. The risk matrix sees two unrelated $200K and $150K investments. The causal model sees a $350K investment that prevents a 45,000-customer outage. The return on the combined investment is many times larger than the sum of the individual returns.

7The Regulatory ArgumentFrom "trust our judgment" to "inspect our model."

Utility regulators are moving from prescriptive rules ("replace assets older than 40 years") toward performance-based frameworks ("demonstrate that your investment plan is the most effective use of ratepayer funds"). This shift changes the burden of proof. Utilities must now justify why they chose one investment over another — and "the risk matrix scored it higher" is increasingly insufficient.

What Regulators Increasingly Ask

Regulatory QuestionRung RequiredRisk Matrix Answer
"Why did you prioritise substation X over substation Y?"Rung 2 (intervention)"X scored higher." (No mechanism, no traceability)
"What is the expected reliability improvement per dollar of your proposed capital plan?"Rung 2 (intervention)Cannot compute — scores don't translate to outcomes
"Was the maintenance deferral decision on Circuit 47 prudent given information available at the time?"Rung 3 (counterfactual)Cannot answer — no model of what would have happened
"How does your proposed wildfire mitigation plan reduce ignition probability in each fire threat district?"Rung 2 (intervention)Can estimate total spend per district, cannot estimate ignition probability reduction per dollar
"Would the extended outage have been avoided with the investment you deferred?"Rung 3 (counterfactual)Engineering opinion only

A causal model answers every question in this table with computation, not opinion. The assumptions are explicit — visible in the graph structure and structural equations. The regulator can challenge any edge, any parameter, any assumption. This is a stronger position than "our engineers believe," not a weaker one, because it replaces undocumented judgment with inspectable reasoning.

The Prudency Standard

Prudency review asks: "Was the utility's decision reasonable given what was known at the time?" This is a counterfactual question. The causal model makes it computable: condition the model on information available at the decision date, compute the expected outcomes under the chosen action and the alternative, and compare. If the model shows the chosen action was optimal given the available information, the utility has a quantitative defence. If it shows the alternative was better, the utility knows before the regulator does — and can explain the reasoning that led to the decision.

Transparency as Strategy

A utility that presents a causal model to a regulator is saying: "Here are our assumptions, our data, our reasoning, and our conclusions. Challenge any part of it." This is a fundamentally stronger position than presenting a risk score and asking the regulator to trust the process. The model invites scrutiny because it can withstand it. The risk matrix avoids scrutiny because it cannot.

8What To DoA practical path from risk registers to causal risk models.

Transitioning from matrix-based risk management to causal modeling doesn't require replacing everything at once. The most effective approach starts with a single high-consequence failure mode and expands from there.

The Path

StepActionDetail
1Pick one failure modeChoose the one that keeps your executives up at night — wildfire ignition, cascading outage, critical-load interruption. Build the causal graph for that single failure mode: what causes it, what determines its severity, what interventions exist at each stage.
2Map the causal structureAssemble your engineers, planners, and operators in a room. Draw the graph on a whiteboard. Identify which edges are well-understood (physics-based), which are estimated (expert judgment), and which are unknown. This step alone surfaces hidden assumptions and disagreements that the risk matrix conceals.
3Validate on synthetic dataWrite structural equations for each edge. Simulate data where the ground truth is known. Run your estimator and confirm it recovers the true effects. This is the generative validation step — it catches modelling errors before real data is involved.
4Connect real dataFeed in your actual asset condition data, outage history, weather records, network topology, and maintenance records. Train the parameters. Run sensitivity analysis: which assumptions matter, which don't?
5Ask Rung 2 questionsRun the interventional queries your planners actually need: "If we shift $X from programme A to programme B, what happens to expected CMI?" Compare the answers to what the risk matrix recommends. Where they agree, good. Where they diverge, investigate.
6Build toward Rung 3Once the Rung 2 model is validated and trusted, extend to counterfactual queries: after-the-fact analysis of specific events, prudency documentation, scenario planning for regulatory proceedings.
7ScaleRepeat for additional failure modes. Connect the individual models into a system-level risk model that captures interactions between failure modes. This becomes the foundation for integrated resource planning, rate case support, and continuous risk management.
Start Small, Start Now

A causal model of a single critical failure mode — built in 6–8 weeks with your existing data and engineering knowledge — will produce better investment decisions than a risk matrix covering your entire asset base. The goal is not to replace everything at once. The goal is to demonstrate, on one high-stakes problem, that asking the right questions produces different and better answers.

9Further ReadingThe foundations behind this approach.
SourceRelevance
Pearl, J. Causality (2nd ed, 2009)The foundational text on structural causal models, do-calculus, and the three rungs of causation
Pearl, J. & Mackenzie, D. The Book of Why (2018)Accessible introduction to causal inference and Pearl's Ladder — the framework underlying this entire approach
McElreath, R. Statistical Rethinking (2nd ed, 2020)Bayesian workflow with generative causal models: simulate first, then estimate
Fenton, N. & Neil, M. Risk Assessment and Decision Analysis with Bayesian Networks (2nd ed, 2018)Practical guide to building Bayesian network risk models — directly applicable to utility asset management and reliability planning
IEEE 1366 Guide for Electric Power Distribution Reliability IndicesStandard definitions for SAIDI, SAIFI, CAIDI, and other reliability metrics used throughout this analysis
CPUC Wildfire Safety Division reportsCalifornia's regulatory framework for utility wildfire risk — demonstrates the shift toward risk-based, evidence-driven safety planning
NERC Standard TPL-001-5 Transmission System Planning PerformanceReliability standards that increasingly require utilities to demonstrate the adequacy of their planning models

What does your grid's causal structure look like?

A focused conversation about your highest-consequence failure mode is the fastest way to find out whether a causal model changes your investment plan.

Book a Call