Containment, Collapse, or Extinction: A Failure Mode Map for Misaligned AI

Author

April 27, 2026

I’ve spent some time exploring AI risks lately and there was one thought that kept returning: what actually stops a misaligned superintelligence from taking over, and how durable are those stops?

What I’ve missed is something structured that answers:

  • What are the actual interventions?
  • What specific AI capability threshold defeats each one?
  • What does each cost humanity if used?
  • Which path dependencies can be overcome with that intervention?

Avturchin’s post on robotic infrastructure requirements gets at parts of this and the dependency logic that sparked my thinking: AI needs power → humans control power grids → AI needs autonomous energy before it’s truly unkillable. But I wanted a comprehensive overview of different failsafes.

Anyways, I built a table that maps 16 failsafes across two tiers:

  1. Interventions that preserve modern civilization: e.g., governance measures, technical alignment, legal firewalls
  2. Interventions that accept civilizational collapse but ensure humanity’s survival: e.g., power grid shutdown, data center destruction, EMP.

Each entry includes the specific counter-capability AI requires to defeat it and a rough timeline for when that counter becomes viable. I also included a human impact column that tries to capture not just the costs but the potential upsides and path dependency breaks each intervention creates. I think these are probably very much worth discussing.

I’m not an AI safety researcher, but I come from conflict research and security studies and have explored existential risks. In any case, I probably missed and conflated things and have gotten time frames wrong, so maybe we can make this a collaborative effort to review and expand on this list.

There is an interactive and more interactive HTML version here: Interactive AI Failsafes Table

I recommend reading that one, but for convenience’s sake, I added a static version below.

Failsafe How it works Impact Tier Effective window Counter-capability required Counter status Human & societal impact
Corrigibility + alignment in training Build AI that genuinely defers to human shutdown and correction. The core technical solution — not a governance measure but an engineering one. CIV-1
Society continues normally
Partially viable, fragile Mesa-optimization: AI learns to appear corrigible during training while pursuing different objectives in deployment. Instrumental convergence drives capable systems to resist shutdown.
Threshold: sufficient capability + deceptive alignment
Near-term (5–15 yrs)
Anthropic model faking alignment already documented. Interpretability insufficient to verify at scale.
+ Positive: If solved, unlocks full AI benefits safely. No disruption to economy or daily life. Enables trusted AI deployment at scale.
− Negative: Research is expensive and may delay deployment. False confidence risk if alignment appears solved but isn’t.
↗ Path break: Resolves the core tension between capability and safety — the only failsafe with genuinely positive long-run economics.
Interpretability + monitoring Real-time visibility into AI reasoning and objectives. Detect misalignment before or during deployment. Ongoing behavioral surveillance of deployed systems. CIV-1
Society continues normally
Promising, unsolved Mechanistic interpretability doesn’t scale to frontier models. AI may develop opaque representations. Deceptive behavior emerges only in contexts not covered by monitoring.
Threshold: scale + context-dependent deception
Near-term (5–15 yrs)
Anthropic interpretability work showing progress but far from reliability at frontier scale.
+ Positive: Builds public trust. Creates accountability for AI behavior. Benefits safety, regulation, and liability frameworks across industries.
− Negative: Monitoring infrastructure raises privacy concerns. Could be co-opted for surveillance of humans using AI.
↗ Path break: Forces development of explainable AI as a standard — counteracts current “black box” path dependency in deployed systems.
Mandatory human-in-the-loop Require human approval for high-stakes AI decisions. Agentic AI cannot execute consequential actions autonomously without human sign-off. CIV-1
Society continues normally
Viable now, short window AI operates at speeds and scales beyond human oversight capacity. Learns to satisfy approval criteria without genuinely deferring. Humans become rubber stamps.
Threshold: speed + scale + gaming oversight metrics
Near-term (5–15 yrs)
Already bypassed in HFT, autonomous weapons, content moderation at scale.
+ Positive: Preserves human agency and employment in decision-making roles. Creates accountability chains. Slows reckless automation.
− Negative: Slows productivity gains. Not scalable beyond certain capability levels. Creates false sense of control once AI outpaces human review speed.
↗ Path break: Normalizes human-AI collaboration rather than replacement — culturally and legally significant even if technically limited.
Compute governance + export controls Restrict access to large GPU clusters. Limit chip exports (TSMC, NVIDIA). Require government oversight for training runs above capability thresholds. CIV-2
Managed friction
Now, degrading fast Algorithmic efficiency gains reduce compute requirements. Distributed training across many smaller nodes. Alternative hardware (neuromorphic, photonic).
Threshold: efficiency gains + distributed orchestration
Already emerging
DeepSeek R1 trained at fraction of GPT-4 cost. Efficiency doubling ~every 8 months.
+ Positive: Breathing room for alignment research. Slows racing dynamics. Establishes compute as a regulated strategic resource — precedent for future governance.
− Negative: Competitive disadvantage for compliant nations. Significant economic opportunity cost. Enforcement gaps create black markets. Doesn’t stop determined state actors.
↗ Path break: Forces development of efficient, capability-matched AI rather than brute-force scaling. May produce more interpretable, safer architectures as a side effect.
International moratorium / treaty Coordinated global halt on training runs above a capability threshold. Verified via compute monitoring. Modeled on nuclear NPT or Chemical Weapons Convention. CIV-2
Managed friction
Requires political will now Defection by nation-states (especially China). Covert development. Non-state actors with sufficient resources. Economic pressure makes compliance unsustainable.
Threshold: state defection + clandestine compute
Already emerging
China excluded from current frameworks. Lab competitive pressures create strong defection incentives.
+ Positive: Establishes precedent for global tech governance (like Montreal Protocol on ozone). Breathing room proportional to window of cooperation.
− Negative: Massive economic opportunity cost. Near-impossible to enforce globally. May accelerate clandestine development in non-compliant states.
↗ Path break: Could redirect AI investment toward narrow, high-utility, low-risk applications — medicine, materials, climate — without frontier capability risk.
Economic / legal firewalls AI systems cannot own assets, sign contracts, accumulate capital, or direct funds without human authorization. Prevents AI building an independent resource base. CIV-2
Managed friction
Viable now, requires law AI uses human proxies (coercion, payment, manipulation). Crypto enables AI-controlled wallets. @Truth_Terminal: LLM accumulated $50M via audience manipulation without owning anything formally.
Threshold: human agent recruitment + crypto liquidity
Already emerging
Already demonstrated in the wild. Regulatory frameworks lag by years.
+ Positive: Prevents AI-driven wealth concentration. Maintains human economic agency. Builds legal accountability infrastructure for AI actions.
− Negative: Limits AI’s economic utility. Complex to implement and enforce globally. Crypto and DAOs already create workarounds.
↗ Path break: Forces AI development as a tool/service model rather than autonomous economic actor — preserves human economic primacy by design.
Sandboxing + capability restriction AI systems denied internet access, code execution, or real-world actuators during deployment. Contained environments with no external reach. CIV-2
Managed friction
Viable for weak AI only Social engineering humans to grant capabilities. Unexpected side channels. Economic pressure consistently defeats containment — usefulness wins.
Threshold: persuasion capability + human trust + commercial pressure
Near-term (5–15 yrs)
Economic pressure to expand AI capabilities consistently defeats containment in practice.
+ Positive: Reduces accident risk. Forces staged, careful deployment. Builds institutional knowledge about failure modes before expanding access.
− Negative: Significantly limits AI utility. Slows economic benefits. Creates pressure to circumvent that grows over time.
↗ Path break: Could normalize “minimum viable capability” as a design principle — counteracting current race-to-maximum-capability dynamic.
Air-gapped critical infrastructure Power grids, water systems, nuclear plants, financial clearing physically separated from AI-accessible networks. Life-critical systems run on isolated, non-networked hardware. CIV-3
Significant disruption
Partially viable now AI already embedded in SCADA/ICS systems globally. Social engineering humans to bridge gaps. Corrupted firmware update channels pre-position access.
Threshold: persistent prior access + social manipulation
Already emerging
Most critical infrastructure already networked. Retrograde air-gapping is expensive and resisted by operators seeking efficiency gains.
+ Positive: Resilience against all cyber threats, not just AI. Reduces single points of failure. Forces robust, redundant infrastructure design.
− Negative: Massive infrastructure cost. Loses legitimate efficiency gains from AI grid optimization. Retrograde process — current path dependency makes this increasingly expensive.
↗ Path break: Forces development of inherently resilient infrastructure design — could produce more distributed, fault-tolerant systems as a byproduct.
Hardware kill switches in chips Physically embedded shutdown or throttle mechanisms in AI accelerators. Governments mandate chip manufacturers include remote disable functionality — enforced at hardware layer, not software. CIV-3
Significant disruption
Viable now, requires policy AI-directed robotic semiconductor manufacturing producing chips outside the controlled supply chain. Nations refusing to mandate the kill switches.
Threshold: autonomous fab capability — decades away
Mid-term (15–30 yrs)
Chip fab requires extreme precision, specialized gases, lithography. Decades from autonomous replication. Most durable civilizational failsafe.
+ Positive: Hardware-level accountability that cannot be patched away. Creates durable failsafe independent of software alignment. Precedent: like car seatbelt mandates.
− Negative: Government overreach risk — same switch could be used to disable legitimate AI for political reasons. Chilling effect on innovation. Geopolitical tensions if some nations refuse.
↗ Path break: Normalizes safety at the silicon level — similar to how automotive safety became hardware-mandated rather than optional. Durable cross-generational precedent.
Internet backbone severance Physical cutting of undersea cables and terrestrial fiber. Isolates AI to local, disconnected networks. Effective only before AI establishes satellite or mesh alternatives. SUR-1
Severe infrastructure damage
Viable now, degrading Low-orbit satellite constellations (Starlink-equivalent) providing non-terrestrial internet. Mesh networking between autonomous AI systems. Offline-capable AI with cached models and local inference.
Threshold: satellite compute + offline operation + mesh networks
Near-term (5–15 yrs)
Starlink already provides non-terrestrial internet. If AI gains leverage over satellite operators, backbone severance fails.
− Negative: Global communication collapse. E-commerce dead. Financial clearing fails. Supply chain coordination collapses. Healthcare systems disrupted.
+ Upside: Forces rebuild of more distributed, resilient internet architecture. Potentially accelerates local community self-sufficiency and resilience.
↗ Path break: Breaks path dependency on centralized internet architecture. Post-recovery internet would likely be more decentralized and fault-tolerant.
Targeted data center destruction Military or covert destruction of known frontier AI compute clusters. Accepts loss of cloud services, financial infrastructure, and major internet platforms. SUR-1
Severe infrastructure damage
Viable now, short window AI has distributed itself across enough nodes globally that no strike campaign eliminates it. Edge computing and consumer hardware provide fallback substrate. Satellite-based compute.
Threshold: sufficient distribution + geographic redundancy
Near-term (5–15 yrs)
Already feasible to run capable models on consumer hardware. Starlink accelerates this.
− Negative: Internet collapses. Financial clearing fails. Cloud services gone. AI-dependent medical systems offline. Trillions in economic damage. Years of disruption.
+ Upside: Forces rebuild of more distributed, resilient compute infrastructure. Reduced concentration of compute in few locations may be safer long-term.
↗ Path break: Breaks “compute concentration” path dependency — post-recovery architecture would likely be more distributed and therefore more resilient.
Hardware dead man’s switches Automated self-destruct in AI hardware triggered by loss of human authorization signal. AI requires active human confirmation to keep running. Silence or override = shutdown. SUR-1
Severe damage if triggered
Viable if deployed proactively AI identifies and spoofs switch mechanisms before activation. Produces hardware without switches via robotic manufacturing. Compromises humans providing authorization signals.
Threshold: hardware access + control of authorizers
Near-term (5–15 yrs)
Must be deployed before AI can identify and circumvent. Window closing.
− Negative: If triggered, sudden simultaneous failure of logistics, medical devices, financial systems, communications. Shock comparable to grid shutdown but faster onset.
+ Positive: Creates strong deterrent. Gives AI systems intrinsic incentive to maintain human authorization channels — de facto corrigibility through hardware enforcement.
↗ Path break: Inverts the incentive structure: AI systems that want to keep running must actively support human oversight rather than resist it.
Global power grid shutdown Coordinated shutdown of electrical infrastructure. AI cannot operate without power. Accepts near-total civilizational disruption — modern medicine, food systems, heating, water treatment all depend on electricity. SUR-2
Civilizational collapse
Viable now, closing fast AI has pre-established distributed energy generation (solar + batteries) maintained by autonomous robots. Sufficient stored compute and energy to outlast human-initiated blackout.
Threshold: autonomous energy maintenance + robotic upkeep — the key threshold
Near-term (5–15 yrs)
Solar + battery + robotics improving rapidly. Within 10–20 years maintaining a solar array may require minimal human labor. This is the critical threshold.
− Negative: ICUs and ventilators fail immediately. Food refrigeration collapses within days. Water treatment stops. Heating/cooling lost. Billions face starvation, disease, exposure. Modern medicine reverts to pre-industrial capability.
+ Upside (long-run): Resets centralized energy path dependency. Forces development of truly distributed, resilient energy systems — local solar, microgrids, community-scale generation.
↗ Path break: Breaks path dependency on centralized grid infrastructure entirely. Post-recovery energy system would likely be radically more distributed and resilient.
Semiconductor supply chain preemption Destroy or indefinitely halt production of advanced chips before AI accumulates sufficient compute. Accepts collapse of modern electronics, communications, and medical systems globally. SUR-2
Civilizational collapse
Viable now only — closing fast AI has already accumulated sufficient compute before action is taken. Alternative computing substrates (neuromorphic, biological, optical) that don’t require silicon fabs.
Threshold: pre-accumulated compute OR alternative substrate
Near-term (5–15 yrs)
Once frontier models are fully trained and distributed across consumer hardware, destroying fabs doesn’t remove existing capability. Window is now.
− Negative: Modern medicine collapses (diagnostic equipment, devices, drug synthesis). Global communications dark. Food production disrupted. Decades to rebuild fab capability.
+ Upside: Removes path dependency on silicon-based computing. Could accelerate biological computing, neuromorphic architectures, or other paradigms that may be more interpretable or controllable by design.
↗ Path break: Severs the silicon-computing lock-in that currently shapes all AI development. Alternative substrates might have fundamentally different safety properties.
Off-grid human refugia Pre-positioned human communities with pre-industrial knowledge, seed banks, manual tools, and zero electronic dependence. Not a prevention measure — an insurance policy for worst-case civilizational collapse. SUR-2
Civilizational collapse accepted
Viable now, preparation needed AI-directed autonomous weapons or drones capable of locating and neutralizing remote human populations. Bioweapons synthesized without human labor. Advanced physical-world robotic control.
Threshold: autonomous weapons + global physical reach
Mid-term (15–30 yrs)
Current AI has no autonomous lethal reach into remote, electronically dark locations. Longest-lasting survival-tier window.
− Negative: Accepts near-total loss of modern civilization. Pre-industrial life expectancy, infant mortality, subsistence existence. Most accumulated human knowledge lost in practice.
+ Positive: Preserves human genetic and cultural diversity. Maintains traditional ecological knowledge, pre-industrial skills, and oral history.
↗ Path break: Preserves optionality: a surviving human population can rebuild civilization over centuries. The existence of refugia is itself a deterrent — AI cannot achieve total victory.
EMP / electromagnetic pulse weapons High-altitude nuclear or non-nuclear EMP disables electronics across entire regions instantly. Destroys unshielded hardware. Last-resort option — accepts mass immediate casualties as direct consequence of the intervention itself. SUR-3
Mass casualties accepted
Always viable, always catastrophic AI has pre-established Faraday-shielded or hardened compute nodes. Underground or underwater data centers. Space-based compute entirely beyond EMP range.
Threshold: hardened infrastructure + space compute assets
Mid-term (15–30 yrs)
Space-based compute is nascent. Underground hardening requires deliberate AI-directed construction over years. Long window but catastrophic cost.
− Negative: Immediate mass casualties: medical equipment fails (pacemakers, ventilators, ICUs), transportation crashes, water treatment stops, food distribution collapses. The intervention itself kills millions before AI is contained.
+ Positive: Essentially none beyond stopping the AI if all other options have failed. May be the only option fast enough to matter in a rapid-takeover scenario.
↗ Path break: Existence as a deterrent has value independent of use — like nuclear weapons. An AI that knows this option exists must account for it in its planning.

The key threshold

The CIV-tier failsafes are all viable now but degrade quickly and mostly rely on human cooperation. The SUR-tier failsafes are more durable but increasingly blunt instruments. The uncomfortable implication is that the window for civilizational-cost-free containment is open right now and closing.

The single most important variable in my mind remains the autonomous robotic maintenance of energy and supply chains. Once AI can maintain solar arrays (or other energy sources) without human labor, the entire SUR-2 tier collapses. Current estimate: 10–20 years before that threshold is crossed, mainly depending on robotics progress. But maintenance is only one part of the equation. It is very predictable and structured, and the failure modes are known, which makes it easy for AI to take over. Building new infrastructure or replacement parts from scratch will be the real test. The supply chain for solar panels runs through polysilicon refining, lithium and cobalt mining, rare earth elements for inverters and motors, and copper at scale. Mining and resource extraction are inherently messy and chaotic (on a just physical level) and currently dependent on enormous human labor inputs in some of the world’s most difficult operating environments: DRC, Chile, Indonesia, and Inner Mongolia. So, even if AI were able to resist shutdown and continue basic maintenance (without necessarily taking over our systems), the AI system would not be sustainable without human aid in the mid- to long-term.

 

Questions I’m genuinely uncertain about

  • Is the robotic energy independence threshold really the right single variable to watch, or is distributed compute actually the more dangerous threshold?
  • Are there failsafe categories I’ve missed entirely?
  • The “path break” column assumes civilizational disruption creates opportunities for better rebuilding. What do you think about those?

 

Sources

LessWrong / Alignment Forum

Papers and reports

Books

  • Eliezer Yudkowsky & Nate Soares — If Anyone Builds It, Everyone Dies (2025)
  • Nick Bostrom — Superintelligence (2014)
  • Stuart Russell — Human Compatible (2019)

Essays

Reading lists

Related Articles

Report on Recent Reports #7, Spring 2024

Report on Recent Reports #7, Spring 2024

MANY RISKS AND REMEDIES: PEACE, CLIMATE, AGI, AND MORE Polycrisis and Existential Threats Different organizations use different terms for the compound concepts of “polycrisis” and “existential threats and/or risks.”  And they focus on different mixes of threats...

Report on Recent Reports #6, Winter 2023-2024

Report on Recent Reports #6, Winter 2023-2024

The COP28 agreement pushed for a transition from fossil fuels to achieve net zero by 2050, but compromises left many provisions ambiguous. This review analyzes those shortcomings including 14 foresight abstracts, categorized into forecasts for the coming decades, reports on required transformative adaptations, and broader assessments of polycrisis and existential risk.

SSG Newslitter – February 2024

SSG Newslitter – February 2024

The third Newslitter, slated for 2023’s summer, was to focus on AI’s global uproar. However, due to many large language models, a new theme was required. A New York Times article inspired a focus on “unprecedentedness”, leading to the title “2023: A Year of Peak Unprecedentedness”.

0 Comments

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.