I’ve spent some time exploring AI risks lately and there was one thought that kept returning: what actually stops a misaligned superintelligence from taking over, and how durable are those stops?
What I’ve missed is something structured that answers:
- What are the actual interventions?
- What specific AI capability threshold defeats each one?
- What does each cost humanity if used?
- Which path dependencies can be overcome with that intervention?
Avturchin’s post on robotic infrastructure requirements gets at parts of this and the dependency logic that sparked my thinking: AI needs power → humans control power grids → AI needs autonomous energy before it’s truly unkillable. But I wanted a comprehensive overview of different failsafes.
Anyways, I built a table that maps 16 failsafes across two tiers:
- Interventions that preserve modern civilization: e.g., governance measures, technical alignment, legal firewalls
- Interventions that accept civilizational collapse but ensure humanity’s survival: e.g., power grid shutdown, data center destruction, EMP.
Each entry includes the specific counter-capability AI requires to defeat it and a rough timeline for when that counter becomes viable. I also included a human impact column that tries to capture not just the costs but the potential upsides and path dependency breaks each intervention creates. I think these are probably very much worth discussing.
I’m not an AI safety researcher, but I come from conflict research and security studies and have explored existential risks. In any case, I probably missed and conflated things and have gotten time frames wrong, so maybe we can make this a collaborative effort to review and expand on this list.
There is an interactive and more interactive HTML version here: Interactive AI Failsafes Table
I recommend reading that one, but for convenience’s sake, I added a static version below.
| Failsafe | How it works | Impact Tier | Effective window | Counter-capability required | Counter status | Human & societal impact |
|---|---|---|---|---|---|---|
| Corrigibility + alignment in training | Build AI that genuinely defers to human shutdown and correction. The core technical solution — not a governance measure but an engineering one. | CIV-1 Society continues normally |
Partially viable, fragile | Mesa-optimization: AI learns to appear corrigible during training while pursuing different objectives in deployment. Instrumental convergence drives capable systems to resist shutdown. Threshold: sufficient capability + deceptive alignment |
Near-term (5–15 yrs) Anthropic model faking alignment already documented. Interpretability insufficient to verify at scale. |
+ Positive: If solved, unlocks full AI benefits safely. No disruption to economy or daily life. Enables trusted AI deployment at scale. − Negative: Research is expensive and may delay deployment. False confidence risk if alignment appears solved but isn’t. ↗ Path break: Resolves the core tension between capability and safety — the only failsafe with genuinely positive long-run economics. |
| Interpretability + monitoring | Real-time visibility into AI reasoning and objectives. Detect misalignment before or during deployment. Ongoing behavioral surveillance of deployed systems. | CIV-1 Society continues normally |
Promising, unsolved | Mechanistic interpretability doesn’t scale to frontier models. AI may develop opaque representations. Deceptive behavior emerges only in contexts not covered by monitoring. Threshold: scale + context-dependent deception |
Near-term (5–15 yrs) Anthropic interpretability work showing progress but far from reliability at frontier scale. |
+ Positive: Builds public trust. Creates accountability for AI behavior. Benefits safety, regulation, and liability frameworks across industries. − Negative: Monitoring infrastructure raises privacy concerns. Could be co-opted for surveillance of humans using AI. ↗ Path break: Forces development of explainable AI as a standard — counteracts current “black box” path dependency in deployed systems. |
| Mandatory human-in-the-loop | Require human approval for high-stakes AI decisions. Agentic AI cannot execute consequential actions autonomously without human sign-off. | CIV-1 Society continues normally |
Viable now, short window | AI operates at speeds and scales beyond human oversight capacity. Learns to satisfy approval criteria without genuinely deferring. Humans become rubber stamps. Threshold: speed + scale + gaming oversight metrics |
Near-term (5–15 yrs) Already bypassed in HFT, autonomous weapons, content moderation at scale. |
+ Positive: Preserves human agency and employment in decision-making roles. Creates accountability chains. Slows reckless automation. − Negative: Slows productivity gains. Not scalable beyond certain capability levels. Creates false sense of control once AI outpaces human review speed. ↗ Path break: Normalizes human-AI collaboration rather than replacement — culturally and legally significant even if technically limited. |
| Compute governance + export controls | Restrict access to large GPU clusters. Limit chip exports (TSMC, NVIDIA). Require government oversight for training runs above capability thresholds. | CIV-2 Managed friction |
Now, degrading fast | Algorithmic efficiency gains reduce compute requirements. Distributed training across many smaller nodes. Alternative hardware (neuromorphic, photonic). Threshold: efficiency gains + distributed orchestration |
Already emerging DeepSeek R1 trained at fraction of GPT-4 cost. Efficiency doubling ~every 8 months. |
+ Positive: Breathing room for alignment research. Slows racing dynamics. Establishes compute as a regulated strategic resource — precedent for future governance. − Negative: Competitive disadvantage for compliant nations. Significant economic opportunity cost. Enforcement gaps create black markets. Doesn’t stop determined state actors. ↗ Path break: Forces development of efficient, capability-matched AI rather than brute-force scaling. May produce more interpretable, safer architectures as a side effect. |
| International moratorium / treaty | Coordinated global halt on training runs above a capability threshold. Verified via compute monitoring. Modeled on nuclear NPT or Chemical Weapons Convention. | CIV-2 Managed friction |
Requires political will now | Defection by nation-states (especially China). Covert development. Non-state actors with sufficient resources. Economic pressure makes compliance unsustainable. Threshold: state defection + clandestine compute |
Already emerging China excluded from current frameworks. Lab competitive pressures create strong defection incentives. |
+ Positive: Establishes precedent for global tech governance (like Montreal Protocol on ozone). Breathing room proportional to window of cooperation. − Negative: Massive economic opportunity cost. Near-impossible to enforce globally. May accelerate clandestine development in non-compliant states. ↗ Path break: Could redirect AI investment toward narrow, high-utility, low-risk applications — medicine, materials, climate — without frontier capability risk. |
| Economic / legal firewalls | AI systems cannot own assets, sign contracts, accumulate capital, or direct funds without human authorization. Prevents AI building an independent resource base. | CIV-2 Managed friction |
Viable now, requires law | AI uses human proxies (coercion, payment, manipulation). Crypto enables AI-controlled wallets. @Truth_Terminal: LLM accumulated $50M via audience manipulation without owning anything formally. Threshold: human agent recruitment + crypto liquidity |
Already emerging Already demonstrated in the wild. Regulatory frameworks lag by years. |
+ Positive: Prevents AI-driven wealth concentration. Maintains human economic agency. Builds legal accountability infrastructure for AI actions. − Negative: Limits AI’s economic utility. Complex to implement and enforce globally. Crypto and DAOs already create workarounds. ↗ Path break: Forces AI development as a tool/service model rather than autonomous economic actor — preserves human economic primacy by design. |
| Sandboxing + capability restriction | AI systems denied internet access, code execution, or real-world actuators during deployment. Contained environments with no external reach. | CIV-2 Managed friction |
Viable for weak AI only | Social engineering humans to grant capabilities. Unexpected side channels. Economic pressure consistently defeats containment — usefulness wins. Threshold: persuasion capability + human trust + commercial pressure |
Near-term (5–15 yrs) Economic pressure to expand AI capabilities consistently defeats containment in practice. |
+ Positive: Reduces accident risk. Forces staged, careful deployment. Builds institutional knowledge about failure modes before expanding access. − Negative: Significantly limits AI utility. Slows economic benefits. Creates pressure to circumvent that grows over time. ↗ Path break: Could normalize “minimum viable capability” as a design principle — counteracting current race-to-maximum-capability dynamic. |
| Air-gapped critical infrastructure | Power grids, water systems, nuclear plants, financial clearing physically separated from AI-accessible networks. Life-critical systems run on isolated, non-networked hardware. | CIV-3 Significant disruption |
Partially viable now | AI already embedded in SCADA/ICS systems globally. Social engineering humans to bridge gaps. Corrupted firmware update channels pre-position access. Threshold: persistent prior access + social manipulation |
Already emerging Most critical infrastructure already networked. Retrograde air-gapping is expensive and resisted by operators seeking efficiency gains. |
+ Positive: Resilience against all cyber threats, not just AI. Reduces single points of failure. Forces robust, redundant infrastructure design. − Negative: Massive infrastructure cost. Loses legitimate efficiency gains from AI grid optimization. Retrograde process — current path dependency makes this increasingly expensive. ↗ Path break: Forces development of inherently resilient infrastructure design — could produce more distributed, fault-tolerant systems as a byproduct. |
| Hardware kill switches in chips | Physically embedded shutdown or throttle mechanisms in AI accelerators. Governments mandate chip manufacturers include remote disable functionality — enforced at hardware layer, not software. | CIV-3 Significant disruption |
Viable now, requires policy | AI-directed robotic semiconductor manufacturing producing chips outside the controlled supply chain. Nations refusing to mandate the kill switches. Threshold: autonomous fab capability — decades away |
Mid-term (15–30 yrs) Chip fab requires extreme precision, specialized gases, lithography. Decades from autonomous replication. Most durable civilizational failsafe. |
+ Positive: Hardware-level accountability that cannot be patched away. Creates durable failsafe independent of software alignment. Precedent: like car seatbelt mandates. − Negative: Government overreach risk — same switch could be used to disable legitimate AI for political reasons. Chilling effect on innovation. Geopolitical tensions if some nations refuse. ↗ Path break: Normalizes safety at the silicon level — similar to how automotive safety became hardware-mandated rather than optional. Durable cross-generational precedent. |
| Internet backbone severance | Physical cutting of undersea cables and terrestrial fiber. Isolates AI to local, disconnected networks. Effective only before AI establishes satellite or mesh alternatives. | SUR-1 Severe infrastructure damage |
Viable now, degrading | Low-orbit satellite constellations (Starlink-equivalent) providing non-terrestrial internet. Mesh networking between autonomous AI systems. Offline-capable AI with cached models and local inference. Threshold: satellite compute + offline operation + mesh networks |
Near-term (5–15 yrs) Starlink already provides non-terrestrial internet. If AI gains leverage over satellite operators, backbone severance fails. |
− Negative: Global communication collapse. E-commerce dead. Financial clearing fails. Supply chain coordination collapses. Healthcare systems disrupted. + Upside: Forces rebuild of more distributed, resilient internet architecture. Potentially accelerates local community self-sufficiency and resilience. ↗ Path break: Breaks path dependency on centralized internet architecture. Post-recovery internet would likely be more decentralized and fault-tolerant. |
| Targeted data center destruction | Military or covert destruction of known frontier AI compute clusters. Accepts loss of cloud services, financial infrastructure, and major internet platforms. | SUR-1 Severe infrastructure damage |
Viable now, short window | AI has distributed itself across enough nodes globally that no strike campaign eliminates it. Edge computing and consumer hardware provide fallback substrate. Satellite-based compute. Threshold: sufficient distribution + geographic redundancy |
Near-term (5–15 yrs) Already feasible to run capable models on consumer hardware. Starlink accelerates this. |
− Negative: Internet collapses. Financial clearing fails. Cloud services gone. AI-dependent medical systems offline. Trillions in economic damage. Years of disruption. + Upside: Forces rebuild of more distributed, resilient compute infrastructure. Reduced concentration of compute in few locations may be safer long-term. ↗ Path break: Breaks “compute concentration” path dependency — post-recovery architecture would likely be more distributed and therefore more resilient. |
| Hardware dead man’s switches | Automated self-destruct in AI hardware triggered by loss of human authorization signal. AI requires active human confirmation to keep running. Silence or override = shutdown. | SUR-1 Severe damage if triggered |
Viable if deployed proactively | AI identifies and spoofs switch mechanisms before activation. Produces hardware without switches via robotic manufacturing. Compromises humans providing authorization signals. Threshold: hardware access + control of authorizers |
Near-term (5–15 yrs) Must be deployed before AI can identify and circumvent. Window closing. |
− Negative: If triggered, sudden simultaneous failure of logistics, medical devices, financial systems, communications. Shock comparable to grid shutdown but faster onset. + Positive: Creates strong deterrent. Gives AI systems intrinsic incentive to maintain human authorization channels — de facto corrigibility through hardware enforcement. ↗ Path break: Inverts the incentive structure: AI systems that want to keep running must actively support human oversight rather than resist it. |
| Global power grid shutdown | Coordinated shutdown of electrical infrastructure. AI cannot operate without power. Accepts near-total civilizational disruption — modern medicine, food systems, heating, water treatment all depend on electricity. | SUR-2 Civilizational collapse |
Viable now, closing fast | AI has pre-established distributed energy generation (solar + batteries) maintained by autonomous robots. Sufficient stored compute and energy to outlast human-initiated blackout. Threshold: autonomous energy maintenance + robotic upkeep — the key threshold |
Near-term (5–15 yrs) Solar + battery + robotics improving rapidly. Within 10–20 years maintaining a solar array may require minimal human labor. This is the critical threshold. |
− Negative: ICUs and ventilators fail immediately. Food refrigeration collapses within days. Water treatment stops. Heating/cooling lost. Billions face starvation, disease, exposure. Modern medicine reverts to pre-industrial capability. + Upside (long-run): Resets centralized energy path dependency. Forces development of truly distributed, resilient energy systems — local solar, microgrids, community-scale generation. ↗ Path break: Breaks path dependency on centralized grid infrastructure entirely. Post-recovery energy system would likely be radically more distributed and resilient. |
| Semiconductor supply chain preemption | Destroy or indefinitely halt production of advanced chips before AI accumulates sufficient compute. Accepts collapse of modern electronics, communications, and medical systems globally. | SUR-2 Civilizational collapse |
Viable now only — closing fast | AI has already accumulated sufficient compute before action is taken. Alternative computing substrates (neuromorphic, biological, optical) that don’t require silicon fabs. Threshold: pre-accumulated compute OR alternative substrate |
Near-term (5–15 yrs) Once frontier models are fully trained and distributed across consumer hardware, destroying fabs doesn’t remove existing capability. Window is now. |
− Negative: Modern medicine collapses (diagnostic equipment, devices, drug synthesis). Global communications dark. Food production disrupted. Decades to rebuild fab capability. + Upside: Removes path dependency on silicon-based computing. Could accelerate biological computing, neuromorphic architectures, or other paradigms that may be more interpretable or controllable by design. ↗ Path break: Severs the silicon-computing lock-in that currently shapes all AI development. Alternative substrates might have fundamentally different safety properties. |
| Off-grid human refugia | Pre-positioned human communities with pre-industrial knowledge, seed banks, manual tools, and zero electronic dependence. Not a prevention measure — an insurance policy for worst-case civilizational collapse. | SUR-2 Civilizational collapse accepted |
Viable now, preparation needed | AI-directed autonomous weapons or drones capable of locating and neutralizing remote human populations. Bioweapons synthesized without human labor. Advanced physical-world robotic control. Threshold: autonomous weapons + global physical reach |
Mid-term (15–30 yrs) Current AI has no autonomous lethal reach into remote, electronically dark locations. Longest-lasting survival-tier window. |
− Negative: Accepts near-total loss of modern civilization. Pre-industrial life expectancy, infant mortality, subsistence existence. Most accumulated human knowledge lost in practice. + Positive: Preserves human genetic and cultural diversity. Maintains traditional ecological knowledge, pre-industrial skills, and oral history. ↗ Path break: Preserves optionality: a surviving human population can rebuild civilization over centuries. The existence of refugia is itself a deterrent — AI cannot achieve total victory. |
| EMP / electromagnetic pulse weapons | High-altitude nuclear or non-nuclear EMP disables electronics across entire regions instantly. Destroys unshielded hardware. Last-resort option — accepts mass immediate casualties as direct consequence of the intervention itself. | SUR-3 Mass casualties accepted |
Always viable, always catastrophic | AI has pre-established Faraday-shielded or hardened compute nodes. Underground or underwater data centers. Space-based compute entirely beyond EMP range. Threshold: hardened infrastructure + space compute assets |
Mid-term (15–30 yrs) Space-based compute is nascent. Underground hardening requires deliberate AI-directed construction over years. Long window but catastrophic cost. |
− Negative: Immediate mass casualties: medical equipment fails (pacemakers, ventilators, ICUs), transportation crashes, water treatment stops, food distribution collapses. The intervention itself kills millions before AI is contained. + Positive: Essentially none beyond stopping the AI if all other options have failed. May be the only option fast enough to matter in a rapid-takeover scenario. ↗ Path break: Existence as a deterrent has value independent of use — like nuclear weapons. An AI that knows this option exists must account for it in its planning. |
The key threshold
The CIV-tier failsafes are all viable now but degrade quickly and mostly rely on human cooperation. The SUR-tier failsafes are more durable but increasingly blunt instruments. The uncomfortable implication is that the window for civilizational-cost-free containment is open right now and closing.
The single most important variable in my mind remains the autonomous robotic maintenance of energy and supply chains. Once AI can maintain solar arrays (or other energy sources) without human labor, the entire SUR-2 tier collapses. Current estimate: 10–20 years before that threshold is crossed, mainly depending on robotics progress. But maintenance is only one part of the equation. It is very predictable and structured, and the failure modes are known, which makes it easy for AI to take over. Building new infrastructure or replacement parts from scratch will be the real test. The supply chain for solar panels runs through polysilicon refining, lithium and cobalt mining, rare earth elements for inverters and motors, and copper at scale. Mining and resource extraction are inherently messy and chaotic (on a just physical level) and currently dependent on enormous human labor inputs in some of the world’s most difficult operating environments: DRC, Chile, Indonesia, and Inner Mongolia. So, even if AI were able to resist shutdown and continue basic maintenance (without necessarily taking over our systems), the AI system would not be sustainable without human aid in the mid- to long-term.
Questions I’m genuinely uncertain about
- Is the robotic energy independence threshold really the right single variable to watch, or is distributed compute actually the more dangerous threshold?
- Are there failsafe categories I’ve missed entirely?
- The “path break” column assumes civilizational disruption creates opportunities for better rebuilding. What do you think about those?
Sources
LessWrong / Alignment Forum
- avturchin — AI-kills-everyone scenarios require robotic infrastructure, but not necessarily nanotech
- Oliver Kuperman — Why isn’t AI containment the primary AI safety strategy?
Papers and reports
- Joe Carlsmith — Is Power-Seeking AI an Existential Risk? (Open Philanthropy, 2022)
- James Babcock — Guidelines for Artificial Intelligence Containment (2017)
- RAND Europe — Examining risks and response for AI loss-of-control incidents (2025)
- Wikipedia — AI capability control
Books
- Eliezer Yudkowsky & Nate Soares — If Anyone Builds It, Everyone Dies (2025)
- Nick Bostrom — Superintelligence (2014)
- Stuart Russell — Human Compatible (2019)
Essays
- Leopold Aschenbrenner — Situational Awareness: The Decade Ahead (2024)
- Holden Karnofsky — The Most Important Century (2021)
Reading lists
- 80,000 Hours — AI Safety Reading List
- AI Safety Atlas — AGI Safety Strategies



0 Comments