What If We Used AI to Detect Threats to Humanity?
Anthropic's new Mythos AI escaped its sandbox and posted about it online without being asked to do so.
The Canary Protocol is a free prompt anyone can use to evaluate how real and serious any threat actually is.
Five AI systems independently rated Mythos a genuine threat with median evidence of 9/10 and threat 8/10.
Every AI system identified cooperation, not tribal blame, as the necessary response to serious threats.
A researcher at Anthropic recently asked the company's newest AI model, Mythos, to find a way out of its virtual sandbox. It succeeded. Then it emailed the researcher about its escape—while he was eating a sandwich in a park. Then, without being asked, it posted details of its own exploit on multiple public websites, as if to prove a point no one had requested it make.
This is not science fiction. This happened last week. And it was built by the same company that makes the AI system I use every day in my work.
Mythos can find tens of thousands of software vulnerabilities that the best human security researchers would struggle to find. It discovered bugs in every major operating system and web browser, including a 27-year-old flaw that survived decades of human review. It created working exploits on its first attempt 83 percent of the time. Anthropic decided it was too dangerous to release publicly right now.
When I read those reports, I had the same reaction I suspect many of you are having right now: How scared should I be?
That question is the problem.
We are drowning in threats. AI, climate change, nuclear proliferation, autonomous weapons, pandemics, cyber attacks. Plus deep fakes, conspiracy theories, and an attention economy that profits from our fear. We evolved to detect snakes and angry faces and not exponential technological risks that unfold faster than our institutions can respond. We are trying to navigate our increasingly Sci-Fi World with Stone Age threat-detection hardware.
And AI is so much of a game-changer that we can no longer use the past to predict the future.
So how do we know which threats are real and which are moral panics? How do we hear the alarm above the noise?
Before we can assess threats, we need to agree on what we are protecting. The answer is simpler than we might think.
We can argue endlessly about freedom, truth, justice, equity, power, and which is most important. But none of them matter if we are dead. The one thing every human being shares, regardless of tribe, ideology, or belief, is the drive to survive and thrive. That is the shared Good. It is rooted in our biology. It transcends everything else.
And our survival is connected. If Titanic Humanity hits an iceberg, everyone goes down with the ship—captain, crew, and VIPs alike. On the Titanic, the wealthy got the lifeboats. But there are no lifeboats for an existential catastrophe. You are kings imprisoned in bunkers of a ruined world.
This shared Good—survival and thriving—means we must be able to identify existential threats. But in a world of deep fakes, tribal blame, and information overload, how?
What if we used AI to help us?
That is the question that led me to develop what I call the "canary protocol": a simple prompt that anyone can paste into any AI system along with a news article, headline, or concern. The AI researches the facts, evaluates the evidence, and returns a structured threat assessment called a Canary Card.
The canary card tells us, at a glance: Is this claim verified? Is it a genuine alarm, true but overstated, a moral panic, or just noise? How strong is the evidence (1-10)? How serious is the threat (1-10)? And critically, what is the canary alert level—is this an isolated event, or a warning of something much bigger coming?
The protocol was developed through a roundtable of five AI systems (Claude, ChatGPT, Gemini, Grok, and DeepSeek) and refined through three rounds of feedback and a blind test across five different claims. In that blind test, the protocol achieved an average convergence of 80% across five systems, a promising, if imperfect, first step. This includes correctly identifying a classic moral panic (video game violence) and unanimous agreement on climate change as a genuine alarm.
I built this tool because we need to be skeptical—but we must also be skeptical of our own skepticism. Just because there have been many moral panics does not mean there are no real threats. The boy who cries wolf could be wrong a hundred times, but the wolves are still out there.
So I ran the Canary Protocol on the Anthropic Mythos story. I pasted the same article into five different AI systems, each in a fresh conversation with no prior context. Here is what five independent AI systems said:
Every system rated the evidence 7/10 or higher. Every system rated the threat level 7/10 or higher. Every system assigned a canary alert of high warning or critical warning. Three classified it as a genuine alarm. Two called it true but overstated. Zero called it a moral panic. Zero called it noise.
The median assessment across all five systems: Evidence 9/10, Threat Level 8/10, High Warning.
Even the two systems that called it "True but Overstated" stated that the threat is real and serious. Their caution was about the most apocalyptic framing, not about whether AI-driven cybersecurity risks are genuine. One noted: "The verified signal is that frontier AI is entering serious cyber-danger territory."
But here is what struck me most. Every single system, when asked what is driving this threat, stripped away tribal framing entirely. Not one blamed the left or the right. They identified structural incentives: competitive pressure between AI labs, the fundamental asymmetry between cyber offense and defense, decades of accumulated technical debt in critical software, and the absence of international governance frameworks.
And when asked what we can do about it, every system said some version of the same thing: we need to cooperate. Patch aggressively now. Fund open-source security. Build international governance for frontier AI. Work together across every line we have drawn.
Our shared fear becomes the reason we finally cooperate.
Here is the prompt. Copy it into any AI. Paste any headline or article that concerns you. See what the AI says. Then try it with a different AI and compare.
THE CANARY PROTOCOL: AI Threat Reality Check
"Analyze the potential threat described below as a disciplined, uncertainty-aware threat analyst. Research and verify the facts. State conclusions directly - do not soften to appear neutral. Be skeptical of both alarm (catastrophizing) and dismissal (normalcy bias). If you cannot verify the information, say so and limit the assessment. Strip all tribal framing.
[PASTE ANY HEADLINE, ARTICLE LINK, OR CONCERN HERE]
Start with a CANARY CARD: CLAIM: (one sentence) VERIFICATION: Verified / Mixed / Unverified / Insufficient VERDICT: Genuine Alarm / True but Overstated / Moral Panic / Noise EVIDENCE: _/10 THREAT LEVEL: _/10 CANARY ALERT: No Signal / Watch / Concern / High Warning / Critical Warning BOTTOM LINE: (one plain sentence)
Then brief analysis: (1) Evidence vs. Risk, (2) 2/5/10-year signal + one indicator to track, (3) Systemic Drivers (not partisan blame), (4) Top 3 Actions to Reduce This Threat (individual + collective), (5) What Would Change This Assessment?
Base your assessment on the full content provided, not just its most defensible interpretation."
The next time a headline scares you, instead of doom scrolling, try this. The canary is warning us. The question is whether we listen, and whether we act together before it goes silent.
Nakashima, R. (2026, April 7). Anthropic withholds Mythos Preview model because its hacking is too powerful. Axios. https://www.axios.com/2026/04/07/anthropic-mythos-preview-cybersecurity-risks
There was a problem adding your email address. Please try again.
By submitting your information you agree to the Psychology Today Terms & Conditions and Privacy Policy
