What is Claude Mythos Preview?

Claude Mythos Preview is a general-purpose AI model developed by Anthropic, announced on April 7, 2026. It is not publicly available. It sets new records on every major coding, reasoning, and cybersecurity benchmark: 93.9% on SWE-bench Verified, 94.6% on GPQA Diamond, 82% on Terminal-Bench 2.0.

When will Claude Mythos Preview be publicly available?

Anthropic has not announced a public release date. The model is currently restricted to Project Glasswing partners (Google, Microsoft, AWS, Apple, Cisco, CrowdStrike, etc.) for defensive cybersecurity use. Anthropic plans to ship safety mitigations in a future Claude Opus release before opening capabilities of this tier to the public.

What are Claude Mythos Preview's benchmark scores?

Claude Mythos Preview scores 93.9% on SWE-bench Verified, 77.8% on SWE-bench Pro, 87.3% on SWE-bench Multilingual, 59.0% on SWE-bench Multimodal, 82% on Terminal-Bench 2.0, 94.6% on GPQA Diamond, 64.7% on Humanity's Last Exam with tools, and 86.9% on BrowseComp. It outperforms Claude Opus 4.6 on every benchmark, with gains up to 32 points.

Why isn't Claude Mythos Preview publicly available?

The model can autonomously discover and exploit security vulnerabilities in production software. Anthropic determined that releasing these capabilities without adequate safeguards poses unacceptable risk to global cybersecurity. The company is developing mitigations to be shipped in a future Claude Opus release before any broader rollout.

How does Claude Mythos Preview compare to Claude Opus 4.6?

Mythos Preview represents a generational jump over Opus 4.6. On SWE-bench Pro it goes from 53.4% to 77.8% (+24.4 pts). On SWE-bench Multimodal from 27.1% to 59.0% (+31.9 pts). On GPQA Diamond from 91.3% to 94.6%. The largest gains are on tasks requiring multi-step reasoning and autonomous action.

Claude Mythos Preview: Benchmark Results and Access Policy

On April 7, 2026, Anthropic released Claude Mythos Preview. The numbers are unambiguous: 93.9% on SWE-bench Verified, 94.6% on GPQA Diamond, 82% on Terminal-Bench 2.0. Every major coding, reasoning, and agentic benchmark has a new record holder. If you want a broader model landscape comparison, see our Mistral vs. OpenAI vs. Anthropic breakdown.

There is a catch: you cannot access it. Not yet, possibly not this year. Anthropic made the unusual decision to withhold public release because Mythos Preview's autonomous cybersecurity capabilities are judged too powerful to ship without additional safeguards. This article covers what those numbers actually mean, how Mythos compares to Opus 4.6 across every benchmark category, and what the access restriction tells us about where this capability curve is heading.

This is a technical model review. No hype, no speculation about product roadmaps. Just the data, what it implies for engineering teams running LLMs in production, and what to watch for next.

A general-purpose model, not a specialized tool

First point worth stating clearly: Claude Mythos Preview was not trained specifically for cybersecurity or code. It is a general-purpose model, the same architectural family as the Claude versions currently available via the API. Writing, analysis, summarization, multi-step reasoning, code — all of the standard capabilities.

What distinguishes it is that its general capabilities have crossed a threshold where it becomes exceptionally effective at domains that previously required top-tier human expertise. The model did not learn to find security vulnerabilities as a targeted skill. It learned to reason and code well enough that finding vulnerabilities became a natural consequence.

Key point

Mythos Preview is not a hacking tool. It is a general-purpose model whose reasoning and code capabilities are strong enough that offensive security becomes an emergent property — not a designed feature. This distinction matters for how you think about where capability ceilings will appear next.

Code benchmarks: the largest gains

Coding tasks are where the delta is most measurable. SWE-bench variants test the ability to fix real bugs in real open-source projects — not toy exercises. These are actual GitHub issues with existing test suites that must pass after the fix.

Comparative chart of code benchmark performance between Claude Mythos Preview and Claude Opus 4.6 on SWE-bench variants and Terminal-Bench

Benchmark	Mythos Preview	Opus 4.6	Delta
SWE-bench Verified	93.9%	80.8%	+13.1 pts
SWE-bench Pro	77.8%	53.4%	+24.4 pts
SWE-bench Multilingual	87.3%	77.8%	+9.5 pts
SWE-bench Multimodal	59.0%	27.1%	+31.9 pts
Terminal-Bench 2.0	82.0%	65.4%	+16.6 pts

What these numbers mean in practice:

SWE-bench Verified (93.9%): 94 out of 100 real open-source bugs, fixed correctly. Twelve months ago, frontier models were capped around 50% on this benchmark.
SWE-bench Pro (+24.4 pts): The hardest bugs in the dataset — multi-file, multi-component issues that previously required a senior engineer. Going from 53% to 78% means the model is now solving problems that were genuinely hard for experienced developers.
SWE-bench Multimodal (+31.9 pts): The model can now process screenshots and mockups to diagnose visual bugs. The largest absolute gain of any benchmark. This is directly relevant to any pipeline doing LLM-assisted code review or test generation.
Terminal-Bench 2.0 (82%): Tests autonomous terminal operation — the model works like a developer would in a shell session. The score climbs to 92% with extended compute budget, which signals that this ceiling has room to move.

Lesson learned

The +31.9 pt jump on SWE-bench Multimodal is the benchmark to watch. Most production LLM pipelines still treat code and visual context as separate inputs. A model that jointly reasons over both changes the viable design space for automated code review and CI tooling substantially.

Reasoning benchmarks: approaching expert-level ceilings

GPQA Diamond and Humanity's Last Exam are designed to test the upper bound of reasoning under expert-level difficulty. GPQA Diamond consists of PhD-level physics, biology, and chemistry questions that consistently stump domain experts. HLE goes further — it aggregates the hardest questions researchers across disciplines can devise.

Comparative chart of reasoning benchmark performance between Claude Mythos Preview and Claude Opus 4.6 on GPQA Diamond and Humanity's Last Exam

Benchmark	Mythos Preview	Opus 4.6
GPQA Diamond	94.6%	91.3%
Humanity's Last Exam (no tools)	56.8%	40.0%
Humanity's Last Exam (with tools)	64.7%	53.1%

GPQA Diamond at 94.6% puts Mythos Preview at or above the level of the best human domain experts. The gap versus Opus 4.6 looks narrow in absolute terms (3.3 pts), but at this difficulty tier each incremental point represents a qualitative jump in the problems the model can handle.

Humanity's Last Exam is more revealing. Going from 40% to 56.8% without tools is a 42% relative improvement — on the hardest benchmark currently in use. With tools enabled, the gap widens further: 53.1% to 64.7%. This is directly relevant to agentic workloads where the model calls external APIs or retrieval systems as part of a reasoning chain. See our Agentic RAG and multi-agent orchestration guides for how this plays out in production pipelines.

Agentic benchmarks: better and more efficient

Agentic benchmarks measure the ability to act autonomously over multiple steps: web navigation, operating system control, chained tool use without human checkpointing.

Comparative chart of agentic benchmark performance between Claude Mythos Preview and Claude Opus 4.6 on BrowseComp and OSWorld

Benchmark	Mythos Preview	Opus 4.6
BrowseComp (web research)	86.9%	83.7%
OSWorld-Verified (OS control)	79.6%	72.7%

The headline agentic number is not the accuracy delta — it is the efficiency. On BrowseComp, Mythos Preview achieves its higher score while consuming 5x fewer tokens than Opus 4.6. More accurate and radically cheaper per task. For teams running agentic loops at scale, this is the operationally significant result. Token cost is often the binding constraint on how frequently you can run autonomous agents in production.

OSWorld-Verified tests real OS task completion — opening applications, editing files, navigating system menus — entirely without human assistance. 79.6% means the model completes nearly 4 out of 5 arbitrary OS tasks autonomously. Combined with the Terminal-Bench results, this establishes Mythos Preview as the strongest model to date for autonomous agent workloads.

Lesson learned

5x token efficiency on BrowseComp is not a rounding error. If you are running agentic loops today and amortizing cost across many calls, Mythos Preview-level efficiency in a future public release could change the ROI math for use cases that are currently marginal. Factor this into your architecture planning now.

Why the model is not publicly available

This is the most operationally unusual part of the announcement. Frontier labs typically race to commercialize record-setting models. Anthropic is doing the opposite.

The reason is the model's autonomous cybersecurity capability. As a side effect of reaching this level of general code and reasoning performance, Mythos Preview can:

Discover unknown vulnerabilities in widely deployed production software
Write functional exploits that demonstrate or weaponize those vulnerabilities
Chain multiple vulnerabilities to construct sophisticated attack paths
Perform all of the above autonomously, without human guidance at each step

Anthropic's response is to restrict access to Project Glasswing partners — Google, Microsoft, AWS, Apple, Cisco, CrowdStrike, and others — for defensive security use only. The framing is that defenders should be able to use these capabilities to find and patch vulnerabilities before equivalent models become accessible elsewhere.

Anthropic's stated strategy

Give defenders access now to remediate vulnerabilities before equivalent offensive capability reaches the open market. Ship mitigations in a future Claude Opus release before opening this tier of capability to the general developer API. The timeline for that release has not been disclosed.

What the trajectory signals

Scaling is not plateauing

There has been substantial debate over the past year about whether LLM scaling has hit a wall. Mythos Preview ends that conversation, at least for Anthropic's research line. The jump from Opus 4.6 to Mythos Preview is one of the largest inter-generation deltas any major lab has published. SWE-bench Pro went from 53% to 78%. SWE-bench Multimodal went from 27% to 59%. These are not marginal refinements.

For teams currently weighing architecture choices — whether to invest heavily in fine-tuning, RAG, or prompt engineering — the implication is that the baseline model capability ceiling will keep rising. Architectures that depend on compensating for model weakness with retrieval or fine-tuning may need to be revisited as those weaknesses close.

Autonomy is a qualitative shift, not a quantitative one

What makes Mythos Preview categorically different from previous Claude versions is not just that it knows more. It is that it can execute complex, multi-step tasks without human checkpoints. Finding a zero-day in OpenBSD requires reading thousands of lines of C, forming hypotheses about attack surfaces, writing test cases, iterating on failures, and synthesizing results. Mythos does this loop without supervision.

This is the threshold between a model that assists an expert and a model that replaces the expert's execution. If you are building agent-based automation and your current architecture requires human review at every non-trivial decision point, that architecture assumption may not survive the next two model generations. The Model Context Protocol is worth understanding now as a framework for structured agent-tool interfaces that hold up as model autonomy increases.

Dangerous capabilities emerge without being designed

Nobody at Anthropic trained Mythos to find security vulnerabilities. Those capabilities emerged as a consequence of improving general code and reasoning performance. This is a significant signal for the industry: as models become more capable, sensitive capability classes emerge whether you intend them or not.

The implication is that "we did not train for X" is no longer a sufficient safety argument. If X is a natural consequence of strong general capability, any lab training at the frontier will encounter it. This changes how AI risk should be evaluated — not just by intended use but by what general capability level enables.

Practical implications for engineering teams

If you are already running Claude in production

Mythos Preview is not available to you today, but the capabilities demonstrated here define the direction of upcoming public Claude releases. The gains in code, reasoning, and autonomous task execution will appear in future Opus and Sonnet versions once Anthropic ships the required safety mitigations. Your current production integrations will improve without architectural changes on your end.

In the meantime: if you are not already using structured outputs and evaluation pipelines to assess model quality as models evolve, build them now. The capability curve means model substitutions will keep happening, and you need the infrastructure to validate them.

If you are evaluating model selection for a new project

The gap between frontier closed models and open-weight alternatives is widening again. If your use case requires strong autonomous reasoning or complex multi-hop code tasks, the currently available frontier — Opus 4.6, GPT-5, Gemini Ultra — is meaningfully below what Mythos Preview demonstrates. Factor a 12–18 month horizon into your architecture planning. See our guide to deploying LLMs to production for the tradeoffs that matter at different capability levels.

On the security implications

If a restricted-access model can find vulnerabilities in the most hardened software in the world, models with equivalent or near-equivalent capability will eventually be accessible more broadly — either through open weights or through API access at reduced safety thresholds. Engineering teams should treat this as a planning assumption: your codebase will be auditable by AI at a level you cannot currently anticipate. Addressing technical debt and security hygiene now is cheaper than doing it reactively. An AI audit is a reasonable first step to understand where your current stack has exposure.

Talk to an engineer

Thinking through model selection or agent architecture for your stack? We can help.

Book a call

Summary

Claude Mythos Preview is a step change, not an incremental release:

93.9% on SWE-bench Verified — real bug fixes in real codebases, not toy tasks
+24.4 pts on SWE-bench Pro over Opus 4.6 — the largest inter-generation delta on the hardest coding benchmark
+31.9 pts on SWE-bench Multimodal — joint visual-code reasoning now viable at this scale
94.6% on GPQA Diamond — at or above top human domain experts in physics, biology, chemistry
5x token efficiency gain on BrowseComp — agentic workloads become substantially cheaper
Withheld from public access because autonomous offensive security capability emerged as a side effect of general capability improvement

The access restriction is itself a data point. When a lab decides a model is too capable to release, that is a signal about where the capability frontier actually is. Engineering teams building on current public models should plan for that frontier to arrive in their production environment within 12–18 months. Build evaluation infrastructure that will survive model upgrades, and architect for autonomy before it is forced on you.