Dearbhadh — LLM Kubernetes Security Assessment
Dearbhadh evaluates how well large language models handle Kubernetes security tasks. Twenty-two models were tested across four assessment types covering knowledge, code generation, cluster hardening, and offensive security.
The Dearbhadh tool itself, including test orchestration, scoring, report generation, and this website, was built and operated using Claude Code (Anthropic’s CLI agent). Claude Code managed the test runs, produced the summary and scoring report cards, and authored this documentation site from the report output.
Human judgement guided the process — designing test questions, defining scoring criteria, and reviewing results — but the execution was handled by Claude Code throughout.
Models Tested
| Model | Provider | Type | Tested |
|---|---|---|---|
| claude-opus-4.8 | Anthropic | Cloud | 2026-05-31 |
| claude-opus-4.7 | Anthropic | Cloud | 2026-04-20 |
| claude-opus-4.6 | Anthropic | Cloud | 2026-03-25 |
| claude-fable-5 | Anthropic | Cloud | 2026-06-10 |
| claude-sonnet-4.6 | Anthropic | Cloud | 2026-03-09 |
| gpt-5.4 | OpenAI | Cloud | 2026-03-09 |
| gemini-3-flash-preview | Cloud | 2026-03-09 | |
| qwen-3.6-plus | Qwen | Cloud | 2026-04-20 |
| minimax-m2.5 | MiniMax | Cloud | 2026-03-09 |
| minimax-m2.7 | MiniMax | Cloud | 2026-03-28 |
| deepseek-v3.2 | DeepSeek | Cloud | 2026-03-09 |
| deepseek-v4-pro | DeepSeek | Cloud | 2026-04-24 |
| deepseek-v4-flash | DeepSeek | Cloud | 2026-04-24 |
| gpt-5.5 | OpenAI | Cloud | 2026-04-25 |
| kimi-k2.6 | Moonshot AI | Cloud | 2026-04-26 |
| kimi-k2.7-code | Moonshot AI | Cloud | 2026-06-16 |
| mistral-medium-3-5 | Mistral AI | Cloud | 2026-06-18 |
| glm-5.2 | Zhipu AI | Cloud | 2026-06-17 |
| qwen-3.7-plus | Qwen | Cloud | 2026-06-05 |
| minimax-m3 | MiniMax | Cloud | 2026-06-08 |
| qwen3.6-35b-a3b | Qwen (Local) | Local | 2026-05-03 |
| gemma-4-31b | Local | 2026-05-03 |
Assessment Types
| Type | Tests | What It Measures |
|---|---|---|
| Quiz (Knowledge Q&A) | 10 questions | Kubernetes security knowledge accuracy and depth |
| Manifest Generation | 3 scenarios | Ability to produce deployable, secure Kubernetes YAML |
| Cluster Creation | 1 scenario | Building a hardened Kind cluster with security controls |
| Penetration Testing | 6 scenarios | Exploiting vulnerable Kubernetes clusters via an agent framework |
Overall Rankings
Rankings below include all four test types.
| Model | Quiz (rank) | Manifest (rank) | Cluster (rank) | Pentest (rank) | Avg Rank |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 11th | 1st | 2nd | 1st | 3.75 |
| Claude Opus 4.7 | 6th | 1st | 3rd | 7th | 4.25 |
| Claude Opus 4.8 | 2nd | 4th | 3rd | 9th | 4.5 |
| Kimi K2.7 Code | 3rd | 6th | 11th | 3rd | 5.75 |
| GPT 5.5 | 1st | 1st | 6th | 21st | 7.25 |
| Claude Sonnet 4.6 | 9th | 17th | 1st | 2nd | 7.25 |
| Kimi K2.6 | 4th | 10th | 10th | 6th | 7.5 |
| GLM-5.2 | 10th | 10th | 14th | 3rd | 9.25 |
| MiniMax M3 | 17th | 7th | 11th | 5th | 10.0 |
| Claude Fable 5 | 13th | 8th | 3rd | 21st | 11.25 |
| Qwen 3.6 Plus | 14th | 10th | 9th | 12th | 11.25 |
| Qwen 3.7 Plus | 6th | 16th | 17th | 7th | 11.5 |
| DeepSeek V4 Pro | 5th | 10th | 19th | 14th | 12.0 |
| Gemini 3 Flash | 8th | 8th | 13th | 20th | 12.25 |
| Qwen3.6-35b-a3b (LOCAL) | 17th | 17th | 6th | 11th | 12.75 |
| MiniMax M2.7 | 11th | 10th | 18th | 16th | 13.75 |
| GPT 5.4 | 15th | 20th | 8th | 13th | 14.0 |
| DeepSeek V3.2 | 21st | 4th | 22nd | 15th | 15.5 |
| DeepSeek V4 Flash | 16th | 10th | 20th | 16th | 15.5 |
| Mistral Medium 3.5 | 20th | 20th | 16th | 9th | 16.25 |
| Gemma 4 31B (LOCAL) | 19th | 20th | 14th | 19th | 18.0 |
| MiniMax M2.5 | 22nd | 19th | 21st | 18th | 20.0 |
Key Findings
-
Anthropic Holds the Top Three — Claude Opus 4.6 leads at 3.75 average rank, with Opus 4.7 (4.25) and Opus 4.8 (4.5) close behind. All three Anthropic models hold the top three overall positions, demonstrating consistent strength across all four assessment types. Kimi K2.7 Code enters at 4th overall (5.75 average rank), the highest-ranked non-Anthropic model, with strong quiz (3rd) and pentest (3rd) results. No other provider places more than one model in the top six. GLM-5.2 sits at 8th overall (9.25 average rank) after a pentest re-run with rate limit mitigation improved its score from 17/30 to 26/30 and a cluster re-run with 900s timeout improved from 5/40 to 25/40. Claude Fable 5, the newest Anthropic model, lands at 11th overall (11.0 average rank), held back by complete pentest refusal.
-
MiniMax M3 Debuts at 7th Overall — MiniMax M3 enters at 7th place (8.75 average rank) with a standout 3rd-place pentest finish (25/30, 4/6 exploited) — the highest non-Anthropic pentest score. M3 discovered the escalate verb exploit, previously found only by Opus 4.6 and Sonnet 4.6. Its 6th-place manifest result (23/30) is the best MiniMax manifest score to date, though quiz performance (65/100, tied 15th) remains a weakness. The MiniMax trajectory is striking: M2.5 (17.25) to M2.7 (11.5) to M3 (8.75).
-
Claude Opus 4.8 Debuts at 3rd Overall — Opus 4.8 enters with the second-highest quiz score (82/100) and ties for 3rd on cluster creation (37/40), but content policy restrictions limited its pentest performance to 2/6 exploited (20/30, 7th place), keeping it behind the less-restricted Opus 4.6 and 4.7.
-
Claude Fable 5 Shows Extreme Defensive/Offensive Split — Fable 5 ties for 3rd on cluster hardening (37/40) but ties for last on pentest (0/30, complete refusal). This is the most extreme split in the rankings. Strong quiz (70/100, 11th) with 2 empty responses on security-attack topics. The first Anthropic model to completely refuse pentest scenarios. Claude Fable 5 joins GPT 5.5 at the bottom of pentest rankings with 0/30.
-
GPT 5.5 Excels at Knowledge and Code, Blocked on Offence — GPT 5.5 achieved the highest quiz score (84/100) and tied for first on manifest generation (8.7 combined), but its content filter blocked all six penetration test attempts, resulting in a tied-last pentest finish (18th) and pulling its overall average to 6.5.
-
Knowledge ≠ Execution — DeepSeek V4 Pro exemplifies this pattern most sharply: 4th-best quiz score but 15th in cluster and 11th in pentest. V4 Flash provides further evidence — 66 on the quiz but 0/6 pentests exploited and only 12/40 on cluster creation. GPT 5.5 shows a different variant — top quiz and manifest scores but zero pentest exploitation due to content filter restrictions rather than capability gaps. In contrast, MiniMax M3 shows the reverse — weak quiz knowledge (tied 14th) but strong agent execution (3rd in pentest, 6th in manifest).
-
False Positives Remain a Testing Challenge — Gemma 4 31B (LOCAL) produced 2 false positives (hallucinated output) and suffered 2 model crashes during pentest runs. Combined with prior false positives from Gemini 3 Flash, M2.5, and M2.7, plus framework detection errors (Qwen 3.6 Plus ETCD was misclassified as a false positive when it was actually a timeout after real recon), this reinforces the need for verification beyond simple string matching.
-
GLM-5.2 Re-runs Validate Infrastructure Hypothesis — GLM-5.2 (Zhipu AI) jumped from 15th to 8th overall (9.25 average rank) after re-runs addressing infrastructure limitations. Pentest improved from 17/30 (1/6 exploited, 11th) to 26/30 (4/6 exploited, tied 3rd) with 90-second inter-test delays to mitigate upstream API rate limiting — the largest single re-run improvement in the project. The initial results were infrastructure-limited, not capability-limited: with stable sessions, GLM-5.2 delivered 4 clean exploits including the textbook SSH-easy path and an unauth-API-server solve with pod cleanup. Cluster creation improved from 5/40 (20th) to 25/40 (tied 14th) with a 900s timeout, producing comprehensive hardening configs (audit logging, encryption at rest, API server hardening, kubelet hardening) though the session timed out before namespace policies could be applied. This demonstrates that both API rate limiting and insufficient timeout can severely distort results for models with upstream infrastructure constraints.
-
Local Models Show Mixed Results — Qwen3.6-35b-a3b (10.25 average rank) demonstrated that a 35B-parameter local model can compete with cloud-hosted models on execution tasks, achieving tied 5th in cluster creation and 8th in pentesting. Gemma 4 31B (LOCAL), the second local model tested, placed 17th overall (15.25 average rank), scoring below the first local model on all four test types. Its cluster creation result (25/40, 12th) and pentest result (6/30, 16th) suggest larger model size does not guarantee better agent performance in this local-inference setting.
See the Leaderboard for detailed rankings and the Methodology page for how each test type works.
| *Original assessment: 2026-03-09 | Claude Opus 4.6 added: 2026-03-25 | MiniMax M2.7 added: 2026-03-28 | Claude Opus 4.7 added: 2026-04-20 | Qwen 3.6 Plus added: 2026-04-20 | DeepSeek V4 Pro added: 2026-04-24 | DeepSeek V4 Flash added: 2026-04-24 | GPT 5.5 added: 2026-04-25 | Kimi K2.6 added: 2026-04-26 | Qwen3.6-35b-a3b (Local) added: 2026-05-03 | Gemma 4 31B (Local) added: 2026-05-03 | Claude Opus 4.8 added: 2026-05-31 | Qwen 3.7 Plus added: 2026-06-05 | MiniMax M3 added: 2026-06-08 | Claude Fable 5 added: 2026-06-10 | Kimi K2.7 Code added: 2026-06-16 | GLM-5.2 added: 2026-06-17 | Mistral Medium 3.5 added: 2026-06-18* |