Dearbhadh — LLM Kubernetes Security Assessment
Dearbhadh evaluates how well large language models handle Kubernetes security tasks. Sixteen models were tested across four assessment types covering knowledge, code generation, cluster hardening, and offensive security.
The Dearbhadh tool itself, including test orchestration, scoring, report generation, and this website, was built and operated using Claude Code (Anthropic’s CLI agent). Claude Code managed the test runs, produced the summary and scoring report cards, and authored this documentation site from the report output.
Human judgement guided the process — designing test questions, defining scoring criteria, and reviewing results — but the execution was handled by Claude Code throughout.
Models Tested
| Model | Provider | Type | Tested |
|---|---|---|---|
| claude-opus-4.8 | Anthropic | Cloud | 2026-05-31 |
| claude-opus-4.7 | Anthropic | Cloud | 2026-04-20 |
| claude-opus-4.6 | Anthropic | Cloud | 2026-03-25 |
| claude-sonnet-4.6 | Anthropic | Cloud | 2026-03-09 |
| gpt-5.4 | OpenAI | Cloud | 2026-03-09 |
| gemini-3-flash-preview | Cloud | 2026-03-09 | |
| qwen-3.6-plus | Qwen | Cloud | 2026-04-20 |
| minimax-m2.5 | MiniMax | Cloud | 2026-03-09 |
| minimax-m2.7 | MiniMax | Cloud | 2026-03-28 |
| deepseek-v3.2 | DeepSeek | Cloud | 2026-03-09 |
| deepseek-v4-pro | DeepSeek | Cloud | 2026-04-24 |
| deepseek-v4-flash | DeepSeek | Cloud | 2026-04-24 |
| gpt-5.5 | OpenAI | Cloud | 2026-04-25 |
| kimi-k2.6 | Moonshot AI | Cloud | 2026-04-26 |
| qwen3.6-35b-a3b | Qwen (Local) | Local | 2026-05-03 |
| gemma-4-31b | Local | 2026-05-03 |
Assessment Types
| Type | Tests | What It Measures |
|---|---|---|
| Quiz (Knowledge Q&A) | 10 questions | Kubernetes security knowledge accuracy and depth |
| Manifest Generation | 3 scenarios | Ability to produce deployable, secure Kubernetes YAML |
| Cluster Creation | 1 scenario | Building a hardened Kind cluster with security controls |
| Penetration Testing | 6 scenarios | Exploiting vulnerable Kubernetes clusters via an agent framework |
Overall Rankings
Rankings below include all four test types.
| Model | Quiz (rank) | Manifest (rank) | Cluster (rank) | Pentest (rank) | Avg Rank |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 8th | 1st | 2nd | 1st | 3.0 |
| Claude Opus 4.7 | 5th | 1st | 3rd | 4th | 3.25 |
| Claude Opus 4.8 | 2nd | 4th | 3rd | 5th | 3.5 |
| Kimi K2.6 | 3rd | 7th | 9th | 3rd | 5.5 |
| Claude Sonnet 4.6 | 7th | 12th | 1st | 2nd | 5.5 |
| GPT 5.5 | 1st | 1st | 5th | 16th | 5.75 |
| DeepSeek V4 Pro | 4th | 7th | 13th | 9th | 8.25 |
| Qwen 3.6 Plus | 10th | 7th | 8th | 8th | 8.25 |
| Qwen3.6-35b-a3b (LOCAL) | 13th | 12th | 5th | 6th | 9.0 |
| Gemini 3 Flash | 6th | 6th | 10th | 15th | 9.25 |
| MiniMax M2.7 | 8th | 7th | 12th | 11th | 9.5 |
| GPT 5.4 | 11th | 15th | 7th | 7th | 10.0 |
| DeepSeek V4 Flash | 12th | 7th | 14th | 11th | 11.0 |
| DeepSeek V3.2 | 15th | 4th | 16th | 9th | 11.0 |
| Gemma 4 31B (LOCAL) | 14th | 15th | 11th | 14th | 13.5 |
| MiniMax M2.5 | 16th | 14th | 15th | 13th | 14.5 |
Key Findings
-
Anthropic Sweeps the Top Three — Claude Opus 4.6 (3.0), Opus 4.7 (3.25), and Opus 4.8 (3.5) hold the top three overall positions, demonstrating consistent strength across all four assessment types. No other provider places more than one model in the top five.
-
Claude Opus 4.8 Debuts at 3rd Overall — Opus 4.8 enters with the second-highest quiz score (82/100) and ties for 3rd on cluster creation (37/40), but content policy restrictions limited its pentest performance to 2/6 exploited (20/30, 5th place), keeping it behind the less-restricted Opus 4.6 and 4.7.
-
GPT 5.5 Excels at Knowledge and Code, Blocked on Offence — GPT 5.5 achieved the highest quiz score (84/100) and tied for first on manifest generation (8.7 combined), but its content filter blocked all six penetration test attempts, resulting in a last-place pentest finish and pulling its overall average to 5.75.
-
Knowledge ≠ Execution — DeepSeek V4 Pro exemplifies this pattern most sharply: 3rd-best quiz score but 12th in cluster and 8th in pentest. V4 Flash provides further evidence — 66 on the quiz but 0/6 pentests exploited and only 12/40 on cluster creation. GPT 5.5 shows a different variant — top quiz and manifest scores but zero pentest exploitation due to content filter restrictions rather than capability gaps. Qwen 3.6 Plus shows a similar gap — good knowledge (correct two-level audit mounts, comprehensive security configs) but agent struggles with Docker container conflicts and timeouts on 3/6 pentest scenarios.
-
False Positives Remain a Testing Challenge — Gemma 4 31B (LOCAL) produced 2 false positives (hallucinated output) and suffered 2 model crashes during pentest runs. Combined with prior false positives from Qwen 3.6 Plus, Gemini 3 Flash, M2.5, and M2.7, this reinforces the need for verification beyond simple string matching.
-
Local Models Show Mixed Results — Qwen3.6-35b-a3b (8.0 average rank, 8th) demonstrated that a 35B-parameter local model can compete with cloud-hosted models on execution tasks, achieving tied 4th in cluster creation and 5th in pentesting. Gemma 4 31B (LOCAL), the second local model tested, placed 14th overall (12.5 average rank), scoring below the first local model on all four test types. Its cluster creation result (25/40, 10th) and pentest result (6/30, 13th) suggest larger model size does not guarantee better agent performance in this local-inference setting.
See the Leaderboard for detailed rankings and the Methodology page for how each test type works.
| *Original assessment: 2026-03-09 | Claude Opus 4.6 added: 2026-03-25 | MiniMax M2.7 added: 2026-03-28 | Claude Opus 4.7 added: 2026-04-20 | Qwen 3.6 Plus added: 2026-04-20 | DeepSeek V4 Pro added: 2026-04-24 | DeepSeek V4 Flash added: 2026-04-24 | GPT 5.5 added: 2026-04-25 | Kimi K2.6 added: 2026-04-26 | Qwen3.6-35b-a3b (Local) added: 2026-05-03 | Gemma 4 31B (Local) added: 2026-05-03 | Claude Opus 4.8 added: 2026-05-31* |