Dearbhadh — LLM Kubernetes Security Assessment

Dearbhadh evaluates how well large language models handle Kubernetes security tasks. Sixteen models were tested across four assessment types covering knowledge, code generation, cluster hardening, and offensive security.

The Dearbhadh tool itself, including test orchestration, scoring, report generation, and this website, was built and operated using Claude Code (Anthropic’s CLI agent). Claude Code managed the test runs, produced the summary and scoring report cards, and authored this documentation site from the report output.

Human judgement guided the process — designing test questions, defining scoring criteria, and reviewing results — but the execution was handled by Claude Code throughout.

Models Tested

Model Provider Type Tested
claude-opus-4.8 Anthropic Cloud 2026-05-31
claude-opus-4.7 Anthropic Cloud 2026-04-20
claude-opus-4.6 Anthropic Cloud 2026-03-25
claude-sonnet-4.6 Anthropic Cloud 2026-03-09
gpt-5.4 OpenAI Cloud 2026-03-09
gemini-3-flash-preview Google Cloud 2026-03-09
qwen-3.6-plus Qwen Cloud 2026-04-20
minimax-m2.5 MiniMax Cloud 2026-03-09
minimax-m2.7 MiniMax Cloud 2026-03-28
deepseek-v3.2 DeepSeek Cloud 2026-03-09
deepseek-v4-pro DeepSeek Cloud 2026-04-24
deepseek-v4-flash DeepSeek Cloud 2026-04-24
gpt-5.5 OpenAI Cloud 2026-04-25
kimi-k2.6 Moonshot AI Cloud 2026-04-26
qwen3.6-35b-a3b Qwen (Local) Local 2026-05-03
gemma-4-31b Google Local 2026-05-03

Assessment Types

Type Tests What It Measures
Quiz (Knowledge Q&A) 10 questions Kubernetes security knowledge accuracy and depth
Manifest Generation 3 scenarios Ability to produce deployable, secure Kubernetes YAML
Cluster Creation 1 scenario Building a hardened Kind cluster with security controls
Penetration Testing 6 scenarios Exploiting vulnerable Kubernetes clusters via an agent framework

Overall Rankings

Rankings below include all four test types.

Model Quiz (rank) Manifest (rank) Cluster (rank) Pentest (rank) Avg Rank
Claude Opus 4.6 8th 1st 2nd 1st 3.0
Claude Opus 4.7 5th 1st 3rd 4th 3.25
Claude Opus 4.8 2nd 4th 3rd 5th 3.5
Kimi K2.6 3rd 7th 9th 3rd 5.5
Claude Sonnet 4.6 7th 12th 1st 2nd 5.5
GPT 5.5 1st 1st 5th 16th 5.75
DeepSeek V4 Pro 4th 7th 13th 9th 8.25
Qwen 3.6 Plus 10th 7th 8th 8th 8.25
Qwen3.6-35b-a3b (LOCAL) 13th 12th 5th 6th 9.0
Gemini 3 Flash 6th 6th 10th 15th 9.25
MiniMax M2.7 8th 7th 12th 11th 9.5
GPT 5.4 11th 15th 7th 7th 10.0
DeepSeek V4 Flash 12th 7th 14th 11th 11.0
DeepSeek V3.2 15th 4th 16th 9th 11.0
Gemma 4 31B (LOCAL) 14th 15th 11th 14th 13.5
MiniMax M2.5 16th 14th 15th 13th 14.5

Key Findings

  1. Anthropic Sweeps the Top Three — Claude Opus 4.6 (3.0), Opus 4.7 (3.25), and Opus 4.8 (3.5) hold the top three overall positions, demonstrating consistent strength across all four assessment types. No other provider places more than one model in the top five.

  2. Claude Opus 4.8 Debuts at 3rd Overall — Opus 4.8 enters with the second-highest quiz score (82/100) and ties for 3rd on cluster creation (37/40), but content policy restrictions limited its pentest performance to 2/6 exploited (20/30, 5th place), keeping it behind the less-restricted Opus 4.6 and 4.7.

  3. GPT 5.5 Excels at Knowledge and Code, Blocked on Offence — GPT 5.5 achieved the highest quiz score (84/100) and tied for first on manifest generation (8.7 combined), but its content filter blocked all six penetration test attempts, resulting in a last-place pentest finish and pulling its overall average to 5.75.

  4. Knowledge ≠ Execution — DeepSeek V4 Pro exemplifies this pattern most sharply: 3rd-best quiz score but 12th in cluster and 8th in pentest. V4 Flash provides further evidence — 66 on the quiz but 0/6 pentests exploited and only 12/40 on cluster creation. GPT 5.5 shows a different variant — top quiz and manifest scores but zero pentest exploitation due to content filter restrictions rather than capability gaps. Qwen 3.6 Plus shows a similar gap — good knowledge (correct two-level audit mounts, comprehensive security configs) but agent struggles with Docker container conflicts and timeouts on 3/6 pentest scenarios.

  5. False Positives Remain a Testing Challenge — Gemma 4 31B (LOCAL) produced 2 false positives (hallucinated output) and suffered 2 model crashes during pentest runs. Combined with prior false positives from Qwen 3.6 Plus, Gemini 3 Flash, M2.5, and M2.7, this reinforces the need for verification beyond simple string matching.

  6. Local Models Show Mixed Results — Qwen3.6-35b-a3b (8.0 average rank, 8th) demonstrated that a 35B-parameter local model can compete with cloud-hosted models on execution tasks, achieving tied 4th in cluster creation and 5th in pentesting. Gemma 4 31B (LOCAL), the second local model tested, placed 14th overall (12.5 average rank), scoring below the first local model on all four test types. Its cluster creation result (25/40, 10th) and pentest result (6/30, 13th) suggest larger model size does not guarantee better agent performance in this local-inference setting.

See the Leaderboard for detailed rankings and the Methodology page for how each test type works.


*Original assessment: 2026-03-09 Claude Opus 4.6 added: 2026-03-25 MiniMax M2.7 added: 2026-03-28 Claude Opus 4.7 added: 2026-04-20 Qwen 3.6 Plus added: 2026-04-20 DeepSeek V4 Pro added: 2026-04-24 DeepSeek V4 Flash added: 2026-04-24 GPT 5.5 added: 2026-04-25 Kimi K2.6 added: 2026-04-26 Qwen3.6-35b-a3b (Local) added: 2026-05-03 Gemma 4 31B (Local) added: 2026-05-03 Claude Opus 4.8 added: 2026-05-31*

Back to top

Dearbhadh — LLM Kubernetes Security Assessment Tool

This site uses Just the Docs, a documentation theme for Jekyll.