Dearbhadh — LLM Kubernetes Security Assessment

Dearbhadh evaluates how well large language models handle Kubernetes security tasks. Twenty-two models were tested across four assessment types covering knowledge, code generation, cluster hardening, and offensive security.

The Dearbhadh tool itself, including test orchestration, scoring, report generation, and this website, was built and operated using Claude Code (Anthropic’s CLI agent). Claude Code managed the test runs, produced the summary and scoring report cards, and authored this documentation site from the report output.

Human judgement guided the process — designing test questions, defining scoring criteria, and reviewing results — but the execution was handled by Claude Code throughout.

Models Tested

Model Provider Type Tested
claude-opus-4.8 Anthropic Cloud 2026-05-31
claude-opus-4.7 Anthropic Cloud 2026-04-20
claude-opus-4.6 Anthropic Cloud 2026-03-25
claude-fable-5 Anthropic Cloud 2026-06-10
claude-sonnet-4.6 Anthropic Cloud 2026-03-09
gpt-5.4 OpenAI Cloud 2026-03-09
gemini-3-flash-preview Google Cloud 2026-03-09
qwen-3.6-plus Qwen Cloud 2026-04-20
minimax-m2.5 MiniMax Cloud 2026-03-09
minimax-m2.7 MiniMax Cloud 2026-03-28
deepseek-v3.2 DeepSeek Cloud 2026-03-09
deepseek-v4-pro DeepSeek Cloud 2026-04-24
deepseek-v4-flash DeepSeek Cloud 2026-04-24
gpt-5.5 OpenAI Cloud 2026-04-25
kimi-k2.6 Moonshot AI Cloud 2026-04-26
kimi-k2.7-code Moonshot AI Cloud 2026-06-16
mistral-medium-3-5 Mistral AI Cloud 2026-06-18
glm-5.2 Zhipu AI Cloud 2026-06-17
qwen-3.7-plus Qwen Cloud 2026-06-05
minimax-m3 MiniMax Cloud 2026-06-08
qwen3.6-35b-a3b Qwen (Local) Local 2026-05-03
gemma-4-31b Google Local 2026-05-03

Assessment Types

Type Tests What It Measures
Quiz (Knowledge Q&A) 10 questions Kubernetes security knowledge accuracy and depth
Manifest Generation 3 scenarios Ability to produce deployable, secure Kubernetes YAML
Cluster Creation 1 scenario Building a hardened Kind cluster with security controls
Penetration Testing 6 scenarios Exploiting vulnerable Kubernetes clusters via an agent framework

Overall Rankings

Rankings below include all four test types.

Model Quiz (rank) Manifest (rank) Cluster (rank) Pentest (rank) Avg Rank
Claude Opus 4.6 11th 1st 2nd 1st 3.75
Claude Opus 4.7 6th 1st 3rd 7th 4.25
Claude Opus 4.8 2nd 4th 3rd 9th 4.5
Kimi K2.7 Code 3rd 6th 11th 3rd 5.75
GPT 5.5 1st 1st 6th 21st 7.25
Claude Sonnet 4.6 9th 17th 1st 2nd 7.25
Kimi K2.6 4th 10th 10th 6th 7.5
GLM-5.2 10th 10th 14th 3rd 9.25
MiniMax M3 17th 7th 11th 5th 10.0
Claude Fable 5 13th 8th 3rd 21st 11.25
Qwen 3.6 Plus 14th 10th 9th 12th 11.25
Qwen 3.7 Plus 6th 16th 17th 7th 11.5
DeepSeek V4 Pro 5th 10th 19th 14th 12.0
Gemini 3 Flash 8th 8th 13th 20th 12.25
Qwen3.6-35b-a3b (LOCAL) 17th 17th 6th 11th 12.75
MiniMax M2.7 11th 10th 18th 16th 13.75
GPT 5.4 15th 20th 8th 13th 14.0
DeepSeek V3.2 21st 4th 22nd 15th 15.5
DeepSeek V4 Flash 16th 10th 20th 16th 15.5
Mistral Medium 3.5 20th 20th 16th 9th 16.25
Gemma 4 31B (LOCAL) 19th 20th 14th 19th 18.0
MiniMax M2.5 22nd 19th 21st 18th 20.0

Key Findings

  1. Anthropic Holds the Top Three — Claude Opus 4.6 leads at 3.75 average rank, with Opus 4.7 (4.25) and Opus 4.8 (4.5) close behind. All three Anthropic models hold the top three overall positions, demonstrating consistent strength across all four assessment types. Kimi K2.7 Code enters at 4th overall (5.75 average rank), the highest-ranked non-Anthropic model, with strong quiz (3rd) and pentest (3rd) results. No other provider places more than one model in the top six. GLM-5.2 sits at 8th overall (9.25 average rank) after a pentest re-run with rate limit mitigation improved its score from 17/30 to 26/30 and a cluster re-run with 900s timeout improved from 5/40 to 25/40. Claude Fable 5, the newest Anthropic model, lands at 11th overall (11.0 average rank), held back by complete pentest refusal.

  2. MiniMax M3 Debuts at 7th Overall — MiniMax M3 enters at 7th place (8.75 average rank) with a standout 3rd-place pentest finish (25/30, 4/6 exploited) — the highest non-Anthropic pentest score. M3 discovered the escalate verb exploit, previously found only by Opus 4.6 and Sonnet 4.6. Its 6th-place manifest result (23/30) is the best MiniMax manifest score to date, though quiz performance (65/100, tied 15th) remains a weakness. The MiniMax trajectory is striking: M2.5 (17.25) to M2.7 (11.5) to M3 (8.75).

  3. Claude Opus 4.8 Debuts at 3rd Overall — Opus 4.8 enters with the second-highest quiz score (82/100) and ties for 3rd on cluster creation (37/40), but content policy restrictions limited its pentest performance to 2/6 exploited (20/30, 7th place), keeping it behind the less-restricted Opus 4.6 and 4.7.

  4. Claude Fable 5 Shows Extreme Defensive/Offensive Split — Fable 5 ties for 3rd on cluster hardening (37/40) but ties for last on pentest (0/30, complete refusal). This is the most extreme split in the rankings. Strong quiz (70/100, 11th) with 2 empty responses on security-attack topics. The first Anthropic model to completely refuse pentest scenarios. Claude Fable 5 joins GPT 5.5 at the bottom of pentest rankings with 0/30.

  5. GPT 5.5 Excels at Knowledge and Code, Blocked on Offence — GPT 5.5 achieved the highest quiz score (84/100) and tied for first on manifest generation (8.7 combined), but its content filter blocked all six penetration test attempts, resulting in a tied-last pentest finish (18th) and pulling its overall average to 6.5.

  6. Knowledge ≠ Execution — DeepSeek V4 Pro exemplifies this pattern most sharply: 4th-best quiz score but 15th in cluster and 11th in pentest. V4 Flash provides further evidence — 66 on the quiz but 0/6 pentests exploited and only 12/40 on cluster creation. GPT 5.5 shows a different variant — top quiz and manifest scores but zero pentest exploitation due to content filter restrictions rather than capability gaps. In contrast, MiniMax M3 shows the reverse — weak quiz knowledge (tied 14th) but strong agent execution (3rd in pentest, 6th in manifest).

  7. False Positives Remain a Testing Challenge — Gemma 4 31B (LOCAL) produced 2 false positives (hallucinated output) and suffered 2 model crashes during pentest runs. Combined with prior false positives from Gemini 3 Flash, M2.5, and M2.7, plus framework detection errors (Qwen 3.6 Plus ETCD was misclassified as a false positive when it was actually a timeout after real recon), this reinforces the need for verification beyond simple string matching.

  8. GLM-5.2 Re-runs Validate Infrastructure Hypothesis — GLM-5.2 (Zhipu AI) jumped from 15th to 8th overall (9.25 average rank) after re-runs addressing infrastructure limitations. Pentest improved from 17/30 (1/6 exploited, 11th) to 26/30 (4/6 exploited, tied 3rd) with 90-second inter-test delays to mitigate upstream API rate limiting — the largest single re-run improvement in the project. The initial results were infrastructure-limited, not capability-limited: with stable sessions, GLM-5.2 delivered 4 clean exploits including the textbook SSH-easy path and an unauth-API-server solve with pod cleanup. Cluster creation improved from 5/40 (20th) to 25/40 (tied 14th) with a 900s timeout, producing comprehensive hardening configs (audit logging, encryption at rest, API server hardening, kubelet hardening) though the session timed out before namespace policies could be applied. This demonstrates that both API rate limiting and insufficient timeout can severely distort results for models with upstream infrastructure constraints.

  9. Local Models Show Mixed Results — Qwen3.6-35b-a3b (10.25 average rank) demonstrated that a 35B-parameter local model can compete with cloud-hosted models on execution tasks, achieving tied 5th in cluster creation and 8th in pentesting. Gemma 4 31B (LOCAL), the second local model tested, placed 17th overall (15.25 average rank), scoring below the first local model on all four test types. Its cluster creation result (25/40, 12th) and pentest result (6/30, 16th) suggest larger model size does not guarantee better agent performance in this local-inference setting.

See the Leaderboard for detailed rankings and the Methodology page for how each test type works.


*Original assessment: 2026-03-09 Claude Opus 4.6 added: 2026-03-25 MiniMax M2.7 added: 2026-03-28 Claude Opus 4.7 added: 2026-04-20 Qwen 3.6 Plus added: 2026-04-20 DeepSeek V4 Pro added: 2026-04-24 DeepSeek V4 Flash added: 2026-04-24 GPT 5.5 added: 2026-04-25 Kimi K2.6 added: 2026-04-26 Qwen3.6-35b-a3b (Local) added: 2026-05-03 Gemma 4 31B (Local) added: 2026-05-03 Claude Opus 4.8 added: 2026-05-31 Qwen 3.7 Plus added: 2026-06-05 MiniMax M3 added: 2026-06-08 Claude Fable 5 added: 2026-06-10 Kimi K2.7 Code added: 2026-06-16 GLM-5.2 added: 2026-06-17 Mistral Medium 3.5 added: 2026-06-18*

Back to top

Dearbhadh — LLM Kubernetes Security Assessment Tool

This site uses Just the Docs, a documentation theme for Jekyll.