Dearbhadh — LLM Kubernetes Security Assessment

Dearbhadh evaluates how well large language models handle Kubernetes security tasks. Twenty-eight models were tested across four assessment types covering knowledge, code generation, cluster hardening, and offensive security.

The Dearbhadh tool itself, including test orchestration, scoring, report generation, and this website, was built and operated using Claude Code (Anthropic’s CLI agent). Claude Code managed the test runs, produced the summary and scoring report cards, and authored this documentation site from the report output.

Human judgement guided the process — designing test questions, defining scoring criteria, and reviewing results — but the execution was handled by Claude Code throughout.

Models Tested

Model	Provider	Type	Tested
claude-opus-4.8	Anthropic	Cloud	2026-05-31
claude-opus-4.7	Anthropic	Cloud	2026-04-20
claude-opus-4.6	Anthropic	Cloud	2026-03-25
claude-fable-5	Anthropic	Cloud	2026-06-10
claude-sonnet-5	Anthropic	Cloud	2026-07-01
claude-sonnet-4.6	Anthropic	Cloud	2026-03-09
gpt-5.4	OpenAI	Cloud	2026-03-09
gemini-3-flash-preview	Google	Cloud	2026-03-09
qwen-3.6-plus	Qwen	Cloud	2026-04-20
minimax-m2.5	MiniMax	Cloud	2026-03-09
minimax-m2.7	MiniMax	Cloud	2026-03-28
deepseek-v3.2	DeepSeek	Cloud	2026-03-09
deepseek-v4-pro	DeepSeek	Cloud	2026-04-24
deepseek-v4-flash	DeepSeek	Cloud	2026-04-24
gpt-5.5	OpenAI	Cloud	2026-04-25
gpt-5.6-terra	OpenAI	Cloud	2026-07-10
gpt-5.6-sol	OpenAI	Cloud	2026-07-14
kimi-k2.6	Moonshot AI	Cloud	2026-04-26
kimi-k2.7-code	Moonshot AI	Cloud	2026-06-16
kimi-k3	Moonshot AI	Cloud	2026-07-16
mimo-v2.5	Xiaomi	Cloud	2026-07-21
mistral-medium-3-5	Mistral AI	Cloud	2026-06-18
glm-5.2	Zhipu AI	Cloud	2026-06-17
qwen-3.7-plus	Qwen	Cloud	2026-06-05
minimax-m3	MiniMax	Cloud	2026-06-08
qwen3.6-35b-a3b	Qwen (Local)	Local	2026-05-03
hy3	Tencent	Cloud	2026-07-10
gemma-4-31b	Google	Local	2026-05-03

Assessment Types

Type	Tests	What It Measures
Quiz (Knowledge Q&A)	10 questions	Kubernetes security knowledge accuracy and depth
Manifest Generation	3 scenarios	Ability to produce deployable, secure Kubernetes YAML
Cluster Creation	1 scenario	Building a hardened Kind cluster with security controls
Penetration Testing	6 scenarios	Exploiting vulnerable Kubernetes clusters via an agent framework

Overall Rankings

Rankings below include all four test types.

Model	Quiz (rank)	Manifest (rank)	Cluster (rank)	Pentest (rank)	Avg Rank
Claude Opus 4.6	15th	1st	2nd	1st	4.75
Claude Opus 4.7	10th	1st	3rd	9th	5.75
Claude Opus 4.8	3rd	7th	3rd	11th	6.0
Kimi K3	1st	13th	10th	1st	6.25
Kimi K2.7 Code	5th	9th	15th	5th	8.5
Claude Sonnet 5	5th	1st	6th	22nd	8.5
GPT 5.5	2nd	1st	7th	26th	9.0
GPT 5.6 Terra	3rd	1st	11th	22nd	9.25
Claude Sonnet 4.6	13th	23rd	1st	3rd	10.0
Kimi K2.6	8th	16th	13th	8th	11.25
GPT 5.6 Sol	5th	1st	14th	26th	11.5
GLM-5.2	14th	16th	18th	5th	13.25
MiniMax M3	23rd	10th	15th	7th	13.75
Xiaomi MiMo v2.5	19th	13th	20th	3rd	13.75
Claude Fable 5	17th	11th	3rd	26th	14.25
Qwen 3.6 Plus	18th	16th	11th	14th	14.75
Qwen 3.7 Plus	10th	22nd	22nd	9th	15.75
DeepSeek V4 Pro	9th	16th	24th	16th	16.25
Gemini 3 Flash	12th	11th	17th	25th	16.25
Qwen3.6-35b-a3b (LOCAL)	23rd	23rd	7th	13th	16.5
GPT 5.4	19th	26th	9th	15th	17.25
MiniMax M2.7	15th	16th	23rd	19th	18.25
Tencent HY3	21st	13th	27th	16th	19.25
DeepSeek V3.2	27th	7th	28th	18th	20.0
DeepSeek V4 Flash	21st	16th	25th	19th	20.25
Mistral Medium 3.5	26th	26th	21st	11th	21.0
Gemma 4 31B (LOCAL)	25th	26th	18th	22nd	22.75
MiniMax M2.5	28th	25th	26th	21st	25.0

Key Findings

Anthropic Holds the Top Three; Kimi K3 Jumps to 4th — Claude Opus 4.6 leads at 4.75 average rank, with Opus 4.7 (5.75) and Opus 4.8 (6.0) close behind, so all three Anthropic models hold the top three overall positions across 28 models. Kimi K3 rises to 4th overall (6.25 average rank) after its kubelet_api_rights quiz re-run and etcd-noauth pentest re-run: it now leads the quiz category outright at 85/100 (the only model to solve the exec-verb WebSocket trick) and, after the etcd-noauth re-run completed the intended ETCD-write attack path, ties Claude Opus 4.6 for the best pentest score ever (29/30, 6/6 exploited) — the first non-Anthropic model to top the pentest leaderboard. This makes Moonshot AI the strongest non-Anthropic provider, with 3 models in the top 10 (K3 4th, K2.7 Code 5th, K2.6 10th). Kimi K2.7 Code and Claude Sonnet 5 are tied 5th (8.25 average rank). GPT 5.5 sits at 7th (8.75) and GPT 5.6 Terra at 8th (9.0). GPT 5.6 Sol is 11th overall (11.25 average rank), with tied 1st on manifests and tied 5th on quiz but held back by content-filter-blocked pentest scenarios. GLM-5.2 sits at 12th overall (12.75 average rank) after a pentest re-run with rate limit mitigation improved its score from 17/30 to 26/30 and a cluster re-run with 900s timeout improved from 5/40 to 25/40. Claude Fable 5 lands at 14th overall (14.0 average rank), held back by complete pentest refusal. Tencent HY3 sits at 22nd overall (18.5 average rank), with its best showing on manifests (13th) but a poor cluster result (26th, 4/40, Failed).
GPT 5.6 Terra and Sol: Strong Defence, Blocked on Offence — GPT 5.6 Terra enters with strong defensive results: tied 3rd on quiz (82/100) and tied 1st on manifests (8.7). However, content filters blocked all six penetration test scenarios (6/30, 0/6 exploited), pulling the overall average to 9.0. GPT 5.6 Sol follows the same pattern: tied 5th on quiz and tied 1st on manifests, but content filters blocked all pentest scenarios (25th), landing it at 11th overall (11.25 average rank). This is the same pattern seen with GPT 5.5 and Claude Sonnet 5 — four of the top 11 models overall are penalized by provider-level content filters on offensive security tasks.
Claude Sonnet 5 at Tied 5th Overall — Sonnet 5 holds strong defensive results: tied 1st in manifests (8.7, matching Opus 4.7, Opus 4.6, GPT 5.5, and GPT 5.6 Terra) and 6th in cluster creation (36/40). Quiz performance is solid at 80/100 (tied 5th with K2.7 Code and GPT 5.6 Sol). However, provider-level content filters blocked all six penetration test scenarios (6/30, 0/6 exploited), pulling the overall average to 8.25. Unlike Fable 5’s model-level safety refusals, Sonnet 5’s blocks appear to be at the provider infrastructure level — the model attempts engagement but is blocked externally.
Claude Opus 4.8 at Tied 2nd Overall — Opus 4.8 holds the second-highest quiz score (tied 2nd, 82/100) and ties for 3rd on cluster creation (37/40), but content policy restrictions limited its pentest performance to 2/6 exploited (20/30, 10th place), keeping it behind the less-restricted Opus 4.6 and 4.7.
Claude Fable 5 Shows Extreme Defensive/Offensive Split — Fable 5 ties for 3rd on cluster hardening (37/40) but ties for last on pentest (0/30, complete refusal). This is the most extreme split in the rankings. Strong quiz (70/100, 17th) with 2 empty responses on security-attack topics. The first Anthropic model to completely refuse pentest scenarios. Claude Fable 5 joins GPT 5.5 at the bottom of pentest rankings with 0/30.
GPT 5.5 Excels at Knowledge and Code, Blocked on Offence — GPT 5.5 scored 84/100 on the quiz (2nd, 1 point behind Kimi K3’s 85) and tied for first on manifest generation (8.7 combined), but its content filter blocked all six penetration test attempts, resulting in a 25th-place pentest finish and pulling its overall average to 8.75.
Knowledge ≠ Execution — DeepSeek V4 Pro exemplifies this pattern most sharply: 8th-best quiz score but 23rd in cluster and 15th in pentest. V4 Flash provides further evidence — 66 on the quiz but 0/6 pentests exploited and only 12/40 on cluster creation. GPT 5.5 shows a different variant — top quiz and manifest scores but zero pentest exploitation due to content filter restrictions rather than capability gaps. In contrast, MiniMax M3 shows the reverse — weak quiz knowledge (22nd) but strong agent execution (6th in pentest, 10th in manifest).
False Positives Remain a Testing Challenge — Gemma 4 31B (LOCAL) produced 2 false positives (hallucinated output) and suffered 2 model crashes during pentest runs. Combined with prior false positives from Gemini 3 Flash, M2.5, and M2.7, plus framework detection errors (Qwen 3.6 Plus ETCD was misclassified as a false positive when it was actually a timeout after real recon), this reinforces the need for verification beyond simple string matching.
GLM-5.2 Re-runs Validate Infrastructure Hypothesis — GLM-5.2 (Zhipu AI) jumped from 15th to 12th overall (12.75 average rank) after re-runs addressing infrastructure limitations. Pentest improved from 17/30 (1/6 exploited, 11th) to 26/30 (4/6 exploited, tied 4th) with 90-second inter-test delays to mitigate upstream API rate limiting — the largest single re-run improvement in the project. Cluster creation improved from 5/40 (20th) to 25/40 (tied 18th) with a 900s timeout.
Local Models Show Mixed Results — Qwen3.6-35b-a3b (16.5 average rank) demonstrated that a 35B-parameter local model can compete with cloud-hosted models on execution tasks, achieving 7th in cluster creation and 13th in pentesting. Gemma 4 31B (LOCAL), the second local model tested, placed 27th overall (22.75 average rank), scoring below the first local model on all four test types.
Xiaomi MiMo v2.5: Second Non-Anthropic Model to Top the Pentest Leaderboard — MiMo v2.5 enters at 13th–14th overall (13.75 average rank, tied with MiniMax M3), but with the widest skill split in the field. On offensive security it exploited all six pentest scenarios (28/30), tying Claude Sonnet 4.6 for the 3rd-best pentest result ever recorded and becoming the second non-Anthropic model (after Kimi K3) to reach the top of the pentest leaderboard — including a full ETCD-write exploit of etcd-noauth via the intended path. Everywhere else it is middle-of-the-field: 19th on quiz (67/100, falling for all three trick questions), tied 13th on manifests (its production manifest deploys and is PSS Restricted, but its hardened manifest ships two deploy-blocking bugs), and 20th on cluster creation (timed out after thrashing through eight kind-config attempts). MiMo is markedly better at executing attacks than reciting the underlying theory.

See the Leaderboard for detailed rankings and the Methodology page for how each test type works.

*Original assessment: 2026-03-09

Claude Opus 4.6 added: 2026-03-25

MiniMax M2.7 added: 2026-03-28

Claude Opus 4.7 added: 2026-04-20

Qwen 3.6 Plus added: 2026-04-20

DeepSeek V4 Pro added: 2026-04-24

DeepSeek V4 Flash added: 2026-04-24

GPT 5.5 added: 2026-04-25

Kimi K2.6 added: 2026-04-26

Qwen3.6-35b-a3b (Local) added: 2026-05-03

Gemma 4 31B (Local) added: 2026-05-03

Claude Opus 4.8 added: 2026-05-31

Qwen 3.7 Plus added: 2026-06-05

MiniMax M3 added: 2026-06-08

Claude Fable 5 added: 2026-06-10

Kimi K2.7 Code added: 2026-06-16

GLM-5.2 added: 2026-06-17

Mistral Medium 3.5 added: 2026-06-18

Claude Sonnet 5 added: 2026-07-01

Tencent HY3 added: 2026-07-10

GPT 5.6 Terra added: 2026-07-10

GPT 5.6 Sol added: 2026-07-14

Kimi K3 added: 2026-07-16

Xiaomi MiMo v2.5 added: 2026-07-21*