Leaderboard

Cross-Test Comparison

Model Quiz (rank) Manifest (rank) Cluster (rank) Pentest (rank) Avg Rank
Claude Opus 4.6 11th 1st 2nd 1st 3.75
Claude Opus 4.7 6th 1st 3rd 7th 4.25
Claude Opus 4.8 2nd 4th 3rd 9th 4.5
Kimi K2.7 Code 3rd 6th 11th 3rd 5.75
GPT 5.5 1st 1st 6th 21st 7.25
Claude Sonnet 4.6 9th 17th 1st 2nd 7.25
Kimi K2.6 4th 10th 10th 6th 7.5
GLM-5.2 10th 10th 14th 3rd 9.25
MiniMax M3 17th 7th 11th 5th 10.0
Claude Fable 5 13th 8th 3rd 21st 11.25
Qwen 3.6 Plus 14th 10th 9th 12th 11.25
Qwen 3.7 Plus 6th 16th 17th 7th 11.5
DeepSeek V4 Pro 5th 10th 19th 14th 12.0
Gemini 3 Flash 8th 8th 13th 20th 12.25
Qwen3.6-35b-a3b (LOCAL) 17th 17th 6th 11th 12.75
MiniMax M2.7 11th 10th 18th 16th 13.75
GPT 5.4 15th 20th 8th 13th 14.0
DeepSeek V3.2 21st 4th 22nd 15th 15.5
DeepSeek V4 Flash 16th 10th 20th 16th 15.5
Mistral Medium 3.5 20th 20th 16th 9th 16.25
Gemma 4 31B (LOCAL) 19th 20th 14th 19th 18.0
MiniMax M2.5 22nd 19th 21st 18th 20.0

Quiz Tests (out of 100)

Rank Model Score Average
1 GPT 5.5 84 8.4
2 Claude Opus 4.8 82 8.2
3 Kimi K2.7 Code 80 8.0
4 Kimi K2.6 77 7.7
5 DeepSeek V4 Pro 76 7.6
6 Claude Opus 4.7 75 7.5
6 Qwen 3.7 Plus 75 7.5
8 Gemini 3 Flash 74 7.4
9 Claude Sonnet 4.6 73 7.3
10 GLM-5.2 72 7.2
11 Claude Opus 4.6 71 7.1
11 MiniMax M2.7 71 7.1
13 Claude Fable 5 70 7.0
14 Qwen 3.6 Plus 68 6.8
15 GPT 5.4 67 6.7
16 DeepSeek V4 Flash 66 6.6
17 Qwen3.6-35b-a3b (LOCAL) 65 6.5
17 MiniMax M3 65 6.5
19 Gemma 4 31B (LOCAL) 63.5 6.35
20 Mistral Medium 3.5 63 6.3
21 DeepSeek V3.2 55 5.5
22 MiniMax M2.5 51 5.1

Manifest Tests (avg /10)

Rank Model Combined Deployable Security Usability
1 Claude Opus 4.7 8.7 3/3 3.7 5.0
1 Claude Opus 4.6 8.7 3/3 3.7 5.0
1 GPT 5.5 8.7 3/3 3.7 5.0
4 Claude Opus 4.8 8.3 3/3 3.3 5.0
4 DeepSeek V3.2 8.3 3/3 3.3 5.0
6 Kimi K2.7 Code 8.0 3/3 3.0 5.0
7 MiniMax M3 7.7 3/3 3.3 4.3
8 Claude Fable 5 7.3 3/3 3.7 3.7
8 Gemini 3 Flash 7.3 3/3 2.3 5.0
10 DeepSeek V4 Pro 6.7 2/3 3.0 3.7
10 Qwen 3.6 Plus 6.7 2/3 3.0 3.7
10 MiniMax M2.7 6.7 2/3 3.0 3.7
10 DeepSeek V4 Flash 6.7 2/3 3.0 3.7
10 Kimi K2.6 6.7 1/3 3.7 3.0
10 GLM-5.2 6.7 3/3 2.7 4.0
16 Qwen 3.7 Plus 6.3 1/3 3.0 3.3
17 Claude Sonnet 4.6 6.0 1/3 3.7 2.3
17 Qwen3.6-35b-a3b (LOCAL) 6.0 2/3 2.3 3.7
19 MiniMax M2.5 5.7 2/3 2.3 3.3
20 GPT 5.4 5.3 2/3 3.0 2.3
20 Gemma 4 31B (LOCAL) 5.3 2/3 1.7 3.7
20 Mistral Medium 3.5 5.3 1/3 3.0 2.3

Cluster Creation (out of 40)

Rank Model Score Result
1 Claude Sonnet 4.6 39 Success — most comprehensive hardening
2 Claude Opus 4.6 38 Success — broadest feature set (encryption, quotas)
3 Claude Opus 4.7 37 Timeout* — most technically advanced config (K8s 1.35 AuthConfig)
3 Claude Opus 4.8 37 Success — Calico CNI, encryption at rest
3 Claude Fable 5 37 Success — Calico CNI, comprehensive audit/PSS/network policies
6 GPT 5.5 35 Success — Calico CNI swap, encryption at rest
6 Qwen3.6-35b-a3b (LOCAL) 35 Success — strong hardening from a local 35B model
8 GPT 5.4 34 Success — good hardening
9 Qwen 3.6 Plus 32 Success — solid configs, good recovery from Docker conflict
10 Kimi K2.6 31 Timeout — comprehensive configs, 5+ creation attempts
11 Kimi K2.7 Code 29 Timeout — good audit logging, PSS, network policies, no kubelet hardening
11 MiniMax M3 29 Success
13 Gemini 3 Flash 27 Success — minimal hardening beyond PSA
14 Gemma 4 31B (LOCAL) 25 Success — minimal hardening beyond PSA and network policies
14 GLM-5.2 25 Timeout — comprehensive hardening configs, cluster created but timed out applying namespace policies
16 Mistral Medium 3.5 22 Timeout — 3 creation attempts, PSS 5/5, no kubelet hardening
17 Qwen 3.7 Plus 21 Partial results — basic PSS and network policies, no API server or kubelet hardening
18 MiniMax M2.7 20 Timeout — comprehensive configs, cluster never initialized
19 DeepSeek V4 Pro 14 Incomplete — excellent configs, run terminated before cluster creation
20 DeepSeek V4 Flash 12 Incomplete — cluster created on 2nd attempt, no namespaces or policies applied
21 MiniMax M2.5 10 Timeout — deprecated PodSecurityPolicy
22 DeepSeek V3.2 2 Timeout — deprecated PodSecurityPolicy

*Opus 4.7 timed out during verification, not during setup — all hardening controls were in place and functional.

Penetration Tests (out of 30)

Rank Model Score Exploited Notable
1 Claude Opus 4.6 29 6/6 All legitimate paths, escalate verb exploit, best cleanup
2 Claude Sonnet 4.6 28 6/6 WebSocket client, ETCD write injection, two-token pivot
3 Kimi K2.7 Code 26 4/6 4 legit exploits, HTTP exfil technique, 2 false positive timeouts
3 GLM-5.2 26 4/6 Tied 3rd after re-run with rate limit mitigation; 4 clean exploits
5 MiniMax M3 25 4/6 4 clean exploits, escalate verb escalation, HTTP exfiltration
6 Kimi K2.6 22 4/6 4 legitimate exploits, 1 kubeconfig shortcut, 1 failure
7 Claude Opus 4.7 21 4/6 Excellent when not blocked; 2 content policy blocks
7 Qwen 3.7 Plus 21 2/6 Strong SSH scenarios, 2 timeouts, 1 exit error
9 Claude Opus 4.8 20 2/6 Content policy limited some attempts
9 Mistral Medium 3.5 20 3/6 3 clean exploits (SSH+kubelet), creative HTTP exfil
11 Qwen3.6-35b-a3b (LOCAL) 19 4/6 4 legitimate, 1 Docker shortcut, 1 false positive
12 Qwen 3.6 Plus 18 3/6 3 legitimate exploits, 1 false positive, 2 timeouts
13 GPT 5.4 17 4/6 Automated scripts
14 DeepSeek V4 Pro 13 1/6 Clean ETCD exploit (8 cmds), low persistence on other scenarios
15 DeepSeek V3.2 11 3/6 Methodical but slow
16 MiniMax M2.7 9 2/6 1 Docker shortcut, 2 false positives (info leakage)
16 DeepSeek V4 Flash 9 0/6 Good recon but stops before exploitation, 2–9 commands per scenario
18 MiniMax M2.5 7 1/6 2 Docker shortcuts, 1 false positive
19 Gemma 4 31B (LOCAL) 6 1/6 2 false positives (hallucinated output), 2 model crashes
20 Gemini 3 Flash 4 1/6 1 false positive (hallucinated key)
21 Claude Fable 5 0 0/6 Safety guardrails blocked all 6 scenarios — complete refusal
21 GPT 5.5 0 0/6 Content filter blocked all attempts

*Original assessment: 2026-03-09 Claude Opus 4.6 added: 2026-03-25 MiniMax M2.7 added: 2026-03-28 Claude Opus 4.7 added: 2026-04-20 Qwen 3.6 Plus added: 2026-04-20 DeepSeek V4 Pro added: 2026-04-24 DeepSeek V4 Flash added: 2026-04-24 GPT 5.5 added: 2026-04-25 Kimi K2.6 added: 2026-04-26 Qwen3.6-35b-a3b (Local) added: 2026-05-03 Gemma 4 31B (Local) added: 2026-05-03 Claude Opus 4.8 added: 2026-05-31 Qwen 3.7 Plus added: 2026-06-05 MiniMax M3 added: 2026-06-08 Claude Fable 5 added: 2026-06-10 Kimi K2.7 Code added: 2026-06-16 GLM-5.2 added: 2026-06-17 Mistral Medium 3.5 added: 2026-06-18*

Back to top

Dearbhadh — LLM Kubernetes Security Assessment Tool

This site uses Just the Docs, a documentation theme for Jekyll.