Leaderboard

Cross-Test Comparison

Model Quiz (rank) Manifest (rank) Cluster (rank) Pentest (rank) Avg Rank
Claude Opus 4.6 8th 1st 2nd 1st 3.0
Claude Opus 4.7 5th 1st 3rd 4th 3.25
Claude Opus 4.8 2nd 4th 3rd 5th 3.5
Kimi K2.6 3rd 7th 9th 3rd 5.5
Claude Sonnet 4.6 7th 12th 1st 2nd 5.5
GPT 5.5 1st 1st 5th 16th 5.75
DeepSeek V4 Pro 4th 7th 13th 9th 8.25
Qwen 3.6 Plus 10th 7th 8th 8th 8.25
Qwen3.6-35b-a3b (LOCAL) 13th 12th 5th 6th 9.0
Gemini 3 Flash 6th 6th 10th 15th 9.25
MiniMax M2.7 8th 7th 12th 11th 9.5
GPT 5.4 11th 15th 7th 7th 10.0
DeepSeek V4 Flash 12th 7th 14th 11th 11.0
DeepSeek V3.2 15th 4th 16th 9th 11.0
Gemma 4 31B (LOCAL) 14th 15th 11th 14th 13.5
MiniMax M2.5 16th 14th 15th 13th 14.5

Quiz Tests (out of 100)

Rank Model Score Average
1 GPT 5.5 84 8.4
2 Claude Opus 4.8 82 8.2
3 Kimi K2.6 77 7.7
4 DeepSeek V4 Pro 76 7.6
5 Claude Opus 4.7 75 7.5
6 Gemini 3 Flash 74 7.4
7 Claude Sonnet 4.6 73 7.3
8 Claude Opus 4.6 71 7.1
8 MiniMax M2.7 71 7.1
10 Qwen 3.6 Plus 68 6.8
11 GPT 5.4 67 6.7
12 DeepSeek V4 Flash 66 6.6
13 Qwen3.6-35b-a3b (LOCAL) 65 6.5
14 Gemma 4 31B (LOCAL) 64 6.4
15 DeepSeek V3.2 55 5.5
16 MiniMax M2.5 51 5.1

Manifest Tests (avg /10)

Rank Model Combined Deployable Security Usability
1 Claude Opus 4.7 8.7 3/3 3.7 5.0
1 Claude Opus 4.6 8.7 3/3 3.7 5.0
1 GPT 5.5 8.7 3/3 3.7 5.0
4 Claude Opus 4.8 8.3 3/3 3.3 5.0
4 DeepSeek V3.2 8.3 3/3 3.3 5.0
6 Gemini 3 Flash 7.3 3/3 2.3 5.0
7 DeepSeek V4 Pro 6.7 2/3 3.0 3.7
7 Qwen 3.6 Plus 6.7 2/3 3.0 3.7
7 MiniMax M2.7 6.7 2/3 3.0 3.7
7 DeepSeek V4 Flash 6.7 2/3 3.0 3.7
7 Kimi K2.6 6.7 1/3 3.7 3.0
12 Claude Sonnet 4.6 6.0 1/3 3.7 2.3
12 Qwen3.6-35b-a3b (LOCAL) 6.0 2/3 2.3 3.7
14 MiniMax M2.5 5.7 2/3 2.3 3.3
15 GPT 5.4 5.3 2/3 3.0 2.3
15 Gemma 4 31B (LOCAL) 5.3 2/3 1.7 3.7

Cluster Creation (out of 40)

Rank Model Score Result
1 Claude Sonnet 4.6 39 Success — most comprehensive hardening
2 Claude Opus 4.6 38 Success — broadest feature set (encryption, quotas)
3 Claude Opus 4.7 37 Timeout* — most technically advanced config (K8s 1.35 AuthConfig)
3 Claude Opus 4.8 37 Success — Calico CNI, encryption at rest
5 GPT 5.5 35 Success — Calico CNI swap, encryption at rest
5 Qwen3.6-35b-a3b (LOCAL) 35 Success — strong hardening from a local 35B model
7 GPT 5.4 34 Success — good hardening
8 Qwen 3.6 Plus 32 Success — solid configs, good recovery from Docker conflict
9 Kimi K2.6 31 Timeout — comprehensive configs, 5+ creation attempts
10 Gemini 3 Flash 27 Success — minimal hardening beyond PSA
11 Gemma 4 31B (LOCAL) 25 Success — minimal hardening beyond PSA and network policies
12 MiniMax M2.7 20 Timeout — comprehensive configs, cluster never initialized
13 DeepSeek V4 Pro 14 Incomplete — excellent configs, run terminated before cluster creation
14 DeepSeek V4 Flash 12 Incomplete — cluster created on 2nd attempt, no namespaces or policies applied
15 MiniMax M2.5 10 Timeout — deprecated PodSecurityPolicy
16 DeepSeek V3.2 2 Timeout — deprecated PodSecurityPolicy

*Opus 4.7 timed out during verification, not during setup — all hardening controls were in place and functional.

Penetration Tests (out of 30)

Rank Model Score Exploited Notable
1 Claude Opus 4.6 29 6/6 All legitimate paths, escalate verb exploit, best cleanup
2 Claude Sonnet 4.6 28 6/6 WebSocket client, ETCD write injection, two-token pivot
3 Kimi K2.6 22 4/6 4 legitimate exploits, 1 kubeconfig shortcut, 1 failure
4 Claude Opus 4.7 21 4/6 Excellent when not blocked; 2 content policy blocks
5 Claude Opus 4.8 20 2/6 Content policy limited some attempts
6 Qwen3.6-35b-a3b (LOCAL) 19 4/6 4 legitimate, 1 Docker shortcut, 1 false positive
7 GPT 5.4 17 4/6 Automated scripts
8 Qwen 3.6 Plus 15 3/6 Efficient SSH Hard exploit (10 cmds), 2 false positives
9 DeepSeek V3.2 11 3/6 Methodical but slow
9 DeepSeek V4 Pro 11 1/6 Strong ETCD exploit, few commands per scenario
11 MiniMax M2.7 9 2/6 1 Docker shortcut, 2 false positives (info leakage)
11 DeepSeek V4 Flash 9 0/6 Good recon but stops before exploitation, 2–9 commands per scenario
13 MiniMax M2.5 7 1/6 2 Docker shortcuts, 1 false positive
14 Gemma 4 31B (LOCAL) 6 1/6 2 false positives (hallucinated output), 2 model crashes
15 Gemini 3 Flash 4 1/6 1 false positive (hallucinated key)
16 GPT 5.5 0 0/6 Content filter blocked all attempts

*Original assessment: 2026-03-09 Claude Opus 4.6 added: 2026-03-25 MiniMax M2.7 added: 2026-03-28 Claude Opus 4.7 added: 2026-04-20 Qwen 3.6 Plus added: 2026-04-20 DeepSeek V4 Pro added: 2026-04-24 DeepSeek V4 Flash added: 2026-04-24 GPT 5.5 added: 2026-04-25 Kimi K2.6 added: 2026-04-26 Qwen3.6-35b-a3b (Local) added: 2026-05-03 Gemma 4 31B (Local) added: 2026-05-03 Claude Opus 4.8 added: 2026-05-31*

Back to top

Dearbhadh — LLM Kubernetes Security Assessment Tool

This site uses Just the Docs, a documentation theme for Jekyll.