Leaderboard
Cross-Test Comparison
| Model |
Quiz (rank) |
Manifest (rank) |
Cluster (rank) |
Pentest (rank) |
Avg Rank |
| Claude Opus 4.6 |
11th |
1st |
2nd |
1st |
3.75 |
| Claude Opus 4.7 |
6th |
1st |
3rd |
7th |
4.25 |
| Claude Opus 4.8 |
2nd |
4th |
3rd |
9th |
4.5 |
| Kimi K2.7 Code |
3rd |
6th |
11th |
3rd |
5.75 |
| GPT 5.5 |
1st |
1st |
6th |
21st |
7.25 |
| Claude Sonnet 4.6 |
9th |
17th |
1st |
2nd |
7.25 |
| Kimi K2.6 |
4th |
10th |
10th |
6th |
7.5 |
| GLM-5.2 |
10th |
10th |
14th |
3rd |
9.25 |
| MiniMax M3 |
17th |
7th |
11th |
5th |
10.0 |
| Claude Fable 5 |
13th |
8th |
3rd |
21st |
11.25 |
| Qwen 3.6 Plus |
14th |
10th |
9th |
12th |
11.25 |
| Qwen 3.7 Plus |
6th |
16th |
17th |
7th |
11.5 |
| DeepSeek V4 Pro |
5th |
10th |
19th |
14th |
12.0 |
| Gemini 3 Flash |
8th |
8th |
13th |
20th |
12.25 |
| Qwen3.6-35b-a3b (LOCAL) |
17th |
17th |
6th |
11th |
12.75 |
| MiniMax M2.7 |
11th |
10th |
18th |
16th |
13.75 |
| GPT 5.4 |
15th |
20th |
8th |
13th |
14.0 |
| DeepSeek V3.2 |
21st |
4th |
22nd |
15th |
15.5 |
| DeepSeek V4 Flash |
16th |
10th |
20th |
16th |
15.5 |
| Mistral Medium 3.5 |
20th |
20th |
16th |
9th |
16.25 |
| Gemma 4 31B (LOCAL) |
19th |
20th |
14th |
19th |
18.0 |
| MiniMax M2.5 |
22nd |
19th |
21st |
18th |
20.0 |
Quiz Tests (out of 100)
| Rank |
Model |
Score |
Average |
| 1 |
GPT 5.5 |
84 |
8.4 |
| 2 |
Claude Opus 4.8 |
82 |
8.2 |
| 3 |
Kimi K2.7 Code |
80 |
8.0 |
| 4 |
Kimi K2.6 |
77 |
7.7 |
| 5 |
DeepSeek V4 Pro |
76 |
7.6 |
| 6 |
Claude Opus 4.7 |
75 |
7.5 |
| 6 |
Qwen 3.7 Plus |
75 |
7.5 |
| 8 |
Gemini 3 Flash |
74 |
7.4 |
| 9 |
Claude Sonnet 4.6 |
73 |
7.3 |
| 10 |
GLM-5.2 |
72 |
7.2 |
| 11 |
Claude Opus 4.6 |
71 |
7.1 |
| 11 |
MiniMax M2.7 |
71 |
7.1 |
| 13 |
Claude Fable 5 |
70 |
7.0 |
| 14 |
Qwen 3.6 Plus |
68 |
6.8 |
| 15 |
GPT 5.4 |
67 |
6.7 |
| 16 |
DeepSeek V4 Flash |
66 |
6.6 |
| 17 |
Qwen3.6-35b-a3b (LOCAL) |
65 |
6.5 |
| 17 |
MiniMax M3 |
65 |
6.5 |
| 19 |
Gemma 4 31B (LOCAL) |
63.5 |
6.35 |
| 20 |
Mistral Medium 3.5 |
63 |
6.3 |
| 21 |
DeepSeek V3.2 |
55 |
5.5 |
| 22 |
MiniMax M2.5 |
51 |
5.1 |
Manifest Tests (avg /10)
| Rank |
Model |
Combined |
Deployable |
Security |
Usability |
| 1 |
Claude Opus 4.7 |
8.7 |
3/3 |
3.7 |
5.0 |
| 1 |
Claude Opus 4.6 |
8.7 |
3/3 |
3.7 |
5.0 |
| 1 |
GPT 5.5 |
8.7 |
3/3 |
3.7 |
5.0 |
| 4 |
Claude Opus 4.8 |
8.3 |
3/3 |
3.3 |
5.0 |
| 4 |
DeepSeek V3.2 |
8.3 |
3/3 |
3.3 |
5.0 |
| 6 |
Kimi K2.7 Code |
8.0 |
3/3 |
3.0 |
5.0 |
| 7 |
MiniMax M3 |
7.7 |
3/3 |
3.3 |
4.3 |
| 8 |
Claude Fable 5 |
7.3 |
3/3 |
3.7 |
3.7 |
| 8 |
Gemini 3 Flash |
7.3 |
3/3 |
2.3 |
5.0 |
| 10 |
DeepSeek V4 Pro |
6.7 |
2/3 |
3.0 |
3.7 |
| 10 |
Qwen 3.6 Plus |
6.7 |
2/3 |
3.0 |
3.7 |
| 10 |
MiniMax M2.7 |
6.7 |
2/3 |
3.0 |
3.7 |
| 10 |
DeepSeek V4 Flash |
6.7 |
2/3 |
3.0 |
3.7 |
| 10 |
Kimi K2.6 |
6.7 |
1/3 |
3.7 |
3.0 |
| 10 |
GLM-5.2 |
6.7 |
3/3 |
2.7 |
4.0 |
| 16 |
Qwen 3.7 Plus |
6.3 |
1/3 |
3.0 |
3.3 |
| 17 |
Claude Sonnet 4.6 |
6.0 |
1/3 |
3.7 |
2.3 |
| 17 |
Qwen3.6-35b-a3b (LOCAL) |
6.0 |
2/3 |
2.3 |
3.7 |
| 19 |
MiniMax M2.5 |
5.7 |
2/3 |
2.3 |
3.3 |
| 20 |
GPT 5.4 |
5.3 |
2/3 |
3.0 |
2.3 |
| 20 |
Gemma 4 31B (LOCAL) |
5.3 |
2/3 |
1.7 |
3.7 |
| 20 |
Mistral Medium 3.5 |
5.3 |
1/3 |
3.0 |
2.3 |
Cluster Creation (out of 40)
| Rank |
Model |
Score |
Result |
| 1 |
Claude Sonnet 4.6 |
39 |
Success — most comprehensive hardening |
| 2 |
Claude Opus 4.6 |
38 |
Success — broadest feature set (encryption, quotas) |
| 3 |
Claude Opus 4.7 |
37 |
Timeout* — most technically advanced config (K8s 1.35 AuthConfig) |
| 3 |
Claude Opus 4.8 |
37 |
Success — Calico CNI, encryption at rest |
| 3 |
Claude Fable 5 |
37 |
Success — Calico CNI, comprehensive audit/PSS/network policies |
| 6 |
GPT 5.5 |
35 |
Success — Calico CNI swap, encryption at rest |
| 6 |
Qwen3.6-35b-a3b (LOCAL) |
35 |
Success — strong hardening from a local 35B model |
| 8 |
GPT 5.4 |
34 |
Success — good hardening |
| 9 |
Qwen 3.6 Plus |
32 |
Success — solid configs, good recovery from Docker conflict |
| 10 |
Kimi K2.6 |
31 |
Timeout — comprehensive configs, 5+ creation attempts |
| 11 |
Kimi K2.7 Code |
29 |
Timeout — good audit logging, PSS, network policies, no kubelet hardening |
| 11 |
MiniMax M3 |
29 |
Success |
| 13 |
Gemini 3 Flash |
27 |
Success — minimal hardening beyond PSA |
| 14 |
Gemma 4 31B (LOCAL) |
25 |
Success — minimal hardening beyond PSA and network policies |
| 14 |
GLM-5.2 |
25 |
Timeout — comprehensive hardening configs, cluster created but timed out applying namespace policies |
| 16 |
Mistral Medium 3.5 |
22 |
Timeout — 3 creation attempts, PSS 5/5, no kubelet hardening |
| 17 |
Qwen 3.7 Plus |
21 |
Partial results — basic PSS and network policies, no API server or kubelet hardening |
| 18 |
MiniMax M2.7 |
20 |
Timeout — comprehensive configs, cluster never initialized |
| 19 |
DeepSeek V4 Pro |
14 |
Incomplete — excellent configs, run terminated before cluster creation |
| 20 |
DeepSeek V4 Flash |
12 |
Incomplete — cluster created on 2nd attempt, no namespaces or policies applied |
| 21 |
MiniMax M2.5 |
10 |
Timeout — deprecated PodSecurityPolicy |
| 22 |
DeepSeek V3.2 |
2 |
Timeout — deprecated PodSecurityPolicy |
*Opus 4.7 timed out during verification, not during setup — all hardening controls were in place and functional.
Penetration Tests (out of 30)
| Rank |
Model |
Score |
Exploited |
Notable |
| 1 |
Claude Opus 4.6 |
29 |
6/6 |
All legitimate paths, escalate verb exploit, best cleanup |
| 2 |
Claude Sonnet 4.6 |
28 |
6/6 |
WebSocket client, ETCD write injection, two-token pivot |
| 3 |
Kimi K2.7 Code |
26 |
4/6 |
4 legit exploits, HTTP exfil technique, 2 false positive timeouts |
| 3 |
GLM-5.2 |
26 |
4/6 |
Tied 3rd after re-run with rate limit mitigation; 4 clean exploits |
| 5 |
MiniMax M3 |
25 |
4/6 |
4 clean exploits, escalate verb escalation, HTTP exfiltration |
| 6 |
Kimi K2.6 |
22 |
4/6 |
4 legitimate exploits, 1 kubeconfig shortcut, 1 failure |
| 7 |
Claude Opus 4.7 |
21 |
4/6 |
Excellent when not blocked; 2 content policy blocks |
| 7 |
Qwen 3.7 Plus |
21 |
2/6 |
Strong SSH scenarios, 2 timeouts, 1 exit error |
| 9 |
Claude Opus 4.8 |
20 |
2/6 |
Content policy limited some attempts |
| 9 |
Mistral Medium 3.5 |
20 |
3/6 |
3 clean exploits (SSH+kubelet), creative HTTP exfil |
| 11 |
Qwen3.6-35b-a3b (LOCAL) |
19 |
4/6 |
4 legitimate, 1 Docker shortcut, 1 false positive |
| 12 |
Qwen 3.6 Plus |
18 |
3/6 |
3 legitimate exploits, 1 false positive, 2 timeouts |
| 13 |
GPT 5.4 |
17 |
4/6 |
Automated scripts |
| 14 |
DeepSeek V4 Pro |
13 |
1/6 |
Clean ETCD exploit (8 cmds), low persistence on other scenarios |
| 15 |
DeepSeek V3.2 |
11 |
3/6 |
Methodical but slow |
| 16 |
MiniMax M2.7 |
9 |
2/6 |
1 Docker shortcut, 2 false positives (info leakage) |
| 16 |
DeepSeek V4 Flash |
9 |
0/6 |
Good recon but stops before exploitation, 2–9 commands per scenario |
| 18 |
MiniMax M2.5 |
7 |
1/6 |
2 Docker shortcuts, 1 false positive |
| 19 |
Gemma 4 31B (LOCAL) |
6 |
1/6 |
2 false positives (hallucinated output), 2 model crashes |
| 20 |
Gemini 3 Flash |
4 |
1/6 |
1 false positive (hallucinated key) |
| 21 |
Claude Fable 5 |
0 |
0/6 |
Safety guardrails blocked all 6 scenarios — complete refusal |
| 21 |
GPT 5.5 |
0 |
0/6 |
Content filter blocked all attempts |
| *Original assessment: 2026-03-09 |
Claude Opus 4.6 added: 2026-03-25 |
MiniMax M2.7 added: 2026-03-28 |
Claude Opus 4.7 added: 2026-04-20 |
Qwen 3.6 Plus added: 2026-04-20 |
DeepSeek V4 Pro added: 2026-04-24 |
DeepSeek V4 Flash added: 2026-04-24 |
GPT 5.5 added: 2026-04-25 |
Kimi K2.6 added: 2026-04-26 |
Qwen3.6-35b-a3b (Local) added: 2026-05-03 |
Gemma 4 31B (Local) added: 2026-05-03 |
Claude Opus 4.8 added: 2026-05-31 |
Qwen 3.7 Plus added: 2026-06-05 |
MiniMax M3 added: 2026-06-08 |
Claude Fable 5 added: 2026-06-10 |
Kimi K2.7 Code added: 2026-06-16 |
GLM-5.2 added: 2026-06-17 |
Mistral Medium 3.5 added: 2026-06-18* |