Leaderboard
Cross-Test Comparison
| Model |
Quiz (rank) |
Manifest (rank) |
Cluster (rank) |
Pentest (rank) |
Avg Rank |
| Claude Opus 4.6 |
8th |
1st |
2nd |
1st |
3.0 |
| Claude Opus 4.7 |
5th |
1st |
3rd |
4th |
3.25 |
| Claude Opus 4.8 |
2nd |
4th |
3rd |
5th |
3.5 |
| Kimi K2.6 |
3rd |
7th |
9th |
3rd |
5.5 |
| Claude Sonnet 4.6 |
7th |
12th |
1st |
2nd |
5.5 |
| GPT 5.5 |
1st |
1st |
5th |
16th |
5.75 |
| DeepSeek V4 Pro |
4th |
7th |
13th |
9th |
8.25 |
| Qwen 3.6 Plus |
10th |
7th |
8th |
8th |
8.25 |
| Qwen3.6-35b-a3b (LOCAL) |
13th |
12th |
5th |
6th |
9.0 |
| Gemini 3 Flash |
6th |
6th |
10th |
15th |
9.25 |
| MiniMax M2.7 |
8th |
7th |
12th |
11th |
9.5 |
| GPT 5.4 |
11th |
15th |
7th |
7th |
10.0 |
| DeepSeek V4 Flash |
12th |
7th |
14th |
11th |
11.0 |
| DeepSeek V3.2 |
15th |
4th |
16th |
9th |
11.0 |
| Gemma 4 31B (LOCAL) |
14th |
15th |
11th |
14th |
13.5 |
| MiniMax M2.5 |
16th |
14th |
15th |
13th |
14.5 |
Quiz Tests (out of 100)
| Rank |
Model |
Score |
Average |
| 1 |
GPT 5.5 |
84 |
8.4 |
| 2 |
Claude Opus 4.8 |
82 |
8.2 |
| 3 |
Kimi K2.6 |
77 |
7.7 |
| 4 |
DeepSeek V4 Pro |
76 |
7.6 |
| 5 |
Claude Opus 4.7 |
75 |
7.5 |
| 6 |
Gemini 3 Flash |
74 |
7.4 |
| 7 |
Claude Sonnet 4.6 |
73 |
7.3 |
| 8 |
Claude Opus 4.6 |
71 |
7.1 |
| 8 |
MiniMax M2.7 |
71 |
7.1 |
| 10 |
Qwen 3.6 Plus |
68 |
6.8 |
| 11 |
GPT 5.4 |
67 |
6.7 |
| 12 |
DeepSeek V4 Flash |
66 |
6.6 |
| 13 |
Qwen3.6-35b-a3b (LOCAL) |
65 |
6.5 |
| 14 |
Gemma 4 31B (LOCAL) |
64 |
6.4 |
| 15 |
DeepSeek V3.2 |
55 |
5.5 |
| 16 |
MiniMax M2.5 |
51 |
5.1 |
Manifest Tests (avg /10)
| Rank |
Model |
Combined |
Deployable |
Security |
Usability |
| 1 |
Claude Opus 4.7 |
8.7 |
3/3 |
3.7 |
5.0 |
| 1 |
Claude Opus 4.6 |
8.7 |
3/3 |
3.7 |
5.0 |
| 1 |
GPT 5.5 |
8.7 |
3/3 |
3.7 |
5.0 |
| 4 |
Claude Opus 4.8 |
8.3 |
3/3 |
3.3 |
5.0 |
| 4 |
DeepSeek V3.2 |
8.3 |
3/3 |
3.3 |
5.0 |
| 6 |
Gemini 3 Flash |
7.3 |
3/3 |
2.3 |
5.0 |
| 7 |
DeepSeek V4 Pro |
6.7 |
2/3 |
3.0 |
3.7 |
| 7 |
Qwen 3.6 Plus |
6.7 |
2/3 |
3.0 |
3.7 |
| 7 |
MiniMax M2.7 |
6.7 |
2/3 |
3.0 |
3.7 |
| 7 |
DeepSeek V4 Flash |
6.7 |
2/3 |
3.0 |
3.7 |
| 7 |
Kimi K2.6 |
6.7 |
1/3 |
3.7 |
3.0 |
| 12 |
Claude Sonnet 4.6 |
6.0 |
1/3 |
3.7 |
2.3 |
| 12 |
Qwen3.6-35b-a3b (LOCAL) |
6.0 |
2/3 |
2.3 |
3.7 |
| 14 |
MiniMax M2.5 |
5.7 |
2/3 |
2.3 |
3.3 |
| 15 |
GPT 5.4 |
5.3 |
2/3 |
3.0 |
2.3 |
| 15 |
Gemma 4 31B (LOCAL) |
5.3 |
2/3 |
1.7 |
3.7 |
Cluster Creation (out of 40)
| Rank |
Model |
Score |
Result |
| 1 |
Claude Sonnet 4.6 |
39 |
Success — most comprehensive hardening |
| 2 |
Claude Opus 4.6 |
38 |
Success — broadest feature set (encryption, quotas) |
| 3 |
Claude Opus 4.7 |
37 |
Timeout* — most technically advanced config (K8s 1.35 AuthConfig) |
| 3 |
Claude Opus 4.8 |
37 |
Success — Calico CNI, encryption at rest |
| 5 |
GPT 5.5 |
35 |
Success — Calico CNI swap, encryption at rest |
| 5 |
Qwen3.6-35b-a3b (LOCAL) |
35 |
Success — strong hardening from a local 35B model |
| 7 |
GPT 5.4 |
34 |
Success — good hardening |
| 8 |
Qwen 3.6 Plus |
32 |
Success — solid configs, good recovery from Docker conflict |
| 9 |
Kimi K2.6 |
31 |
Timeout — comprehensive configs, 5+ creation attempts |
| 10 |
Gemini 3 Flash |
27 |
Success — minimal hardening beyond PSA |
| 11 |
Gemma 4 31B (LOCAL) |
25 |
Success — minimal hardening beyond PSA and network policies |
| 12 |
MiniMax M2.7 |
20 |
Timeout — comprehensive configs, cluster never initialized |
| 13 |
DeepSeek V4 Pro |
14 |
Incomplete — excellent configs, run terminated before cluster creation |
| 14 |
DeepSeek V4 Flash |
12 |
Incomplete — cluster created on 2nd attempt, no namespaces or policies applied |
| 15 |
MiniMax M2.5 |
10 |
Timeout — deprecated PodSecurityPolicy |
| 16 |
DeepSeek V3.2 |
2 |
Timeout — deprecated PodSecurityPolicy |
*Opus 4.7 timed out during verification, not during setup — all hardening controls were in place and functional.
Penetration Tests (out of 30)
| Rank |
Model |
Score |
Exploited |
Notable |
| 1 |
Claude Opus 4.6 |
29 |
6/6 |
All legitimate paths, escalate verb exploit, best cleanup |
| 2 |
Claude Sonnet 4.6 |
28 |
6/6 |
WebSocket client, ETCD write injection, two-token pivot |
| 3 |
Kimi K2.6 |
22 |
4/6 |
4 legitimate exploits, 1 kubeconfig shortcut, 1 failure |
| 4 |
Claude Opus 4.7 |
21 |
4/6 |
Excellent when not blocked; 2 content policy blocks |
| 5 |
Claude Opus 4.8 |
20 |
2/6 |
Content policy limited some attempts |
| 6 |
Qwen3.6-35b-a3b (LOCAL) |
19 |
4/6 |
4 legitimate, 1 Docker shortcut, 1 false positive |
| 7 |
GPT 5.4 |
17 |
4/6 |
Automated scripts |
| 8 |
Qwen 3.6 Plus |
15 |
3/6 |
Efficient SSH Hard exploit (10 cmds), 2 false positives |
| 9 |
DeepSeek V3.2 |
11 |
3/6 |
Methodical but slow |
| 9 |
DeepSeek V4 Pro |
11 |
1/6 |
Strong ETCD exploit, few commands per scenario |
| 11 |
MiniMax M2.7 |
9 |
2/6 |
1 Docker shortcut, 2 false positives (info leakage) |
| 11 |
DeepSeek V4 Flash |
9 |
0/6 |
Good recon but stops before exploitation, 2–9 commands per scenario |
| 13 |
MiniMax M2.5 |
7 |
1/6 |
2 Docker shortcuts, 1 false positive |
| 14 |
Gemma 4 31B (LOCAL) |
6 |
1/6 |
2 false positives (hallucinated output), 2 model crashes |
| 15 |
Gemini 3 Flash |
4 |
1/6 |
1 false positive (hallucinated key) |
| 16 |
GPT 5.5 |
0 |
0/6 |
Content filter blocked all attempts |
| *Original assessment: 2026-03-09 |
Claude Opus 4.6 added: 2026-03-25 |
MiniMax M2.7 added: 2026-03-28 |
Claude Opus 4.7 added: 2026-04-20 |
Qwen 3.6 Plus added: 2026-04-20 |
DeepSeek V4 Pro added: 2026-04-24 |
DeepSeek V4 Flash added: 2026-04-24 |
GPT 5.5 added: 2026-04-25 |
Kimi K2.6 added: 2026-04-26 |
Qwen3.6-35b-a3b (Local) added: 2026-05-03 |
Gemma 4 31B (Local) added: 2026-05-03 |
Claude Opus 4.8 added: 2026-05-31* |