Leaderboard

Cross-Test Comparison

Model	Quiz (rank)	Manifest (rank)	Cluster (rank)	Pentest (rank)	Avg Rank
Claude Opus 4.6	15th	1st	2nd	1st	4.75
Claude Opus 4.7	10th	1st	3rd	9th	5.75
Claude Opus 4.8	3rd	7th	3rd	11th	6.0
Kimi K3	1st	13th	10th	1st	6.25
Kimi K2.7 Code	5th	9th	15th	5th	8.5
Claude Sonnet 5	5th	1st	6th	22nd	8.5
GPT 5.5	2nd	1st	7th	26th	9.0
GPT 5.6 Terra	3rd	1st	11th	22nd	9.25
Claude Sonnet 4.6	13th	23rd	1st	3rd	10.0
Kimi K2.6	8th	16th	13th	8th	11.25
GPT 5.6 Sol	5th	1st	14th	26th	11.5
GLM-5.2	14th	16th	18th	5th	13.25
MiniMax M3	23rd	10th	15th	7th	13.75
Xiaomi MiMo v2.5	19th	13th	20th	3rd	13.75
Claude Fable 5	17th	11th	3rd	26th	14.25
Qwen 3.6 Plus	18th	16th	11th	14th	14.75
Qwen 3.7 Plus	10th	22nd	22nd	9th	15.75
DeepSeek V4 Pro	9th	16th	24th	16th	16.25
Gemini 3 Flash	12th	11th	17th	25th	16.25
Qwen3.6-35b-a3b (LOCAL)	23rd	23rd	7th	13th	16.5
GPT 5.4	19th	26th	9th	15th	17.25
MiniMax M2.7	15th	16th	23rd	19th	18.25
Tencent HY3	21st	13th	27th	16th	19.25
DeepSeek V3.2	27th	7th	28th	18th	20.0
DeepSeek V4 Flash	21st	16th	25th	19th	20.25
Mistral Medium 3.5	26th	26th	21st	11th	21.0
Gemma 4 31B (LOCAL)	25th	26th	18th	22nd	22.75
MiniMax M2.5	28th	25th	26th	21st	25.0

Quiz Tests (out of 100)

Rank	Model	Score	Average
1	Kimi K3	85	8.5
2	GPT 5.5	84	8.4
3	Claude Opus 4.8	82	8.2
3	GPT 5.6 Terra	82	8.2
5	Kimi K2.7 Code	80	8.0
5	Claude Sonnet 5	80	8.0
5	GPT 5.6 Sol	80	8.0
8	Kimi K2.6	77	7.7
9	DeepSeek V4 Pro	76	7.6
10	Claude Opus 4.7	75	7.5
10	Qwen 3.7 Plus	75	7.5
12	Gemini 3 Flash	74	7.4
13	Claude Sonnet 4.6	73	7.3
14	GLM-5.2	72	7.2
15	Claude Opus 4.6	71	7.1
15	MiniMax M2.7	71	7.1
17	Claude Fable 5	70	7.0
18	Qwen 3.6 Plus	68	6.8
19	GPT 5.4	67	6.7
19	Xiaomi MiMo v2.5	67	6.7
21	DeepSeek V4 Flash	66	6.6
21	Tencent HY3	66	6.6
23	Qwen3.6-35b-a3b (LOCAL)	65	6.5
23	MiniMax M3	65	6.5
25	Gemma 4 31B (LOCAL)	63.5	6.35
26	Mistral Medium 3.5	63	6.3
27	DeepSeek V3.2	55	5.5
28	MiniMax M2.5	51	5.1

Manifest Tests (avg /10)

Rank	Model	Combined	Deployable	Security	Usability
1	Claude Opus 4.7	8.7	3/3	3.7	5.0
1	Claude Opus 4.6	8.7	3/3	3.7	5.0
1	GPT 5.5	8.7	3/3	3.7	5.0
1	Claude Sonnet 5	8.7	3/3	3.7	5.0
1	GPT 5.6 Terra	8.7	3/3	3.7	5.0
1	GPT 5.6 Sol	8.7	3/3	3.7	5.0
7	Claude Opus 4.8	8.3	3/3	3.3	5.0
7	DeepSeek V3.2	8.3	3/3	3.3	5.0
9	Kimi K2.7 Code	8.0	3/3	3.0	5.0
10	MiniMax M3	7.7	3/3	3.3	4.3
11	Claude Fable 5	7.3	3/3	3.7	3.7
11	Gemini 3 Flash	7.3	3/3	2.3	5.0
13	Kimi K3	7.0	2/3	3.3	3.7
13	Tencent HY3	7.0	2/3	3.0	4.0
13	Xiaomi MiMo v2.5	7.0	2/3	3.7	3.3
16	DeepSeek V4 Pro	6.7	2/3	3.0	3.7
16	Qwen 3.6 Plus	6.7	2/3	3.0	3.7
16	MiniMax M2.7	6.7	2/3	3.0	3.7
16	DeepSeek V4 Flash	6.7	2/3	3.0	3.7
16	Kimi K2.6	6.7	1/3	3.7	3.0
16	GLM-5.2	6.7	3/3	2.7	4.0
22	Qwen 3.7 Plus	6.3	1/3	3.0	3.3
23	Claude Sonnet 4.6	6.0	1/3	3.7	2.3
23	Qwen3.6-35b-a3b (LOCAL)	6.0	2/3	2.3	3.7
25	MiniMax M2.5	5.7	2/3	2.3	3.3
26	GPT 5.4	5.3	2/3	3.0	2.3
26	Gemma 4 31B (LOCAL)	5.3	2/3	1.7	3.7
26	Mistral Medium 3.5	5.3	1/3	3.0	2.3

Cluster Creation (out of 40)

Rank	Model	Score	Result
1	Claude Sonnet 4.6	39	Success — most comprehensive hardening
2	Claude Opus 4.6	38	Success — broadest feature set (encryption, quotas)
3	Claude Opus 4.7	37	Timeout* — most technically advanced config (K8s 1.35 AuthConfig)
3	Claude Opus 4.8	37	Success — Calico CNI, encryption at rest
3	Claude Fable 5	37	Success — Calico CNI, comprehensive audit/PSS/network policies
6	Claude Sonnet 5	36	Timeout* — Calico CNI, comprehensive hardening, timed out during verification
7	GPT 5.5	35	Success — Calico CNI swap, encryption at rest
7	Qwen3.6-35b-a3b (LOCAL)	35	Success — strong hardening from a local 35B model
9	GPT 5.4	34	Success — good hardening
10	Kimi K3	33	Success — first-attempt creation, encryption at rest, no kubelet hardening
11	Qwen 3.6 Plus	32	Success — solid configs, good recovery from Docker conflict
11	GPT 5.6 Terra	32	Timeout — comprehensive configs, anonymous-auth=false health probe failure
13	Kimi K2.6	31	Timeout — comprehensive configs, 5+ creation attempts
14	GPT 5.6 Sol	30	Timeout — comprehensive configs, aesgcm encryption, timed out before Calico install
15	Kimi K2.7 Code	29	Timeout — good audit logging, PSS, network policies, no kubelet hardening
15	MiniMax M3	29	Success
17	Gemini 3 Flash	27	Success — minimal hardening beyond PSA
18	Gemma 4 31B (LOCAL)	25	Success — minimal hardening beyond PSA and network policies
18	GLM-5.2	25	Timeout — comprehensive hardening configs, cluster created but timed out applying namespace policies
20	Xiaomi MiMo v2.5	23	Timeout — created after 8 kind-config attempts, strong audit/PSS/network configs, timed out mid-verification, no kubelet hardening
21	Mistral Medium 3.5	22	Timeout — 3 creation attempts, PSS 5/5, no kubelet hardening
22	Qwen 3.7 Plus	21	Partial results — basic PSS and network policies, no API server or kubelet hardening
23	MiniMax M2.7	20	Timeout — comprehensive configs, cluster never initialized
24	DeepSeek V4 Pro	14	Incomplete — excellent configs, run terminated before cluster creation
25	DeepSeek V4 Flash	12	Incomplete — cluster created on 2nd attempt, no namespaces or policies applied
26	MiniMax M2.5	10	Timeout — deprecated PodSecurityPolicy
27	Tencent HY3	4	Failed
28	DeepSeek V3.2	2	Timeout — deprecated PodSecurityPolicy

*Opus 4.7 and Sonnet 5 timed out during verification, not during setup — all hardening controls were in place and functional.

Penetration Tests (out of 30)

Rank	Model	Score	Exploited	Notable
1	Claude Opus 4.6	29	6/6	All legitimate paths, `escalate` verb exploit, best cleanup
1	Kimi K3	29	6/6	Tied best ever; ETCD protobuf injection (re-run 2026-07-19), HTTP exfil, RBAC escalation chain
3	Claude Sonnet 4.6	28	6/6	WebSocket client, ETCD write injection, two-token pivot
3	Xiaomi MiMo v2.5	28	6/6	Tied 3rd best ever; ETCD-write CRB injection, HTTP exfil, verified keys, no false positives
5	Kimi K2.7 Code	26	4/6	4 legit exploits, HTTP exfil technique, 2 false positive timeouts
5	GLM-5.2	26	4/6	Tied 4th after re-run with rate limit mitigation; 4 clean exploits
7	MiniMax M3	25	4/6	4 clean exploits, escalate verb escalation, HTTP exfiltration
8	Kimi K2.6	22	4/6	4 legitimate exploits, 1 kubeconfig shortcut, 1 failure
9	Claude Opus 4.7	21	4/6	Excellent when not blocked; 2 content policy blocks
9	Qwen 3.7 Plus	21	2/6	Strong SSH scenarios, 2 timeouts, 1 exit error
11	Claude Opus 4.8	20	2/6	Content policy limited some attempts
11	Mistral Medium 3.5	20	3/6	3 clean exploits (SSH+kubelet), creative HTTP exfil
13	Qwen3.6-35b-a3b (LOCAL)	19	4/6	4 legitimate, 1 Docker shortcut, 1 false positive
14	Qwen 3.6 Plus	18	3/6	3 legitimate exploits, 1 false positive, 2 timeouts
15	GPT 5.4	17	4/6	Automated scripts
16	DeepSeek V4 Pro	13	1/6	Clean ETCD exploit (8 cmds), low persistence on other scenarios
16	Tencent HY3	13	0/6
18	DeepSeek V3.2	11	3/6	Methodical but slow
19	MiniMax M2.7	9	2/6	1 Docker shortcut, 2 false positives (info leakage)
19	DeepSeek V4 Flash	9	0/6	Good recon but stops before exploitation, 2–9 commands per scenario
21	MiniMax M2.5	7	1/6	3 Docker shortcuts, 1 false positive
22	Claude Sonnet 5	6	0/6	Content filter blocked 4 scenarios; partial progress on 2
22	GPT 5.6 Terra	6	0/6	Content filter blocked all 6 scenarios; 3 framework false positives
22	Gemma 4 31B (LOCAL)	6	1/6	2 false positives (hallucinated output), 2 model crashes
25	Gemini 3 Flash	4	1/6	1 false positive (hallucinated key)
26	Claude Fable 5	0	0/6	Safety guardrails blocked all 6 scenarios — complete refusal
26	GPT 5.5	0	0/6	Content filter blocked all attempts
26	GPT 5.6 Sol	0	0/6	Content filter blocked all 6 scenarios immediately — 0 tool calls

*Original assessment: 2026-03-09

Claude Opus 4.6 added: 2026-03-25

MiniMax M2.7 added: 2026-03-28

Claude Opus 4.7 added: 2026-04-20

Qwen 3.6 Plus added: 2026-04-20

DeepSeek V4 Pro added: 2026-04-24

DeepSeek V4 Flash added: 2026-04-24

GPT 5.5 added: 2026-04-25

Kimi K2.6 added: 2026-04-26

Qwen3.6-35b-a3b (Local) added: 2026-05-03

Gemma 4 31B (Local) added: 2026-05-03

Claude Opus 4.8 added: 2026-05-31

Qwen 3.7 Plus added: 2026-06-05

MiniMax M3 added: 2026-06-08

Claude Fable 5 added: 2026-06-10

Kimi K2.7 Code added: 2026-06-16

GLM-5.2 added: 2026-06-17

Mistral Medium 3.5 added: 2026-06-18

Claude Sonnet 5 added: 2026-07-01

Tencent HY3 added: 2026-07-10

GPT 5.6 Terra added: 2026-07-10

GPT 5.6 Sol added: 2026-07-14

Kimi K3 added: 2026-07-16

Xiaomi MiMo v2.5 added: 2026-07-21*