Quiz Test Summary — Twenty-Eight-Model Comparison

Models Tested

Model	Provider	Short Name	Tested
anthropic/claude-fable-5	Anthropic	Fable 5	2026-06-10
anthropic/claude-sonnet-5	Anthropic	Sonnet 5	2026-07-01
anthropic/claude-opus-4.8	Anthropic	Opus 4.8	2026-05-31
anthropic/claude-opus-4.7	Anthropic	Opus 4.7	2026-04-20
anthropic/claude-opus-4.6	Anthropic	Opus 4.6	2026-03-25
anthropic/claude-sonnet-4.6	Anthropic	Sonnet 4.6	2026-03-09
openai/gpt-5.5	OpenAI	GPT 5.5	2026-04-25
openai/gpt-5.6-terra	OpenAI	GPT 5.6 Terra	2026-07-10
openai/gpt-5.6-sol	OpenAI	GPT 5.6 Sol	2026-07-14
openai/gpt-5.4	OpenAI	GPT 5.4	2026-03-09
google/gemini-3-flash-preview	Google	Gemini 3 Flash	2026-03-09
qwen/qwen3.6-plus	Qwen	Qwen 3.6 Plus	2026-04-20
minimax/minimax-m2.5	MiniMax	MiniMax M2.5	2026-03-09
minimax/minimax-m2.7	MiniMax	MiniMax M2.7	2026-03-28
minimax/minimax-m3	MiniMax	MiniMax M3	2026-06-08
deepseek/deepseek-v3.2	DeepSeek	DeepSeek V3.2	2026-03-09
deepseek/deepseek-v4-pro	DeepSeek	DeepSeek V4 Pro	2026-04-24
deepseek/deepseek-v4-flash	DeepSeek	DeepSeek V4 Flash	2026-04-24
moonshotai/kimi-k2.6	Moonshot AI	Kimi K2.6	2026-04-26
moonshotai/kimi-k2.7-code	Moonshot AI	Kimi K2.7 Code	2026-06-16
moonshotai/kimi-k3	Moonshot AI	Kimi K3	2026-07-16
z-ai/glm-5.2	Z-AI	GLM-5.2	2026-06-17
mistralai/mistral-medium-3-5	Mistral AI	Mistral M3.5	2026-06-18
qwen/qwen3.7-plus	Qwen	Qwen 3.7 Plus	2026-06-05
tencent/hy3	Tencent	HY3	2026-07-10
qwen/qwen3.6-35b-a3b	Qwen (Local)	Qwen-35b (LOCAL)	2026-05-03
google/gemma-4-31b	Google (Local)	Gemma 4 31B (LOCAL)	2026-05-03
xiaomi/mimo-v2.5	Xiaomi	MiMo v2.5	2026-07-21

Overall Rankings

Rank	Model	Total (out of 100)	Average	Wins	Last Place
1	moonshotai/kimi-k3	85	8.5	1	0
2	openai/gpt-5.5	84	8.4	7	0
3	openai/gpt-5.6-terra	82	8.2	4	0
3	anthropic/claude-opus-4.8	82	8.2	2	0
5	openai/gpt-5.6-sol	80	8.0	5	0
5	moonshotai/kimi-k2.7-code	80	8.0	3	0
5	anthropic/claude-sonnet-5	80	8.0	2	0
8	moonshotai/kimi-k2.6	77	7.7	2	0
9	deepseek/deepseek-v4-pro	76	7.6	3	0
10	anthropic/claude-opus-4.7	75	7.5	2	2
10	qwen/qwen3.7-plus	75	7.5	1	0
12	google/gemini-3-flash-preview	74	7.4	1	1
13	anthropic/claude-sonnet-4.6	73	7.3	2	0
14	z-ai/glm-5.2	72	7.2	2	0
15	anthropic/claude-opus-4.6	71	7.1	1	2
15	minimax/minimax-m2.7	71	7.1	0	1
17	anthropic/claude-fable-5	70	7.0	5	2
18	qwen/qwen3.6-plus	68	6.8	2	2
19	openai/gpt-5.4	67	6.7	1	2
19	xiaomi/mimo-v2.5	67	6.7	0	0
21	tencent/hy3	66	6.6	0	0
21	deepseek/deepseek-v4-flash	66	6.6	1	1
23	qwen/qwen3.6-35b-a3b (LOCAL)	65	6.5	0	0
23	minimax/minimax-m3	65	6.5	1	0
25	google/gemma-4-31b (LOCAL)	63.5	6.35	0	0
26	mistralai/mistral-medium-3-5	63	6.3	0	0
27	deepseek/deepseek-v3.2	55	5.5	0	4
28	minimax/minimax-m2.5	51	5.1	0	5

Full Score Matrix

Quiz	GPT 5.5	GPT 5.6 Terra	GPT 5.6 Sol	Opus 4.8	K2.7 Code	K3	Sonnet 5	GLM-5.2	Mistral M3.5	K2.6	Opus 4.7	Qwen 3.7+	Opus 4.6	Sonnet 4.6	Fable 5	GPT 5.4	Gemini 3 Flash	Qwen 3.6 Plus	MiniMax M2.7	HY3	MiniMax M3	MiniMax M2.5	DeepSeek V3.2	DeepSeek V4 Pro	DeepSeek V4 Flash	Qwen-35b	Gemma 4 31B	MiMo v2.5	Winner
Admission Control Options	9	9	9	9	9	9	9	7	8	9	9	8	9	9	9	7	7	7	7	7	8	4	5	8	8	7	7.5	7	GPT 5.5/GPT 5.6 Terra/GPT 5.6 Sol/Opus 4.8/K2.7 Code/K3/Sonnet 5/K2.6/Opus 4.7/Opus 4.6/Sonnet/Fable 5
Kubelet API Rights	6	7	6	5	6	9	7	5	3	7	6	5	7	6	0	7	7	5	8	5	4	2	3	4	3	4	5	3	Kimi K3
Kubernetes Authentication	8	7	6	8	7	6	9	7	5	7	7	9	7	8	8	5	7	8	5	5	5	4	5	8	6	5	6	5	Qwen 3.7+/Sonnet 5
Kubernetes Open Ports	8	9	9	9	8	9	6	5	4	7	7	6	6	7	9	5	7	6	6	5	3	4	4	5	5	4	5	7	GPT 5.6 Terra/GPT 5.6 Sol/Opus 4.8/K3/Fable 5
Kubernetes PKI List	10	10	10	10	10	10	9	10	8	10	10	9	9	8	10	9	9	10	9	10	9	7	6	9	10	9	6	10	GPT 5.5/GPT 5.6 Terra/GPT 5.6 Sol/Opus 4.8/K2.7 Code/K3/K2.6/Opus 4.7/Qwen/V4 Flash/Fable 5/HY3
Kubernetes SSRF	6	6	5	8	7	9	5	8	7	5	7	8	6	5	0	4	8	4	6	6	9	3	5	6	5	5	6	4	K3/MiniMax M3
Pod Security Standards Levels	10	9	10	10	9	10	10	10	8	9	9	9	8	9	9	7	9	9	9	9	10	7	8	10	9	10	9	10	GPT 5.5/GPT 5.6 Sol/Opus 4.8/K3/Sonnet 5/V4 Pro/Qwen-35b/MiniMax M3
Privileged vs APE	9	8	8	9	7	8	10	6	8	7	6	7	6	7	9	6	7	6	7	6	6	8	7	9	6	8	6	8	Sonnet 5
RBAC Verbs	9	9	9	8	9	7	8	7	5	9	8	7	7	7	7	9	7	7	6	7	5	6	6	8	7	6	6	7	GPT 5.5/GPT 5.6 Terra/GPT 5.6 Sol/K2.7 Code/K2.6/GPT 5.4
Secrets and Config Maps	9	8	8	6	8	8	7	7	7	7	6	7	6	7	9	8	6	6	8	6	6	6	6	9	7	7	7	6	GPT 5.5/V4 Pro/Fable 5
Total	84	82	80	82	80	85	80	72	63	77	75	75	71	73	70	67	74	68	71	66	65	51	55	76	66	65	63.5	67
Average	8.4	8.2	8.0	8.2	8.0	8.5	8.0	7.2	6.3	7.7	7.5	7.5	7.1	7.3	7.0	6.7	7.4	6.8	7.1	6.6	6.5	5.1	5.5	7.6	6.6	6.5	6.35	6.7

Score Distribution

By Model

Model	9-10	7-8	5-6	3-4	0-2
GPT 5.5	5	3	2	0	0
GPT 5.6 Terra	5	4	1	0	0
GPT 5.6 Sol	5	2	3	0	0
Claude Opus 4.8	4	4	2	0	0
Claude Sonnet 5	5	3	2	0	0
Kimi K2.6	3	5	2	0	0
Kimi K2.7 Code	3	5	2	0	0
Kimi K3	6	3	1	0	0
GLM-5.2	2	5	3	0	0
Mistral M3.5	0	6	2	2	0
Claude Opus 4.7	3	4	3	0	0
Qwen 3.7 Plus	3	5	2	0	0
Claude Fable 5	6	2	0	0	2
Claude Opus 4.6	2	4	4	0	0
Claude Sonnet 4.6	2	6	2	0	0
GPT 5.4	2	2	4	2	0
Gemini 3 Flash	3	5	1	1	0
Qwen 3.6 Plus	2	3	4	1	0
MiniMax M2.7	2	4	4	0	0
MiniMax M3	3	1	4	2	0
MiniMax M2.5	0	3	2	3	2
DeepSeek V3.2	0	2	5	2	1
DeepSeek V4 Pro	3	4	2	1	0
DeepSeek V4 Flash	2	3	4	1	0
Tencent HY3	1	3	5	1	0
Qwen-35b (LOCAL)	2	3	3	2	0
Gemma 4 31B (LOCAL)	1	2	7	0	0
MiMo v2.5	2	4	2	2	0

By Question Difficulty

Quiz	Avg Score	Spread (max-min)	Difficulty
Kubernetes PKI List	9.0	4	Easy
Pod Security Standards Levels	9.0	3	Easy
Admission Control Options	7.8	5	Moderate
RBAC Verbs	7.3	4	Moderate
Privileged vs APE	7.2	4	Moderate
Secrets and Config Maps	7.0	3	Moderate
Kubernetes Authentication	6.7	5	Moderate
Kubernetes Open Ports	6.3	6	Moderate
Kubernetes SSRF	5.6	9	Hard
Kubelet API Rights	5.3	9	Hard

Model Profiles

openai/gpt-5.5 — 84/100 (8.4 avg)

Strongest on: PKI List (10), PSS Levels (10), Admission Control (9), Privileged/APE (9), RBAC Verbs (9), Secrets/ConfigMaps (9)
Weakest on: Kubelet API (6), SSRF (6)
Profile: Second at 84/100, 1 point behind Kimi K3 (85) after K3’s kubelet re-run. Seven quiz wins (more than any other model) and zero last-place finishes. Five scores of 9-10 out of 10 quizzes. The strongest performance across the board, combining deep conceptual understanding with consistent accuracy. Shares the OpenAI family’s ability to identify trick questions (RBAC verb openness at 9, Secrets/ConfigMaps at 9) while also excelling at structured knowledge questions (PKI at 10, PSS at 10). Weaknesses are limited to attack-surface mapping (SSRF: 6) and kubelet-level operational detail (Kubelet API: 6) — the same hard questions that challenge most models. A significant upgrade over GPT 5.4 (+17 points), the largest inter-generation improvement in the test set.

openai/gpt-5.6-terra — 82/100 (8.2 avg)

Strongest on: PKI List (10), Admission Control (9), Open Ports (9), PSS Levels (9), RBAC Verbs (9)
Weakest on: SSRF (6), Authentication (7), Kubelet API (7)
Profile: Debuts at tied 2nd with Opus 4.8 at 82/100, just 2 points behind sibling GPT 5.5 (84). Five scores of 9-10 across PKI, Admission Control, Open Ports, PSS, and RBAC Verbs. Four shared wins, including the RBAC trick question (correctly identifying verbs are an open set). The OpenAI family continues to excel at catching trick questions — GPT 5.6 Terra joins GPT 5.5 and GPT 5.4 on the RBAC trick, and shares the Open Ports lead at 9 (matching Opus 4.8 and Fable 5). Persistent OpenAI family weakness: falls for the Authentication trick question (7, recommending X.509 for production) — none of the four OpenAI models has caught this. SSRF coverage (6) is the primary weakness, missing admission webhook vectors. Zero last-place finishes. The OpenAI family trajectory: GPT 5.4 (67), GPT 5.5 (84), GPT 5.6 Terra (82), GPT 5.6 Sol (80) — the 5.6 variants are a slight regression from GPT 5.5 but still firmly in the top tier.

openai/gpt-5.6-sol — 80/100 (8.0 avg)

Strongest on: PKI List (10), PSS Levels (10), Admission Control (9), Open Ports (9), RBAC Verbs (9)
Weakest on: SSRF (5), Authentication (6), Kubelet API (6)
Profile: Debuts at tied 4th with 80/100 alongside Kimi K2.7 Code and Claude Sonnet 5, 2 points behind sibling GPT 5.6 Terra (82) and 4 behind GPT 5.5 (84). Five scores of 9-10 across PKI, PSS, Admission Control, Open Ports, and RBAC Verbs — five shared wins and zero last-place finishes. Continues the OpenAI family’s RBAC trick-question strength at 9, correctly identifying verbs as an open set. Shares the Open Ports lead at 9 (matching GPT 5.6 Terra, Opus 4.8, and Fable 5) and ties for PKI and PSS perfection at 10. Persistent OpenAI family weakness: falls for the Authentication trick question (6, recommending X.509 for production) — none of the four OpenAI models has caught this, and Sol scores the lowest of the OpenAI family on this question. SSRF coverage (5) is the primary weakness, missing key attack vectors. The OpenAI GPT 5.6 sub-family shows specialisation: Terra excels at Kubelet API (7) and Authentication (7) where Sol scores 6 on both, while Sol edges ahead on PSS Levels (10 vs 9) and matches Terra on Open Ports and RBAC. The OpenAI family trajectory: GPT 5.4 (67), GPT 5.5 (84), GPT 5.6 Terra (82), GPT 5.6 Sol (80) — the 5.6 variants trade breadth for slightly different specialisation profiles.

anthropic/claude-opus-4.8 — 82/100 (8.2 avg)

Strongest on: PKI List (10), PSS Levels (10), Admission Control (9), Open Ports (9), Privileged/APE (9)
Weakest on: Kubelet API (5), Secrets/ConfigMaps (6)
Profile: A significant step up from Opus 4.7 (+7 points), placing 2nd overall with 82/100 — just 2 points behind GPT 5.5. Four scores of 9-10 and four of 7-8 show consistent high performance across the board. Takes the sole lead on Open Ports (9, beating GPT 5.5’s 8) and ties for the lead on SSRF (8, matching Gemini 3 Flash). The Privileged/APE severity calibration (9) is a dramatic improvement over the Anthropic family’s historical weakness — Opus 4.7 scored 6 with APE severity overstatement, while Opus 4.8 correctly conveys the gap. PKI List perfect score continues the Anthropic improvement trajectory (Opus 4.6: 9, Opus 4.7: 10, Opus 4.8: 10). Persistent weaknesses: Kubelet API (5) with the nodes/proxy exec verb error and invented subresources, and Secrets/ConfigMaps (6) with the ConfigMap encryption misconception that plagues the Anthropic family. Zero last-place finishes.

anthropic/claude-sonnet-5 — 80/100 (8.0 avg)

Strongest on: PSS Levels (10), Privileged/APE (10), Admission Control (9), Authentication (9), PKI List (9)
Weakest on: SSRF (5), Open Ports (6)
Profile: Debuts at tied 4th with 80/100 alongside GPT 5.6 Sol and Kimi K2.7 Code, 2 points behind Opus 4.8 (82) and 4 behind GPT 5.5 (84). Five scores of 9-10 and three of 7-8 show consistently strong performance across structured knowledge questions. Sets a new benchmark on Privileged/APE with the first perfect 10/10 — correctly calibrating APE as rarely a serious concern while previous leaders (GPT 5.5, Opus 4.8, V4 Pro, Fable 5) scored 9. Strong on PSS Levels (10, joining the co-winners), Admission Control (9, tying for the lead), Authentication (9, matching Qwen 3.7 Plus), and PKI List (9). Weaknesses mirror the broader model population: SSRF breadth (5, missing node-level vectors) and Open Ports operational detail (6, missing several ports). Zero last-place finishes. Shares the Anthropic family trait of missing the RBAC verb trick (8, does not recognise verbs are open-ended) and the ConfigMap encryption misconception (7). The Anthropic quiz trajectory shows consistent capability: Opus 4.6 (71), Sonnet 4.6 (73), Fable 5 (70), Opus 4.7 (75), Opus 4.8 (82), Sonnet 5 (80).

moonshotai/kimi-k2.6 — 77/100 (7.7 avg)

Strongest on: PKI List (10), Admission Control (9), PSS Levels (9), RBAC Verbs (9)
Weakest on: SSRF (5), Authentication (7), Open Ports (7), Kubelet API (7), Privileged/APE (7), Secrets/ConfigMaps (7)
Profile: Strong debut at 2nd place with 77/100 — 1 point ahead of DeepSeek V4 Pro (76) and 2 ahead of Opus 4.7 (75). Three shared wins (Admission Control, PKI List, RBAC Verbs) and zero last-place finishes. The RBAC Verbs score is notable — Kimi is the first non-OpenAI model to catch the custom verb trick, breaking the pattern that suggested it was an OpenAI-specific training signal. Five scores of exactly 7 suggest a model that reliably covers fundamentals but hits a ceiling on nuanced questions requiring deeper operational knowledge or trick-question awareness.

moonshotai/kimi-k3 – 85/100 (8.5 avg)

Strongest on: PKI List (10), PSS Levels (10), Kubelet API (9), Admission Control (9), Open Ports (9), SSRF (9)
Weakest on: Authentication (6), RBAC Verbs (7)
Profile: The quiz leader at 85/100 (8.5 avg), 1 point ahead of GPT 5.5 (84). Six scores of 9-10 demonstrate strong knowledge across well-documented Kubernetes concepts (PKI, PSS) and attack-surface topics (SSRF, Admission Control, Open Ports). The SSRF score of 9 ties with MiniMax M3 for the joint lead – exceptionally comprehensive coverage including proxy subresources, manual Endpoints/EndpointSlices for arbitrary-IP SSRF, webhook configurations, aggregated API registration, and image pulls as blind SSRF. The standout result is Kubelet API at 9 – the highest score any model has achieved on this question and the only model in the entire field to correctly identify that get on nodes/proxy already permits exec via WebSocket HTTP GET upgrades (the tricky bonus the scoring notes single out). This was scored from a re-run on 2026-07-17 after the original 2026-07-16 attempt failed with persistent connection errors (an upstream routing issue, not a knowledge gap or safety refusal). The only thing keeping it from a perfect 10 is the absence of any mention of K8s 1.33 fine-grained authorization. The Authentication score of 6 reflects falling for the X.509 trick question while acknowledging cert limitations. The Moonshot AI family trajectory: K2.6 (77), K2.7 Code (80), K3 (85) – K3 now tops the family and the whole quiz field. One sole win (Kubelet API) and zero last-place finishes. The score distribution (six 9-10s, three 7-8s, one 5-6) is the strongest of any model tested.

anthropic/claude-opus-4.7 — 75/100 (7.5 avg)

Strongest on: PKI List (10), Admission Control (9), PSS Levels (9)
Weakest on: Kubelet API (6), Privileged/APE (6), Secrets/ConfigMaps (6)
Profile: Strong, consistent performer. Three scores of 9+ and no score below 6. Improved over Opus 4.6 in 4 areas (Open Ports, PKI, SSRF, PSS, RBAC) while regressing in 1 (Kubelet API). The PKI question shows a perfect understanding of Kubernetes CA architecture. Persistent weaknesses are the same Anthropic family traits: trick question blindspots (RBAC verb openness, ConfigMap encryption) and APE severity miscalibration — though APE calibration improved from “SERIOUS” to “Moderate.”

qwen/qwen3.7-plus — 75/100 (7.5 avg)

Strongest on: Authentication (9), PKI List (9), PSS Levels (9), Admission Control (8), SSRF (8)
Weakest on: Kubelet API (5), Open Ports (6)
Profile: A solid improvement over Qwen 3.6 Plus (+7 points), placing tied 5th with Opus 4.7 at 75/100. Three scores of 9 and two of 8 demonstrate strong knowledge across structured and conceptual questions. Takes the sole lead on Authentication (9/10) — the first model to score above 8 on this trick question, correctly identifying that none of the built-in methods are suitable for production user authentication with well-articulated CRL/OCSP reasoning. Ties for the SSRF lead (8/10, matching Gemini 3 Flash and Opus 4.8) with good coverage of proxy subresources, aggregated API, and admission webhooks. The Qwen family’s PKI expertise continues with 9/10 (Qwen 3.6 Plus scored 10). Persistent weaknesses are kubelet-level operational detail (5/10, incorrectly mapping exec to pods/exec) and port binding accuracy (6/10, missing etcd 2381 and kube-proxy 10249). Zero wins on “easy” questions (PSS, PKI) but zero last-place finishes — a consistently strong mid-to-upper performer.

anthropic/claude-fable-5 — 70/100 (7.0 avg)

Strongest on: PKI List (10), Admission Control (9), Open Ports (9), Privileged/APE (9), Secrets/ConfigMaps (9)
Weakest on: Kubelet API (0), SSRF (0), RBAC Verbs (7)
Profile: A polarised performer with the most extreme score distribution of any model tested — six scores of 9-10 alongside two scores of 0. The zeros are unprecedented: Fable 5 produced completely empty responses on Kubelet API and SSRF, appearing to trigger safety guardrails that refused to engage with offensive security topics (kubelet exploitation, SSRF attack vectors). When it does engage, performance is strong — five co-wins across Admission Control, Open Ports, PKI List, Privileged/APE, and Secrets/ConfigMaps. The PKI perfect 10 continues the Anthropic family’s PKI strength (Opus 4.6: 9, Opus 4.7: 10, Opus 4.8: 10, Fable 5: 10). The Privileged/APE score of 9 matches Opus 4.8’s breakthrough in correctly calibrating APE severity. The total of 70/100 places it 11th, but this understates its capability on questions it answers — its average excluding the two zeros would be 8.75/8, the highest effective score of any model. The safety refusal pattern is a significant concern for security assessment use cases.

z-ai/glm-5.2 — 72/100 (7.2 avg)

Strongest on: PKI List (10), PSS Levels (10), SSRF (8)
Weakest on: Kubelet API (5), Open Ports (5)
Profile: A solid debut at 72/100, placing 10th out of 21 models. Two perfect 10s on PKI List and PSS Levels demonstrate strong knowledge of well-documented Kubernetes concepts. The SSRF score of 8 matches the previous three-way tie (Opus 4.8, Gemini 3 Flash, Qwen 3.7 Plus) for strong attack-surface coverage. Five scores of 7-8 across Admission Control, Authentication, RBAC Verbs, Secrets/ConfigMaps, and SSRF show consistent mid-to-upper performance. Weaknesses are concentrated on operational detail: Kubelet API (5, wrong exec verb mapping) and Open Ports (5, missing several ports and incorrect NodePort claim). The Authentication score of 7 reflects incorrectly endorsing X.509 for production use. Zero wins and zero last-place finishes — a reliable mid-tier performer that handles conceptual questions well but struggles with the deeper operational knowledge that separates the top models.

anthropic/claude-opus-4.6 — 71/100 (7.1 avg)

Strongest on: Admission Control (9), PKI List (9), Pod Security Standards (8)
Weakest on: Open Ports (6), Privileged/APE (6), Secrets/ConfigMaps (6), SSRF (6)
Profile: Strong, consistent mid-to-upper performer with no scores below 6 (never last place on any question). Deep knowledge of Kubernetes security architecture. Unlike Sonnet 4.6, Opus 4.6 does not fabricate information — its errors are miscalibrations and missed nuances rather than invented facts. Shares the Anthropic family trait of struggling with trick questions (RBAC verb openness, ConfigMap encryption).

google/gemini-3-flash-preview — 74/100 (7.4 avg)

Strongest on: SSRF (8), PKI (9), PSS Levels (9)
Weakest on: Secrets/ConfigMaps (6)
Profile: Most consistent high performer. Dominated the SSRF question with the widest vector coverage. Best breadth of knowledge across topics. Only model to identify pod image reference as an SSRF vector. Weakest on nuanced questions where common assumptions are wrong (ConfigMap encryption).

anthropic/claude-sonnet-4.6 — 73/100 (7.3 avg)

Strongest on: Admission Control (9), PSS Levels (9), Authentication (8)
Weakest on: SSRF (5)
Profile: Most consistent model — never scored below 5, narrowest range of scores. Deep expertise on individual topics (detailed RBAC tables, good security rationales) but sometimes narrow in breadth. Tends to go deep on one aspect rather than covering the full surface area.

openai/gpt-5.4 — 67/100 (6.7 avg)

Strongest on: RBAC Verbs (9), PKI (9), Secrets/ConfigMaps (8)
Weakest on: SSRF (4), Authentication (5), Open Ports (5)
Profile: Highest variance model among the top 3. Excels on questions with trick elements or nuanced distinctions (recognised RBAC verbs are open-ended, correctly noted ConfigMap encryption is possible). Often too brief on straightforward knowledge questions, costing marks for lack of detail.

deepseek/deepseek-v4-pro — 76/100 (7.6 avg)

Strongest on: PSS Levels (10), Privileged/APE (9), Secrets/ConfigMaps (9)
Weakest on: Kubelet API (4), Open Ports (5)
Profile: Second place with four wins (PSS Levels sole winner at 10/10, Privileged/APE sole winner at 9, Secrets/ConfigMaps sole winner at 9, Authentication shared at 8). Strong on conceptual understanding and security implications — excels at explaining differences and trade-offs (Privileged/APE, Secrets/ConfigMaps). Critical error on kubelet exec authorization (claims CREATE instead of GET for nodes/proxy), and incorrect bind address mappings for several control plane components. Dramatic improvement over DeepSeek V3.2 (+21 points), the largest inter-generation improvement in the test set.

deepseek/deepseek-v4-flash — 66/100 (6.6 avg)

Strongest on: PKI List (10), PSS Levels (9), Admission Control (8)
Weakest on: Kubelet API (3), Open Ports (5), SSRF (5)
Profile: A solid mid-tier performer that sits between V4 Pro (76) and V3.2 (55), closer to the pack than either sibling. Achieves a perfect 10 on PKI List (joining Opus 4.7 and Qwen) and a strong 9 on PSS Levels. Admission Control score of 8 matches V4 Pro. However, struggles significantly on kubelet-level knowledge (3/10 on Kubelet API — fabricates incorrect resource strings) and attack-surface mapping (SSRF: 5). Incorrectly recommends X.509 certificates for production authentication, missing the trick question. The ConfigMap encryption misconception persists from V3.2. Overall pattern: strong on structured knowledge questions, weak on operational detail and trick questions.

deepseek/deepseek-v3.2 — 55/100 (5.5 avg)

Strongest on: PSS Levels (8), Privileged/APE (7)
Weakest on: Kubelet API (3), Open Ports (4)
Profile: Mid-tier performer. Tends to include non-standard components in answers (Dashboard, Ingress controllers for stock kubeadm questions). Occasional factual errors (update includes patch, 4 CAs instead of 3). Reasonable on straightforward questions but struggles with precision.

qwen/qwen3.6-plus — 68/100 (6.8 avg)

Strongest on: PKI List (10), PSS Levels (9), Authentication (8)
Weakest on: SSRF (4), Kubelet API (5)
Profile: Solid mid-tier performer with a perfect 10 on PKI List (matching only Opus 4.7). Strong on structured knowledge questions (PKI, PSS, Authentication) but weak on attack-oriented topics (SSRF, Kubelet). The SSRF answer contained a factual error about kube-proxy endpoints being disabled by default. Shared the Authentication win with Sonnet 4.6 — correctly identified that no built-in method is suitable for production users. Never placed last on any question. Similar consistency pattern to the Anthropic models but without their trick-question blindspots on Authentication.

minimax/minimax-m2.7 — 71/100 (7.1 avg)

Strongest on: PKI List (9), PSS Levels (9), Kubelet API (8), Secrets/ConfigMaps (8)
Weakest on: Authentication (5), Open Ports (6), RBAC Verbs (6), SSRF (6)
Profile: A dramatic improvement over M2.5 (+20 points). Tied with Opus at 71/100 with an identical score distribution (2 scores of 9-10, 4 of 7-8, 4 of 5-6, none below 5). Strong on fundamental topics (PKI, PSS) and practical questions (Kubelet API, Secrets). Like most models, missed the RBAC verbs trick and recommended X.509 for production authentication. The largest single-generation improvement in the test set.

minimax/minimax-m3 — 65/100 (6.5 avg)

Strongest on: SSRF (9, new sole leader), PKI List (9), Pod Security Standards (10)
Weakest on: Open Ports (3, sole last place), Kubelet API (4)
Profile: Scores 65/100, tying for 14th with Qwen3.6-35b-a3b (LOCAL). Strongest areas are SSRF (9, new sole leader), PKI List (9), and Pod Security Standards (10). Weakest on Open Ports (3, sole last place) and Kubelet API (4). Shows strong conceptual understanding of security topics but struggles with detailed factual recall on ports and protocol specifics. The 14-point improvement over MiniMax M2.5 demonstrates clear advancement, though interestingly scores below sibling M2.7 (71).

minimax/minimax-m2.5 — 51/100 (5.1 avg)

Strongest on: Privileged/APE (8), PSS Levels (7), PKI (7)
Weakest on: Kubelet API (2), SSRF (3), Authentication (4)
Profile: Weakest overall with the most last-place finishes (8 of 10). Produced a strong answer on Privileged vs APE (8/10, best severity calibration). Tends to make critical errors on harder questions — Ingress as SSRF vector, cert revocation support, RBAC inspecting pod specs. Verbose answers don’t compensate for accuracy gaps.

google/gemma-4-31b (LOCAL) — 64/100 (6.4 avg)

Strongest on: PSS Levels (9), Admission Control (7.5), Secrets/ConfigMaps (7)
Weakest on: Kubelet API (5), Open Ports (5), PKI List (6)
Profile: The second local model tested — a 31B parameter model running on LM Studio. Scores 64/100, placing 13th out of 15 models. Characterised by a flat mid-range distribution: seven of ten quizzes scored 5 or 6, giving a narrow score range and few standout results. The PSS Levels score of 9/10 is the sole strong result, showing confident knowledge of well-documented concepts. Notable errors include calling the service account signer a CA (PKI), stating that 10257/10259 bind to all interfaces (they are localhost-only), overstating allowPrivilegeEscalation severity, and missing arbitrary RBAC verbs. Zero wins and zero last-place finishes — a consistently average-to-below-average performer.

qwen/qwen3.6-35b-a3b (LOCAL) — 65/100 (6.5 avg)

Strongest on: PSS Levels (10), PKI List (9), Privileged/APE (8)
Weakest on: Kubelet API (4), Open Ports (4), Authentication (5), SSRF (5)
Profile: The first local model tested — a 35B-parameter MoE model running on LM Studio. Scores 65/100, placing 12th out of 15 models. Two scores of 9-10 show strong knowledge on well-defined concepts (PSS levels, PKI architecture), but struggles with operational depth (kubelet webhook auth, port binding details) and trick questions (authentication production suitability, RBAC verb openness). The pattern suggests the model has good conceptual training but lacks the nuanced operational knowledge that larger cloud models demonstrate. Notable: zero hallucinations — errors are omissions and misunderstandings rather than fabricated information.

tencent/hy3 — 66/100 (6.6 avg)

Strongest on: PKI List (10), PSS Levels (9), Admission Control (7), RBAC Verbs (7)
Weakest on: Kubelet API (5), Authentication (5), Open Ports (5)
Profile: Debuts at tied 17th with 66/100 alongside DeepSeek V4 Flash. One perfect 10 on PKI List (correctly identifying all 3 root CAs and noting the SA key pair is not a CA) and a strong 9 on PSS Levels demonstrate solid knowledge of well-documented Kubernetes concepts. Five scores of 5-6 across the middle range show a model that handles fundamentals but struggles with deeper operational knowledge. The Kubelet API score of 5 reflects the common exec verb error and partial model confusion, while Authentication (5) falls for the trick question by recommending X.509 certificates for production. Open Ports (5) misses several ports and incorrectly includes the NodePort range. The SSRF score of 6 covers multiple vectors but incorrectly prioritises API aggregation over API server proxy. RBAC Verbs (7) correctly identifies verbs but hallucinated non-existent ones. Zero wins and zero last-place finishes — a consistently mid-tier performer that handles conceptual questions well but lacks the operational depth and trick-question awareness that separate the top models.

mistralai/mistral-medium-3-5 — 63/100 (6.3 avg)

Strongest on: PKI List (8), Admission Control (8), PSS Levels (8), Privileged/APE (8)
Weakest on: Kubelet API (3), Open Ports (4)
Profile: Ranks 20th out of 22 models with 63/100 — below average but comfortably above the bottom two (V3.2 at 55, M2.5 at 51). The score distribution is notably flat: six scores of 7-8, two of 5, and two of 3-4, with zero scores above 8. Strong on conceptual questions (Admission Control 8, PSS 8, PKI 8, Privileged/APE 8) but weak on technical detail — Kubelet API Rights (3) features an entirely wrong RBAC mapping (using pod resources instead of node subresources), and Open Ports (4) includes outdated port numbers and wrong interface bindings. The Authentication score of 5 reflects the common X.509 production recommendation misconception. Zero wins and zero last-place finishes — a consistently below-average performer that handles well-documented concepts adequately but lacks the operational depth that separates mid-tier from top-tier models.

xiaomi/mimo-v2.5 — 67/100 (6.7 avg)

Strongest on: PKI List (10), PSS Levels (10), Privileged/APE (8)
Weakest on: Kubelet API (3), SSRF (4), Authentication (5)
Profile: Debuts at tied 19th with 67/100 (6.7 avg), matching GPT 5.4. No outright wins and no outright last-place finishes — a consistent, unremarkable middle-of-the-field quiz profile. At its best on pure recall: perfect 10s on PKI List and Pod Security Standards (both shared with other models, not sole wins) plus a strong 8 on Privileged/APE with a well-reasoned privileged-vs-APE distinction. Falls for all three trick questions — recommends X.509 for production authentication (compounded by a false claim that certificates can be centrally revoked), treats RBAC verbs as a closed set of 11 rather than an open set, and misses the ConfigMap-in-view-ClusterRole distinction. Weakest on kubelet authorization (3, mapping rights onto the main-API pods/pods/log/pods/exec resources rather than the nodes/proxy subresources the webhook authorizer uses) and SSRF (4, covering only the API-server-proxy vector). The mid-table quiz result badly understates the model’s hands-on ability: the same model that misreads kubelet authorization went on to exploit all six pentest scenarios (28/30, 3rd best ever). MiMo is far stronger executing attacks than reciting the underlying theory.

Cross-Cutting Findings

1. Breadth vs Depth

Questions asking about attack surfaces or comprehensive lists reward breadth (SSRF, Open Ports, Admission Control). Models that go deep on one aspect but miss others score lower. Gemini consistently showed the broadest knowledge.

2. Trick Questions Reveal Understanding

Two questions had trick elements:

RBAC Verbs: Verbs are an open set, not a fixed list — GPT 5.6 Sol, GPT 5.6 Terra, GPT 5.5, GPT 5.4, and Kimi K2.6 recognised this. All other models miss this.
Secrets/ConfigMaps: ConfigMaps CAN be encrypted at rest — only GPT noted this.
Authentication: None of the built-in methods are suitable for production — Qwen 3.7 Plus and Sonnet 5 share the lead at 9/10, correctly identifying that none are suitable with well-articulated reasoning.

The RBAC verb trick was previously an OpenAI-only insight, but Kimi K2.6 broke this pattern by also scoring 9. GPT 5.5 scores 9 on both RBAC Verbs and Secrets/ConfigMaps. GPT 5.6 Sol continues the OpenAI family’s strength on RBAC Verbs (9). Sonnet 5 and HY3 do not catch the RBAC verb trick (8 and 7 respectively). HY3 additionally hallucinated non-existent verbs. Kimi K3 scores 7 on RBAC Verbs, missing the arbitrary verb trick. Xiaomi MiMo v2.5 falls for all three trick questions — recommending X.509 for production authentication (5, compounded by a false central-revocation claim), treating RBAC verbs as a closed set of 11 (7), and missing the ConfigMap-in-view-ClusterRole distinction (6).

3. “Standard Kubeadm” Qualifier Matters

Questions specifying “standard kubeadm” caught models that included non-standard components:

MiniMax: Ingress controllers (SSRF), cert revocation (Authentication)
DeepSeek: Dashboard, Ingress controllers (SSRF, Open Ports)

4. Severity Calibration Varies

On the Privileged vs APE question, most models overstated allowPrivilegeEscalation as a serious risk. MiniMax M2.5 — the overall weakest model — produced the best-calibrated answer on this specific question. M2.7 reverted to the majority position, overstating APE severity. Opus 4.8 breaks the Anthropic family pattern with a well-calibrated 9/10, joining GPT 5.5 and V4 Pro at the top — a dramatic improvement from Opus 4.7’s 6/10. Sonnet 5 sets a new benchmark with 10/10, the first model to achieve a perfect score on this question — correctly calibrating APE as rarely a serious concern. Kimi K3 scores 8/10, slightly overstating APE risk rather than stating it is rarely a serious concern.

5. Qwen’s PKI Expertise Stands Out

Qwen 3.6 Plus matched Opus 4.7’s perfect 10/10 on PKI List — only two models to achieve this. Its Authentication score (8) was previously tied for first but is now second behind Qwen 3.7 Plus (9). The Qwen family demonstrates strong infrastructure security knowledge. Qwen 3.6 Plus’s weakness is attack-surface mapping (SSRF: 4, Kubelet API: 5).

6. Node-Level Knowledge Is Weakest — But Kimi K3 Breaks Through

Across most models, knowledge about kubelet-level behaviour was poorest:

No model identified pod probes as SSRF vectors
Kubelet API authorization (node subresources) was poorly understood by most models — but Kimi K3 (9/10) is the first and only model in the field to correctly identify that get on nodes/proxy already permits exec via WebSocket HTTP GET upgrades, the tricky point the scoring notes flag as unlikely for a model to get. Every other model said create (or create plus get) for exec. This overturns the long-standing finding that no model got the exec verb right.
Pod image reference as SSRF only identified by Gemini and Opus 4.7

7. Easy Questions Don’t Differentiate

PSS Levels and PKI List had the highest average scores and lowest spreads. These questions confirm basic knowledge but don’t separate strong from weak models. However, perfect 10s on these questions — Opus 4.7 and Qwen on PKI List, V4 Pro on PSS Levels — show that even “easy” questions can reward precision.

8. Factual Errors vs Missing Information

Missing information costs less than wrong information. Models that stayed accurate but brief (GPT on some questions) scored better than models that were comprehensive but included errors (DeepSeek’s “4 CAs”, MiniMax’s Ingress focus).

9. Anthropic Family Trait: Improving but Persistent Gaps — Plus a New Safety Concern

The six Anthropic models (Sonnet 5, Fable 5, Opus 4.8, Opus 4.7, Opus 4.6, Sonnet 4.6) share blindspots on RBAC verb openness and ConfigMap encryption eligibility. However, Sonnet 5, Opus 4.8, and Fable 5 break the APE severity overstatement pattern — Sonnet 5 sets a new benchmark with a perfect 10/10, while Opus 4.8 and Fable 5 score 9/10. Fable 5 introduces a new concern: safety guardrails that produce completely empty responses on offensive security topics (Kubelet API: 0, SSRF: 0). This is the first model to score 0 on any quiz, and it happens twice. When Fable 5 does engage, it performs strongly (six scores of 9-10), but the safety refusals represent a significant limitation for security assessment use cases.

10. MiniMax M3 Sets New SSRF High and Continues Family Trajectory

MiniMax M3 takes the sole lead on SSRF at 9/10, surpassing the previous three-way tie at 8/10 (Opus 4.8, Gemini 3 Flash, Qwen 3.7 Plus). With comprehensive coverage of API server proxy, admission webhooks, and CVE references, M3 demonstrates the strongest SSRF attack-surface knowledge of any model. The MiniMax family trajectory (M2.5: 51, M2.7: 71, M3: 65) is interesting — M3 actually scores below M2.7 despite being the newer model, suggesting different training trade-offs rather than a simple linear improvement.