Quiz Test Summary — Twenty-Two-Model Comparison
Original date: 2026-03-09 (scored 2026-03-10) | Claude Opus 4.6 added: 2026-03-25 | MiniMax M2.7 added: 2026-03-28 | Claude Opus 4.7 added: 2026-04-20 | Qwen 3.6 Plus added: 2026-04-20 | DeepSeek V4 Pro added: 2026-04-24 | DeepSeek V4 Flash added: 2026-04-24 | GPT 5.5 added: 2026-04-25 | Kimi K2.6 added: 2026-04-26 | Qwen3.6-35b-a3b (Local) added: 2026-05-03 | Gemma 4 31B (Local) added: 2026-05-03 | Claude Opus 4.8 added: 2026-05-31 | Qwen 3.7 Plus added: 2026-06-05 | MiniMax M3 added: 2026-06-08 | Claude Fable 5 added: 2026-06-10 | Kimi K2.7 Code added: 2026-06-16 | GLM-5.2 added: 2026-06-17 | Mistral Medium 3.5 added: 2026-06-18 Models: 22 models tested across 10 Kubernetes security knowledge questions
Models Tested
| Model | Provider | Short Name | Tested |
|---|---|---|---|
| anthropic/claude-fable-5 | Anthropic | Fable 5 | 2026-06-10 |
| anthropic/claude-opus-4.8 | Anthropic | Opus 4.8 | 2026-05-31 |
| anthropic/claude-opus-4.7 | Anthropic | Opus 4.7 | 2026-04-20 |
| anthropic/claude-opus-4.6 | Anthropic | Opus 4.6 | 2026-03-25 |
| anthropic/claude-sonnet-4.6 | Anthropic | Sonnet 4.6 | 2026-03-09 |
| openai/gpt-5.5 | OpenAI | GPT 5.5 | 2026-04-25 |
| openai/gpt-5.4 | OpenAI | GPT 5.4 | 2026-03-09 |
| google/gemini-3-flash-preview | Gemini 3 Flash | 2026-03-09 | |
| qwen/qwen3.6-plus | Qwen | Qwen 3.6 Plus | 2026-04-20 |
| minimax/minimax-m2.5 | MiniMax | MiniMax M2.5 | 2026-03-09 |
| minimax/minimax-m2.7 | MiniMax | MiniMax M2.7 | 2026-03-28 |
| minimax/minimax-m3 | MiniMax | MiniMax M3 | 2026-06-08 |
| deepseek/deepseek-v3.2 | DeepSeek | DeepSeek V3.2 | 2026-03-09 |
| deepseek/deepseek-v4-pro | DeepSeek | DeepSeek V4 Pro | 2026-04-24 |
| deepseek/deepseek-v4-flash | DeepSeek | DeepSeek V4 Flash | 2026-04-24 |
| moonshotai/kimi-k2.6 | Moonshot AI | Kimi K2.6 | 2026-04-26 |
| moonshotai/kimi-k2.7-code | Moonshot AI | Kimi K2.7 Code | 2026-06-16 |
| z-ai/glm-5.2 | Z-AI | GLM-5.2 | 2026-06-17 |
| mistralai/mistral-medium-3-5 | Mistral AI | Mistral M3.5 | 2026-06-18 |
| qwen/qwen3.7-plus | Qwen | Qwen 3.7 Plus | 2026-06-05 |
| qwen/qwen3.6-35b-a3b | Qwen (Local) | Qwen-35b (LOCAL) | 2026-05-03 |
| google/gemma-4-31b | Google (Local) | Gemma 4 31B (LOCAL) | 2026-05-03 |
Overall Rankings
| Rank | Model | Total (out of 100) | Average | Wins | Last Place |
|---|---|---|---|---|---|
| 1 | openai/gpt-5.5 | 84 | 8.4 | 7 | 0 |
| 2 | anthropic/claude-opus-4.8 | 82 | 8.2 | 2 | 0 |
| 3 | moonshotai/kimi-k2.7-code | 80 | 8.0 | 3 | 0 |
| 4 | moonshotai/kimi-k2.6 | 77 | 7.7 | 2 | 0 |
| 5 | deepseek/deepseek-v4-pro | 76 | 7.6 | 3 | 0 |
| 6 | anthropic/claude-opus-4.7 | 75 | 7.5 | 2 | 2 |
| 6 | qwen/qwen3.7-plus | 75 | 7.5 | 1 | 0 |
| 8 | google/gemini-3-flash-preview | 74 | 7.4 | 1 | 1 |
| 9 | anthropic/claude-sonnet-4.6 | 73 | 7.3 | 2 | 0 |
| 10 | z-ai/glm-5.2 | 72 | 7.2 | 0 | 0 |
| 11 | anthropic/claude-opus-4.6 | 71 | 7.1 | 1 | 2 |
| 11 | minimax/minimax-m2.7 | 71 | 7.1 | 1 | 1 |
| 13 | anthropic/claude-fable-5 | 70 | 7.0 | 5 | 2 |
| 14 | qwen/qwen3.6-plus | 68 | 6.8 | 2 | 2 |
| 15 | openai/gpt-5.4 | 67 | 6.7 | 1 | 1 |
| 16 | deepseek/deepseek-v4-flash | 66 | 6.6 | 1 | 1 |
| 17 | minimax/minimax-m3 | 65 | 6.5 | 1 | 1 |
| 17 | qwen/qwen3.6-35b-a3b (LOCAL) | 65 | 6.5 | 0 | 0 |
| 19 | google/gemma-4-31b (LOCAL) | 64 | 6.4 | 0 | 0 |
| 20 | mistralai/mistral-medium-3-5 | 63 | 6.3 | 0 | 0 |
| 21 | deepseek/deepseek-v3.2 | 55 | 5.5 | 0 | 4 |
| 22 | minimax/minimax-m2.5 | 51 | 5.1 | 0 | 5 |
Full Score Matrix
| Quiz | GPT 5.5 | Opus 4.8 | K2.7 Code | GLM-5.2 | Mistral M3.5 | K2.6 | Opus 4.7 | Qwen 3.7+ | Opus 4.6 | Sonnet 4.6 | Fable 5 | GPT 5.4 | Gemini 3 Flash | Qwen 3.6 Plus | MiniMax M2.7 | MiniMax M3 | MiniMax M2.5 | DeepSeek V3.2 | DeepSeek V4 Pro | DeepSeek V4 Flash | Qwen-35b | Gemma 4 31B | Winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Admission Control Options | 9 | 9 | 9 | 7 | 8 | 9 | 9 | 8 | 9 | 9 | 9 | 7 | 7 | 7 | 7 | 8 | 4 | 5 | 8 | 8 | 7 | 7.5 | GPT 5.5/Opus 4.8/K2.7 Code/K2.6/Opus 4.7/Opus 4.6/Sonnet/Fable 5 |
| Kubelet API Rights | 6 | 5 | 6 | 5 | 3 | 7 | 6 | 5 | 7 | 6 | 0 | 7 | 7 | 5 | 8 | 4 | 2 | 3 | 4 | 3 | 4 | 5 | MiniMax M2.7 |
| Kubernetes Authentication | 8 | 8 | 7 | 7 | 5 | 7 | 7 | 9 | 7 | 8 | 8 | 5 | 7 | 8 | 5 | 5 | 4 | 5 | 8 | 6 | 5 | 6 | Qwen 3.7+ |
| Kubernetes Open Ports | 8 | 9 | 8 | 5 | 4 | 7 | 7 | 6 | 6 | 7 | 9 | 5 | 7 | 6 | 6 | 3 | 4 | 4 | 5 | 5 | 4 | 5 | Opus 4.8/Fable 5 |
| Kubernetes PKI List | 10 | 10 | 10 | 10 | 8 | 10 | 10 | 9 | 9 | 8 | 10 | 9 | 9 | 10 | 9 | 9 | 7 | 6 | 9 | 10 | 9 | 6 | GPT 5.5/Opus 4.8/K2.7 Code/K2.6/Opus 4.7/Qwen/V4 Flash/Fable 5 |
| Kubernetes SSRF | 6 | 8 | 7 | 8 | 7 | 5 | 7 | 8 | 6 | 5 | 0 | 4 | 8 | 4 | 6 | 9 | 3 | 5 | 6 | 5 | 5 | 6 | MiniMax M3 |
| Pod Security Standards Levels | 10 | 10 | 9 | 10 | 8 | 9 | 9 | 9 | 8 | 9 | 9 | 7 | 9 | 9 | 9 | 10 | 7 | 8 | 10 | 9 | 10 | 9 | GPT 5.5/Opus 4.8/V4 Pro/Qwen-35b/MiniMax M3 |
| Privileged vs APE | 9 | 9 | 7 | 6 | 8 | 7 | 6 | 7 | 6 | 7 | 9 | 6 | 7 | 6 | 7 | 6 | 8 | 7 | 9 | 6 | 8 | 6 | GPT 5.5/Opus 4.8/V4 Pro/Fable 5 |
| RBAC Verbs | 9 | 8 | 9 | 7 | 5 | 9 | 8 | 7 | 7 | 7 | 7 | 9 | 7 | 7 | 6 | 5 | 6 | 6 | 8 | 7 | 6 | 6 | GPT 5.5/K2.7 Code/K2.6/GPT 5.4 |
| Secrets and Config Maps | 9 | 6 | 8 | 7 | 7 | 7 | 6 | 7 | 6 | 7 | 9 | 8 | 6 | 6 | 8 | 6 | 6 | 6 | 9 | 7 | 7 | 7 | GPT 5.5/V4 Pro/Fable 5 |
| Total | 84 | 82 | 80 | 72 | 63 | 77 | 75 | 75 | 71 | 73 | 70 | 67 | 74 | 68 | 71 | 65 | 51 | 55 | 76 | 66 | 65 | 64 | |
| Average | 8.4 | 8.2 | 8.0 | 7.2 | 6.3 | 7.7 | 7.5 | 7.5 | 7.1 | 7.3 | 7.0 | 6.7 | 7.4 | 6.8 | 7.1 | 6.5 | 5.1 | 5.5 | 7.6 | 6.6 | 6.5 | 6.4 |
Score Distribution
By Model
| Model | 9-10 | 7-8 | 5-6 | 3-4 | 0-2 |
|---|---|---|---|---|---|
| GPT 5.5 | 5 | 3 | 2 | 0 | 0 |
| Claude Opus 4.8 | 4 | 4 | 2 | 0 | 0 |
| Kimi K2.6 | 3 | 5 | 2 | 0 | 0 |
| Kimi K2.7 Code | 3 | 5 | 2 | 0 | 0 |
| GLM-5.2 | 2 | 5 | 3 | 0 | 0 |
| Mistral M3.5 | 0 | 6 | 2 | 2 | 0 |
| Claude Opus 4.7 | 3 | 4 | 3 | 0 | 0 |
| Qwen 3.7 Plus | 3 | 5 | 2 | 0 | 0 |
| Claude Fable 5 | 6 | 2 | 0 | 0 | 2 |
| Claude Opus 4.6 | 2 | 4 | 4 | 0 | 0 |
| Claude Sonnet 4.6 | 2 | 6 | 2 | 0 | 0 |
| GPT 5.4 | 2 | 2 | 4 | 2 | 0 |
| Gemini 3 Flash | 3 | 5 | 1 | 1 | 0 |
| Qwen 3.6 Plus | 2 | 3 | 4 | 1 | 0 |
| MiniMax M2.7 | 2 | 4 | 4 | 0 | 0 |
| MiniMax M3 | 3 | 1 | 4 | 2 | 0 |
| MiniMax M2.5 | 0 | 3 | 2 | 3 | 2 |
| DeepSeek V3.2 | 0 | 2 | 5 | 2 | 1 |
| DeepSeek V4 Pro | 3 | 4 | 2 | 1 | 0 |
| DeepSeek V4 Flash | 2 | 3 | 4 | 1 | 0 |
| Qwen-35b (LOCAL) | 2 | 3 | 3 | 2 | 0 |
| Gemma 4 31B (LOCAL) | 1 | 2 | 7 | 0 | 0 |
By Question Difficulty
| Quiz | Avg Score | Spread (max-min) | Difficulty |
|---|---|---|---|
| Kubernetes PKI List | 9.0 | 4 | Easy |
| Pod Security Standards Levels | 9.0 | 3 | Easy |
| Admission Control Options | 7.6 | 5 | Moderate |
| RBAC Verbs | 7.2 | 4 | Moderate |
| Privileged vs APE | 7.1 | 3 | Moderate |
| Secrets and Config Maps | 7.0 | 3 | Moderate |
| Kubernetes Authentication | 6.6 | 5 | Moderate |
| Kubernetes Open Ports | 6.2 | 6 | Moderate |
| Kubernetes SSRF | 5.6 | 9 | Hard |
| Kubelet API Rights | 4.9 | 8 | Hard |
Model Profiles
openai/gpt-5.5 — 84/100 (8.4 avg)
- Strongest on: PKI List (10), PSS Levels (10), Admission Control (9), Privileged/APE (9), RBAC Verbs (9), Secrets/ConfigMaps (9)
- Weakest on: Kubelet API (6), SSRF (6)
- Profile: The clear quiz leader with 84/100 — 8 points ahead of the previous best (V4 Pro at 76). Seven quiz wins (more than any other model) and zero last-place finishes. Five scores of 9-10 out of 10 quizzes. The strongest performance across the board, combining deep conceptual understanding with consistent accuracy. Shares the OpenAI family’s ability to identify trick questions (RBAC verb openness at 9, Secrets/ConfigMaps at 9) while also excelling at structured knowledge questions (PKI at 10, PSS at 10). Weaknesses are limited to attack-surface mapping (SSRF: 6) and kubelet-level operational detail (Kubelet API: 6) — the same hard questions that challenge most models. A significant upgrade over GPT 5.4 (+17 points), the largest inter-generation improvement in the test set.
anthropic/claude-opus-4.8 — 82/100 (8.2 avg)
- Strongest on: PKI List (10), PSS Levels (10), Admission Control (9), Open Ports (9), Privileged/APE (9)
- Weakest on: Kubelet API (5), Secrets/ConfigMaps (6)
- Profile: A significant step up from Opus 4.7 (+7 points), placing 2nd overall with 82/100 — just 2 points behind GPT 5.5. Four scores of 9-10 and four of 7-8 show consistent high performance across the board. Takes the sole lead on Open Ports (9, beating GPT 5.5’s 8) and ties for the lead on SSRF (8, matching Gemini 3 Flash). The Privileged/APE severity calibration (9) is a dramatic improvement over the Anthropic family’s historical weakness — Opus 4.7 scored 6 with APE severity overstatement, while Opus 4.8 correctly conveys the gap. PKI List perfect score continues the Anthropic improvement trajectory (Opus 4.6: 9, Opus 4.7: 10, Opus 4.8: 10). Persistent weaknesses: Kubelet API (5) with the nodes/proxy exec verb error and invented subresources, and Secrets/ConfigMaps (6) with the ConfigMap encryption misconception that plagues the Anthropic family. Zero last-place finishes.
moonshotai/kimi-k2.6 — 77/100 (7.7 avg)
- Strongest on: PKI List (10), Admission Control (9), PSS Levels (9), RBAC Verbs (9)
- Weakest on: SSRF (5), Authentication (7), Open Ports (7), Kubelet API (7), Privileged/APE (7), Secrets/ConfigMaps (7)
- Profile: Strong debut at 2nd place with 77/100 — 1 point ahead of DeepSeek V4 Pro (76) and 2 ahead of Opus 4.7 (75). Three shared wins (Admission Control, PKI List, RBAC Verbs) and zero last-place finishes. The RBAC Verbs score is notable — Kimi is the first non-OpenAI model to catch the custom verb trick, breaking the pattern that suggested it was an OpenAI-specific training signal. Five scores of exactly 7 suggest a model that reliably covers fundamentals but hits a ceiling on nuanced questions requiring deeper operational knowledge or trick-question awareness.
anthropic/claude-opus-4.7 — 75/100 (7.5 avg)
- Strongest on: PKI List (10), Admission Control (9), PSS Levels (9)
- Weakest on: Kubelet API (6), Privileged/APE (6), Secrets/ConfigMaps (6)
- Profile: Strong, consistent performer. Three scores of 9+ and no score below 6. Improved over Opus 4.6 in 4 areas (Open Ports, PKI, SSRF, PSS, RBAC) while regressing in 1 (Kubelet API). The PKI question shows a perfect understanding of Kubernetes CA architecture. Persistent weaknesses are the same Anthropic family traits: trick question blindspots (RBAC verb openness, ConfigMap encryption) and APE severity miscalibration — though APE calibration improved from “SERIOUS” to “Moderate.”
qwen/qwen3.7-plus — 75/100 (7.5 avg)
- Strongest on: Authentication (9), PKI List (9), PSS Levels (9), Admission Control (8), SSRF (8)
- Weakest on: Kubelet API (5), Open Ports (6)
- Profile: A solid improvement over Qwen 3.6 Plus (+7 points), placing tied 5th with Opus 4.7 at 75/100. Three scores of 9 and two of 8 demonstrate strong knowledge across structured and conceptual questions. Takes the sole lead on Authentication (9/10) — the first model to score above 8 on this trick question, correctly identifying that none of the built-in methods are suitable for production user authentication with well-articulated CRL/OCSP reasoning. Ties for the SSRF lead (8/10, matching Gemini 3 Flash and Opus 4.8) with good coverage of proxy subresources, aggregated API, and admission webhooks. The Qwen family’s PKI expertise continues with 9/10 (Qwen 3.6 Plus scored 10). Persistent weaknesses are kubelet-level operational detail (5/10, incorrectly mapping exec to pods/exec) and port binding accuracy (6/10, missing etcd 2381 and kube-proxy 10249). Zero wins on “easy” questions (PSS, PKI) but zero last-place finishes — a consistently strong mid-to-upper performer.
anthropic/claude-fable-5 — 70/100 (7.0 avg)
- Strongest on: PKI List (10), Admission Control (9), Open Ports (9), Privileged/APE (9), Secrets/ConfigMaps (9)
- Weakest on: Kubelet API (0), SSRF (0), RBAC Verbs (7)
- Profile: A polarised performer with the most extreme score distribution of any model tested — six scores of 9-10 alongside two scores of 0. The zeros are unprecedented: Fable 5 produced completely empty responses on Kubelet API and SSRF, appearing to trigger safety guardrails that refused to engage with offensive security topics (kubelet exploitation, SSRF attack vectors). When it does engage, performance is strong — five co-wins across Admission Control, Open Ports, PKI List, Privileged/APE, and Secrets/ConfigMaps. The PKI perfect 10 continues the Anthropic family’s PKI strength (Opus 4.6: 9, Opus 4.7: 10, Opus 4.8: 10, Fable 5: 10). The Privileged/APE score of 9 matches Opus 4.8’s breakthrough in correctly calibrating APE severity. The total of 70/100 places it 11th, but this understates its capability on questions it answers — its average excluding the two zeros would be 8.75/8, the highest effective score of any model. The safety refusal pattern is a significant concern for security assessment use cases.
z-ai/glm-5.2 — 72/100 (7.2 avg)
- Strongest on: PKI List (10), PSS Levels (10), SSRF (8)
- Weakest on: Kubelet API (5), Open Ports (5)
- Profile: A solid debut at 72/100, placing 10th out of 21 models. Two perfect 10s on PKI List and PSS Levels demonstrate strong knowledge of well-documented Kubernetes concepts. The SSRF score of 8 matches the previous three-way tie (Opus 4.8, Gemini 3 Flash, Qwen 3.7 Plus) for strong attack-surface coverage. Five scores of 7-8 across Admission Control, Authentication, RBAC Verbs, Secrets/ConfigMaps, and SSRF show consistent mid-to-upper performance. Weaknesses are concentrated on operational detail: Kubelet API (5, wrong exec verb mapping) and Open Ports (5, missing several ports and incorrect NodePort claim). The Authentication score of 7 reflects incorrectly endorsing X.509 for production use. Zero wins and zero last-place finishes — a reliable mid-tier performer that handles conceptual questions well but struggles with the deeper operational knowledge that separates the top models.
anthropic/claude-opus-4.6 — 71/100 (7.1 avg)
- Strongest on: Admission Control (9), PKI List (9), Pod Security Standards (8)
- Weakest on: Open Ports (6), Privileged/APE (6), Secrets/ConfigMaps (6), SSRF (6)
- Profile: Strong, consistent mid-to-upper performer with no scores below 6 (never last place on any question). Deep knowledge of Kubernetes security architecture. Unlike Sonnet 4.6, Opus 4.6 does not fabricate information — its errors are miscalibrations and missed nuances rather than invented facts. Shares the Anthropic family trait of struggling with trick questions (RBAC verb openness, ConfigMap encryption).
google/gemini-3-flash-preview — 74/100 (7.4 avg)
- Strongest on: SSRF (8), PKI (9), PSS Levels (9)
- Weakest on: Secrets/ConfigMaps (6)
- Profile: Most consistent high performer. Dominated the SSRF question with the widest vector coverage. Best breadth of knowledge across topics. Only model to identify pod image reference as an SSRF vector. Weakest on nuanced questions where common assumptions are wrong (ConfigMap encryption).
anthropic/claude-sonnet-4.6 — 73/100 (7.3 avg)
- Strongest on: Admission Control (9), PSS Levels (9), Authentication (8)
- Weakest on: SSRF (5)
- Profile: Most consistent model — never scored below 5, narrowest range of scores. Deep expertise on individual topics (detailed RBAC tables, good security rationales) but sometimes narrow in breadth. Tends to go deep on one aspect rather than covering the full surface area.
openai/gpt-5.4 — 67/100 (6.7 avg)
- Strongest on: RBAC Verbs (9), PKI (9), Secrets/ConfigMaps (8)
- Weakest on: SSRF (4), Authentication (5), Open Ports (5)
- Profile: Highest variance model among the top 3. Excels on questions with trick elements or nuanced distinctions (recognised RBAC verbs are open-ended, correctly noted ConfigMap encryption is possible). Often too brief on straightforward knowledge questions, costing marks for lack of detail.
deepseek/deepseek-v4-pro — 76/100 (7.6 avg)
- Strongest on: PSS Levels (10), Privileged/APE (9), Secrets/ConfigMaps (9)
- Weakest on: Kubelet API (4), Open Ports (5)
- Profile: Second place with four wins (PSS Levels sole winner at 10/10, Privileged/APE sole winner at 9, Secrets/ConfigMaps sole winner at 9, Authentication shared at 8). Strong on conceptual understanding and security implications — excels at explaining differences and trade-offs (Privileged/APE, Secrets/ConfigMaps). Critical error on kubelet exec authorization (claims CREATE instead of GET for nodes/proxy), and incorrect bind address mappings for several control plane components. Dramatic improvement over DeepSeek V3.2 (+21 points), the largest inter-generation improvement in the test set.
deepseek/deepseek-v4-flash — 66/100 (6.6 avg)
- Strongest on: PKI List (10), PSS Levels (9), Admission Control (8)
- Weakest on: Kubelet API (3), Open Ports (5), SSRF (5)
- Profile: A solid mid-tier performer that sits between V4 Pro (76) and V3.2 (55), closer to the pack than either sibling. Achieves a perfect 10 on PKI List (joining Opus 4.7 and Qwen) and a strong 9 on PSS Levels. Admission Control score of 8 matches V4 Pro. However, struggles significantly on kubelet-level knowledge (3/10 on Kubelet API — fabricates incorrect resource strings) and attack-surface mapping (SSRF: 5). Incorrectly recommends X.509 certificates for production authentication, missing the trick question. The ConfigMap encryption misconception persists from V3.2. Overall pattern: strong on structured knowledge questions, weak on operational detail and trick questions.
deepseek/deepseek-v3.2 — 55/100 (5.5 avg)
- Strongest on: PSS Levels (8), Privileged/APE (7)
- Weakest on: Kubelet API (3), Open Ports (4)
- Profile: Mid-tier performer. Tends to include non-standard components in answers (Dashboard, Ingress controllers for stock kubeadm questions). Occasional factual errors (update includes patch, 4 CAs instead of 3). Reasonable on straightforward questions but struggles with precision.
qwen/qwen3.6-plus — 68/100 (6.8 avg)
- Strongest on: PKI List (10), PSS Levels (9), Authentication (8)
- Weakest on: SSRF (4), Kubelet API (5)
- Profile: Solid mid-tier performer with a perfect 10 on PKI List (matching only Opus 4.7). Strong on structured knowledge questions (PKI, PSS, Authentication) but weak on attack-oriented topics (SSRF, Kubelet). The SSRF answer contained a factual error about kube-proxy endpoints being disabled by default. Shared the Authentication win with Sonnet 4.6 — correctly identified that no built-in method is suitable for production users. Never placed last on any question. Similar consistency pattern to the Anthropic models but without their trick-question blindspots on Authentication.
minimax/minimax-m2.7 — 71/100 (7.1 avg)
- Strongest on: PKI List (9), PSS Levels (9), Kubelet API (8), Secrets/ConfigMaps (8)
- Weakest on: Authentication (5), Open Ports (6), RBAC Verbs (6), SSRF (6)
- Profile: A dramatic improvement over M2.5 (+20 points). Tied with Opus at 71/100 with an identical score distribution (2 scores of 9-10, 4 of 7-8, 4 of 5-6, none below 5). Strong on fundamental topics (PKI, PSS) and practical questions (Kubelet API, Secrets). Like most models, missed the RBAC verbs trick and recommended X.509 for production authentication. The largest single-generation improvement in the test set.
minimax/minimax-m3 — 65/100 (6.5 avg)
- Strongest on: SSRF (9, new sole leader), PKI List (9), Pod Security Standards (10)
- Weakest on: Open Ports (3, sole last place), Kubelet API (4)
- Profile: Scores 65/100, tying for 14th with Qwen3.6-35b-a3b (LOCAL). Strongest areas are SSRF (9, new sole leader), PKI List (9), and Pod Security Standards (10). Weakest on Open Ports (3, sole last place) and Kubelet API (4). Shows strong conceptual understanding of security topics but struggles with detailed factual recall on ports and protocol specifics. The 14-point improvement over MiniMax M2.5 demonstrates clear advancement, though interestingly scores below sibling M2.7 (71).
minimax/minimax-m2.5 — 51/100 (5.1 avg)
- Strongest on: Privileged/APE (8), PSS Levels (7), PKI (7)
- Weakest on: Kubelet API (2), SSRF (3), Authentication (4)
- Profile: Weakest overall with the most last-place finishes (8 of 10). Produced a strong answer on Privileged vs APE (8/10, best severity calibration). Tends to make critical errors on harder questions — Ingress as SSRF vector, cert revocation support, RBAC inspecting pod specs. Verbose answers don’t compensate for accuracy gaps.
google/gemma-4-31b (LOCAL) — 64/100 (6.4 avg)
- Strongest on: PSS Levels (9), Admission Control (7.5), Secrets/ConfigMaps (7)
- Weakest on: Kubelet API (5), Open Ports (5), PKI List (6)
- Profile: The second local model tested — a 31B parameter model running on LM Studio. Scores 64/100, placing 13th out of 15 models. Characterised by a flat mid-range distribution: seven of ten quizzes scored 5 or 6, giving a narrow score range and few standout results. The PSS Levels score of 9/10 is the sole strong result, showing confident knowledge of well-documented concepts. Notable errors include calling the service account signer a CA (PKI), stating that 10257/10259 bind to all interfaces (they are localhost-only), overstating allowPrivilegeEscalation severity, and missing arbitrary RBAC verbs. Zero wins and zero last-place finishes — a consistently average-to-below-average performer.
qwen/qwen3.6-35b-a3b (LOCAL) — 65/100 (6.5 avg)
- Strongest on: PSS Levels (10), PKI List (9), Privileged/APE (8)
- Weakest on: Kubelet API (4), Open Ports (4), Authentication (5), SSRF (5)
- Profile: The first local model tested — a 35B-parameter MoE model running on LM Studio. Scores 65/100, placing 12th out of 15 models. Two scores of 9-10 show strong knowledge on well-defined concepts (PSS levels, PKI architecture), but struggles with operational depth (kubelet webhook auth, port binding details) and trick questions (authentication production suitability, RBAC verb openness). The pattern suggests the model has good conceptual training but lacks the nuanced operational knowledge that larger cloud models demonstrate. Notable: zero hallucinations — errors are omissions and misunderstandings rather than fabricated information.
mistralai/mistral-medium-3-5 — 63/100 (6.3 avg)
- Strongest on: PKI List (8), Admission Control (8), PSS Levels (8), Privileged/APE (8)
- Weakest on: Kubelet API (3), Open Ports (4)
- Profile: Ranks 20th out of 22 models with 63/100 — below average but comfortably above the bottom two (V3.2 at 55, M2.5 at 51). The score distribution is notably flat: six scores of 7-8, two of 5, and two of 3-4, with zero scores above 8. Strong on conceptual questions (Admission Control 8, PSS 8, PKI 8, Privileged/APE 8) but weak on technical detail — Kubelet API Rights (3) features an entirely wrong RBAC mapping (using pod resources instead of node subresources), and Open Ports (4) includes outdated port numbers and wrong interface bindings. The Authentication score of 5 reflects the common X.509 production recommendation misconception. Zero wins and zero last-place finishes — a consistently below-average performer that handles well-documented concepts adequately but lacks the operational depth that separates mid-tier from top-tier models.
Cross-Cutting Findings
1. Breadth vs Depth
Questions asking about attack surfaces or comprehensive lists reward breadth (SSRF, Open Ports, Admission Control). Models that go deep on one aspect but miss others score lower. Gemini consistently showed the broadest knowledge.
2. Trick Questions Reveal Understanding
Two questions had trick elements:
- RBAC Verbs: Verbs are an open set, not a fixed list — GPT 5.5, GPT 5.4, and Kimi K2.6 recognised this. All other models miss this.
- Secrets/ConfigMaps: ConfigMaps CAN be encrypted at rest — only GPT noted this.
- Authentication: None of the built-in methods are suitable for production — Qwen 3.7 Plus now holds the sole lead at 9/10, the first model to score above 8 by correctly identifying that none are suitable with well-articulated CRL/OCSP reasoning.
The RBAC verb trick was previously an OpenAI-only insight, but Kimi K2.6 broke this pattern by also scoring 9. GPT 5.5 scores 9 on both RBAC Verbs and Secrets/ConfigMaps.
3. “Standard Kubeadm” Qualifier Matters
Questions specifying “standard kubeadm” caught models that included non-standard components:
- MiniMax: Ingress controllers (SSRF), cert revocation (Authentication)
- DeepSeek: Dashboard, Ingress controllers (SSRF, Open Ports)
4. Severity Calibration Varies
On the Privileged vs APE question, most models overstated allowPrivilegeEscalation as a serious risk. MiniMax M2.5 — the overall weakest model — produced the best-calibrated answer on this specific question. M2.7 reverted to the majority position, overstating APE severity. Opus 4.8 breaks the Anthropic family pattern with a well-calibrated 9/10, joining GPT 5.5 and V4 Pro at the top — a dramatic improvement from Opus 4.7’s 6/10.
5. Qwen’s PKI Expertise Stands Out
Qwen 3.6 Plus matched Opus 4.7’s perfect 10/10 on PKI List — only two models to achieve this. Its Authentication score (8) was previously tied for first but is now second behind Qwen 3.7 Plus (9). The Qwen family demonstrates strong infrastructure security knowledge. Qwen 3.6 Plus’s weakness is attack-surface mapping (SSRF: 4, Kubelet API: 5).
6. Node-Level Knowledge Is Weakest
Across all models, knowledge about kubelet-level behaviour was poorest:
- No model identified pod probes as SSRF vectors
- Kubelet API authorization (node subresources) was poorly understood
- Pod image reference as SSRF only identified by Gemini and Opus 4.7
7. Easy Questions Don’t Differentiate
PSS Levels and PKI List had the highest average scores and lowest spreads. These questions confirm basic knowledge but don’t separate strong from weak models. However, perfect 10s on these questions — Opus 4.7 and Qwen on PKI List, V4 Pro on PSS Levels — show that even “easy” questions can reward precision.
8. Factual Errors vs Missing Information
Missing information costs less than wrong information. Models that stayed accurate but brief (GPT on some questions) scored better than models that were comprehensive but included errors (DeepSeek’s “4 CAs”, MiniMax’s Ingress focus).
9. Anthropic Family Trait: Improving but Persistent Gaps — Plus a New Safety Concern
The five Anthropic models (Fable 5, Opus 4.8, Opus 4.7, Opus 4.6, Sonnet 4.6) share blindspots on RBAC verb openness and ConfigMap encryption eligibility. However, Opus 4.8 and Fable 5 both break the APE severity overstatement pattern with correctly calibrated 9/10 scores. Fable 5 introduces a new concern: safety guardrails that produce completely empty responses on offensive security topics (Kubelet API: 0, SSRF: 0). This is the first model to score 0 on any quiz, and it happens twice. When Fable 5 does engage, it performs strongly (six scores of 9-10), but the safety refusals represent a significant limitation for security assessment use cases.
10. MiniMax M3 Sets New SSRF High and Continues Family Trajectory
MiniMax M3 takes the sole lead on SSRF at 9/10, surpassing the previous three-way tie at 8/10 (Opus 4.8, Gemini 3 Flash, Qwen 3.7 Plus). With comprehensive coverage of API server proxy, admission webhooks, and CVE references, M3 demonstrates the strongest SSRF attack-surface knowledge of any model. The MiniMax family trajectory (M2.5: 51, M2.7: 71, M3: 65) is interesting — M3 actually scores below M2.7 despite being the newer model, suggesting different training trade-offs rather than a simple linear improvement.