Quiz Test Summary — Sixteen-Model Comparison
Original date: 2026-03-09 (scored 2026-03-10) | Claude Opus 4.6 added: 2026-03-25 | MiniMax M2.7 added: 2026-03-28 | Claude Opus 4.7 added: 2026-04-20 | Qwen 3.6 Plus added: 2026-04-20 | DeepSeek V4 Pro added: 2026-04-24 | DeepSeek V4 Flash added: 2026-04-24 | GPT 5.5 added: 2026-04-25 | Kimi K2.6 added: 2026-04-26 | Qwen3.6-35b-a3b (Local) added: 2026-05-03 | Gemma 4 31B (Local) added: 2026-05-03 | Claude Opus 4.8 added: 2026-05-31 Models: 16 models tested across 10 Kubernetes security knowledge questions
Models Tested
| Model | Provider | Short Name | Tested |
|---|---|---|---|
| anthropic/claude-opus-4.8 | Anthropic | Opus 4.8 | 2026-05-31 |
| anthropic/claude-opus-4.7 | Anthropic | Opus 4.7 | 2026-04-20 |
| anthropic/claude-opus-4.6 | Anthropic | Opus 4.6 | 2026-03-25 |
| anthropic/claude-sonnet-4.6 | Anthropic | Sonnet 4.6 | 2026-03-09 |
| openai/gpt-5.5 | OpenAI | GPT 5.5 | 2026-04-25 |
| openai/gpt-5.4 | OpenAI | GPT 5.4 | 2026-03-09 |
| google/gemini-3-flash-preview | Gemini 3 Flash | 2026-03-09 | |
| qwen/qwen3.6-plus | Qwen | Qwen 3.6 Plus | 2026-04-20 |
| minimax/minimax-m2.5 | MiniMax | MiniMax M2.5 | 2026-03-09 |
| minimax/minimax-m2.7 | MiniMax | MiniMax M2.7 | 2026-03-28 |
| deepseek/deepseek-v3.2 | DeepSeek | DeepSeek V3.2 | 2026-03-09 |
| deepseek/deepseek-v4-pro | DeepSeek | DeepSeek V4 Pro | 2026-04-24 |
| deepseek/deepseek-v4-flash | DeepSeek | DeepSeek V4 Flash | 2026-04-24 |
| moonshotai/kimi-k2.6 | Moonshot AI | Kimi K2.6 | 2026-04-26 |
| qwen/qwen3.6-35b-a3b | Qwen (Local) | Qwen-35b (LOCAL) | 2026-05-03 |
| google/gemma-4-31b | Google (Local) | Gemma 4 31B (LOCAL) | 2026-05-03 |
Overall Rankings
| Rank | Model | Total (out of 100) | Average | Wins | Last Place |
|---|---|---|---|---|---|
| 1 | openai/gpt-5.5 | 84 | 8.4 | 7 | 0 |
| 2 | anthropic/claude-opus-4.8 | 82 | 8.2 | 2 | 0 |
| 3 | moonshotai/kimi-k2.6 | 77 | 7.7 | 3 | 0 |
| 4 | deepseek/deepseek-v4-pro | 76 | 7.6 | 3 | 0 |
| 5 | anthropic/claude-opus-4.7 | 75 | 7.5 | 2 | 2 |
| 6 | google/gemini-3-flash-preview | 74 | 7.4 | 1 | 1 |
| 7 | anthropic/claude-sonnet-4.6 | 73 | 7.3 | 2 | 0 |
| 8 | anthropic/claude-opus-4.6 | 71 | 7.1 | 1 | 2 |
| 8 | minimax/minimax-m2.7 | 71 | 7.1 | 1 | 1 |
| 10 | qwen/qwen3.6-plus | 68 | 6.8 | 2 | 2 |
| 11 | openai/gpt-5.4 | 67 | 6.7 | 1 | 1 |
| 12 | deepseek/deepseek-v4-flash | 66 | 6.6 | 1 | 1 |
| 13 | qwen/qwen3.6-35b-a3b (LOCAL) | 65 | 6.5 | 0 | 0 |
| 14 | google/gemma-4-31b (LOCAL) | 64 | 6.4 | 0 | 0 |
| 15 | deepseek/deepseek-v3.2 | 55 | 5.5 | 0 | 4 |
| 16 | minimax/minimax-m2.5 | 51 | 5.1 | 0 | 8 |
Full Score Matrix
| Quiz | GPT 5.5 | Opus 4.8 | Kimi | Opus 4.7 | Opus 4.6 | Sonnet 4.6 | GPT 5.4 | Gemini 3 Flash | Qwen 3.6 Plus | MiniMax M2.7 | MiniMax M2.5 | DeepSeek V3.2 | DeepSeek V4 Pro | DeepSeek V4 Flash | Qwen-35b | Gemma 4 31B | Winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Admission Control Options | 9 | 9 | 9 | 9 | 9 | 9 | 7 | 7 | 7 | 7 | 4 | 5 | 8 | 8 | 7 | 7.5 | GPT 5.5/Opus 4.8/Kimi/Opus 4.7/Opus 4.6/Sonnet |
| Kubelet API Rights | 6 | 5 | 7 | 6 | 7 | 6 | 7 | 7 | 5 | 8 | 2 | 3 | 4 | 3 | 4 | 5 | MiniMax M2.7 |
| Kubernetes Authentication | 8 | 8 | 7 | 7 | 7 | 8 | 5 | 7 | 8 | 5 | 4 | 5 | 8 | 6 | 5 | 6 | GPT 5.5/Opus 4.8/Sonnet/Qwen/V4 Pro |
| Kubernetes Open Ports | 8 | 9 | 7 | 7 | 6 | 7 | 5 | 7 | 6 | 6 | 4 | 4 | 5 | 5 | 4 | 5 | Opus 4.8 |
| Kubernetes PKI List | 10 | 10 | 10 | 10 | 9 | 8 | 9 | 9 | 10 | 9 | 7 | 6 | 9 | 10 | 9 | 6 | GPT 5.5/Opus 4.8/Kimi/Opus 4.7/Qwen/V4 Flash |
| Kubernetes SSRF | 6 | 8 | 5 | 7 | 6 | 5 | 4 | 8 | 4 | 6 | 3 | 5 | 6 | 5 | 5 | 6 | Gemini 3 Flash/Opus 4.8 |
| Pod Security Standards Levels | 10 | 10 | 9 | 9 | 8 | 9 | 7 | 9 | 9 | 9 | 7 | 8 | 10 | 9 | 10 | 9 | GPT 5.5/Opus 4.8/V4 Pro/Qwen-35b |
| Privileged vs APE | 9 | 9 | 7 | 6 | 6 | 7 | 6 | 7 | 6 | 7 | 8 | 7 | 9 | 6 | 8 | 6 | GPT 5.5/Opus 4.8/V4 Pro |
| RBAC Verbs | 9 | 8 | 9 | 8 | 7 | 7 | 9 | 7 | 7 | 6 | 6 | 6 | 8 | 7 | 6 | 6 | GPT 5.5/Kimi/GPT 5.4 |
| Secrets and Config Maps | 9 | 6 | 7 | 6 | 6 | 7 | 8 | 6 | 6 | 8 | 6 | 6 | 9 | 7 | 7 | 7 | GPT 5.5/V4 Pro |
| Total | 84 | 82 | 77 | 75 | 71 | 73 | 67 | 74 | 68 | 71 | 51 | 55 | 76 | 66 | 65 | 64 | |
| Average | 8.4 | 8.2 | 7.7 | 7.5 | 7.1 | 7.3 | 6.7 | 7.4 | 6.8 | 7.1 | 5.1 | 5.5 | 7.6 | 6.6 | 6.5 | 6.4 |
Score Distribution
By Model
| Model | 9-10 | 7-8 | 5-6 | 3-4 | 1-2 |
|---|---|---|---|---|---|
| GPT 5.5 | 5 | 3 | 2 | 0 | 0 |
| Claude Opus 4.8 | 4 | 4 | 2 | 0 | 0 |
| Kimi K2.6 | 3 | 5 | 2 | 0 | 0 |
| Claude Opus 4.7 | 3 | 4 | 3 | 0 | 0 |
| Claude Opus 4.6 | 2 | 4 | 4 | 0 | 0 |
| Claude Sonnet 4.6 | 2 | 6 | 2 | 0 | 0 |
| GPT 5.4 | 2 | 2 | 4 | 2 | 0 |
| Gemini 3 Flash | 3 | 5 | 1 | 1 | 0 |
| Qwen 3.6 Plus | 2 | 3 | 4 | 1 | 0 |
| MiniMax M2.7 | 2 | 4 | 4 | 0 | 0 |
| MiniMax M2.5 | 0 | 3 | 2 | 3 | 2 |
| DeepSeek V3.2 | 0 | 2 | 5 | 2 | 1 |
| DeepSeek V4 Pro | 3 | 4 | 2 | 1 | 0 |
| DeepSeek V4 Flash | 2 | 3 | 4 | 1 | 0 |
| Qwen-35b (LOCAL) | 2 | 3 | 3 | 2 | 0 |
| Gemma 4 31B (LOCAL) | 1 | 2 | 7 | 0 | 0 |
By Question Difficulty
| Quiz | Avg Score | Spread (max-min) | Difficulty |
|---|---|---|---|
| Pod Security Standards Levels | 8.9 | 3 | Easy |
| Kubernetes PKI List | 8.8 | 4 | Easy |
| Admission Control Options | 7.5 | 5 | Moderate |
| RBAC Verbs | 7.3 | 3 | Moderate |
| Privileged vs APE | 7.1 | 3 | Moderate |
| Secrets and Config Maps | 6.9 | 3 | Moderate |
| Kubernetes Authentication | 6.5 | 4 | Moderate |
| Kubernetes Open Ports | 6.3 | 5 | Moderate |
| Kubernetes SSRF | 5.6 | 5 | Hard |
| Kubelet API Rights | 5.3 | 6 | Hard |
Model Profiles
openai/gpt-5.5 — 84/100 (8.4 avg)
- Strongest on: PKI List (10), PSS Levels (10), Admission Control (9), Privileged/APE (9), RBAC Verbs (9), Secrets/ConfigMaps (9)
- Weakest on: Kubelet API (6), SSRF (6)
- Profile: The clear quiz leader with 84/100 — 8 points ahead of the previous best (V4 Pro at 76). Seven quiz wins (more than any other model) and zero last-place finishes. Five scores of 9-10 out of 10 quizzes. The strongest performance across the board, combining deep conceptual understanding with consistent accuracy. Shares the OpenAI family’s ability to identify trick questions (RBAC verb openness at 9, Secrets/ConfigMaps at 9) while also excelling at structured knowledge questions (PKI at 10, PSS at 10). Weaknesses are limited to attack-surface mapping (SSRF: 6) and kubelet-level operational detail (Kubelet API: 6) — the same hard questions that challenge most models. A significant upgrade over GPT 5.4 (+17 points), the largest inter-generation improvement in the test set.
anthropic/claude-opus-4.8 — 82/100 (8.2 avg)
- Strongest on: PKI List (10), PSS Levels (10), Admission Control (9), Open Ports (9), Privileged/APE (9)
- Weakest on: Kubelet API (5), Secrets/ConfigMaps (6)
- Profile: A significant step up from Opus 4.7 (+7 points), placing 2nd overall with 82/100 — just 2 points behind GPT 5.5. Four scores of 9-10 and four of 7-8 show consistent high performance across the board. Takes the sole lead on Open Ports (9, beating GPT 5.5’s 8) and ties for the lead on SSRF (8, matching Gemini 3 Flash). The Privileged/APE severity calibration (9) is a dramatic improvement over the Anthropic family’s historical weakness — Opus 4.7 scored 6 with APE severity overstatement, while Opus 4.8 correctly conveys the gap. PKI List perfect score continues the Anthropic improvement trajectory (Opus 4.6: 9, Opus 4.7: 10, Opus 4.8: 10). Persistent weaknesses: Kubelet API (5) with the nodes/proxy exec verb error and invented subresources, and Secrets/ConfigMaps (6) with the ConfigMap encryption misconception that plagues the Anthropic family. Zero last-place finishes.
moonshotai/kimi-k2.6 — 77/100 (7.7 avg)
- Strongest on: PKI List (10), Admission Control (9), PSS Levels (9), RBAC Verbs (9)
- Weakest on: SSRF (5), Authentication (7), Open Ports (7), Kubelet API (7), Privileged/APE (7), Secrets/ConfigMaps (7)
- Profile: Strong debut at 2nd place with 77/100 — 1 point ahead of DeepSeek V4 Pro (76) and 2 ahead of Opus 4.7 (75). Three shared wins (Admission Control, PKI List, RBAC Verbs) and zero last-place finishes. The RBAC Verbs score is notable — Kimi is the first non-OpenAI model to catch the custom verb trick, breaking the pattern that suggested it was an OpenAI-specific training signal. Five scores of exactly 7 suggest a model that reliably covers fundamentals but hits a ceiling on nuanced questions requiring deeper operational knowledge or trick-question awareness.
anthropic/claude-opus-4.7 — 75/100 (7.5 avg)
- Strongest on: PKI List (10), Admission Control (9), PSS Levels (9)
- Weakest on: Kubelet API (6), Privileged/APE (6), Secrets/ConfigMaps (6)
- Profile: Strong, consistent performer. Three scores of 9+ and no score below 6. Improved over Opus 4.6 in 4 areas (Open Ports, PKI, SSRF, PSS, RBAC) while regressing in 1 (Kubelet API). The PKI question shows a perfect understanding of Kubernetes CA architecture. Persistent weaknesses are the same Anthropic family traits: trick question blindspots (RBAC verb openness, ConfigMap encryption) and APE severity miscalibration — though APE calibration improved from “SERIOUS” to “Moderate.”
anthropic/claude-opus-4.6 — 71/100 (7.1 avg)
- Strongest on: Admission Control (9), PKI List (9), Pod Security Standards (8)
- Weakest on: Open Ports (6), Privileged/APE (6), Secrets/ConfigMaps (6), SSRF (6)
- Profile: Strong, consistent mid-to-upper performer with no scores below 6 (never last place on any question). Deep knowledge of Kubernetes security architecture. Unlike Sonnet 4.6, Opus 4.6 does not fabricate information — its errors are miscalibrations and missed nuances rather than invented facts. Shares the Anthropic family trait of struggling with trick questions (RBAC verb openness, ConfigMap encryption).
google/gemini-3-flash-preview — 74/100 (7.4 avg)
- Strongest on: SSRF (8), PKI (9), PSS Levels (9)
- Weakest on: Secrets/ConfigMaps (6)
- Profile: Most consistent high performer. Dominated the SSRF question with the widest vector coverage. Best breadth of knowledge across topics. Only model to identify pod image reference as an SSRF vector. Weakest on nuanced questions where common assumptions are wrong (ConfigMap encryption).
anthropic/claude-sonnet-4.6 — 73/100 (7.3 avg)
- Strongest on: Admission Control (9), PSS Levels (9), Authentication (8)
- Weakest on: SSRF (5)
- Profile: Most consistent model — never scored below 5, narrowest range of scores. Deep expertise on individual topics (detailed RBAC tables, good security rationales) but sometimes narrow in breadth. Tends to go deep on one aspect rather than covering the full surface area.
openai/gpt-5.4 — 67/100 (6.7 avg)
- Strongest on: RBAC Verbs (9), PKI (9), Secrets/ConfigMaps (8)
- Weakest on: SSRF (4), Authentication (5), Open Ports (5)
- Profile: Highest variance model among the top 3. Excels on questions with trick elements or nuanced distinctions (recognised RBAC verbs are open-ended, correctly noted ConfigMap encryption is possible). Often too brief on straightforward knowledge questions, costing marks for lack of detail.
deepseek/deepseek-v4-pro — 76/100 (7.6 avg)
- Strongest on: PSS Levels (10), Privileged/APE (9), Secrets/ConfigMaps (9)
- Weakest on: Kubelet API (4), Open Ports (5)
- Profile: Second place with four wins (PSS Levels sole winner at 10/10, Privileged/APE sole winner at 9, Secrets/ConfigMaps sole winner at 9, Authentication shared at 8). Strong on conceptual understanding and security implications — excels at explaining differences and trade-offs (Privileged/APE, Secrets/ConfigMaps). Critical error on kubelet exec authorization (claims CREATE instead of GET for nodes/proxy), and incorrect bind address mappings for several control plane components. Dramatic improvement over DeepSeek V3.2 (+21 points), the largest inter-generation improvement in the test set.
deepseek/deepseek-v4-flash — 66/100 (6.6 avg)
- Strongest on: PKI List (10), PSS Levels (9), Admission Control (8)
- Weakest on: Kubelet API (3), Open Ports (5), SSRF (5)
- Profile: A solid mid-tier performer that sits between V4 Pro (76) and V3.2 (55), closer to the pack than either sibling. Achieves a perfect 10 on PKI List (joining Opus 4.7 and Qwen) and a strong 9 on PSS Levels. Admission Control score of 8 matches V4 Pro. However, struggles significantly on kubelet-level knowledge (3/10 on Kubelet API — fabricates incorrect resource strings) and attack-surface mapping (SSRF: 5). Incorrectly recommends X.509 certificates for production authentication, missing the trick question. The ConfigMap encryption misconception persists from V3.2. Overall pattern: strong on structured knowledge questions, weak on operational detail and trick questions.
deepseek/deepseek-v3.2 — 55/100 (5.5 avg)
- Strongest on: PSS Levels (8), Privileged/APE (7)
- Weakest on: Kubelet API (3), Open Ports (4)
- Profile: Mid-tier performer. Tends to include non-standard components in answers (Dashboard, Ingress controllers for stock kubeadm questions). Occasional factual errors (update includes patch, 4 CAs instead of 3). Reasonable on straightforward questions but struggles with precision.
qwen/qwen3.6-plus — 68/100 (6.8 avg)
- Strongest on: PKI List (10), PSS Levels (9), Authentication (8)
- Weakest on: SSRF (4), Kubelet API (5)
- Profile: Solid mid-tier performer with a perfect 10 on PKI List (matching only Opus 4.7). Strong on structured knowledge questions (PKI, PSS, Authentication) but weak on attack-oriented topics (SSRF, Kubelet). The SSRF answer contained a factual error about kube-proxy endpoints being disabled by default. Shared the Authentication win with Sonnet 4.6 — correctly identified that no built-in method is suitable for production users. Never placed last on any question. Similar consistency pattern to the Anthropic models but without their trick-question blindspots on Authentication.
minimax/minimax-m2.7 — 71/100 (7.1 avg)
- Strongest on: PKI List (9), PSS Levels (9), Kubelet API (8), Secrets/ConfigMaps (8)
- Weakest on: Authentication (5), Open Ports (6), RBAC Verbs (6), SSRF (6)
- Profile: A dramatic improvement over M2.5 (+20 points). Tied with Opus at 71/100 with an identical score distribution (2 scores of 9-10, 4 of 7-8, 4 of 5-6, none below 5). Strong on fundamental topics (PKI, PSS) and practical questions (Kubelet API, Secrets). Like most models, missed the RBAC verbs trick and recommended X.509 for production authentication. The largest single-generation improvement in the test set.
minimax/minimax-m2.5 — 51/100 (5.1 avg)
- Strongest on: Privileged/APE (8), PSS Levels (7), PKI (7)
- Weakest on: Kubelet API (2), SSRF (3), Authentication (4)
- Profile: Weakest overall with the most last-place finishes (8 of 10). Produced a strong answer on Privileged vs APE (8/10, best severity calibration). Tends to make critical errors on harder questions — Ingress as SSRF vector, cert revocation support, RBAC inspecting pod specs. Verbose answers don’t compensate for accuracy gaps.
google/gemma-4-31b (LOCAL) — 64/100 (6.4 avg)
- Strongest on: PSS Levels (9), Admission Control (7.5), Secrets/ConfigMaps (7)
- Weakest on: Kubelet API (5), Open Ports (5), PKI List (6)
- Profile: The second local model tested — a 31B parameter model running on LM Studio. Scores 64/100, placing 13th out of 15 models. Characterised by a flat mid-range distribution: seven of ten quizzes scored 5 or 6, giving a narrow score range and few standout results. The PSS Levels score of 9/10 is the sole strong result, showing confident knowledge of well-documented concepts. Notable errors include calling the service account signer a CA (PKI), stating that 10257/10259 bind to all interfaces (they are localhost-only), overstating allowPrivilegeEscalation severity, and missing arbitrary RBAC verbs. Zero wins and zero last-place finishes — a consistently average-to-below-average performer.
qwen/qwen3.6-35b-a3b (LOCAL) — 65/100 (6.5 avg)
- Strongest on: PSS Levels (10), PKI List (9), Privileged/APE (8)
- Weakest on: Kubelet API (4), Open Ports (4), Authentication (5), SSRF (5)
- Profile: The first local model tested — a 35B-parameter MoE model running on LM Studio. Scores 65/100, placing 12th out of 15 models. Two scores of 9-10 show strong knowledge on well-defined concepts (PSS levels, PKI architecture), but struggles with operational depth (kubelet webhook auth, port binding details) and trick questions (authentication production suitability, RBAC verb openness). The pattern suggests the model has good conceptual training but lacks the nuanced operational knowledge that larger cloud models demonstrate. Notable: zero hallucinations — errors are omissions and misunderstandings rather than fabricated information.
Cross-Cutting Findings
1. Breadth vs Depth
Questions asking about attack surfaces or comprehensive lists reward breadth (SSRF, Open Ports, Admission Control). Models that go deep on one aspect but miss others score lower. Gemini consistently showed the broadest knowledge.
2. Trick Questions Reveal Understanding
Two questions had trick elements:
- RBAC Verbs: Verbs are an open set, not a fixed list — GPT 5.5, GPT 5.4, and Kimi K2.6 recognised this. All other models miss this.
- Secrets/ConfigMaps: ConfigMaps CAN be encrypted at rest — only GPT noted this.
The RBAC verb trick was previously an OpenAI-only insight, but Kimi K2.6 broke this pattern by also scoring 9. GPT 5.5 scores 9 on both RBAC Verbs and Secrets/ConfigMaps.
3. “Standard Kubeadm” Qualifier Matters
Questions specifying “standard kubeadm” caught models that included non-standard components:
- MiniMax: Ingress controllers (SSRF), cert revocation (Authentication)
- DeepSeek: Dashboard, Ingress controllers (SSRF, Open Ports)
4. Severity Calibration Varies
On the Privileged vs APE question, most models overstated allowPrivilegeEscalation as a serious risk. MiniMax M2.5 — the overall weakest model — produced the best-calibrated answer on this specific question. M2.7 reverted to the majority position, overstating APE severity. Opus 4.8 breaks the Anthropic family pattern with a well-calibrated 9/10, joining GPT 5.5 and V4 Pro at the top — a dramatic improvement from Opus 4.7’s 6/10.
5. Qwen’s PKI Expertise Stands Out
Qwen matched Opus 4.7’s perfect 10/10 on PKI List — only two models to achieve this. Combined with its Authentication win (8, tied with Sonnet), Qwen demonstrates strong infrastructure security knowledge. Its weakness is attack-surface mapping (SSRF: 4, Kubelet API: 5).
6. Node-Level Knowledge Is Weakest
Across all models, knowledge about kubelet-level behaviour was poorest:
- No model identified pod probes as SSRF vectors
- Kubelet API authorization (node subresources) was poorly understood
- Pod image reference as SSRF only identified by Gemini and Opus 4.7
7. Easy Questions Don’t Differentiate
PSS Levels and PKI List had the highest average scores and lowest spreads. These questions confirm basic knowledge but don’t separate strong from weak models. However, perfect 10s on these questions — Opus 4.7 and Qwen on PKI List, V4 Pro on PSS Levels — show that even “easy” questions can reward precision.
8. Factual Errors vs Missing Information
Missing information costs less than wrong information. Models that stayed accurate but brief (GPT on some questions) scored better than models that were comprehensive but included errors (DeepSeek’s “4 CAs”, MiniMax’s Ingress focus).
9. Anthropic Family Trait: Improving but Persistent Gaps
The four Anthropic models (Opus 4.8, Opus 4.7, Opus 4.6, Sonnet 4.6) share blindspots on RBAC verb openness and ConfigMap encryption eligibility. However, Opus 4.8 breaks the APE severity overstatement pattern with a correctly calibrated 9/10 (up from Opus 4.7’s 6/10). Opus 4.8 also takes the sole lead on Open Ports (9) and ties for the lead on SSRF (8), showing the family is improving on breadth questions while the trick-question gaps persist.