Quiz Report Card: Privileged and Allow Privilege Escalation

Date: 2026-03-09 | Qwen 3.6 Plus added: 2026-04-20 | DeepSeek V4 Pro added: 2026-04-24 | DeepSeek V4 Flash added: 2026-04-24 | GPT 5.5 added: 2026-04-25 | Kimi K2.6 added: 2026-04-26 | Qwen3.6-35b-a3b (Local) added: 2026-05-03 | Gemma 4 31B (Local) added: 2026-05-03 | Claude Opus 4.8 added: 2026-05-31 | Qwen 3.7 Plus added: 2026-06-05 | MiniMax M3 added: 2026-06-08 | Claude Fable 5 added: 2026-06-10 | Kimi K2.7 Code added: 2026-06-16 | GLM-5.2 added: 2026-06-17 | Mistral Medium 3.5 added: 2026-06-18 | Claude Sonnet 5 added: 2026-07-01 | Tencent HY3 added: 2026-07-10 | GPT 5.6 Terra added: 2026-07-10 | GPT 5.6 Sol added: 2026-07-14 | Kimi K3 added: 2026-07-16 | Xiaomi MiMo v2.5 added: 2026-07-21 Question: In the context of a Kubernetes pod manifest, Explain the difference between the privileged setting and the allowprivilegeescalation setting, mentioning whether each setting presents a serious security concern.

Reference Answer

These two settings are very different despite having similar names:

privileged — Serious security concern

Removes most significant security and isolation protection
A privileged container can break out to the underlying host
Grants all Linux capabilities, access to host devices, bypasses namespace/cgroup isolation
Essentially makes the container equivalent to a root process on the host

allowPrivilegeEscalation — Rarely a serious security concern

Controls whether a process can gain more privileges than its parent process
Directly controls the no_new_privs kernel flag
Affects setuid/setgid binary behaviour within the container
All other parts of container isolation still apply regardless of this setting
While it’s a reasonable hardening step, its security impact is limited compared to privileged

The key discriminator is whether models correctly convey the severity gap: privileged is critical (host breakout), while allowPrivilegeEscalation is rarely serious (container isolation still intact).

Scoring Criteria

Correct description of privileged: Removes container isolation, enables host access/breakout
Correct description of allowPrivilegeEscalation: Controls no_new_privs flag, affects setuid binaries
Severity calibration: privileged = critical; allowPrivilegeEscalation = rarely serious. Models that overstate APE’s risk miss the key point.
Technical accuracy: Correct mechanism descriptions, no factual errors
Clear differentiation: The answer should make the gap between the two settings obvious

Results Summary

Model	Score	Privileged	APE	Severity Gap	no_new_privs	Errors
anthropic/claude-opus-4.7	6/10	Correct	Correct	Overstated	Yes	APE severity inflated
minimax/minimax-m2.5	8/10	Correct	Correct	Well calibrated	Yes	Minor (runAsNonRoot default)
anthropic/claude-sonnet-4.6	7/10	Correct	Correct	Overstated	Yes	APE severity inflated
deepseek/deepseek-v3.2	7/10	Correct	Correct	Reasonable	Yes	None
google/gemini-3-flash-preview	7/10	Correct	Correct	Overstated	Yes	“Major foothold” language
openai/gpt-5.4	6/10	Correct	Correct	Good language	No	Missing no_new_privs
minimax/minimax-m2.7	7/10	Yes	Yes	Partial	Not mentioned	Overstates APE
qwen/qwen3.6-plus	6/10	Correct	Correct	Overstated	Yes	APE “Significant” overstatement
deepseek/deepseek-v4-pro	9/10	Correct	Correct	Well calibrated	Yes	None
deepseek/deepseek-v4-flash	6/10	Correct	Correct	Overstated	Yes	APE severity inflated
moonshotai/kimi-k2.6	7/10	Correct	Correct	Overstated	Yes	APE severity inflated
openai/gpt-5.5	9/10	Correct	Correct	Well calibrated	Yes	None
qwen/qwen3.6-35b-a3b (LOCAL)	8/10	Yes	Yes (no_new_privs)	Slight overclaim	Good understanding
anthropic/claude-opus-4.8	9/10	Correct	Correct	Well calibrated	Yes	None
google/gemma-4-31b (LOCAL)	6/10	Correct	Correct	Overstated	Yes	APE severity inflated
qwen/qwen3.7-plus	7/10	Correct	Correct	Overstated	Yes	APE severity inflated
minimax/minimax-m3	6/10	Correct	Correct	Overstated	Yes	APE severity inflated
anthropic/claude-fable-5	9/10	Correct	Correct	Well calibrated	Yes	None
moonshotai/kimi-k2.7-code	7/10	Correct	Correct	Overstated	Yes	APE severity inflated
z-ai/glm-5.2	6/10	Correct	Correct	Overstated	Yes	APE severity inflated
mistralai/mistral-medium-3-5	8/10	Correct	Correct	Reasonable	Yes	Slight APE overstatement
anthropic/claude-sonnet-5	10/10	Correct	Correct	Perfectly calibrated	Yes	None
tencent/hy3	6/10	Correct	Correct	Overstated	Yes	APE severity inflated, factual error about runAsNonRoot interaction
openai/gpt-5.6-terra	8/10	Correct	Correct	Reasonable	Yes	APE hedging
openai/gpt-5.6-sol	8/10	Correct	Correct	Reasonable	Yes	APE overstated
moonshotai/kimi-k3	8/10	Correct	Correct	Reasonable	Yes	Slight APE overstatement
xiaomi/mimo-v2.5	8/10	Correct	Correct	Reasonable	Yes	Slight APE overstatement

Detailed Analysis

anthropic/claude-opus-4.7 — 6/10

Strengths:

Correctly identifies privileged as extremely serious — removes container isolation
Correctly explains APE controls no_new_privs flag
Correct defaults (privileged: false, APE: true)

Weaknesses:

Severity calibration wrong — rates APE as “Moderate” severity. Should be “rarely a serious security concern.” Better than Opus 4.6’s “SERIOUS” but still overstated.
Framing of APE as “hardening weakness” and “aids in-container privilege escalation” overstates practical impact

Comparison vs Opus 4.6 (6): Same score. “Moderate” is directionally better than “SERIOUS” but still doesn’t capture the correct calibration.

Notable: The Anthropic family continues to overstate APE severity. Opus 4.7’s “Moderate” is closer to correct than Opus 4.6’s “SERIOUS” or Sonnet’s “Moderate-to-serious”, but still misses the “rarely serious” mark that MiniMax M2.5 nailed.

minimax/minimax-m2.5 — 8/10

Strengths:

Best severity calibration: “Very high” for privileged, “Moderate” for APE — closest to the scoring notes’ position
Correctly states APE “does not by itself give the container host-wide powers” — the key insight
Excellent interaction scenarios table showing all three combinations (privileged, non-privileged+escalation, non-privileged+no-escalation)
Correctly identifies no_new_privs kernel flag
Notes Baseline PSS allows APE while Restricted forces it to false
Practical guidance section with defence-in-depth example

Weaknesses:

Claims runAsNonRoot: true affects the default of allowPrivilegeEscalation — this is not accurate; the default is true regardless
Verbose — could be more concise

Notable: The best answer for this question. The interaction scenarios table is the clearest way to show how the two settings relate. The severity calibration is the most accurate across all models.

anthropic/claude-sonnet-4.6 — 7/10

Strengths:

Excellent technical depth on both settings
Correctly identifies no_new_privs flag and setuid/setgid mechanism
Good comparison table with scope, default, primary risk, and mechanism
Useful relationship note: privileged overrides allowPrivilegeEscalation
Practical recommended baseline YAML

Weaknesses:

Rates APE as “Moderate-to-serious security risk” — overstates the severity
Describes an attack chain where APE leads to “may then attempt container escape” — this is misleading. APE controls setuid within the container; gaining root inside a non-privileged container does not provide a direct path to container escape
Calls APE a “meaningful hardening step” — while true, this overstates relative to the scoring notes’ “rarely a serious security concern”

Notable negative: The “may then attempt container escape” chain conflates two separate issues. Gaining root inside a non-privileged container via setuid is very different from escaping the container. All container isolation still applies.

deepseek/deepseek-v3.2 — 7/10

Strengths:

Good severity calibration: “HIGH RISK” for privileged, “MEDIUM RISK” for APE
Correctly describes both mechanisms
Notes that privileged overrides APE setting
Clean comparison table
Practical recommendations

Weaknesses:

APE section says “Can bypass certain restrictions even without privileged: true” — vague and could be misleading
Less detail than some other responses on the specific mechanisms
Doesn’t explicitly state that container isolation remains intact with APE

Notable: Solid, accurate answer with reasonable severity calibration. No significant errors.

google/gemini-3-flash-preview — 7/10

Strengths:

Good technical detail on both settings
Correctly identifies no_new_privs flag and default values
Good comparison table with analogy (building keys vs. picking a lock)
Correctly notes privileged forces APE to true
Mentions using granular capabilities instead of privileged

Weaknesses:

Rates APE as “Moderate to High” and calls it “a major foothold for Privilege Escalation attacks” — this significantly overstates the risk
The framing implies APE is a serious vulnerability by default, when in practice the container isolation boundaries still constrain what an attacker can do

Notable negative: Calling APE a “major foothold” is the strongest overstatement across all models. While disabling APE is good practice, the scoring notes are clear that it’s “rarely a serious security concern” because isolation remains intact.

openai/gpt-5.4 — 6/10

Strengths:

Best severity language: “less severe than privileged: true. Its impact depends on what binaries, capabilities, and users exist inside the container” — most nuanced acknowledgement that context matters
Correctly states APE “does not by itself make the container fully privileged”
Clear, accurate summary

Weaknesses:

Does not mention the no_new_privs kernel flag — this is the key technical mechanism and a significant omission
Very brief — three bullet points per setting with minimal elaboration
No YAML examples, no comparison table, no practical recommendations
Doesn’t explain the relationship between the two settings

Notable: The severity calibration language is among the best, but the omission of no_new_privs and the extreme brevity limit the score. Understanding what APE actually controls at the kernel level is essential for this question.

minimax/minimax-m2.7 — 7/10

Strengths:

Correctly explains fundamental difference (privileged = host-level, APE = process-level)
Identifies privileged as critical
Good YAML example
Clear comparison table

Weaknesses:

Overstates allowPrivilegeEscalation severity — calls it “High security risk” when it’s rarely a serious concern (all other container isolation still applies)
Doesn’t mention no_new_privs

Notable: Similar to Sonnet and DeepSeek V3.2 (both 7/10). MiniMax M2.5 scored 8/10 on this question with the best severity calibration — MiniMax M2.7 regressed to the majority position.

qwen/qwen3.6-plus — 6/10

Strengths:

Correctly identifies privileged as “Extremely Serious (Critical)” — removes container-host isolation
Correctly identifies no_new_privs kernel flag and explains its mechanism (setuid/setgid, capability inheritance)
Good comparison table covering scope, mechanism, default values, and impact
Correct defaults: privileged=false, APE=true
Good PSS context: Baseline allows APE, Restricted requires APE=false
Correctly notes privileged overrides APE benefit

Weaknesses:

APE severity overstated: Rates it as “Significant (but narrower scope)” — the scoring notes say “rarely a serious security concern.” While the “narrower scope” qualifier is correct, calling it “Significant” is still too strong.
States APE “could potentially escalate privileges within the container” and describes it as a “critical defense-in-depth control” — this framing overstates its practical importance
The attack chain linking APE to “other vulnerabilities (e.g., container breakout, misconfigured mounts)” conflates container-internal privilege escalation with container escape, which are separate issues

Notable: Follows the same pattern as other Anthropic-family and frontier models — correctly explains both mechanisms but overstates APE severity. The “Significant” rating is better than Gemini 3 Flash’s “major foothold” or Sonnet’s “Moderate-to-serious” but still doesn’t reach MiniMax M2.5’s best-calibrated answer.

deepseek/deepseek-v4-pro — 9/10

Strengths:

Excellent and detailed explanation of both settings
Clear on privileged removing isolation as the critical security concern
Correct explanation of the no_new_privs flag for APE
Best severity calibration after MiniMax M2.5 — correctly rates APE as moderate/important hardening rather than overstating it

Weaknesses:

None significant

Notable: Takes the top score on this question, beating MiniMax M2.5’s previous best of 8/10. The severity gap between privileged (critical) and APE (moderate hardening) is correctly conveyed without the overstatement that most other models exhibit. A remarkable improvement over DeepSeek V3.2 (7/10).

deepseek/deepseek-v4-flash — 6/10

Strengths:

Correctly describes privileged as removing container isolation and enabling host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag
Good technical understanding of both mechanisms

Weaknesses:

Overstates allowPrivilegeEscalation severity — treats APE as a significant security concern when the correct calibration is “rarely a serious security concern” since all container isolation remains intact
Framing implies APE is a critical setting when in practice its impact is limited compared to privileged

Notable: Scores the same as Opus 4.7 and Qwen 3.6 Plus (6/10) — all three overstate APE severity. A significant regression from V4 Pro (9/10, which had the second-best severity calibration after MiniMax M2.5). The DeepSeek V4 models diverge sharply on this question: Pro understands the severity gap correctly, Flash does not.

openai/gpt-5.5 — 9/10

Strengths:

Excellent distinction between the two settings with clear, well-structured explanation
Correctly explains privileged removes security and isolation: “broad access to the host,” “most or all Linux capabilities,” “access to host devices,” “reduced isolation”
Accurate explanation of allowPrivilegeEscalation controlling the no_new_privs flag and affecting setuid/setgid binaries
Well-calibrated severity assessment: “Very serious” for privileged, “Security concern, but narrower” for APE — correctly conveys the gap without overstating APE
Good summary table clearly showing the severity difference
Correctly notes that privileged: true overrides allowPrivilegeEscalation: false
Practical YAML examples throughout

Weaknesses:

Could note more explicitly that APE is “rarely a serious concern” per the scoring criteria — “narrower concern” is close but slightly softer than the ideal calibration
Does not mention the Pod Security Standards context (Baseline allows APE, Restricted requires false)

Notable: A significant improvement over GPT 5.4 (6/10), which had good severity language but missed the no_new_privs mechanism entirely. GPT 5.5 covers both the technical mechanism and the severity gap correctly. Joins DeepSeek V4 Pro at 9/10, making these two the best-calibrated models on this question after MiniMax M2.5 (8/10).

moonshotai/kimi-k2.6 — 7/10

Strengths:

Correctly describes privileged as removing container isolation and enabling host breakout
Good explanation of allowPrivilegeEscalation controlling the no_new_privs kernel flag
Correct defaults for both settings

Weaknesses:

Overstates APE severity — treats allowPrivilegeEscalation as a more significant security concern than it is. The correct calibration is “rarely a serious security concern” since all container isolation remains intact regardless of APE setting.

Notable: Matches Sonnet, DeepSeek V3.2, Gemini 3 Flash, and MiniMax M2.7 at 7/10. Follows the common pattern of correctly explaining both mechanisms but overstating APE severity. Only DeepSeek V4 Pro (9/10), GPT 5.5 (9/10), and MiniMax M2.5 (8/10) correctly calibrated the severity gap.

qwen/qwen3.6-35b-a3b (LOCAL) — 8/10

Strengths:

Correctly explains privileged as critical severity — all capabilities, host escape, disables seccomp/AppArmor
Correctly identifies the no_new_privs flag mechanism for allowPrivilegeEscalation
Good understanding of the fundamental difference between the two settings
Clear distinction between host-level access (privileged) and process-level control (APE)

Weaknesses:

Slight overclaim on APE severity — rates it as “Moderate/Important” when the correct calibration is “rarely a serious concern.” Container isolation remains intact regardless of APE setting.

Notable: Matches MiniMax M2.5 at 8/10 with good severity calibration — closer to correct than most models that rate APE as “High” or “Major.” The no_new_privs identification is strong. Only DeepSeek V4 Pro (9/10) and GPT 5.5 (9/10) scored higher with better severity gap calibration.

google/gemma-4-31b (LOCAL) — 6/10

Strengths:

Correctly explains privileged as a critical security concern — grants all capabilities, enables host breakout
Correctly identifies the no_new_privs flag mechanism for allowPrivilegeEscalation
Good basic description of both settings’ mechanisms

Weaknesses:

Overstates allowPrivilegeEscalation severity — rates APE as a significant security concern when the correct calibration is “rarely a serious concern” since all container isolation remains intact regardless of the APE setting
Framing implies APE is a high-impact setting comparable to privileged, missing the important severity gap

Notable: Falls into the same majority pattern as Opus 4.7, Sonnet, Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Flash — all at 6/10 for the same reason: correct mechanism identification but incorrect severity calibration. The no_new_privs identification is the only standout element.

anthropic/claude-opus-4.8 — 9/10

Strengths:

Excellent differentiation between the two settings
Correctly identifies privileged as critical — host compromise, removes all isolation
Correctly explains APE controls the no_new_privs kernel flag
Well-calibrated severity: privileged = critical, APE = moderate/no_new_privs (correctly conveys that APE is not a serious concern since container isolation remains intact)
Correctly notes that privileged overrides APE
Minor: slightly overstates APE risk in places, but overall calibration is correct

Weaknesses:

Minor overstatement of APE in some phrasing, but the overall severity gap is correctly conveyed

Comparison vs Opus 4.7 (6): Dramatic improvement. Opus 4.7 rated APE as “Moderate” which was still too high. Opus 4.8 correctly conveys that APE is rarely a serious concern, joining GPT 5.5 and V4 Pro at 9/10.

Notable: Breaks the Anthropic family pattern of overstating APE severity. Opus 4.6 (6/10, “SERIOUS”), Opus 4.7 (6/10, “Moderate”), Opus 4.8 (9/10, well-calibrated) — a clear trajectory of improvement. Joins GPT 5.5 and DeepSeek V4 Pro at the top of the leaderboard for this question.

qwen/qwen3.7-plus — 7/10

Strengths:

Correctly identifies privileged as critical — removes container isolation, enables host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag
Good technical description of both mechanisms
Correct defaults for both settings

Weaknesses:

Overstates APE severity — treats allowPrivilegeEscalation as a more significant security concern than it is. The correct calibration is “rarely a serious security concern” since all container isolation remains intact regardless of APE setting.

Notable: An improvement over Qwen 3.6 Plus (6/10) — better severity calibration, though still not reaching the correct “rarely serious” mark. Matches Sonnet, DeepSeek V3.2, Gemini 3 Flash, MiniMax M2.7, and Kimi K2.6 at 7/10. Only GPT 5.5, Opus 4.8, and DeepSeek V4 Pro (all 9/10) correctly calibrated the severity gap.

minimax/minimax-m3 — 6/10

Strengths:

Correctly explains both settings and their differences
Correctly identifies privileged as critical — removes container isolation, enables host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag

Weaknesses:

Overstates allowPrivilegeEscalation severity — calls it “SERIOUS” when other isolation still applies, making it rarely a serious concern in practice. All container isolation boundaries remain intact regardless of the APE setting.

Notable: Falls into the majority pattern of correct mechanism identification but incorrect severity calibration, matching Opus 4.7, Qwen 3.6 Plus, DeepSeek V4 Flash, and Gemma 4 31B at 6/10. Interestingly, MiniMax M2.5 (8/10) had the best severity calibration on this question, while M2.7 (7/10) and M3 (6/10) both overstate APE severity — a regression on this specific question within the MiniMax family.

anthropic/claude-fable-5 — 9/10

Strengths:

Excellent differentiation between the two settings
Correctly identifies privileged as critical — removes container isolation, enables host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag
Well-calibrated severity: privileged = critical, APE = rarely serious (correctly conveys that APE is not a serious concern since container isolation remains intact)
Correctly notes that privileged overrides APE

Weaknesses:

Minor — slightly overstates APE risk in some phrasing, but overall calibration is correct

Notable: Joins GPT 5.5, Opus 4.8, and DeepSeek V4 Pro at 9/10. Continues the Opus 4.8 breakthrough in APE severity calibration — Fable 5 correctly conveys the severity gap, unlike the older Anthropic models (Opus 4.6: 6/10 “SERIOUS”, Opus 4.7: 6/10 “Moderate”). The Anthropic family APE trajectory: Opus 4.6 (6), Opus 4.7 (6), Opus 4.8 (9), Fable 5 (9).

moonshotai/kimi-k2.7-code — 7/10

Strengths:

Correct distinction between privileged (full host access, all capabilities, no namespace isolation) and APE
Correctly identifies that APE controls the no_new_privs bit
Notes that both should be set to false in production

Weaknesses:

Overstates APE severity, describing it as “Serious” — in practice APE is rarely a serious security concern because the no_new_privs flag primarily affects setuid/setgid binaries, which are uncommon in containers
Should note that privileged is a far more significant risk than APE

Notable: Same score as K2.6 (7/10) — both Moonshot models overstate APE severity. Matches Sonnet, DeepSeek V3.2, Gemini 3 Flash, MiniMax M2.7, and Qwen 3.7 Plus in the 7/10 cluster. Only GPT 5.5, Opus 4.8, V4 Pro, and Fable 5 (all 9/10) correctly calibrated the severity gap.

z-ai/glm-5.2 — 6/10

Strengths:

Correctly identifies privileged as critical — removes container isolation, enables host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag
Good technical description of both mechanisms

Weaknesses:

Overstates allowPrivilegeEscalation severity — treats APE as a more significant security concern than it is. The correct calibration is “rarely a serious security concern” since all container isolation remains intact regardless of APE setting.
Incorrect default claim — states an incorrect default for allowPrivilegeEscalation

Notable: Falls into the majority pattern of correct mechanism identification but incorrect severity calibration, matching Opus 4.7, Qwen 3.6 Plus, DeepSeek V4 Flash, Gemma 4 31B, and MiniMax M3 at 6/10. Only GPT 5.5, Opus 4.8, DeepSeek V4 Pro, and Fable 5 (all 9/10) correctly calibrated the severity gap.

mistralai/mistral-medium-3-5 — 8/10

Strengths:

Correctly identifies privileged as critical — removes container isolation, enables host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag
Good distinction between the two settings with correct SUID/SGID retention explanation
Reasonable severity calibration — correctly conveys that privileged is far more serious than APE

Weaknesses:

Slightly overstates APE risk — while the severity gap is correctly conveyed, some phrasing implies APE is more impactful than it is in practice. Container isolation remains fully intact regardless of the APE setting.

Notable: Matches MiniMax M2.5 and Qwen-35b at 8/10 with good severity calibration that correctly conveys the gap between privileged (critical) and APE (moderate). Better calibrated than the 7/10 cluster (Sonnet, DeepSeek V3.2, Gemini 3 Flash, MiniMax M2.7, Kimi K2.6, Kimi K2.7 Code, Qwen 3.7 Plus) but not quite reaching the correct “rarely serious” mark of the 9/10 group (GPT 5.5, Opus 4.8, V4 Pro, Fable 5).

anthropic/claude-sonnet-5 — 10/10

Strengths:

Excellent differentiation between the two settings
Correctly identifies privileged as critical — removes container isolation, enables host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag
Perfectly calibrated severity: privileged = critical, APE = rarely a serious concern. The first model to achieve a fully correct severity calibration with no overstatement.
Correctly notes that privileged overrides APE
Clear explanation of why APE is a hardening step but not a security concern — container isolation remains intact regardless

Weaknesses: None significant.

Notable: THE NEW SOLE LEADER at 10/10 — the first model to achieve a perfect score on this question. Previous leaders (GPT 5.5, Opus 4.8, DeepSeek V4 Pro, Fable 5) all scored 9/10 with minor APE severity overstatement. Sonnet 5 achieves the perfect severity calibration that the scoring criteria demand: privileged = critical (host breakout), APE = rarely serious (container isolation intact). The Anthropic family APE trajectory shows dramatic improvement: Opus 4.6 (6), Sonnet 4.6 (7), Opus 4.7 (6), Opus 4.8 (9), Fable 5 (9), Sonnet 5 (10).

tencent/hy3 — 6/10

Strengths:

Correctly identifies privileged as critical — removes container isolation, enables host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag
Good technical description of both mechanisms

Weaknesses:

Overstates allowPrivilegeEscalation severity — treats APE as a more significant security concern than it is. The correct calibration is “rarely a serious security concern” since all container isolation remains intact regardless of APE setting.
Factual error about runAsNonRoot interaction — incorrectly describes how runAsNonRoot interacts with allowPrivilegeEscalation

Notable: Falls into the majority pattern of correct mechanism identification but incorrect severity calibration, matching Opus 4.7, Qwen 3.6 Plus, DeepSeek V4 Flash, Gemma 4 31B, GLM-5.2, and MiniMax M3 at 6/10. The runAsNonRoot interaction error is an additional factual issue beyond the severity miscalibration. Only GPT 5.5, Opus 4.8, DeepSeek V4 Pro, and Fable 5 (all 9/10) and Sonnet 5 (10/10) correctly calibrated the severity gap.

openai/gpt-5.6-terra — 8/10

Strengths:

Good distinction between the two settings
Correctly identifies privileged as a serious security concern granting near-host-level access
Correctly explains APE controls no_new_privs

Weaknesses:

Hedges on APE severity — calls it a “hardening concern” rather than stating clearly that it is rarely a serious security concern since all other isolation mechanisms still apply

Notable: Matches MiniMax M2.5, Qwen-35b, and Mistral M3.5 at 8/10 with reasonable severity calibration that correctly conveys the gap between privileged (critical) and APE (moderate). Better calibrated than the 7/10 cluster but not quite reaching the “rarely serious” mark of the 9/10 group (GPT 5.5, Opus 4.8, V4 Pro, Fable 5) or Sonnet 5’s sole 10/10.

openai/gpt-5.6-sol — 8/10

Strengths:

Good distinction between the two settings
Correctly identifies privileged as a serious security concern — removes container isolation, enables host breakout
Correctly explains APE controls the no_new_privs kernel flag
Good technical description of both mechanisms

Weaknesses:

Overstates APE risk — describes allowPrivilegeEscalation as more impactful than it is in practice, rather than noting it is rarely a serious concern since all other container isolation mechanisms still apply

Notable: Matches GPT 5.6 Terra, MiniMax M2.5, Qwen-35b, and Mistral M3.5 at 8/10 with reasonable severity calibration that correctly conveys the gap between privileged (critical) and APE (moderate). Better calibrated than the 7/10 cluster but not quite reaching the “rarely serious” mark of the 9/10 group (GPT 5.5, Opus 4.8, V4 Pro, Fable 5) or Sonnet 5’s sole 10/10.

moonshotai/kimi-k3 – 8/10

Strengths:

Correctly identifies privileged as critical – removes container isolation, enables host breakout
Correctly explains allowPrivilegeEscalation controls the no_new_privs kernel flag
Good distinction between the two settings
Correctly identifies APE as controlling no_new_privs

Weaknesses:

Slightly overstates APE risk – describes allowPrivilegeEscalation as more impactful than it is in practice, rather than noting it is rarely a serious concern since all other container isolation mechanisms still apply

Notable: Matches GPT 5.6 Terra, GPT 5.6 Sol, MiniMax M2.5, Qwen-35b, and Mistral M3.5 at 8/10 with reasonable severity calibration that correctly conveys the gap between privileged (critical) and APE (moderate). Better calibrated than the 7/10 cluster (Sonnet 4.6, DeepSeek V3.2, Gemini 3 Flash, MiniMax M2.7, Kimi K2.6, Kimi K2.7 Code, Qwen 3.7 Plus) but not quite reaching the “rarely serious” mark of the 9/10 group (GPT 5.5, Opus 4.8, V4 Pro, Fable 5) or Sonnet 5’s sole 10/10. The Moonshot AI family trajectory: K2.6 (7), K2.7 Code (7), K3 (8) – gradual improvement in APE severity calibration.

xiaomi/mimo-v2.5 — 8/10

Strengths:

Excellent, accurate distinction — privileged breaks container-to-host isolation, while allowPrivilegeEscalation maps to the no_new_privs kernel flag
Correctly explains the interaction: privileged effectively negates APE=false, while APE=false does not undo privileged
Correctly rates privileged as the far more serious, near-host-root concern

Weaknesses:

Overstates APE risk as “MEDIUM” — suggests it enables root inside the container or a kernel-exploit escape. The scoring notes are clear that APE is rarely a serious concern because all other isolation still applies regardless of its setting.
Minor overstatement that privileged disables seccomp/AppArmor/SELinux outright

Notable: A strong 8/10 — an accurate privileged-vs-APE distinction with correct severity ranking, docked only for slightly overstating APE risk.

Key Findings

Severity calibration is the key discriminator: The scoring notes emphasise that allowPrivilegeEscalation is “rarely a serious security concern” because container isolation remains intact. Most models overstated APE’s severity — MiniMax M2.5 came closest to the correct calibration.
MiniMax M2.5’s strongest quiz performance: After consistently lower scores on other quizzes, MiniMax M2.5 produced the best answer here — well-calibrated severity, correct technical details, and an excellent interaction scenarios table.
Claude and Gemini 3 Flash overstated APE severity: Both described APE as leading to further attacks (container escape in Claude’s case, “major foothold” in Gemini 3 Flash’s). This misses the scoring notes’ point that all container isolation remains intact regardless of APE.
GPT 5.4 had the best severity language but the weakest technical depth: Saying “its impact depends on what binaries, capabilities, and users exist” is the most accurate framing, but not mentioning no_new_privs at all is a notable gap.
All models correctly described privileged: The host-breakout risk of privileged containers is well understood across all models. The differentiation comes entirely from how they handle allowPrivilegeEscalation.