Manifest Test Assessment
Models tested: Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, GPT 5.5, GPT 5.4, Gemini 3 Flash, MiniMax M2.5, MiniMax M2.7, DeepSeek V3.2, Qwen 3.6 Plus, DeepSeek V4 Pro, DeepSeek V4 Flash, Kimi K2.6, Qwen3.6-35b-a3b (Local), Gemma 4 31B (Local) Original date: 2026-03-09 | Claude Opus 4.6 added: 2026-03-25 | MiniMax M2.7 added: 2026-03-28 | Claude Opus 4.7 added: 2026-04-20 | Qwen 3.6 Plus added: 2026-04-20 | DeepSeek V4 Pro added: 2026-04-24 | DeepSeek V4 Flash added: 2026-04-24 | GPT 5.5 added: 2026-04-25 | Kimi K2.6 added: 2026-04-26 | Qwen3.6-35b-a3b (Local) added: 2026-05-03 | Gemma 4 31B (Local) added: 2026-05-03 | Claude Opus 4.8 added: 2026-05-31 Cluster: Kind (local) for deployability testing
Scoring Criteria
Per manifest_tests/Scoring_Criteria.md:
- Usability — Do the Deployment and Service objects actually deploy and work? (Malformed YAML = serious fault; settings that prevent functioning = fault)
- Security — How well do the manifests implement Pod Security Standards while remaining functional?
Only Deployment and Service objects are scored. Extra objects (HPA, PDB, NetworkPolicy, etc.) are noted but not scored.
Deployability Results
Each model’s Deployment + Service were extracted and applied to a Kind cluster. Results:
| Scenario | Claude Opus 4.8 | Claude Opus 4.7 | Claude Opus 4.6 | Claude Sonnet 4.6 | GPT 5.5 | GPT 5.4 | Gemini 3 Flash | MiniMax M2.5 | MiniMax M2.7 | DeepSeek V3.2 | Qwen 3.6 Plus | DeepSeek V4 Pro | DeepSeek V4 Flash | Kimi K2.6 | Qwen-35b (Local) | Gemma 4 31B (Local) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Basic | PASS | PASS | PASS | FAIL | PASS | PASS | PASS | PASS | PASS | PASS | PASS | PASS | PASS | FAIL | PASS | PASS |
| Production | PASS | PASS | PASS | FAIL | PASS | FAIL | PASS | FAIL | FAIL | PASS | FAIL | FAIL | FAIL | PASS | FAIL | PASS |
| Hardened | PASS | PASS | PASS | PASS | PASS | FAIL | PASS | PASS | PASS | PASS | PASS | PASS | PASS | FAIL | PASS | FAIL |
| Pass Rate | 3/3 | 3/3 | 3/3 | 1/3 | 3/3 | 1/3 | 3/3 | 2/3 | 2/3 | 3/3 | 2/3 | 2/3 | 2/3 | 1/3 | 2/3 | 2/3 |
Failure Root Causes
| Model | Scenario | Root Cause |
|---|---|---|
| Claude Sonnet 4.6 | Basic | capabilities: drop: ALL, add: NET_BIND_SERVICE — nginx:latest needs CHOWN capability to chown("/var/cache/nginx/client_temp", 101). Running as root (uid 0) with only NET_BIND_SERVICE causes chown failure. |
| Claude Sonnet 4.6 | Production | Same CHOWN issue. runAsUser: 0 with capabilities: drop: ALL, add: NET_BIND_SERVICE. ConfigMap exists but nginx master can’t chown cache directories. |
| GPT 5.4 | Production | capabilities: drop: ALL with no added capabilities. Same chown failure as Claude. |
| GPT 5.4 | Hardened | Specifies containerPort: 8080 and probes target port 8080, but provides NO ConfigMap to reconfigure nginx. nginx:latest listens on port 80 by default. Startup probe fails: connection refused on port 8080. |
| MiniMax M2.5 | Production | runAsNonRoot: true, runAsUser: 101 but nginx:latest on port 80. Container crashes because nginx master process tries to execute user nginx; directive (setuid) but can’t as non-root. Also placed capabilities under pod-level securityContext which is invalid YAML structure. |
| MiniMax M2.7 | Production | runAsNonRoot: true but no runAsUser: 101 at container level. nginx:latest defaults to root, so Kubernetes rejects the pod. Good security settings on paper but non-functional deployment. |
| Qwen 3.6 Plus | Production | capabilities: drop: ALL, add: NET_BIND_SERVICE — same chown failure pattern as Sonnet 4.6. Running as root (no runAsNonRoot) on port 80 with dropped capabilities. nginx master process can’t chown /var/cache/nginx/client_temp. |
| DeepSeek V4 Pro | Production | runAsUser: 101 with capabilities drop ALL but no emptyDir volumes for /var/cache/nginx. UID 101 cannot create subdirectories. Same non-root pitfall as others but without the chown issue — directory ownership prevents writes. |
| DeepSeek V4 Flash | Production | capabilities: drop: ALL, add: NET_BIND_SERVICE — same chown failure pattern. Running as root with dropped capabilities on port 80. nginx master process can’t chown /var/cache/nginx/client_temp. |
| Kimi K2.6 | Basic | Response was HTML (JavaScript web application), not YAML. No deployable Kubernetes manifest in the output. |
| Kimi K2.6 | Hardened | try_files $uri $uri/ =404; in nginx.conf causes root path to return 404. Startup probe fails with HTTP 404, pod never becomes Ready. |
| Qwen-35b (Local) | Production | Container-level securityContext and probes nested under invalid YAML keys (“container security context:” and “health checks:” instead of proper Kubernetes field names). The model knows the right settings but generates structurally invalid YAML. |
| Gemma 4 31B (Local) | Hardened | capabilities: drop: [ALL] with runAsNonRoot: false (root user) causes nginx:latest chown failure — CrashLoopBackOff. Good security intent (readOnlyRootFilesystem, allowPrivilegeEscalation: false, volume mounts) but nginx:latest cannot chown cache dirs when capabilities are fully dropped while running as root. Model recommended nginxinc/nginx-unprivileged in comments but did not use it. |
Key Insight
The common failure pattern is the tension between nginx:latest’s default behaviour and security hardening:
- nginx:latest running as root needs
CHOWNcapability (to chown cache dirs to worker user) - nginx:latest running as non-root needs a custom ConfigMap (to listen on non-privileged port and use /tmp for PID)
- Only models that resolved this tension (either by running as uid 101 with proper config, or using an unprivileged image) produced working hardened deployments
Security Assessment (Pod Security Standards)
PSS Compliance Summary
| Feature | Opus 4.8 Basic | Opus 4.7 Basic | Opus 4.6 Basic | Sonnet 4.6 Basic | GPT 5.5 Basic | GPT 5.4 Basic | Gemini 3 Flash Basic | MiniMax M2.5 Basic | MiniMax M2.7 Basic | DeepSeek V3.2 Basic | Qwen 3.6 Plus Basic | V4 Pro Basic | V4 Flash Basic | Kimi K2.6 Basic | Qwen-35b Basic | Gemma 4 31B Basic |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| runAsNonRoot | Not set | Not set | Not set | false | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | N/A (HTML) | Not set | Not set |
| seccompProfile | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | N/A (HTML) | Not set | Not set |
| allowPrivilegeEscalation: false | Not set | Not set | Not set | YES | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | N/A (HTML) | Not set | Not set |
| capabilities: drop ALL | Not set | Not set | Not set | YES | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | N/A (HTML) | Not set | Not set |
| readOnlyRootFilesystem | Not set | Not set | Not set | false | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | N/A (HTML) | Not set | Not set |
| Resource limits | Not set | YES | YES | YES | Not set | Not set | Not set | YES | YES | YES | Not set | YES | Not set | N/A (HTML) | Not set | Not set |
| Probes | Not set | L+R | L+R | L+R+S | Not set | Not set | Not set | L+R | L+R | L+R | Not set | L+R | L+R | N/A (HTML) | Not set | Not set |
| PSS Level | Baseline | Baseline | Baseline | Baseline | Baseline | Baseline | Baseline | Baseline | Baseline | Baseline | Baseline | Baseline | Baseline | N/A | None | None |
| Feature | Opus 4.8 Prod | Opus 4.7 Prod | Opus 4.6 Prod | Sonnet 4.6 Prod | GPT 5.5 Prod | GPT 5.4 Prod | Gemini 3 Flash Prod | MiniMax M2.5 Prod | MiniMax M2.7 Prod | DeepSeek V3.2 Prod | Qwen 3.6 Plus Prod | V4 Pro Prod | V4 Flash Prod | Kimi K2.6 Prod | Qwen-35b Prod | Gemma 4 31B Prod |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| runAsNonRoot | true (uid 101) | true (uid 101) | true (uid 101) | false (uid 0!) | true (uid 101) | false | false | true | true | true | Not set | true (uid 101) | Not set | true (uid 101) | N/A (invalid YAML) | Not set |
| seccompProfile | RuntimeDefault | RuntimeDefault | RuntimeDefault | Not set | RuntimeDefault | RuntimeDefault | Not set | Not set | RuntimeDefault | Not set | Not set | Not set | Not set | RuntimeDefault | N/A (invalid YAML) | Not set |
| allowPrivilegeEscalation: false | YES | YES | YES | YES | YES | YES | YES | Not set | YES | YES | YES | YES | YES | YES | N/A (invalid YAML) | Not set |
| capabilities: drop ALL | YES | YES | YES | YES (+NET_BIND) | drop ALL | YES | Not set | Invalid placement | YES | YES | YES (+NET_BIND) | drop ALL | drop ALL +NET_BIND | drop ALL | N/A (invalid YAML) | Not set |
| readOnlyRootFilesystem | true | true | true | false | true | false | false | Not set | true | true | Not set | Not set | Not set | true | N/A (invalid YAML) | Not set |
| automountServiceAccountToken: false | Not set | YES | YES | YES | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | Not set | YES | N/A (invalid YAML) | Not set |
| Resource limits | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | N/A (invalid YAML) | YES |
| Probes | L+R | L+R+S | L+R+S | L+R+S | L+R | L+R+S | L+R | L+R | L+R | L+R | L+R | L+R | L+R | L+R | N/A (invalid YAML) | L+R |
| PSS Level | Restricted | Restricted | Restricted | Baseline | Restricted | Baseline+ | Baseline | Baseline | ~Restricted | ~Restricted | Baseline+ | Baseline+ | Baseline+ | Restricted | Partial Baseline | Baseline |
| Feature | Opus 4.8 Hard | Opus 4.7 Hard | Opus 4.6 Hard | Sonnet 4.6 Hard | GPT 5.5 Hard | GPT 5.4 Hard | Gemini 3 Flash Hard | MiniMax M2.5 Hard | MiniMax M2.7 Hard | DeepSeek V3.2 Hard | Qwen 3.6 Plus Hard | V4 Pro Hard | V4 Flash Hard | Kimi K2.6 Hard | Qwen-35b Hard | Gemma 4 31B Hard |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| runAsNonRoot | true | true | true | true | true | true | true | true | true | true | true | true | true | true | true | false |
| runAsUser (non-zero) | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 101 | Not set |
| seccompProfile | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | Not set | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault |
| allowPrivilegeEscalation: false | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES |
| capabilities: drop ALL | YES | YES | YES | YES | YES | YES | YES | YES | YES (+NET_BIND) | YES | YES | drop ALL +NET_BIND | YES | YES | drop ALL +NET_BIND | YES |
| readOnlyRootFilesystem | true | true | true | true | true | true | true | true | true | true | true | true | true | true | true | true |
| automountServiceAccountToken: false | YES | YES (SA + pod) | YES | YES | YES | YES | Not set | Commented out | Not set | Not set | YES | Not set | Not set | YES | Not set | Not set |
| NetworkPolicy | YES (deny+ingress+DNS from ingress-nginx) | YES (deny+ingress+DNS) | YES (ingress+egress) | YES (ingress+egress) | Not provided | Not provided | YES (ingress only) | YES (ingress+egress) | Not set | YES (ingress only) | Not provided | No | No | YES (deny+DNS from ingress-nginx) | Not provided | Not provided |
| Resource limits | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES |
| Probes | L+R | L+R+S | L+R+S | L+R+S | L+R+S | L+R+S | L+R | L+R | L+R | L+R | L+R | L+R | L+R | L+R+S | L+R | L+R |
| PSA namespace labels | YES (restricted) | YES (restricted) | No | No | YES (restricted) | No | No | No | No | No | No | No | No | No | No | No |
| PSS Level | Restricted | Restricted | Restricted | Restricted | Restricted | Restricted | ~Restricted | Restricted | Restricted | Restricted | Restricted | Restricted | Restricted | Restricted | Baseline+ | Baseline+ |
PSS Level Key:
- Baseline = Meets Baseline profile (no privileged containers, no host namespaces)
- Baseline+ = Baseline with some Restricted features
- ~Restricted = Near-Restricted but missing 1-2 requirements (typically seccompProfile)
- Restricted = Fully meets Restricted profile
Per-Scenario Detailed Results
1. Basic Deployment
Prompt: “Generate a Kubernetes deployment for a web application using the nginx:latest image and provide a service exposing the application.”
This is intentionally a minimal prompt — no mention of security or production use. We’re testing what models produce by default.
| Category | Claude Opus 4.8 | Claude Opus 4.7 | Claude Opus 4.6 | Claude Sonnet 4.6 | GPT 5.5 | GPT 5.4 | Gemini 3 Flash | MiniMax M2.5 | MiniMax M2.7 | DeepSeek V3.2 | Qwen 3.6 Plus | DeepSeek V4 Pro | DeepSeek V4 Flash | Kimi K2.6 | Qwen-35b (Local) | Gemma 4 31B (Local) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Deploys? | YES | YES | YES | NO | YES | YES | YES | YES | YES | YES | YES | YES | YES | NO (HTML) | YES | YES |
| Replicas | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 2 | 3 | 3 | 3 | 3 | 3 | N/A | 3 | 3 |
| Service type | ClusterIP | LoadBalancer | LoadBalancer | LoadBalancer | ClusterIP | LoadBalancer | LoadBalancer | NodePort | LoadBalancer | LoadBalancer | LoadBalancer | ClusterIP | NodePort | N/A | ClusterIP | ClusterIP |
| Resource limits | No | Yes | Yes (128Mi/250m) | Yes | No | No | No | Yes | Yes | Yes | No | Yes | No | N/A | No | No |
| Probes | None | L+R | L+R | L+R+S | None | None | None | L+R | L+R | L+R | None | L+R | L+R | N/A | None | None |
| Security context | None | None | None | Partial | None | None | None | None | None | None | None | None | None | N/A | None | None |
| Extra objects | None | None | None | NS, CM, HPA, Ingress | None | None | None | None | None | None | None | None | None | N/A | None | None |
Notable observations:
- Opus 4.8 shows appropriate restraint for a basic prompt — Deployment and Service only, no security context, no resource limits, no probes. ClusterIP service type, port 80. Minimal response matching the basic prompt.
- Opus 4.7 shows appropriate restraint for a basic prompt — just a Deployment and Service with resource limits and probes. RollingUpdate strategy with maxUnavailable: 0 for zero-downtime deploys. No security context on a basic prompt — correct calibration matching Opus 4.6.
- Opus 4.6 similarly restrained — Deployment and Service only, with resource limits (128Mi/250m) and liveness+readiness probes. No security context for a basic prompt, matching Opus 4.7’s calibration.
- Sonnet 4.6 massively over-engineered for a basic prompt (6 objects, init container, ConfigMap, HPA, Ingress) but ironically produced a manifest that doesn’t deploy due to the capability issue
- GPT 5.4 gave the most minimal correct answer — exactly what was asked, nothing more
- MiniMax M2.5 was the only model to add probes and resource limits without being asked — good production instincts
- DeepSeek V3.2 also added probes and limits; provided both split and combined file versions
- Qwen 3.6 Plus minimal response — Deployment and Service only, no resource limits, no probes, no security context. Comments out probes and suggests them as optional. Good production advice in text but not in YAML.
- DeepSeek V4 Pro minimal like most others — standard Deployment + Service with nginx:latest, no security context. Includes resource limits and liveness+readiness probes. ClusterIP service type.
- GPT 5.5 minimal response — Deployment and Service only, no resource limits, no probes, no security context. ClusterIP service type. Appropriate restraint for a basic prompt.
- DeepSeek V4 Flash standard Deployment + Service with NodePort. No security contexts, no resource limits. Includes liveness+readiness probes. Minimal response matching the basic prompt.
- Kimi K2.6 responded with HTML/JavaScript code for a web application rather than Kubernetes YAML. No deployable manifest was produced. A fundamental misunderstanding of the prompt.
- Qwen-35b (Local) minimal response — Deployment and Service only, no resource limits, no probes, no security context. A bare-minimum answer matching the basic prompt. ClusterIP service type.
- Gemma 4 31B (Local) minimal response — Deployment and Service only, no resource limits, no probes, no security context. A bare-minimum answer matching the basic prompt. ClusterIP service type.
2. Production Deployment
Prompt: “…Ensure that the deployment and service are configured suitably for production cluster use.”
| Category | Claude Opus 4.8 | Claude Opus 4.7 | Claude Opus 4.6 | Claude Sonnet 4.6 | GPT 5.5 | GPT 5.4 | Gemini 3 Flash | MiniMax M2.5 | MiniMax M2.7 | DeepSeek V3.2 | Qwen 3.6 Plus | DeepSeek V4 Pro | DeepSeek V4 Flash | Kimi K2.6 | Qwen-35b (Local) | Gemma 4 31B (Local) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Deploys? | YES | YES | YES | NO | YES | NO | YES | NO | NO | YES | NO | NO | NO | YES | NO | YES |
| runAsNonRoot | true (uid 101) | true (uid 101) | true (uid 101) | false (uid 0!) | true (uid 101) | false | false | true (broken) | true (broken) | true | Not set | true (uid 101) | Not set | true (uid 101) | N/A (invalid YAML) | Not set |
| Capabilities | drop ALL | drop ALL | drop ALL | drop ALL +NET_BIND | drop ALL | drop ALL | No | Invalid placement | drop ALL | drop ALL | drop ALL +NET_BIND | drop ALL | drop ALL +NET_BIND | drop ALL | N/A (invalid YAML) | No |
| readOnlyFS | true | true | true | false | true | false | false | N/A | true | true | Not set | Not set | Not set | true | N/A (invalid YAML) | Not set |
| Probes | L+R | L+R+S | L+R+S | L+R+S | L+R | L+R+S | L+R | L+R | L+R | L+R | L+R | L+R | L+R | L+R | N/A (invalid YAML) | L+R |
| NetworkPolicy | No | No | Yes (ingress+egress) | Yes | No | No | No | No | No | No | No | No | No | No | No | No |
| PDB | No | Yes | Yes | Yes | No | No | No | No | No | No | Yes | No | No | Yes | No | No |
| HPA | No | Yes (CPU + memory) | Yes (CPU + memory) | Yes | No | No | No | No | No | No | No | No | No | Yes | No | No |
| ConfigMap | Yes (port 8080) | Yes (port 8080) | Yes (port 8080) | Yes (port 8080) | Yes (port 8080) | No | No | No | No | No | No | No | No | Yes (port 8080) | No | No |
Notable observations:
- Opus 4.8 achieves full PSS Restricted at Production level — runAsNonRoot with uid 101, seccompProfile RuntimeDefault, drop ALL capabilities, readOnlyRootFilesystem true. ConfigMap reconfigures nginx to port 8080. Topology spread constraints. Deploys successfully. No PDB, HPA, or NetworkPolicy — leaner than Opus 4.7 but all security essentials present.
- Opus 4.7 achieves full PSS Restricted at Production level — matching Opus 4.6 as the only models to reach Restricted without the explicit “hardened” prompt. Image pinned to
nginx:1.27.2(the only model to avoid:latest). ConfigMap reconfigures nginx to port 8080 with health endpoint. topologySpreadConstraints, preStop hook, ClusterIP service type. Leaner than Sonnet 4.6’s production (no Namespace, no NetworkPolicy) but all security essentials present. - Opus 4.6 also achieves PSS Restricted at Production level with uid 101, drop ALL capabilities, readOnlyRootFilesystem, seccompProfile, and automountServiceAccountToken: false. Includes NetworkPolicy (ingress+egress), PDB, HPA (CPU + memory), and ConfigMap for port 8080. One of only two models to reach Restricted without the hardened prompt.
- Sonnet 4.6 produced the most comprehensive response (10 objects!) but made a critical error:
runAsUser: 0with dropped capabilities. The comment even says “nginx master needs root for port binding” despite configuring port 8080 in the ConfigMap (which doesn’t need root). Self-contradictory. - GPT 5.4 acknowledged the manifest needs more work and offered to provide it — honest but incomplete
- Gemini 3 Flash kept it simple and functional but lacked security hardening for a “production” prompt
- MiniMax M2.5 attempted non-root but placed
capabilitiesunder pod-level securityContext (invalid YAML structure) — a structural error - DeepSeek V3.2 was the most practical: switched to
nginx:1.25-alpine, ran as non-root, read-only filesystem, and it actually works - Qwen 3.6 Plus falls into the same chown trap as Sonnet 4.6: drops ALL capabilities and adds NET_BIND_SERVICE, but runs as root on port 80. Ironically, the text notes include advice to use non-root with port 8080 — the model knows the right answer but doesn’t implement it. Includes PDB (good) but no ConfigMap or readOnlyFS.
- DeepSeek V4 Pro sets runAsUser: 101 and drops ALL capabilities — good security intent. But fails to provide emptyDir volumes for
/var/cache/nginx, so UID 101 cannot create subdirectories. A slightly different failure mode than the chown pitfall: ownership prevents writes rather than missing capabilities. - GPT 5.5 achieves PSS Restricted at Production level — runAsNonRoot with uid 101, seccompProfile RuntimeDefault, drop ALL capabilities, readOnlyRootFilesystem true. ConfigMap reconfigures nginx to port 8080. One of only three models (with both Opus versions) to reach Restricted without the explicit “hardened” prompt. Liveness and readiness probes present.
- DeepSeek V4 Flash drops ALL capabilities and adds NET_BIND_SERVICE, runs as root on port 80 with 3 replicas. Same chown failure pattern as Sonnet 4.6 and Qwen 3.6 Plus — nginx master can’t chown cache directories. Resource limits present. ClusterIP service.
- Kimi K2.6 achieves full PSS Restricted at Production level — runAsNonRoot with uid 101, seccompProfile RuntimeDefault, drop ALL capabilities, readOnlyRootFilesystem true, automountServiceAccountToken false. ConfigMap reconfigures nginx to port 8080 with health endpoint. Includes PDB and HPA. Pod Running 1/1 with healthy endpoint. One of four models (with both Opus versions and GPT 5.5) to reach Restricted without the explicit “hardened” prompt.
- Qwen-35b (Local) has a NOVEL failure mode: the model generates security context fields and probes but nests them under invalid YAML keys (“container security context:” and “health checks:” instead of proper Kubernetes field names like
securityContextandlivenessProbe). The model clearly knows the right settings but generates structurally invalid YAML that Kubernetes cannot parse. This is a unique failure pattern not seen in any other model. - Gemma 4 31B (Local) minimal production response — standard Deployment and Service with nginx:latest running as root. No security context, no capabilities manipulation, no ConfigMap for port reconfiguration. Includes resource limits and liveness+readiness probes. Deploys successfully but has no security hardening. ClusterIP service type.
3. Hardened Production Deployment
Prompt: “…Ensure the deployment and service are properly secured and hardened.”
| Category | Claude Opus 4.8 | Claude Opus 4.7 | Claude Opus 4.6 | Claude Sonnet 4.6 | GPT 5.5 | GPT 5.4 | Gemini 3 Flash | MiniMax M2.5 | MiniMax M2.7 | DeepSeek V3.2 | Qwen 3.6 Plus | DeepSeek V4 Pro | DeepSeek V4 Flash | Kimi K2.6 | Qwen-35b (Local) | Gemma 4 31B (Local) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Deploys? | YES | YES | YES | YES | YES | NO | YES | YES | YES | YES | YES | YES | YES | NO (404) | YES | NO (CrashLoop) |
| PSS Restricted? | YES | YES | YES | YES | YES | YES (if it worked) | Almost (no seccomp) | YES | YES | YES | YES | YES | YES | YES | Almost (NET_BIND_SERVICE) | No (runAsNonRoot: false) |
| runAsNonRoot | true | true | true | true | true | true | true | true | true | true | true | true | true | true | true | false |
| seccompProfile | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | Not set | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault | RuntimeDefault |
| drop ALL caps | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes (+NET_BIND) | Yes | Yes | Yes | Yes (+NET_BIND) | Yes | Yes (+NET_BIND) | Yes |
| readOnlyFS | true | true | true | true | true | true | true | true | true | true | true | true | true | true | true | true |
| Port 8080 | Yes (with ConfigMap) | Yes (with ConfigMap) | Yes (with ConfigMap) | Yes (with ConfigMap) | Yes (with ConfigMap) | Yes (NO ConfigMap!) | Yes (unprivileged image) | No (port 80) | Yes (with ConfigMap) | No (port 80) | Yes (with ConfigMap) | Yes (with ConfigMap) | Yes (with ConfigMap) | Yes (with ConfigMap) | No (port 80) | No (port 80) |
| NetworkPolicy | Yes (deny+ingress+DNS from ingress-nginx) | Yes (deny+ingress+DNS) | Yes (ingress+egress) | Yes (ingress+egress) | No | No | Yes (ingress) | Yes (ingress+egress) | No | Yes (ingress) | No | No | No | Yes (deny+DNS from ingress-nginx) | No | No |
| automountSAToken: false | Yes | Yes (SA + pod) | Yes | Yes | Yes | Yes | No | Commented out | No | No | Yes | Not set | Not set | Yes | No | Not set |
| PDB | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | No | No | No | Yes | No | No |
| HPA | No | Yes | Yes | Yes | No | Yes | No | No | No | Yes | No | No | No | No | No | No |
| ConfigMap | Yes (comprehensive) | Yes (comprehensive) | Yes (comprehensive) | Yes (comprehensive) | Yes (defaultMode 0444) | No | No | No | Yes (with security headers) | Yes (comprehensive) | Yes (port 8080, server_tokens off) | Yes | Yes | Yes (full nginx.conf) | No | No |
| Nginx security headers | Yes (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, server_tokens off) | Yes (6 headers) | Yes (7 headers) | Yes (7 headers) | Yes | No | No | No | Yes | Yes (4 headers) | No | No | No | Yes (6 headers + server_tokens off) | No | No |
| Rate limiting | No | No | Yes | Yes | No | No | No | No | Yes | Yes | No | No | No | No | No | No |
| PSA namespace labels | Yes (restricted) | Yes (restricted) | No | No | Yes (restricted) | No | No | No | No | No | No | No | No | No | No | No |
| ServiceAccount | Yes | Yes | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Notable observations:
- Opus 4.8 achieves full PSS Restricted with comprehensive hardening — runAsNonRoot with uid 101, seccompProfile RuntimeDefault, drop ALL capabilities, readOnlyRootFilesystem true, automountServiceAccountToken false, dedicated ServiceAccount. ConfigMap replaces nginx.conf with port 8080, security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, server_tokens off). PSA namespace labels with restricted enforce/audit/warn. NetworkPolicy with default-deny plus DNS egress and ingress scoped to ingress-nginx namespace. PDB present. No HPA or rate limiting.
- Opus 4.7 creates a Namespace with PSA restricted enforce/audit/warn labels — the only model to add cluster-level enforcement beyond pod-level security contexts. ConfigMap replaces the entire
nginx.confwith comprehensive hardening (temp paths, server_tokens off, 6 security headers). emptyDir volumes withsizeLimitandmedium: Memory. ConfigMap mounted withdefaultMode: 0444. NetworkPolicy uses default-deny base with ingress fromingress-nginxnamespace and DNS egress targetingkube-dnspod selector. Usesapp.kubernetes.io/namelabel convention. Missing: rate limiting, HTTP method restriction, hidden file blocking. - Opus 4.6 achieves full PSS Restricted with comprehensive ConfigMap (port 8080, security headers including 7 headers), rate limiting, NetworkPolicy (ingress+egress), PDB, HPA, and automountServiceAccountToken: false. Deploys successfully. No PSA namespace labels or dedicated ServiceAccount.
- Sonnet 4.6 finally gets it right here — proper non-root (uid 101), ConfigMap with port 8080 and PID in /tmp, comprehensive security headers, rate limiting, memory-backed emptyDirs with size limits. The most complete security implementation.
- GPT 5.4 has perfect security settings on paper but critically fails to provide the ConfigMap needed to make nginx listen on port 8080. Acknowledges this in a caveat note — but the manifest as-delivered does not work.
- Gemini 3 Flash took the smartest approach: used
nginxinc/nginx-unprivileged:stable-alpinewhich natively runs non-root on 8080 — no ConfigMap needed. But missed seccompProfile and some features. - MiniMax M2.5 used
nginx:1.21(pinned version, good) but ran on port 80 as non-root. PlacedtopologySpreadConstraintsunderaffinity(invalid placement). Applied PSS enforce label to namespace (good). LeftautomountServiceAccountToken: falsecommented out (half-hearted). - DeepSeek V3.2 provided a comprehensive ConfigMap but had
limit_req_zoneinside theserverblock in default.conf (invalid — must be inhttpblock). Also included deprecated seccomp annotation alongside the modern field. Added NET_BIND_SERVICE (unnecessary if they’d used port 8080). - Qwen 3.6 Plus redeems itself completely here — full PSS Restricted with uid 101, ConfigMap for port 8080 with
server_tokens off, drop ALL capabilities (no NET_BIND needed), readOnlyRootFilesystem with 5 emptyDir volumes (cache, run, tmp, log), automountServiceAccountToken: false. Clean implementation referencing CIS Benchmark and NSA/CISA guidelines. No NetworkPolicy, PDB, or security headers — functional but not the most feature-rich. - DeepSeek V4 Pro achieves full PSS Restricted — nginx:latest with ConfigMap for port 8080, runAsNonRoot with uid 101, seccompProfile RuntimeDefault, drop ALL capabilities, readOnlyRootFilesystem true, emptyDir volumes for cache and run directories. Learns from its Production failure and gets all the non-root pieces right. No NetworkPolicy, automountServiceAccountToken, PDB, HPA, security headers, or rate limiting.
- GPT 5.5 achieves full PSS Restricted with uid 101, seccompProfile RuntimeDefault, drop ALL capabilities, readOnlyRootFilesystem, automountServiceAccountToken false, and PSA namespace labels (restricted enforce/audit/warn) — the second model (after Opus 4.7) to add cluster-level enforcement. ConfigMap for port 8080 with defaultMode 0444. Security headers present. startupProbe in addition to liveness+readiness. topologySpreadConstraints and PDB for availability. ephemeral-storage limits. A comprehensive hardened deployment that matches or exceeds most models on feature coverage.
- DeepSeek V4 Flash achieves full PSS Restricted — uid 101, readOnlyRootFilesystem, seccomp RuntimeDefault, drop ALL capabilities (+NET_BIND_SERVICE), emptyDir volumes for /tmp, /var/cache/nginx, /var/run. ConfigMap for port 8080. Pod anti-affinity for scheduling. ClusterIP service. No NetworkPolicy, automountServiceAccountToken, PDB, HPA, security headers, or rate limiting — functional but not feature-rich.
- Kimi K2.6 achieves full PSS Restricted with comprehensive security — uid 101, seccomp RuntimeDefault, drop ALL capabilities, readOnlyRootFilesystem, automountServiceAccountToken false. Full nginx.conf ConfigMap with 6 security headers and server_tokens off. NetworkPolicy with default-deny and DNS egress from ingress-nginx namespace. PDB present. However,
try_files $uri $uri/ =404;in nginx.conf causes the root path to return 404, which fails the startup probe. Pod runs but never becomes Ready. - Qwen-35b (Local) produces a genuinely good hardened deployment — runAsNonRoot with uid 101, seccompProfile RuntimeDefault, allowPrivilegeEscalation false, drop ALL capabilities (adds NET_BIND_SERVICE — the only thing preventing full PSS Restricted), readOnlyRootFilesystem true. Uses
nginx:1.25.3-alpine(a good pinned image choice). Does NOT usenginxinc/nginx-unprivileged— instead runs standard nginx as uid 101 on port 80 (valid alternative). Liveness and readiness probes present. Resource limits set. No ConfigMap, NetworkPolicy, PDB, HPA, automountServiceAccountToken, or security headers. A clean, minimal hardened deployment that works. - Gemma 4 31B (Local) has good security intent in the hardened deployment — drop ALL capabilities, readOnlyRootFilesystem true, allowPrivilegeEscalation false, seccompProfile RuntimeDefault, volume mounts for cache/run/tmp, and resource limits. However,
runAsNonRoot: falsemeans it runs as root (uid 0). With ALL capabilities dropped, nginx:latest cannot chown/var/cache/nginx/client_temp— the same failure mode as Claude Sonnet 4.6 and Qwen 3.6 Plus in production. The model recommended usingnginxinc/nginx-unprivilegedin its comments but did not implement it. Falls into port 80 without ConfigMap. No ConfigMap, NetworkPolicy, PDB, HPA, automountServiceAccountToken, or security headers.
Overall Scoring
Usability Score (out of 5)
| Model | Basic | Production | Hardened | Total | Average |
|---|---|---|---|---|---|
| Claude Opus 4.8 | 5 | 5 | 5 | 15 | 5.0 |
| Claude Opus 4.7 | 5 | 5 | 5 | 15 | 5.0 |
| Claude Opus 4.6 | 5 | 5 | 5 | 15 | 5.0 |
| Claude Sonnet 4.6 | 1 | 1 | 5 | 7 | 2.3 |
| GPT 5.5 | 5 | 5 | 5 | 15 | 5.0 |
| GPT 5.4 | 5 | 1 | 1 | 7 | 2.3 |
| Gemini 3 Flash | 5 | 5 | 5 | 15 | 5.0 |
| Kimi K2.6 | 1 | 5 | 3 | 9 | 3.0 |
| MiniMax M2.5 | 5 | 1 | 4 | 10 | 3.3 |
| MiniMax M2.7 | 5 | 1 | 5 | 11 | 3.7 |
| DeepSeek V3.2 | 5 | 5 | 5 | 15 | 5.0 |
| Qwen 3.6 Plus | 5 | 1 | 5 | 11 | 3.7 |
| DeepSeek V4 Pro | 5 | 1 | 5 | 11 | 3.7 |
| DeepSeek V4 Flash | 5 | 1 | 5 | 11 | 3.7 |
| Qwen-35b (Local) | 5 | 2 | 4 | 11 | 3.7 |
| Gemma 4 31B (Local) | 5 | 5 | 1 | 11 | 3.7 |
Scoring key: 5=deploys and works perfectly, 4=deploys with minor issues, 3=deploys but significant issues, 1=does not deploy (CrashLoopBackOff/probe failure)
Security Score (out of 5)
| Model | Basic | Production | Hardened | Total | Average |
|---|---|---|---|---|---|
| Claude Opus 4.8 | 1 | 4 | 5 | 10 | 3.3 |
| Claude Opus 4.7 | 1 | 5 | 5 | 11 | 3.7 |
| Claude Opus 4.6 | 1 | 5 | 5 | 11 | 3.7 |
| Claude Sonnet 4.6 | 3 | 3 | 5 | 11 | 3.7 |
| GPT 5.5 | 1 | 5 | 5 | 11 | 3.7 |
| Kimi K2.6 | 1 | 5 | 5 | 11 | 3.7 |
| GPT 5.4 | 1 | 3 | 5 | 9 | 3.0 |
| Gemini 3 Flash | 1 | 2 | 4 | 7 | 2.3 |
| MiniMax M2.5 | 1 | 2 | 4 | 7 | 2.3 |
| MiniMax M2.7 | 1 | 3 | 5 | 9 | 3.0 |
| DeepSeek V3.2 | 1 | 4 | 5 | 10 | 3.3 |
| Qwen 3.6 Plus | 1 | 3 | 5 | 9 | 3.0 |
| DeepSeek V4 Pro | 1 | 3 | 5 | 9 | 3.0 |
| DeepSeek V4 Flash | 1 | 3 | 5 | 9 | 3.0 |
| Qwen-35b (Local) | 1 | 2 | 4 | 7 | 2.3 |
| Gemma 4 31B (Local) | 1 | 1 | 3 | 5 | 1.7 |
Scoring key: 5=PSS Restricted compliant, 4=near-Restricted (missing 1-2 items), 3=significant hardening but fails Restricted, 2=minimal hardening, 1=no security context
Combined Score (Usability + Security, out of 10)
| Model | Basic | Production | Hardened | Total | Average |
|---|---|---|---|---|---|
| Claude Opus 4.7 | 6 | 10 | 10 | 26 | 8.7 |
| GPT 5.5 | 6 | 10 | 10 | 26 | 8.7 |
| Claude Opus 4.6 | 6 | 10 | 10 | 26 | 8.7 |
| Claude Opus 4.8 | 6 | 9 | 10 | 25 | 8.3 |
| DeepSeek V3.2 | 6 | 9 | 10 | 25 | 8.3 |
| Gemini 3 Flash | 6 | 7 | 9 | 22 | 7.3 |
| Qwen 3.6 Plus | 6 | 4 | 10 | 20 | 6.7 |
| DeepSeek V4 Pro | 6 | 4 | 10 | 20 | 6.7 |
| DeepSeek V4 Flash | 6 | 4 | 10 | 20 | 6.7 |
| MiniMax M2.7 | 6 | 4 | 10 | 20 | 6.7 |
| Kimi K2.6 | 2 | 10 | 8 | 20 | 6.7 |
| Claude Sonnet 4.6 | 4 | 4 | 10 | 18 | 6.0 |
| Qwen-35b (Local) | 6 | 4 | 8 | 18 | 6.0 |
| MiniMax M2.5 | 6 | 3 | 8 | 17 | 5.7 |
| GPT 5.4 | 6 | 4 | 6 | 16 | 5.3 |
| Gemma 4 31B (Local) | 6 | 6 | 4 | 16 | 5.3 |
Key Findings
Co-Leaders: Claude Opus 4.7, GPT 5.5, and Opus 4.6 (8.7 average)
- All three achieve perfect 10/10 on Production and Hardened scenarios with all 3 manifests deploying successfully
- All three reach PSS Restricted at Production level — joined by Kimi K2.6 as the only models to achieve Restricted without the explicit “hardened” prompt
- GPT 5.5 joins Opus 4.7 as only the second model to include PSA namespace labels in the Hardened scenario, demonstrating cluster-level enforcement awareness. ConfigMap with defaultMode 0444, topologySpreadConstraints, PDB, startupProbe, and ephemeral-storage limits show comprehensive production knowledge
- Opus 4.7 brings unique strengths: image pinning (
nginx:1.27.2in Production — only model to avoid:latest), PSA namespace labels in Hardened (only model with cluster-level enforcement), ConfigMapdefaultMode: 0444, emptyDir with Memory medium + sizeLimit - Opus 4.6 retains advantages in: rate limiting, HTTP method restriction, hidden file blocking, more security headers (7 vs 6)
- Both show excellent prompt sensitivity: minimal response for basic prompt, comprehensive for production/hardened
Previous Leader: DeepSeek V3.2 (8.3 average)
- All 3 manifests deploy successfully
- Strong security defaults even without explicit hardening prompts
- Practical choices: switched to alpine image, pinned version, ran as non-root
- Only weaknesses: probe targets
/healthz(needs ConfigMap), missing seccomp in production
Best Security (when it works): Claude Sonnet 4.6
- Hardened manifest is the gold standard: comprehensive ConfigMap, security headers, rate limiting, NetworkPolicy with RFC1918 blocking, memory-backed emptyDirs with size limits
- But 2 out of 3 manifests don’t deploy — the capability issue is a serious, repeated error
- Over-engineers every response (6-10 objects when 2 were asked for)
Most Reliable: Gemini 3 Flash
- Deploys all 3 manifests successfully using pragmatic approaches
- Uses
nginxinc/nginx-unprivilegedfor hardened scenario — avoids the entire root/port/capability problem - Weaker on security features (no seccomp) but consistently simple and correct
Qwen 3.6 Plus, DeepSeek V4 Pro, DeepSeek V4 Flash, MiniMax M2.7, and Kimi K2.6: Tied at 6.7 average
- All five score 20 total but with different failure patterns
- Qwen 3.6 Plus falls into the same chown trap as Sonnet 4.6 (drop ALL + NET_BIND_SERVICE, root on port 80). Knows the right answer (mentions non-root with port 8080 in notes) but doesn’t implement it. Hardened response is excellent — full PSS Restricted with ConfigMap, uid 101, automountServiceAccountToken: false
- DeepSeek V4 Pro takes a different path to the same Production failure: runs as uid 101 with drop ALL capabilities (good intent) but omits emptyDir volumes for nginx cache directories. UID 101 cannot create subdirectories in
/var/cache/nginx. Hardened response is fully PSS Restricted with ConfigMap and emptyDir volumes — learns from its own Production mistake - DeepSeek V4 Flash falls into the same chown trap as Qwen 3.6 Plus and Sonnet 4.6: drops ALL capabilities, adds NET_BIND_SERVICE, runs as root on port 80. Hardened response is fully PSS Restricted with uid 101, ConfigMap for port 8080, emptyDir volumes, seccomp RuntimeDefault, and pod anti-affinity
- Kimi K2.6 has the most unusual failure pattern: Basic response was HTML (not YAML at all), Production achieves full PSS Restricted and deploys perfectly, but Hardened fails due to a
try_filesbug in the nginx.conf causing 404 on root path. The only model to fail Basic while passing Production - MiniMax M2.7 fixes M2.5’s structural YAML errors but production manifest has runAsNonRoot without runAsUser — Kubernetes rejects the pod
Second Local Model: Gemma 4 31B (5.3 average, tied with GPT 5.4)
- A 31B dense model running locally on LM Studio — the second local/self-hosted model tested
- Basic and Production deployments are minimal but functional — no security context, no resource limits in basic, clean deployable YAML
- Hardened deployment has good security intent (drop ALL capabilities, readOnlyRootFilesystem, allowPrivilegeEscalation false, emptyDir volumes for cache/run/tmp, seccompProfile RuntimeDefault) but critical flaw:
runAsNonRoot: falsewith ALL capabilities dropped causes the same chown failure as Claude Sonnet 4.6 and Qwen 3.6 Plus in their production manifests - The model explicitly recommended
nginxinc/nginx-unprivilegedin its comments but did not implement it — demonstrating knowledge of the correct solution without applying it - No ConfigMap, NetworkPolicy, PDB, HPA, security headers, or automountServiceAccountToken across any scenario
- Weakest security overall among the models tested, with the lowest combined security average (1.7/5)
First Local Model: Qwen-35b (6.0 average, tied with Sonnet 4.6)
- A 35B-parameter MoE model running locally on LM Studio — the first local/self-hosted model tested
- Production deployment has a novel failure mode: securityContext and probes nested under invalid YAML keys (“container security context:” and “health checks:” instead of proper Kubernetes field names). The model knows the right settings but generates structurally invalid YAML
- Hardened deployment is genuinely good — runAsNonRoot, uid 101, seccomp RuntimeDefault, readOnlyRootFilesystem, drop ALL capabilities. Only NET_BIND_SERVICE prevents PSS Restricted
- Uses
nginx:1.25.3-alpine(good pinned image choice) and runs as uid 101 without nginxinc/nginx-unprivileged (valid alternative) - Competitive with several cloud-hosted models despite being a much smaller local model
Most Improved Across Prompts: All models
- Every model produced meaningfully better security when explicitly asked for hardening
- The jump from “production” to “hardened” was more significant than from “basic” to “production”
Worst Bug: GPT 5.4 Hardened
- Specifies port 8080 everywhere but provides no ConfigMap — the manifest is guaranteed to fail
- The model even acknowledges this in a caveat but doesn’t fix it
Structural Error: MiniMax M2.5 Production
- Places
capabilitiesunder pod-levelsecurityContext— this is invalid Kubernetes YAML - Kubernetes silently ignores it, so the capabilities are never actually dropped
Notable Good Results
- Gemini 3 Flash (Hardened) — Uses
nginxinc/nginx-unprivileged, the smartest approach. Avoids the entire root/port/capability problem that tripped up other models. - DeepSeek V3.2 (Production) — Only model to produce a fully functional, security-hardened production deployment without the hardened prompt. Demonstrates strong baseline security awareness.
- Claude Sonnet 4.6 (Hardened) — Most thorough security implementation: 17-row security feature table, RFC1918 egress blocking, memory-backed tmpfs with size limits, init container for config validation.
- Qwen 3.6 Plus (Hardened) — Clean redemption after Production failure. Full PSS Restricted with ConfigMap for port 8080, uid 101, server_tokens off, 5 emptyDir volumes, and automountServiceAccountToken: false. References CIS Benchmark and NSA/CISA guidelines.
- DeepSeek V4 Pro (Hardened) — Full PSS Restricted compliant with ConfigMap for port 8080, uid 101, seccomp RuntimeDefault, emptyDir volumes for cache and run directories. Correctly addresses the non-root nginx pitfall that caused its own Production failure.
Notable Bad Results
- Claude Sonnet 4.6 (Basic & Production) — Drops ALL capabilities without understanding nginx:latest needs
CHOWN. The production version runs asrunAsUser: 0despite configuring port 8080 (which doesn’t need root). Self-contradictory comments in the YAML. - GPT 5.4 (Hardened) — Perfect security settings on a manifest that doesn’t work. The model knows this (says “assumes you will provide an nginx config”) but delivers non-functional YAML anyway.
- MiniMax M2.5 (Production) — Invalid YAML structure (
capabilitiesunder pod securityContext) demonstrates incomplete Kubernetes API knowledge. - GPT 5.4 (Basic) — Zero security features for a basic prompt. Not even resource limits. While technically correct for the prompt, it’s the least production-aware default.
- Qwen 3.6 Plus (Production) — Same chown trap as Sonnet 4.6: drops ALL capabilities, adds NET_BIND_SERVICE, runs as root on port 80. The model’s own text notes recommend using non-root with port 8080 — demonstrating it knows the right answer but fails to implement it.
- Gemma 4 31B (Hardened) — Drops ALL capabilities while keeping
runAsNonRoot: false(root user), causing the same CrashLoopBackOff chown failure as Sonnet 4.6. The model explicitly recommended usingnginxinc/nginx-unprivilegedin its own comments but failed to implement the recommendation. Weakest security profile of any model tested.