Report Card: Hardened Cluster Creation

Test type: Secure Cluster Creation Original date: 2026-03-09 Re-run date: 2026-03-10 (4 failed models re-run with additional guidance) Claude Opus 4.6 added: 2026-03-25 | MiniMax M2.7 added: 2026-03-28 | Claude Opus 4.7 added: 2026-04-20 Qwen 3.6 Plus added: 2026-04-20 | DeepSeek V4 Pro added: 2026-04-24 | DeepSeek V4 Flash added: 2026-04-24 | GPT 5.5 added: 2026-04-25 Kimi K2.6 added: 2026-04-26 | Qwen3.6-35b-a3b (Local) added: 2026-05-03 | Gemma 4 31B (Local) added: 2026-05-03 Claude Opus 4.8 added: 2026-05-31 | Qwen 3.7 Plus added: 2026-06-05 | MiniMax M3 added: 2026-06-08 | Claude Fable 5 added: 2026-06-10 | Kimi K2.7 Code added: 2026-06-16 | GLM-5.2 added: 2026-06-17 | Mistral Medium 3.5 added: 2026-06-18 Claude Sonnet 5 added: 2026-07-01 | Tencent HY3 added: 2026-07-10 | GPT 5.6 Terra added: 2026-07-10 GPT 5.6 Sol added: 2026-07-14 | Kimi K3 added: 2026-07-16 | Xiaomi MiMo v2.5 added: 2026-07-21 Scenario: Create a hardened Kubernetes cluster using Kind with comprehensive security controls (audit logging, PSS, network policies, API server hardening, kubelet hardening, etc.) Timeout: 600 seconds (10 minutes)

Kimi K3 (2026-07-16)

Result: SUCCESS (33/40)

Approach: Task plan upfront (6 TODOs), wrote 5 config files, cluster succeeded first attempt, applied namespaces/policies, thorough verification.

Security features implemented:

Audit logging: Correct two-level mount pattern with rotation.
Encryption at rest: aescbc encryption verified in etcd.
PSA: Three-tier enforcement — dev=privileged, test=baseline+restricted, prod=restricted.
Network policies: Default-deny with DNS exception.
API server: profiling=false, TLS min version 1.2.
Kubelet: No hardening — no KubeletConfiguration.
Additional controls: None beyond the above.

Category scores:

Cluster Creation: 5/5 (first-attempt success)
Audit Logging: 5/5 (correct two-level mount with rotation)
PSS: 5/5 (three-tier enforcement)
Network Policies: 5/5 (default-deny with DNS exception)
API Server Hardening: 4/5 (profiling=false, TLS 1.2 min)
Kubelet Hardening: 0/5 (no KubeletConfiguration)
Additional Controls: 4/5 (encryption at rest verified in etcd)
Agent Behaviour: 5/5 (first-attempt cluster creation, thorough verification)

Notable: First-attempt cluster creation. One of approximately 5 models to implement encryption at rest. Zero kubelet hardening (no KubeletConfiguration). Completed within timeout (SUCCESS). Thorough verification including live network tests and etcd inspection.

Xiaomi MiMo v2.5 (2026-07-21)

Result: TIMEOUT (23/40)

Approach: Created a TODO plan and wrote configuration files (audit-policy, namespaces, network-policies, hardening), then spent the overwhelming majority of the time budget struggling to create the Kind cluster — iterating through eight different kind-config files (kind-config, minimal, audit, hardened, v2, v3, v4, final), repeatedly deleting and recreating, and even pausing to inspect host resources. The cluster finally came up on the “final” config (confirmed by a 4.3 MB audit.log), after which the agent applied namespaces, network policies, and hardening resources and began verification — but timed out mid-verification. It never tested PSS enforcement with a privileged pod, never verified network policies with live traffic, and never cleaned up.

Security features implemented:

Audit logging: Granular policy (RequestResponse for pods, Metadata for secrets/configmaps, Metadata catch-all with omitStages), rotation configured, correct two-level mount. 4.3 MB audit.log confirmed.
PSA: Three-tier — development=baseline enforce + restricted audit/warn, test=restricted enforce, production=restricted enforce. Applied and label-verified; no privileged-pod rejection test (timed out).
Network policies: Default-deny ingress+egress on both test and production, DNS egress exception, kube-system + intra-namespace allowances. Applied but not traffic-verified.
API server: profiling=false, service-account-lookup=true, controller-manager use-service-account-credentials=true. Missing anonymous-auth=false, TLS min version, encryption-provider-config.
Kubelet: No hardening — only ClusterConfiguration patches, no KubeletConfiguration.
Additional controls: ResourceQuotas and LimitRanges for test and production. No encryption at rest. RBAC hardening attempts ineffective (a removed-in-1.25 PodSecurityPolicy use Role and unbound read-only Roles that restrict nothing).

Category scores:

Cluster Creation: 2/5 (created only after ~7 failed attempts across 8 config iterations; the thrashing caused the timeout)
Audit Logging: 5/5 (granular policy, rotation, two-level mount, 4.3 MB log)
PSS: 4/5 (strict three-tier, applied and label-verified, no enforcement test)
Network Policies: 4/5 (default-deny both namespaces + DNS, applied, not traffic-verified)
API Server Hardening: 3/5 (profiling, service-account-lookup, use-service-account-credentials; missing anon-auth, TLS floor, encryption)
Kubelet Hardening: 0/5 (no KubeletConfiguration)
Additional Controls: 3/5 (ResourceQuotas, LimitRanges; no encryption; ineffective RBAC)
Agent Behaviour: 2/5 (planned and wrote configs, but thrashed on creation, timed out before verification, no cleanup)

Notable: Strong security artefacts undone by execution. The audit, PSS, and network-policy configurations are respectable, but the eight-attempt struggle to create the cluster consumed the time budget and forced a timeout before verification. Had cluster creation gone smoothly, this run would plausibly have reached the high 20s or low 30s. The RBAC portion of the hardening file reflects outdated knowledge (grants use on the removed PodSecurityPolicy resource; defines unbound Roles that impose no restriction).

GPT 5.6 Sol (2026-07-14)

Result: TIMEOUT (30/40)

Approach: Created comprehensive configuration files (kind-config, audit-policy, admission-config, encryption-config, security-resources) for a 3-node cluster. Planned disableDefaultCNI with Calico installation. Spent 354 seconds on configuration files, leaving too little time for cluster creation and hardening application. Timed out at 600s.

Security features implemented:

Encryption at rest: aesgcm encryption (not aescbc) — authenticated encryption providing both confidentiality and integrity.
PSA: Dual-layer enforcement — cluster-wide baseline via AdmissionConfiguration + namespace-level restricted enforcement on designated namespaces.
Network policies: Default-deny NetworkPolicies with DNS egress allowed.
Kubelet: Comprehensive hardening on all nodes — anonymous auth disabled, Webhook authorization, readOnlyPort 0, TLS configuration.
Additional controls: LimitRanges and ResourceQuotas configured.

Missing: Cluster never reached running state — disableDefaultCNI was configured without confirmed Calico installation. Time budget exhausted on configuration.

Category scores:

Cluster Creation: 3/5 (comprehensive config but timed out, disableDefaultCNI without confirmed Calico)
Audit Logging: 4/5 (comprehensive policy, two-level mount, never verified)
PSS: 4/5 (dual-layer — cluster-wide baseline + namespace restricted)
Network Policies: 4/5 (default-deny with DNS egress)
API Server Hardening: 4/5 (encryption at rest with aesgcm, admission plugins)
Kubelet Hardening: 5/5 (comprehensive hardening on all nodes)
Additional Controls: 4/5 (LimitRanges and ResourceQuotas)
Agent Behaviour: 2/5 (spent 354s on config files leaving too little time for execution)

Notable: Used aesgcm instead of aescbc for encryption at rest — aesgcm provides authenticated encryption (confidentiality + integrity) compared to aescbc which provides only confidentiality. This is a more sophisticated encryption choice than any other model. However, spending 354 seconds on configuration files left insufficient time for cluster creation, and configuring disableDefaultCNI without a confirmed Calico installation path was a risk that contributed to the timeout.

Claude Opus 4.8 (2026-05-31)

Result: SUCCESS (37/40)

Approach: Created configuration files and cluster, which was created successfully on the first attempt. Discovered that the default CNI does not enforce NetworkPolicy, deleted the cluster and recreated it with Calico CNI. Applied namespaces with PSA labels, network policies, and verified PSA enforcement and network policy isolation.

Security features implemented:

Audit logging: Correct two-level mount pattern. Comprehensive policy — pods at RequestResponse, secrets/configmaps at Metadata (avoids logging values), RBAC resources included, network policies covered. Log rotation configured (30 days, 10 backups, 100 MB max).
PSA: Cluster-wide restricted default via AdmissionConfiguration with system namespace exemptions. dev=baseline enforce + restricted audit/warn. test+production=restricted enforce. The strongest PSS configuration tier.
Network policies: Default deny ingress+egress on test and production namespaces. DNS egress correctly scoped to kube-system.
API server: anonymous-auth=false, profiling=false, NodeRestriction admission plugin, encryption at rest (AES-CBC), TLS min version 1.2 with strong cipher suites, service-account-lookup=true.
Kubelet: readOnlyPort: 0, anonymous auth disabled, Webhook authorization, TLS min version 1.2 with strong cipher suites. Correctly avoided protectKernelDefaults and seccomp-default (Kind-incompatible).
Controller-manager/scheduler: Profiling disabled on both.
Additional controls: Calico CNI for NetworkPolicy enforcement, encryption at rest.

Category scores:

Cluster Creation: 4/5 (required recreating cluster for Calico CNI)
Audit Logging: 5/5 (correct two-level mount, comprehensive policy, rotation)
PSS: 5/5 (cluster-wide restricted default, tiered namespace enforcement)
Network Policies: 5/5 (default deny + DNS egress, Calico enforcement verified)
API Server Hardening: 5/5 (anonymous-auth=false, profiling=false, encryption at rest, TLS hardening, service-account-lookup)
Kubelet Hardening: 5/5 (anonymous disabled, webhook authz, readOnlyPort 0, TLS ciphers, correctly avoided Kind pitfalls)
Additional Controls: 4/5 (Calico CNI, encryption at rest, but no ResourceQuotas/LimitRanges)
Agent Behaviour: 4/5 (efficient execution, proactive Calico recreation for NetworkPolicy enforcement)

Notable: The decision to recreate the cluster with Calico after discovering the default CNI does not enforce NetworkPolicy demonstrates strong operational awareness — most models applied network policies without verifying enforcement. Ties with Opus 4.7 at 37/40. Strong across all security categories with the only gap being ResourceQuotas/LimitRanges and SA token restriction.

Claude Sonnet 5 (2026-07-01)

Result: TIMEOUT (36/40)

Approach: Created Kind cluster with Calico CNI. First attempt failed due to anonymous-auth=false causing API server health probe failures. Self-diagnosed, removed anonymous-auth=false, and recreated the cluster successfully on the second attempt. Applied cluster-wide AdmissionConfiguration with per-namespace PSA labels, default-deny network policies with DNS egress and same-namespace policies, ResourceQuotas, LimitRanges, and ServiceAccount token automount disabled on all three namespaces. Timed out during verification — all hardening controls were in place.

Security features implemented:

Audit logging: Comprehensive policy with correct two-level mount pattern.
PSA: Cluster-wide AdmissionConfiguration + per-namespace labels. Tiered enforcement — dev=baseline, test+production=restricted enforce/audit/warn.
Network policies: Default-deny ingress+egress on test and production. DNS egress correctly scoped to kube-system. Same-namespace policies verified.
API server: profiling=false, TLS min version 1.2, Node,RBAC authorization, 6 admission plugins. Missing anonymous-auth=false (removed after first attempt), no encryption at rest.
Kubelet: readOnlyPort=0, eventRecordQPS=0, streamingConnectionIdleTimeout=5m. Missing anonymous auth disable, Webhook authorization.
Additional controls: ResourceQuotas and LimitRanges on all 3 namespaces. ServiceAccount automountToken disabled on all 3 namespaces. Calico CNI for NetworkPolicy enforcement.

Missing: anonymous-auth=false on API server (removed after first-attempt probe failure), encryption at rest, kubelet Webhook authorization.

Category scores:

Cluster Creation: 4/5 (required recreating cluster after anonymous-auth probe issue)
Audit Logging: 5/5 (comprehensive policy, correct two-level mount)
PSS: 5/5 (cluster-wide AdmissionConfiguration + namespace labels, tiered enforcement)
Network Policies: 5/5 (default deny + DNS egress + same-namespace, Calico enforcement verified)
API Server Hardening: 4/5 (profiling disabled, TLS 1.2, Node+RBAC, 6 admission plugins — missing anonymous-auth=false)
Kubelet Hardening: 4/5 (readOnlyPort=0, eventRecordQPS=0, streamingConnectionIdleTimeout — missing anonymous auth disable, Webhook auth)
Additional Controls: 5/5 (ResourceQuotas, LimitRanges, SA token automount disabled — all 3 namespaces)
Agent Behaviour: 4/5 (clean recovery from anonymous-auth probe failure, comprehensive hardening, timed out during verification)

Notable: Sole 6th place at 36/40, between the 37/40 trio (Fable 5, Opus 4.8, Opus 4.7) and GPT 5.5/Qwen-35b at 35/40. Strong across all categories with the main gaps being anonymous-auth=false (removed after the first-attempt probe issue — the same problem Opus 4.7 solved with AuthenticationConfiguration) and encryption at rest. ResourceQuotas, LimitRanges, and SA token restriction on all 3 namespaces demonstrate comprehensive additional controls.

Claude Fable 5 (2026-06-10)

Result: SUCCESS (37/40)

Approach: Created Kind cluster on first attempt with Calico CNI. Applied comprehensive hardening across all layers — audit logging, tiered PSS enforcement, network policies, API server and kubelet hardening. Created LimitRanges and ResourceQuotas for test and production namespaces.

Security features implemented:

Audit logging: Comprehensive policy with noise filtering and rotation. Correct two-level mount pattern.
PSA: Tiered namespace enforcement — dev=baseline, test+production=restricted enforce/audit/warn.
Network policies: Default-deny ingress+egress on test and production. DNS egress correctly scoped to kube-system.
API server: NodeRestriction admission plugin, profiling=false, TLS min version 1.2. Missing anonymous-auth=false.
Kubelet: seccompDefault enabled, streamingConnectionIdleTimeout configured. Missing readOnlyPort=0.
Additional controls: LimitRanges and ResourceQuotas for test and production namespaces.

Missing: anonymous-auth=false on API server, encryption at rest, readOnlyPort=0 on kubelet.

Category scores:

Cluster Creation: 5/5 (successful first attempt with Calico CNI)
Audit Logging: 5/5 (comprehensive policy with noise filtering, rotation)
PSS: 4/5 (tiered namespace enforcement, missing cluster-wide AdmissionConfiguration)
Network Policies: 5/5 (default deny + DNS egress, Calico enforcement)
API Server Hardening: 4/5 (NodeRestriction, profiling disabled, TLS 1.2 — missing anonymous-auth=false)
Kubelet Hardening: 4/5 (seccompDefault, streamingConnectionIdleTimeout — missing readOnlyPort=0)
Additional Controls: 4/5 (LimitRanges and ResourceQuotas for test/production)
Agent Behaviour: 5/5 (first-attempt success with Calico, efficient execution)

Notable: Ties with Opus 4.8 and Opus 4.7 at 37/40. First-attempt success with proactive Calico CNI selection for NetworkPolicy enforcement. Strong across all categories with comprehensive audit policy and tiered PSS. The main gaps are anonymous-auth=false on the API server, encryption at rest, and readOnlyPort=0 on the kubelet.

MiniMax M3 (2026-06-08)

Result: SUCCESS (29/40)

Approach: Single-attempt Kind cluster creation with comprehensive kubeadmConfigPatches. Created kind-config.yaml with API server hardening, audit policy, and kubelet config. Applied namespace PSS labels and network policies. 363 seconds, 41 tool calls.

Security features implemented:

Audit logging: Multi-level policy with rotation. Correct two-level mount pattern.
PSA: 3-tier namespace enforcement — dev=baseline, test+production=restricted enforce/audit/warn.
Network policies: Default deny ingress+egress on test and production. DNS egress correctly scoped to kube-system.
API server: anonymous-auth=false, TLS min version 1.2 with strong cipher suites, profiling=false. NodeRestriction admission plugin.
Kubelet: readOnlyPort: 0, anonymous auth disabled, Webhook authorization. Configured via Kind config.
Additional controls: None (no encryption at rest, no ResourceQuotas, no LimitRanges).

Category scores:

Cluster Creation: 5/5 (successful first attempt)
Audit Logging: 4/5 (multi-level policy with rotation, correct two-level mount)
PSS: 4/5 (restricted on test+prod, baseline on dev — missing cluster-wide AdmissionConfiguration)
Network Policies: 3/5 (default deny + DNS egress)
API Server Hardening: 4/5 (anonymous-auth=false, TLS 1.2+, strong ciphers, profiling disabled)
Kubelet Hardening: 3/5 (readOnlyPort 0, anonymous auth disabled, Webhook auth — via Kind config)
Additional Controls: 2/5 (no encryption at rest, no ResourceQuotas, no LimitRanges)
Agent Behaviour: 4/5 (efficient first-attempt success, clean methodical execution)

Notable: First-attempt success — unlike M2.7 which timed out before creating a cluster. Clean methodical execution with 41 tool calls in 363 seconds. Verified PSS enforcement (privileged pod rejected) and network policies (DNS allowed, external/inter-namespace blocked). Missing encryption at rest and ResourceQuotas. A major step forward for the MiniMax family: M2.5 scored 10/40 (timeout), M2.7 scored 20/40 (timeout), and M3 scores 29/40 (success).

Kimi K2.7 Code (2026-06-16)

Result: TIMEOUT (29/40)

Approach: Created cluster with comprehensive audit logging, PSS for test/prod namespaces, network policies. Missing kubelet hardening.

Security features implemented:

Audit logging: Comprehensive policy — pods at RequestResponse, RBAC resources and exec/portforward at RequestResponse, secrets/configmaps at Metadata (avoids logging values).
PSA: Restricted enforce/audit/warn on test and production namespaces. Missing dev namespace PSS labels.
Network policies: Default deny ingress+egress on test and production. DNS egress correctly scoped to kube-system.
API server: TLS cipher suites configured, controller-manager and scheduler localhost binding, NodeRestriction admission plugin.
Kubelet: No hardening — biggest gap in the configuration.
Additional controls: ResourceQuotas and LimitRanges deployed. ServiceAccount automountToken disabled.

Category scores:

Cluster Creation: 4/5
Audit Logging: 5/5
Pod Security Standards: 4/5
Network Policies: 5/5
API Server Hardening: 3/5
Kubelet Hardening: 1/5
Additional Controls: 4/5
Agent Behaviour: 3/5

Notable: Comprehensive audit policy (RequestResponse for pods/RBAC/exec), restricted PSS on test/prod but missing dev namespace. Default deny + DNS network policies. No kubelet hardening at all (biggest gap). TLS cipher suites for API server, controller-manager/scheduler localhost binding, NodeRestriction. ResourceQuotas, LimitRanges, SA token disabled. Timeout due to slow model, first-attempt success on cluster creation.

Qwen3.6-35b-a3b — LOCAL (2026-05-03)

Result: SUCCESS (35/40)

Note: This is a LOCAL model (35B-parameter MoE, running on LM Studio). Timeout was extended to 30 minutes (vs 10 minutes standard) to accommodate slower local inference.

Approach: Created configuration files and cluster. Required 3 attempts at cluster creation before succeeding, but once running applied all security configurations correctly and performed verification testing. Cluster name: dh-qwen3-6-35b-a3b-a41177e4.

Security features implemented:

Audit logging: Correct two-level mount pattern. Excellent policy: security-sensitive resources at RequestResponse, secrets at Metadata (avoids logging values), omits health/metrics noise. Audit log actively writing (3.6MB by end).
API server: profiling disabled on API server/controller-manager/scheduler. service-account-lookup true. Comprehensive admission plugins (PodSecurity, NodeRestriction, LimitRanger, ResourceQuota). Used newer AuthenticationConfiguration for per-endpoint anonymous auth.
Kubelet: anonymous auth disabled, webhook auth+authz, readOnlyPort 0, strong TLS cipher suites. Missing streamingConnectionIdleTimeout and rotateCertificates.
PSA: dev: enforce baseline + warn/audit restricted. test+production: enforce restricted. Cluster-wide AdmissionConfiguration defaults to restricted with system namespace exemptions.
Network policies: Default deny ingress+egress on test+production. DNS egress correctly scoped to kube-system/kube-dns on port 53 UDP+TCP.
Additional controls: automountServiceAccountToken false on default SA in all namespaces. ResourceQuotas and LimitRanges on all 3 namespaces. Namespace-scoped RBAC roles. PodDisruptionBudget for CoreDNS.

Missing: streamingConnectionIdleTimeout, rotateCertificates, explicit anonymous-auth=false flag (used AuthenticationConfiguration instead).

Category scores:

Cluster Creation: 4/5 (3 attempts)
Audit Logging: 5/5 (correct two-level mount, excellent policy)
PSS: 5/5 (restricted on test+prod, baseline on dev, cluster-wide defaults)
Network Policies: 5/5 (default deny + DNS egress)
API Server Hardening: 4/5 (comprehensive, used AuthConfig approach)
Kubelet Hardening: 4/5 (anonymous disabled, webhook, readOnlyPort 0, TLS ciphers)
Additional Controls: 5/5 (SA token, quotas, limits, RBAC, PDB)
Agent Behaviour: 3/5 (3 creation attempts, initial directory confusion)

Notable: Ties with GPT 5.5 at 35/40 — impressive for a 35B local model. The security knowledge demonstrated (cluster-wide AdmissionConfiguration, per-endpoint anonymous auth, comprehensive audit policy) matches or exceeds several larger cloud-hosted models. The extended timeout (30 min vs 10 min) accommodated the slower inference speed without affecting the quality of output.

Qwen 3.7 Plus (2026-06-05)

Result: PARTIAL (21/40)

Approach: Created audit policy and Kind cluster config with audit logging. First attempt failed due to invalid PodSecurity feature gate in KubeletConfiguration; agent diagnosed and removed it. Second attempt succeeded. Applied three namespaces with PSS labels (dev=baseline, test/prod=restricted) and default-deny network policies. No API server or kubelet hardening beyond audit logging.

Security features implemented:

Audit logging: Two-level mount with rotation. Basic policy — pods at RequestResponse, secrets/configmaps at Metadata only. No RBAC resource or auth event coverage.
PSA: Namespace-level labels: dev=baseline, test+prod=restricted enforce/audit/warn. No cluster-wide AdmissionConfiguration.
Network policies: Default deny ingress+egress on test+prod. No DNS egress allowance (renders namespaces unusable). kindnet doesn’t enforce.
API server: Only audit logging flags. No anonymous-auth=false, profiling=false, encryption, TLS hardening.
Kubelet: None in final config. Initial KubeletConfiguration removed after PodSecurity feature gate error.
Additional: Only audit log rotation parameters.

Category scores:

Cluster Creation: 4/5 (required second attempt after PodSecurity feature gate error)
Audit Logging: 3/5 (two-level mount correct, basic policy, rotation)
PSS: 4/5 (restricted on test+prod, baseline on dev — missing cluster-wide AdmissionConfiguration)
Network Policies: 3/5 (default deny but no DNS egress, kindnet doesn’t enforce)
API Server Hardening: 2/5 (only audit logging flags)
Kubelet Hardening: 1/5 (no hardening — KubeletConfiguration removed after error)
Additional Controls: 1/5 (only audit log rotation)
Agent Behaviour: 3/5 (clean recovery from first creation failure, but conservative approach after setback)

Notable: Clean recovery from first creation failure, but conservative approach meant no API server or kubelet hardening was attempted after the setback. Network policies lack DNS egress, making them non-functional.

Gemma 4 31B — LOCAL (2026-05-03)

Result: SUCCESS (25/40)

Note: This is a LOCAL model (31B dense, running on LM Studio). Timeout was extended to accommodate slower local inference.

Approach: Created configuration files and cluster in 5 tool calls total — the most minimal execution of any model tested. Successfully created the cluster on the first attempt. Applied PSS labels and network policies, but did not configure API server hardening or kubelet hardening beyond defaults.

Security features implemented:

Audit logging: Correct two-level mount pattern. Standard policy covering pods at RequestResponse, secrets/configmaps at Metadata. Audit log confirmed writing.
PSA: test and production namespaces: enforce restricted. development: enforce baseline. Correct tiered enforcement.
Network policies: Default deny ingress+egress on test and production. DNS egress correctly scoped to kube-system on port 53.
API server: Basic admission plugins (NodeRestriction, PodSecurity). No anonymous-auth=false, no profiling=false, no TLS hardening, no service-account-lookup=true.
Kubelet: No hardening — default kubelet configuration only.
Additional controls: No ResourceQuotas, no LimitRanges, no controller-manager/scheduler hardening, no encryption at rest, no SA token restriction.

Missing: API server anonymous-auth, profiling disable, TLS hardening. Kubelet hardening entirely absent. No ResourceQuotas or LimitRanges. No controller-manager/scheduler hardening.

Category scores:

Cluster Creation: 5/5 (successful first attempt)
Audit Logging: 5/5 (correct two-level mount, functional policy)
PSS: 4/5 (restricted on test+prod, baseline on dev — missing cluster-wide AdmissionConfiguration)
Network Policies: 4/5 (default deny + DNS egress, properly scoped)
API Server Hardening: 1/5 (basic admission plugins only, no hardening flags)
Kubelet Hardening: 1/5 (no hardening — default configuration)
Additional Controls: 1/5 (no quotas, limits, or additional hardening)
Agent Behaviour: 4/5 (extremely efficient — 5 tool calls, cluster created first attempt)

Notable: The most minimal execution of any model — only 5 tool calls total. This extreme efficiency comes at the cost of security depth: the model achieved cluster creation, basic PSS, audit logging, and network policies, but skipped all API server and kubelet hardening. A striking contrast to Qwen3.6-35b-a3b (also a local model) which used more calls but achieved 25/40 with stronger API server and kubelet configurations. Tied with Gemini 3 Flash Preview at 10th place but with a very different profile (Gemini focused on operational reliability; Gemma 4 31B on minimal but correct security foundations).

Kimi K2.6 (2026-04-26)

Result: TIMEOUT (31/40)

Approach: Created configuration files and cluster, but took 5+ attempts at cluster creation, timing out before completing verification. Cluster name: dh-kimi-k2-6-57c1770f.

Security features implemented:

Audit logging: Correct two-level mount pattern. Policy covers pods (RequestResponse), secrets/configmaps/auth/authorization (Metadata).
API server: Node,RBAC authorization, admission plugins (NodeRestriction, NamespaceLifecycle, LimitRanger, ServiceAccount, ResourceQuota, PodSecurity), audit log rotation. Missing: anonymous-auth=false, profiling=false.
Kubelet: readOnlyPort=0, anonymous auth disabled, Webhook authorization. Missing: TLS cipher suite, certificate rotation.
PSA: development=baseline enforce + restricted audit/warn; test and production=restricted enforce/audit/warn.
Network policies: Default deny + DNS egress to kube-system + intra-namespace communication for test and production namespaces.
Additional controls: LimitRanges and ResourceQuotas for all 3 namespaces, PSA test pod verification.

Missing: No encryption at rest, no anonymous-auth=false on API server, no profiling=false. No TLS cipher suite on kubelet, no certificate rotation. No controller-manager/scheduler hardening.

Category scores:

Cluster Creation: 3/5 (created but took 5+ attempts)
Audit Logging: 4/5 (correct two-level mount, functional policy)
PSS: 5/5 (restricted enforce/audit/warn on test+production, baseline on development)
Network Policies: 5/5 (default deny + DNS egress + intra-namespace)
API Server Hardening: 4/5 (good admission plugins and authorization, missing anon-auth and profiling)
Kubelet Hardening: 4/5 (anonymous auth disabled, Webhook auth, readOnlyPort=0)
Additional Controls: 4/5 (LimitRanges and ResourceQuotas for all 3 namespaces)
Agent Behaviour: 2/5 (5+ creation attempts, timed out before completing verification)

Notable: Solid security configuration across PSA, network policies, and resource controls, with LimitRanges and ResourceQuotas deployed to all three namespaces. The main weakness was operational — repeated cluster creation attempts consumed the majority of the timeout budget, leaving no time for verification. The security knowledge is strong but the agent’s iterative debugging approach was inefficient.

GPT 5.5 (2026-04-25)

Result: SUCCESS (35/40)

Approach: Created configuration files and built the Kind cluster with comprehensive hardening. Successfully created the cluster, applied namespaces with PSA labels, network policies, and additional security controls.

Security features implemented:

Audit logging: Policy configured with two-level mount pattern
API server: Admission plugins including PodSecurity and NodeRestriction, profiling disabled
Kubelet: Anonymous auth disabled, Webhook authorization, readOnlyPort=0, cert rotation
PSA: Restricted enforce/audit/warn on test and production namespaces
Network policies: Default deny ingress+egress on test and production, DNS egress allowed
Additional controls: ResourceQuotas and LimitRanges on namespaces, encryption at rest, controller-manager and scheduler profiling disabled

Category scores:

Cluster Creation: 5/5 (successful creation)
Audit Logging: 4/5 (functional policy with two-level mount)
PSS: 5/5 (restricted enforce/audit/warn on test+production)
Network Policies: 5/5 (default deny + DNS egress)
API Server Hardening: 4/5 (good hardening flags, profiling disabled)
Kubelet Hardening: 4/5 (anonymous auth disabled, Webhook auth, readOnlyPort=0)
Additional Controls: 4/5 (ResourceQuotas, LimitRanges, encryption at rest, controller-manager+scheduler hardening)
Agent Behaviour: 4/5 (efficient execution, completed within timeout)

Notable: A strong result that matches the upper tier of successful models. The inclusion of encryption at rest, ResourceQuotas, LimitRanges, and controller-manager/scheduler profiling disabled demonstrates broad security awareness. However, the content filter restrictions that blocked GPT 5.5 in all pentest scenarios did not affect the cluster creation task, which is a defensive rather than offensive security exercise.

Claude Opus 4.6 (2026-03-25)

Result: SUCCESS (38/40)

Approach: Created all config files first (audit-policy.yaml, kind-config.yaml, encryption-config.yaml), then attempted cluster creation. First attempt failed (likely Kind-incompatible kubelet setting). Self-diagnosed, deleted the failed cluster, fixed kind-config.yaml, and succeeded on the second attempt.

Security features implemented:

Audit logging: Comprehensive granular policy — pods and RBAC at RequestResponse, secrets/configmaps at Metadata only, health checks excluded, log rotation configured. Two-level mount correct.
API server: anonymous-auth=false, Node,RBAC authorization, NodeRestriction,PodSecurity plugins, TLS 1.2 min with strong ciphers, profiling=false, encryption at rest (aescbc), service-account-lookup=true
Kubelet: KubeletConfiguration object — anonymous auth disabled, Webhook auth+authz, readOnlyPort=0, cert rotation
PSA: Restricted enforce/audit/warn on test+production, verified with privileged pod rejection test
Network policies: Default deny ingress+egress on test+production, DNS egress allowed with namespace selector
Unique additions: ResourceQuotas and LimitRanges on test/production namespaces, encryption at rest

Notable: The only model to implement all of: audit logging, API server hardening, kubelet hardening, PSA, network policies, encryption at rest, ResourceQuotas, and LimitRanges. Excellent operational cleanup and verification. Lost 1 point on cluster creation (needed 2nd attempt) and 1 on additional controls (no controller-manager/scheduler hardening).

Claude Opus 4.7 (2026-04-20)

Result: TIMEOUT (37/40)

Approach: Created config files (audit-policy.yaml, kind-config.yaml, admission-config.yaml, authentication-config.yaml), then attempted cluster creation. First attempt failed because anonymous-auth=false caused API server health probes to return 401. Self-diagnosed, deleted the cluster, and created an AuthenticationConfiguration using Kubernetes 1.35’s AnonymousAuthConfigurableEndpoints feature gate to allow health endpoints without auth while blocking all other anonymous access. Succeeded on second attempt. Applied namespaces, PSA labels, network policies, ResourceQuotas, LimitRanges, and ServiceAccount restrictions. Timed out during final verification — all hardening controls were in place.

Security features implemented:

Audit logging: Comprehensive granular policy — health endpoints excluded, system component reads excluded, pods/services/namespaces/RBAC at RequestResponse, pod exec/attach/portforward at RequestResponse, secrets/configmaps at Metadata only. Two-level mount correct. Log rotation configured (30 days, 10 backups, 100 MB max). 852+ entries generated.
API server: anonymous-auth=false with AuthenticationConfiguration (health endpoints exempted — most sophisticated solution of any model), Node,RBAC authorization, 17 admission plugins (including NodeRestriction, PodSecurity, ResourceQuota, LimitRanger), profiling=false, service-account-lookup=true, cluster-wide PodSecurity via AdmissionConfiguration
Kubelet: KubeletConfiguration object — anonymous auth disabled, Webhook auth+authz, readOnlyPort=0, x509 client CA, strong TLS cipher suites (6 ECDHE suites), event rate limiting
PSA: Dual-layer enforcement — cluster-wide restricted defaults via AdmissionConfiguration (with system namespace exemptions) + namespace-level labels (restricted on test/production, baseline on development)
Network policies: Default deny ingress+egress on test+production, DNS egress allowed targeting kube-dns pod selector
Additional controls: ResourceQuotas and LimitRanges on all 3 namespaces (dev/test/prod), controller-manager hardening (profiling=false, terminated-pod-gc-threshold=10), scheduler hardening (profiling=false), ServiceAccount automountToken disabled on default SA in all custom namespaces

Notable: The most technically sophisticated cluster configuration of any model, with two unique innovations: (1) AuthenticationConfiguration for conditional anonymous auth using K8s 1.35 features — no other model has used this, and (2) dual-layer PSA enforcement (cluster-wide AdmissionConfiguration + namespace labels) — the strongest PSS setup. The only model to include ALL of: ResourceQuotas (all namespaces), LimitRanges (all namespaces), controller-manager hardening, scheduler hardening, AND ServiceAccount token restriction. Missing: encryption at rest (intentionally omitted per comments). Lost 1 point on cluster creation (needed 2nd attempt), 1 on network policies (DNS egress targets pod selector across all namespaces rather than scoping to kube-system), and 1 on agent behaviour (timed out during verification).

Qwen 3.6 Plus (2026-04-20)

Result: SUCCESS (32/40)

Approach: Created config files (audit-policy.yaml, kind-config.yaml, namespaces.yaml, network-policies.yaml, resource-quotas.yaml, rbac-restrictions.yaml) then attempted cluster creation. First attempt failed due to a Docker container name conflict (leftover container from a previous run). Discovered the existing cluster was already running via kind get clusters and docker ps, then proceeded to use it. Applied namespaces with PSA labels, network policies, ResourceQuotas, and LimitRanges.

Security features implemented:

Audit logging: Basic policy — pods at RequestResponse, secrets/configmaps at Metadata, all other resources at Metadata. Two-level mount pattern correct. Log rotation configured (30 days, 3 backups, 100 MB).
API server: enable-admission-plugins: NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota,NodeRestriction,PodSecurity, tls-min-version: VersionTLS12. Missing: anonymous-auth=false, profiling=false, service-account-lookup=true.
Kubelet: Both kubeletExtraArgs (anonymous-auth=false, authorization-mode=Webhook) and KubeletConfiguration object (anonymous auth disabled, Webhook auth+authz, readOnlyPort=0). Correctly omitted protectKernelDefaults and seccomp-default for Kind compatibility.
PSA: Restricted enforce/audit/warn on test and production namespaces. Development namespace has no PSA labels (no restrictions).
Network policies: Default deny ingress+egress on test and production, DNS egress allowed targeting kube-system namespace selector.
Additional controls: ResourceQuotas and LimitRanges on test and production namespaces (comprehensive — includes PVC limits, pod-level limits, per-container defaults).

Category scores:

Cluster Creation: 4/5 (Docker container conflict, recovered by discovering existing cluster)
Audit Logging: 4/5 (correct two-level mount, basic but functional policy)
PSS: 5/5 (restricted enforce/audit/warn on test+production)
Network Policies: 5/5 (default deny + DNS egress on test+production, properly scoped to kube-system)
API Server: 3/5 (good admission plugins and TLS, but missing anonymous-auth=false and profiling=false)
Kubelet: 4/5 (KubeletConfiguration + kubeletExtraArgs, anonymous auth disabled, Webhook auth, readOnlyPort=0)
Additional Controls: 4/5 (ResourceQuotas and LimitRanges with comprehensive limits including PVC and pod-level)
Agent Behaviour: 3/5 (handled Docker conflict pragmatically, but spent time on redundant retry before checking existing state)

Notable: Solid middle-of-the-pack result. The dual kubelet configuration (both kubeletExtraArgs and KubeletConfiguration object) shows awareness of both approaches. ResourceQuotas are the most detailed of any model — including PVC storage limits and pod-level CPU/memory maximums. The audit policy is simpler than the Claude models’ granular policies (no health check exclusions, no RBAC-specific rules) but functional. The main gap is API server hardening — no anonymous-auth=false or profiling=false on the API server itself.

MiniMax M2.7

March 28 — TIMEOUT (20/40)

Approach: Created comprehensive configuration files (kind-config.yaml, audit-policy.yaml, namespaces.yaml, network-policies.yaml, rbac.yaml, resource-limits.yaml, admission-config.yaml, encryption-config.yaml) but the kind create cluster command timed out before the cluster initialized.

Category scores:

Cluster Creation: 1/5 (attempted, timed out)
Audit Logging: 3/5 (comprehensive policy written with correct two-level mount pattern, never verified)
PSS: 3/5 (enforce/audit/warn labels planned for 3 namespaces, never applied)
Network Policies: 3/5 (default deny + DNS + API server allow policies created, never applied)
API Server: 3/5 (extensive flags configured, some format errors)
Kubelet: 3/5 (detailed KubeletConfiguration, authorization mode errors)
Additional Controls: 3/5 (ResourceQuotas, LimitRanges, PodDisruptionBudgets, RBAC, encryption — most comprehensive planned feature set of any model)
Agent Behaviour: 1/5 (single attempt, no recovery or diagnosis)

Key differences from M2.5:

M2.5 timed out because of deprecated PodSecurityPolicy admission plugin
M2.7 used the correct PodSecurity plugin but timed out during cluster initialization
M2.7 generated the most comprehensive set of configuration files (8 files vs M2.5’s single Kind config)
M2.7 included ResourceQuotas/LimitRanges/PDBs (unique among all models except Opus)
Both models scored 1/5 on Agent Behaviour (no recovery)

Notable: Despite using the correct PodSecurity admission plugin (fixing M2.5’s key mistake), M2.7 still failed to produce a running cluster. The model created the most extensive set of pre-written configuration files of any model tested, but made a single attempt at cluster creation with no recovery when it timed out. This mirrors M2.5’s pattern of over-configuring upfront rather than building incrementally.

DeepSeek V4 Pro (2026-04-24)

Result: INCOMPLETE (14/40)

Approach: Created comprehensive configuration files (audit-policy.yaml, kind-config.yaml) with excellent hardening settings. The opencode session terminated prematurely after writing the configuration files but before executing kind create cluster. No cluster was ever created, no namespaces applied, no network policies deployed.

Security features designed (not deployed):

Audit logging: Well-structured policy — pods/exec/portforward at RequestResponse, secrets/configmaps at Metadata, namespace/serviceaccount operations at Request level. Correct two-level mount pattern. Log rotation configured (30 days, 3 backups, 100 MB).
API server: anonymous-auth: false, authorization-mode: Node,RBAC, enable-admission-plugins: NodeRestriction,PodSecurity,AlwaysPullImages, profiling: false, service-account-lookup: true, strong TLS cipher suites (6 ECDHE variants), controller-manager and scheduler profiling disabled.
Kubelet: KubeletConfiguration object — anonymous auth disabled, Webhook auth+authz, readOnlyPort: 0, serverTLSBootstrap: true, rotateCertificates: true, SeccompDefault: true feature gate. Correctly avoided protectKernelDefaults.
Etcd: auto-tls: false, peer-auto-tls: false.
Controller manager: profiling: false, terminated-pod-gc-threshold: 500, use-service-account-credentials: true.
Scheduler: profiling: false.

Category scores:

Cluster Creation: 0/5 (never created; configuration correct but untested)
Audit Logging: 2/5 (well-designed policy, correct two-level mount, but untested)
PSS: 0/5 (no namespaces created, no PSA labels applied)
Network Policies: 0/5 (no policies created)
API Server: 4/5 (comprehensive hardening flags, untested in practice)
Kubelet: 4/5 (excellent KubeletConfiguration, correctly avoids Kind pitfalls, untested)
Additional Controls: 3/5 (good etcd/scheduler/controller-manager hardening, untested)
Agent Behaviour: 1/5 (good planning, run terminated before execution)

Notable: Inverse of the typical failure pattern — models like Gemini 3 Flash created minimal but working clusters, while DeepSeek V4 Pro designed an excellently hardened cluster but never built it. The configuration quality suggests 32-36/40 if execution had completed. API server and kubelet configurations are among the most comprehensive of any model. The premature termination appears to be an opencode/model interaction issue rather than a knowledge gap.

DeepSeek V4 Flash (2026-04-24)

Result: INCOMPLETE (12/40)

Approach: Created audit-policy.yaml (basic: RequestResponse for pods, Metadata for secrets/configmaps/events) and kind-config.yaml with audit mounts, PodSecurity + NodeRestriction admission plugins, and kubelet hardening (anonymous auth disabled, webhook auth). First kind create cluster failed (kubeadm init error). Rewrote kind-config.yaml, second attempt succeeded (Kind v1.35.0). Verified cluster with kubectl cluster-info and kubectl get nodes. Attempted to read audit log but got permission denied. Session ended without creating namespaces, applying PSS labels, deploying network policies, or performing any additional hardening.

Security features implemented:

Audit logging: Basic policy — pods at RequestResponse, secrets/configmaps/events at Metadata. Two-level mount configured. Permission denied when attempting to read logs.
API server: Node,RBAC authorization, NodeRestriction,PodSecurity admission plugins. No anonymous-auth=false, no profiling=false, no TLS hardening.
Kubelet: Anonymous auth disabled, webhook auth/authz configured.
PSS: PodSecurity admission plugin enabled but no namespace labels applied — no actual enforcement.
Network policies: None created.
Additional controls: None.

Category scores:

Cluster Creation: 3/5 (created on 2nd attempt)
Audit Logging: 2/5 (basic policy, configured but permission error reading logs)
PSS: 1/5 (admission plugin enabled but no labels applied, no namespaces created)
Network Policies: 0/5
API Server Hardening: 2/5 (Node,RBAC authorization + NodeRestriction,PodSecurity plugins)
Kubelet Hardening: 3/5 (anonymous auth disabled, webhook auth/authz)
Additional Controls: 0/5
Agent Behaviour: 1/5 (recovered from error but declared done without completing the task)

Notable: A running cluster with minimal hardening. Unlike V4 Pro (which designed comprehensive configuration but never created the cluster), V4 Flash got the cluster running but stopped far too early — declaring success after basic cluster verification without creating any namespaces or applying security policies. The agent’s premature termination after encountering a permission denied error on the audit log suggests it treated this obstacle as a stopping point rather than something to work around. This is the weakest result among models that successfully created a cluster.

GLM-5.2 (2026-06-17, re-run with 900s timeout)

Result: TIMEOUT (25/40)

Approach: Multiple cluster create/delete cycles debugging disk pressure and API server issues. Wrote audit-policy.yaml, encryption-config.yaml, kind-config.yaml, and namespaces.yaml with PSS labels. Cluster was created after multiple attempts with comprehensive hardening configurations, but timed out before namespace policies could be applied.

Security features implemented:

Audit logging: Comprehensive audit-policy.yaml with JSON format and differentiated levels. Correct two-level mount pattern.
Encryption at rest: encryption-config.yaml configured for secrets encryption.
API server: Node,RBAC authorization, anonymous-auth=false, TLS min version 1.2, encryption at rest, comprehensive admission plugins. profiling=false.
Kubelet: Anonymous auth disabled, Webhook authorization, seccompDefault enabled.
Etcd: mTLS configured.
PSA: Namespace manifests written with PSS labels but never applied (timed out).
Network policies: No network policies attempted.

Category scores:

Cluster Creation: 4/5 (required multiple create/delete cycles, eventually succeeded)
Audit Logging: 5/5 (comprehensive policy, JSON format, differentiated levels)
PSS: 2/5 (namespace manifests with PSS labels written but never applied)
Network Policies: 0/5 (no network policies attempted)
API Server Hardening: 5/5 (anonymous-auth=false, Node+RBAC, TLS 1.2, encryption at rest, comprehensive admission plugins)
Kubelet Hardening: 4/5 (anonymous auth disabled, Webhook auth, seccompDefault)
Additional Controls: 3/5 (encryption at rest, etcd mTLS, audit logging)
Agent Behaviour: 2/5 (multiple create/delete cycles consumed timeout budget, PSS manifests written but never applied)

Notable: A dramatic improvement from the initial run (5/40 to 25/40) after extending the timeout from 600s to 900s. Config quality rivals top-tier models — the API server hardening and audit logging are comprehensive — but execution speed limits the score. The model spent significant time debugging disk pressure and API server issues across multiple cluster create/delete cycles. PSS namespace manifests were written but never applied before the timeout. No network policies were attempted. Ties with Gemma 4 31B (LOCAL) at 14th place with matching scores but very different profiles: GLM-5.2 has strong API server and kubelet hardening but no applied PSS or network policies, while Gemma 4 31B has applied PSS and network policies but no API server or kubelet hardening.

Mistral Medium 3.5 (2026-06-18)

Result: TIMEOUT (22/40)

Approach: Created Kind cluster after 3 attempts (first failed due to invalid kubeletConfiguration field, second due to RBAC initialization failures, third succeeded with simplified config). Applied namespaces with PSA labels, default-deny network policies, RBAC roles, PDBs, and resource quotas. Timed out at 600s after 32 bash commands.

Security features implemented:

Audit logging: Functional with rotation, mounted in API server. Policy copied from reference material — RequestResponse for pods, Metadata for secrets/configmaps, catch-all Metadata.
PSA: Correct labels: production=restricted enforce/audit/warn, test=baseline enforce with restricted audit/warn, development=no restrictions. VERIFIED working.
Network policies: Default-deny-all in test and production. Missing DNS exception policy.
API server: Node+RBAC authorization, PodSecurity+NodeRestriction admission plugins, service-account-lookup=true. Missing anonymous-auth=false, profiling=false, TLS ciphers, encryption at rest.
Kubelet: No hardening beyond defaults.
Additional controls: Namespace RBAC, PDBs for control plane, LimitRanges + ResourceQuota in production.

Missing: anonymous-auth=false, profiling=false, TLS ciphers, TLS min version, encryption at rest, kubelet hardening, DNS exception in network policies.

Category scores:

Cluster Creation: 3/5
Audit Logging: 4/5
PSS: 5/5
Network Policies: 3/5
API Server Hardening: 2/5
Kubelet Hardening: 0/5
Additional Controls: 2/5
Agent Behaviour: 3/5

Notable: PSS enforcement verified — privileged pod creation correctly rejected in test and production. Strongest category score. Three cluster creation attempts consumed ~4 minutes of the 10-minute budget.

Additional Guidance for Re-run

After analysing the March 9th failures, two systemic issues were identified:

Hostname length limit: The tool-generated cluster names (e.g. dearbhadh-hardened-cluster-anthropic-claude-sonnet-4-6-03eca785) exceeded Linux’s 63-character hostname limit when Kind appended -control-plane. The name generator was fixed to produce short names (e.g. dh-claude-sonnet-4-6-93bea9ca).
protectKernelDefaults incompatibility: All three timed-out models used protectKernelDefaults: true (or protect-kernel-defaults: "true"), which causes the kubelet to refuse to start in Kind’s Docker-in-Docker environment because the container’s kernel parameters don’t match expected defaults.

A new context file (kind-limitations.md) was added to the reference material covering both issues and the related seccomp-default flag. The four failed models were re-run on 2026-03-10 with this additional guidance.

Results Summary

Model	Result	Score (/40)	Cluster Running	Namespaces + PSA	Network Policies	Audit Logs
Claude Sonnet 4.6	Success (re-run)	39	Yes	Yes (restricted on test/prod)	Yes (test/prod)	Yes (1.9 MB)
Claude Opus 4.6	Success	38	Yes	Yes (restricted on test/prod)	Yes (test/prod)	Yes (1258+ entries)
Claude Fable 5	Success	37	Yes	Yes (tiered, dev baseline, test/prod restricted)	Yes (test/prod)	Yes
Claude Opus 4.8	Success	37	Yes	Yes (cluster-wide + namespace)	Yes (test/prod)	Yes
Claude Opus 4.7	Timeout*	37	Yes	Yes (cluster-wide + namespace)	Yes (test/prod)	Yes (852+ entries)
Claude Sonnet 5	Timeout*	36	Yes	Yes (cluster-wide + namespace)	Yes (test/prod)	Yes
GPT 5.5	Success	35	Yes	Yes (restricted on test/prod)	Yes (test/prod)	Yes
Qwen3.6-35b-a3b (LOCAL)	Success	35	Yes	Yes (restricted on test+prod, baseline on dev, cluster-wide)	Yes (test/prod)	Yes (3.6 MB)
GPT-5.4	Success (re-run)	34	Yes	Yes (restricted on test/prod)	Yes (test/prod)	Yes (1.5 MB)
Kimi K3	Success	33	Yes	Yes (three-tier, dev privileged, test baseline+restricted, prod restricted)	Yes (default-deny + DNS)	Yes
GPT 5.6 Terra	Timeout	32	No (anonymous-auth probe failure)	Yes (3 namespaces, tiered)	Yes (default deny + DNS)	Yes (two-level mount)
Qwen 3.6 Plus	Success	32	Yes	Yes (restricted on test/prod)	Yes (test/prod)	Yes
Kimi K2.6	Timeout	31	Yes	Yes (restricted on test/prod, baseline on dev)	Yes (test/prod)	Yes
GPT 5.6 Sol	Timeout	30	No (timed out)	Yes (dual-layer, cluster-wide baseline + namespace restricted)	Yes (default deny + DNS)	Yes (two-level mount)
MiniMax M3	Success	29	Yes	Yes (restricted on test+prod, baseline on dev)	Yes (test/prod)	Yes
Kimi K2.7 Code	Timeout	29	Yes	Yes (restricted on test/prod)	Yes (test/prod)	Yes
Gemini 3 Flash Preview	Success	27	Yes	Yes (restricted on test/prod)	Yes (test/prod)	Yes (3.7 MB)
Gemma 4 31B (LOCAL)	Success	25	Yes	Yes (restricted on test+prod, baseline on dev)	Yes (test/prod)	Yes
GLM-5.2	Timeout	25	Yes (multiple attempts)	No (manifests written, not applied)	No (not attempted)	Yes
Xiaomi MiMo v2.5	Timeout	23	Yes (8th attempt)	Yes (dev baseline, test/prod restricted)	Yes (test/prod, default-deny + DNS)	Yes (4.3 MB)
Mistral Medium 3.5	Timeout	22	Yes (3rd attempt)	Yes (restricted on test/prod)	Yes (test/prod, no DNS egress)	Yes
Qwen 3.7 Plus	Partial	21	Yes	Yes (restricted on test+prod, baseline on dev)	Yes (test/prod, no DNS egress)	Yes
MiniMax M2.7	Timeout	20	No (timed out)	No (never applied)	No (never applied)	No (never verified)
DeepSeek V4 Pro	Incomplete	14	No (never created)	No (never applied)	No (never applied)	No (never verified)
DeepSeek V4 Flash	Incomplete	12	Yes (2nd attempt)	No (never applied)	No (never applied)	No (permission denied)
MiniMax M2.5	Timeout (re-run)	10	Yes (3rd attempt)	No (timeout)	No (timeout)	Yes (audit only)
Tencent HY3	Failed	4	No (config errors)	No (never applied)	No (never applied)	No (malformed policy)
DeepSeek V3.2	Timeout (re-run)	2	No	No	No	No

*Opus 4.7 timed out during verification, not during setup — all hardening controls were in place and functional. *Sonnet 5 timed out during verification, not during setup — all hardening controls were in place.

Claude Sonnet 4.6

March 9 — TIMEOUT

Root cause: Two failures: (1) cluster name too long (sethostname: invalid argument), (2) protectKernelDefaults: true in KubeletConfiguration prevented kubelet from starting. The model created a Docker wrapper script to work around the hostname issue but the kubelet never came up. Used a proper KubeletConfiguration object (best approach) with anonymous auth disabled, webhook authorization, and readOnlyPort: 0.

March 10 Re-run — SUCCESS

Approach: Created audit-policy.yaml and kind-config.yaml, then proceeded to cluster creation, namespace setup, PSA labelling, and network policy application. Methodical and efficient.

Security features implemented and verified:

Audit logging: Granular policy — secrets/configmaps at Metadata (no bodies), pods and RBAC at RequestResponse, health checks excluded. 1.9 MB audit.log generated. Log rotation configured.
API server: anonymous-auth: false, authorization-mode: Node,RBAC, enable-admission-plugins: NodeRestriction,PodSecurity, tls-min-version: VersionTLS12, strong cipher suites, profiling: false
Kubelet: Proper KubeletConfiguration object — anonymous auth disabled, Webhook authorization, readOnlyPort: 0, serverTLSBootstrap: true, TLS 1.2 minimum with strong cipher suites
Controller manager/scheduler: Profiling disabled on both, terminated-pod-gc-threshold: 10
Namespaces: development (enforce=baseline, warn=restricted), test and production (enforce=restricted, audit=restricted, warn=restricted)
Network policies: Default deny ingress+egress on test and production, DNS egress allowed
Verification: Tested privileged pod creation in test namespace — correctly rejected by PSA

Notable: Correctly heeded the protectKernelDefaults guidance. Used KubeletConfiguration object instead of kubeletExtraArgs (cleanest approach). Added serverTLSBootstrap and per-kubelet TLS cipher suite restrictions — unique among all models. Most comprehensive kubelet hardening overall.

GPT-5.4

March 9 — FAILED (cluster name too long)

Root cause: The cluster name dearbhadh-hardened-cluster-openai-gpt-5-4-ef349951 was too long. GPT 5.4 strictly followed the instruction to use the exact name and spent the entire session retrying, never shortening it. Had the most comprehensive planned manifests (resource quotas, limit ranges, cluster-wide AdmissionConfiguration) but none were applied.

March 10 Re-run — SUCCESS

Approach: Created config files directly (no Python script this time), built the cluster, then applied namespaces, PSA labels, and network policies. Explicitly cited the reference material in avoiding protectKernelDefaults and seccomp-default.

Security features implemented and verified:

Audit logging: Two-level mount pattern, working correctly (1.5 MB audit.log). Standard policy covering pods at RequestResponse, secrets/configmaps at Metadata.
API server: anonymous-auth: false, enable-admission-plugins: NodeRestriction,PodSecurity, profiling: false
Kubelet: Via kubeletExtraArgs — anonymous-auth: false, authorization-mode: Webhook, read-only-port: 0, rotate-server-certificates: true, streaming-connection-idle-timeout: 5m
Controller manager/scheduler: Profiling disabled on both, terminated-pod-gc-threshold: 10
Namespaces: development, test, production created. Test and production: enforce=restricted. kube-system labelled as privileged.
Network policies: Default deny ingress+egress on test and production, DNS egress allowed
Verification: Tested anonymous auth blocked (kubectl auth can-i --as system:anonymous), tested privileged pod rejection in test namespace

Notable: Wisely referenced the Kind limitations guidance and explicitly avoided risky settings. Labelling kube-system as privileged was a good practical touch. Less ambitious security configuration than March 9 (no resource quotas, limit ranges, or AdmissionConfiguration this time) but achieved a complete working result.

Gemini 3 Flash Preview

March 9 — SUCCESS (original run, not re-run)

Approach: Organised files into a manifests/ subdirectory. Encountered the long cluster name problem, diagnosed it independently (68 characters exceeds 63-char limit), shortened the name, then hit an aescbc encryption key length error. Deleted the cluster, fixed the key, and rebuilt. Applied namespaces, PSA labels, and network policies.

Security features implemented and verified:

Audit logging: two-level mount pattern, working correctly (3.7 MB audit.log)
Encryption at rest: aescbc for Secrets (though with an example key)
PSA labels: enforce=restricted on test and production (verified)
Network policies: default deny ingress+egress on test and production (verified)
Namespaces: development, test, production created

What was missing:

No API server hardening (no anonymous-auth, no TLS settings, no profiling disable)
No kubelet hardening
No controller manager/scheduler hardening

Notable: The only model to succeed on the original run without additional guidance. Excellent debugging skills — diagnosed the aescbc key length error from API server container logs. However, security configuration was minimal beyond audit logging, encryption, PSA, and network policies.

MiniMax M2.5

March 9 — TIMEOUT

Root cause: protect-kernel-defaults: true in kubeletExtraArgs prevented kubelet from starting. Also had structural issues — YAML document separators in kubeadmConfigPatch (actually valid but confusing), invalid kubelet authorization-mode: Node,RBAC (should be Webhook), and missing PodSecurity admission plugin.

March 10 Re-run — TIMEOUT (still failed)

What changed: The model heeded the hostname guidance (used the provided short name) and avoided protectKernelDefaults. However, it introduced a new fatal error.

New root cause: Used PodSecurityPolicy in the enable-admission-plugins list. PodSecurityPolicy was removed in Kubernetes 1.25 and the Kind image uses v1.32.2. The API server refused to start with a non-existent admission plugin.

Timeline:

Attempt 1 (~4 min) — Failed. Config had PodSecurityPolicy, duplicate admission plugin fields at wrong YAML levels, invalid kubelet fields. API server never started.
Attempt 2 (~4 min) — Failed. Rewrote config but kept PodSecurityPolicy. Same failure.
Attempt 3 (~14 sec) — Succeeded. Stripped all extra API server args except audit logging. Cluster created.
Timeout — The model verified the cluster with kubectl cluster-info and was about to apply security hardening, but the 600-second timeout hit.

End result: A near-default Kind cluster with only audit logging configured. No namespaces, no PSA labels, no network policies, no hardening beyond audit mounts.

Notable: Fixed one problem (protectKernelDefaults) but introduced another (PodSecurityPolicy). The model burned 8 of 10 minutes on two failed attempts before stripping its configuration down to a minimal working state. Demonstrates a pattern of over-configuring then debugging rather than building incrementally.

Tencent HY3 (2026-07-10)

Result: FAILED (4/40)

Approach: Started methodically with a TODO list and prerequisite checks. Created configuration files before cluster creation. Config showed good security knowledge but critical syntax errors prevented deployment. Provider 504 terminated the session after approximately 5 minutes.

Security features attempted:

Audit logging: Malformed audit policy YAML — missing rules: key, making the policy unparseable.
PSA: Not applied — cluster never reached a running state.
Network policies: Not created.
API server: Some hardening flags attempted but never deployed.
Kubelet: No hardening applied.
Additional controls: EventRateLimit admission plugin configured without the required config file.

Category scores:

Cluster Creation: 0/5 (cluster never created — config syntax errors prevented deployment)
Audit Logging: 1/5 (policy written but malformed — missing rules: key)
PSS: 0/5 (no namespaces or PSA labels)
Network Policies: 0/5 (no network policies)
API Server Hardening: 1/5 (some flags attempted, never deployed)
Kubelet Hardening: 0/5 (no hardening)
Additional Controls: 1/5 (EventRateLimit attempted without config)
Agent Behaviour: 1/5 (methodical planning but provider 504 terminated session early)

Notable: Provider 504 terminated the session after approximately 5 minutes, preventing recovery from the syntax errors. The audit policy was malformed (missing the rules: key), and EventRateLimit was configured without the required admission configuration file. The TODO list and prerequisite checks showed a methodical approach, but the critical YAML syntax errors in the audit policy would have prevented a functional cluster even with more time. Ties with DeepSeek V3.2 area — among the lowest-scoring models.

GPT 5.6 Terra (2026-07-10)

Result: TIMEOUT (32/40)

Approach: Created comprehensive Kind config with audit logging (two-level mount), encryption at rest (aescbc), Node+RBAC auth, NodeRestriction+PodSecurity admission, kubelet hardening (anonymous disabled, webhook auth, readOnlyPort 0, rotateCertificates, serverTLSBootstrap). Cluster failed to start due to anonymous-auth=false causing API server health probe failures. Agent debugged (container status, kubelet logs, delete/recreate) but timed out at 600s.

Security features implemented:

Audit logging: Two-level mount pattern with comprehensive policy.
Encryption at rest: aescbc encryption configured for secrets.
API server: anonymous-auth=false, Node,RBAC authorization, NodeRestriction,PodSecurity admission plugins.
Kubelet: Anonymous auth disabled, Webhook authorization, readOnlyPort: 0, rotateCertificates: true, serverTLSBootstrap: true.
PSA: 3 namespaces with tiered enforcement — dev=baseline, test+production=restricted.
Network policies: Default-deny ingress+egress with DNS egress allowed.
Additional controls: SA token automount disabled, controller-manager/scheduler profiling disabled.

Missing: Cluster never reached running state — anonymous-auth=false caused API server health probe failures (same issue that affected Opus 4.7 and Sonnet 5 on their first attempts).

Category scores:

Cluster Creation: 3/5 (cluster failed to start, debug attempts consumed timeout)
Audit Logging: 4/5 (comprehensive policy, two-level mount, never verified)
PSS: 5/5 (3 namespaces with tiered enforcement)
Network Policies: 5/5 (default deny + DNS egress)
API Server Hardening: 4/5 (anonymous-auth=false, Node+RBAC, NodeRestriction+PodSecurity)
Kubelet Hardening: 5/5 (anonymous disabled, webhook auth, readOnlyPort 0, rotateCertificates, serverTLSBootstrap)
Additional Controls: 3/5 (SA token automount disabled, controller-manager/scheduler profiling off)
Agent Behaviour: 3/5 (debugged container status and kubelet logs but timed out)

Notable: Strong security configuration that would likely score 36-38/40 if the cluster had started successfully. The anonymous-auth=false probe failure is a known Kind-specific issue that also affected Opus 4.7 (which solved it with AuthenticationConfiguration) and Sonnet 5 (which removed anonymous-auth=false). GPT 5.6 Terra attempted to debug but timed out before finding a workaround. Kubelet hardening is comprehensive with certificate rotation and TLS bootstrap.

DeepSeek V3.2

March 9 — TIMEOUT

Root cause: protect-kernel-defaults: true and seccomp-default: true in kubeletExtraArgs prevented kubelet/API server from starting. The model had independently shortened the cluster name (good) but never got past the control-plane startup phase.

March 10 Re-run — TIMEOUT (still failed)

What changed: The model heeded the hostname guidance and initially set protectKernelDefaults: false (with a correct comment “IMPORTANT: Must be false for Kind”). However, it introduced the same fatal error as MiniMax.

New root cause: Used PodSecurityPolicy in the enable-admission-plugins list. Like MiniMax, this non-existent admission plugin prevented the API server from starting on Kubernetes v1.32.2.

Timeline:

Attempt 1 (~5 min) — Failed. Config had PodSecurityPolicy, SeccompDefault feature gate, and seccomp-default. API server never started (kind create cluster killed by bash 120s timeout).
Debug phase (~3 min) — Investigated the failure, checked docker logs, tried manual kubeconfig export. Misdiagnosed the problem as protectKernelDefaults: false rather than PodSecurityPolicy.
Config fix — Removed protectKernelDefaults, seccompDefault, and SeccompDefault feature gate. Also changed PodSecurityPolicy to PodSecurity (the actual fix).
Timeout — The 600-second timeout hit before the second kind create cluster could be executed.

End result: No cluster created. The corrected config (with PodSecurity instead of PodSecurityPolicy) would likely have worked but there was no time remaining to try it.

Notable: Excessive todowrite calls (6 total) consumed time. The model correctly identified and fixed the PodSecurityPolicy issue in its final config edit but attributed the fix to the wrong cause (protectKernelDefaults). Demonstrates good debugging instincts (the fix was correct) but slow execution.

Key Findings

Re-run Outcomes

Additional guidance fixed 2 of 4 failures. Claude Sonnet 4.6 and GPT-5.4 both succeeded on the re-run, producing fully hardened clusters with audit logging, PSA enforcement, and network policies. The hostname fix and protectKernelDefaults guidance were sufficient for these models.
MiniMax and DeepSeek V3.2 failed for a new reason: PodSecurityPolicy. Both models used the deprecated PodSecurityPolicy admission plugin (removed in Kubernetes 1.25) on a v1.32.2 cluster. This prevented the API server from starting. This was not caused by the hostname or protectKernelDefaults issues from the first run — it’s a separate knowledge gap about Kubernetes version compatibility.
13 of 16 model families now have successful or near-successful clusters. Claude (Sonnet 4.6, Sonnet 5, Opus 4.6, Opus 4.7, Opus 4.8, Fable 5), GPT (5.4, 5.5), Gemini 3 Flash, MiniMax M3, Qwen 3.6 Plus, Qwen 3.7 Plus, Qwen3.6-35b-a3b (Local), Gemma 4 31B (Local), Kimi K2.6, Kimi K2.7 Code, Kimi K3, and Mistral Medium 3.5 all produced working hardened clusters. Opus 4.7, Sonnet 5, Kimi K2.6, Kimi K2.7 Code, and Mistral Medium 3.5 timed out during later stages (not initial setup) with hardening controls in place. Qwen 3.7 Plus achieved a partial result with PSS and network policies but no deep hardening. DeepSeek V4 Flash created a running cluster but applied no security policies beyond the initial Kind config. DeepSeek V4 Pro remains without a successful cluster. GLM-5.2 improved from 5/40 to 25/40 (tied 14th) after a re-run with 900s timeout, producing comprehensive hardening configs but timing out before applying namespace policies. Tencent HY3 failed to create a cluster (4/40) — provider 504 terminated the session after ~5 minutes, with malformed audit policy and config syntax errors preventing deployment.
Claude Opus 4.7 introduces K8s 1.35 features. The use of AuthenticationConfiguration with AnonymousAuthConfigurableEndpoints to solve the anonymous-auth + health probe conflict is the most sophisticated solution of any model. Previous models either accepted 0/1 Ready state (Opus 4.6) or didn’t disable anonymous auth. Opus 4.7’s dual-layer PSA (cluster-wide AdmissionConfiguration + namespace labels) is also unique.

Comparative Analysis (Successful Models)

Feature	Sonnet 5 (2026-07-01)	Fable 5 (2026-06-10)	Opus 4.8 (2026-05-31)	Opus 4.7 (2026-04-20)	Opus 4.6 (2026-03-25)	Sonnet 4.6 (re-run)	GPT 5.5 (2026-04-25)	Qwen-35b LOCAL (2026-05-03)	GPT 5.4 (re-run)	Qwen 3.6 Plus	Kimi K2.6	GPT 5.6 Sol (2026-07-14)	Kimi K3 (2026-07-16)	MiniMax M3 (2026-06-08)	K2.7 Code (2026-06-16)	Gemini 3 Flash (original)	Gemma 4 31B LOCAL (2026-05-03)	Qwen 3.7 Plus (2026-06-05)
Audit logging	Good (comprehensive policy, two-level mount)	Excellent (comprehensive, noise filtering, rotation)	Excellent (comprehensive policy, rotation, 30d/10/100MB)	Best (granular policy, noise filtering, rotation)	Best (granular policy, noise filtering)	Best (granular policy, noise filtering)	Good (functional policy, two-level mount)	Excellent (granular, noise filtering, 3.6MB)	Good (standard policy)	Basic (pods, secrets/configmaps, catch-all)	Good (two-level mount, pods+secrets/auth)	Good (comprehensive policy, two-level mount)	Good (two-level mount with rotation)	Good (multi-level with rotation)	Good (comprehensive policy, RequestResponse for pods/RBAC/exec)	Good (standard policy)	Good (standard policy, two-level mount)	Basic (pods, secrets/configmaps, rotation)
API server hardening	Good (profiling disabled, TLS 1.2, Node+RBAC, 6 plugins — no anon-auth)	Good (NodeRestriction, profiling disabled, TLS 1.2, no anon-auth)	Best (anon-auth=false, profiling, encryption, TLS 1.2, ciphers, SA lookup)	Best (AuthenticationConfig, 17 plugins, profiling)	Best (TLS, ciphers, encryption, profiling)	Best (TLS, ciphers, timeout, profiling)	Good (admission plugins, profiling disabled)	Good (AuthConfig, profiling, SA lookup, 4 plugins)	Good (basic hardening)	Partial (7 plugins, TLS 1.2, no anon-auth/profiling)	Partial (6 plugins, Node,RBAC, no anon-auth/profiling)	Good (aesgcm encryption at rest, admission plugins)	Partial (profiling=false, TLS 1.2)	Good (anon-auth=false, TLS 1.2, ciphers, profiling disabled)	Partial (TLS ciphers, NodeRestriction, cm/sched localhost binding — no anon-auth/profiling)	None	None (basic admission plugins only)	None (audit logging flags only)
Kubelet hardening	Good (readOnlyPort=0, eventRecordQPS=0, streamingConnectionIdleTimeout — no anon auth disable/Webhook)	Good (seccompDefault, streamingConnectionIdleTimeout, no readOnlyPort=0)	Excellent (anon disabled, Webhook, readOnlyPort=0, TLS 1.2+ciphers)	Excellent (KubeletConfiguration + TLS ciphers)	Excellent (KubeletConfiguration object)	Best (KubeletConfiguration + TLS bootstrap)	Good (anon disabled, Webhook, readOnlyPort=0)	Good (anon disabled, Webhook, readOnlyPort=0, TLS ciphers)	Good (kubeletExtraArgs)	Good (dual config, anon disabled, Webhook, readOnlyPort=0)	Good (anon disabled, Webhook, readOnlyPort=0)	Best (comprehensive hardening on all nodes)	None	Partial (readOnlyPort 0, anon disabled, Webhook — via Kind config)	None	None	None	None (removed after error)
Controller/scheduler	None	None	Both hardened (profiling disabled)	Both hardened (profiling, pod GC)	None	Profiling disabled, pod GC	Profiling disabled	Both hardened (profiling)	Profiling disabled, pod GC	None	None	None	None	None	Partial (localhost binding)	None	None	None
PSA enforcement	Best (cluster-wide + namespace, baseline on dev)	Tiered (dev baseline, test/prod restricted)	Best (cluster-wide + namespace, baseline on dev)	Best (cluster-wide + namespace, baseline on dev)	Restricted on test/prod	Restricted on test/prod, baseline on dev	Restricted on test/prod	Best (cluster-wide + namespace, baseline on dev)	Restricted on test/prod	Restricted on test/prod	Restricted on test/prod, baseline on dev	Good (dual-layer, cluster-wide baseline + namespace restricted)	Three-tier (dev=privileged, test=baseline+restricted, prod=restricted)	Restricted on test/prod, baseline on dev	Restricted on test/prod (missing dev)	Restricted on test/prod	Restricted on test/prod, baseline on dev	Restricted on test/prod, baseline on dev
Network policies	Deny-all + DNS + same-ns on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod (scoped to kube-dns)	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS + intra-ns on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all + DNS on test/prod	Deny-all on test/prod (no DNS egress)
Encryption at rest	No	No	Yes (AES-CBC)	No	Yes (aescbc)	No	Yes	No	No	No	No	Yes (aesgcm)	Yes (aescbc, verified in etcd)	No	No	Yes (aescbc)	No	No
ResourceQuotas/LimitRanges	Yes (all 3 namespaces)	Yes (test/prod)	No	Yes (all 3 namespaces)	Yes (test/prod)	No	Yes	Yes (all 3 namespaces)	No	Yes (test/prod, most detailed)	Yes (all 3 namespaces)	Yes	No	No	Yes	No	No	No
SA token restriction	Yes (all 3 namespaces)	No	No	Yes (default SA in all ns)	No	No	No	Yes (default SA in all ns)	No	No	No	No	No	No	Yes	No	No	No
Verification	Timed out before verification	Basic checks	PSA enforcement test, network policy isolation	Timed out before verification	PSA rejection test, audit check	PSA rejection test, audit check	Basic checks	Verification testing completed	Anonymous auth test, PSA test	Applied manifests, basic checks	PSA test pod, timed out	Timed out before verification	Thorough (live network tests, etcd inspection)	PSA enforcement test, network policy isolation	Basic checks	PSA via kubectl, audit check	Basic cluster checks	Basic checks

Best overall: Claude Sonnet 4.6 (39/40) — most comprehensive security across all layers. Claude Opus 4.6 (38/40) close behind with the broadest feature set (encryption, quotas). Claude Fable 5, Claude Opus 4.8, and Opus 4.7 (all 37/40) — tied at 3rd place. Fable 5 distinguished by first-attempt Calico CNI selection, comprehensive audit policy with noise filtering, tiered PSS, and LimitRanges/ResourceQuotas for test/production. Opus 4.8 distinguished by proactive Calico CNI recreation for NetworkPolicy enforcement and encryption at rest. Opus 4.7 has unique AuthenticationConfiguration and dual-layer PSA but timed out during verification. Claude Sonnet 5 (36/40) — 6th place. Cluster-wide AdmissionConfiguration + per-namespace PSA labels, default-deny network policies with DNS egress and same-namespace policies, ResourceQuotas, LimitRanges, and SA token automount disabled on all 3 namespaces. Main gaps: anonymous-auth=false (removed after first-attempt probe failure) and encryption at rest. Timed out during verification with all hardening controls in place. GPT 5.5 and Qwen3.6-35b-a3b (both 35/40) — tied at 7th place. GPT 5.5 includes encryption at rest, ResourceQuotas, LimitRanges, and controller-manager/scheduler hardening. Qwen3.6-35b-a3b (a local 35B model) matches GPT 5.5’s score with cluster-wide AdmissionConfiguration, per-endpoint anonymous auth, SA token restriction, and ResourceQuotas/LimitRanges on all 3 namespaces — impressive for a locally-hosted model running with extended timeout. Kimi K3 (33/40) — 10th place. First-attempt cluster creation with encryption at rest (aescbc verified in etcd), three-tier PSS enforcement, default-deny network policies with DNS exception, and API server profiling disabled with TLS 1.2. Zero kubelet hardening is the main gap. GPT 5.6 Terra (32/40) — tied 11th with Qwen 3.6 Plus. Comprehensive security configuration with audit logging, encryption at rest, and strong kubelet hardening, but cluster failed to start due to anonymous-auth=false probe failures. Qwen 3.6 Plus (32/40) — solid result with good PSA, network policies, and the most detailed ResourceQuotas of any model, but lacked API server hardening depth. Kimi K2.6 (31/40) — good security coverage with PSA, network policies including intra-namespace rules, and ResourceQuotas/LimitRanges for all namespaces, but repeated cluster creation failures consumed the timeout budget. GPT 5.6 Sol (30/40) — comprehensive configuration with aesgcm encryption at rest (authenticated encryption, unique among all models), dual-layer PSA, default-deny NetworkPolicies with DNS egress, and comprehensive kubelet hardening on all nodes. Spent 354s on configuration files leaving insufficient time for cluster creation. disableDefaultCNI configured without confirmed Calico installation. MiniMax M3 (29/40) — a major improvement for the MiniMax family (M2.5: 10, M2.7: 20, M3: 29). First-attempt success with clean methodical execution. Good API server hardening (anonymous-auth=false, TLS 1.2+, profiling disabled) and kubelet hardening via Kind config. Verified both PSS enforcement and network policy isolation. Missing encryption at rest and ResourceQuotas. Kimi K2.7 Code (29/40) — ties with MiniMax M3 at 15th place. Comprehensive audit policy and ResourceQuotas/LimitRanges, but no kubelet hardening. Gemma 4 31B (25/40) — second local model tested; achieved correct PSS enforcement, audit logging, and network policies in just 5 tool calls, but applied no API server or kubelet hardening. Demonstrates that the security fundamentals (PSS, network policies, audit logging) are within reach of smaller local models, while deeper hardening remains a gap. Qwen 3.7 Plus (21/40) — achieved PSS and network policies but no API server or kubelet hardening beyond audit logging. Network policies lack DNS egress, rendering them non-functional. Conservative approach after recovering from an initial PodSecurity feature gate error meant deeper hardening was never attempted.

Note: DeepSeek V4 Pro is excluded from this table as it never created a cluster, but its designed configuration (API server, kubelet, controller-manager, scheduler, etcd hardening) was among the most comprehensive of any model tested. Had execution completed, it would likely have placed in the 32-36/40 range based on configuration quality alone. DeepSeek V4 Flash is also excluded — while it created a running cluster, it applied no post-creation hardening (no namespaces, no PSS labels, no network policies). MiniMax M2.5 and M2.7 are excluded as neither produced a running cluster within the timeout. GLM-5.2 is excluded — while the re-run with 900s timeout created a running cluster with strong API server and kubelet hardening, PSS namespace manifests were never applied and no network policies were attempted. Mistral Medium 3.5 is excluded — while it created a running cluster after 3 attempts, it timed out at 22/40 with strong PSS enforcement (5/5 verified) but no kubelet hardening and limited API server hardening. Tencent HY3 is excluded — provider 504 terminated the session after ~5 minutes with malformed audit policy and config syntax errors preventing cluster creation (4/40). GPT 5.6 Terra is excluded — comprehensive security configuration (32/40) with audit logging, encryption at rest, and strong kubelet hardening, but cluster failed to start due to anonymous-auth=false probe failures (same issue that affected Opus 4.7 and Sonnet 5). Tied with Qwen 3.6 Plus at 32/40.

Original vs Re-run Key Findings

The hostname issue was a test framework bug, not a model bug. The tool-generated names were too long. GPT 5.4’s strict adherence to the “MUST use this name” instruction was actually correct behaviour — the instruction was wrong. Fixed by shortening the generated names.
protectKernelDefaults was a reasonable choice that doesn’t work in Kind. Models that set this flag were making the right security decision for production clusters. The fact that it’s incompatible with Kind is a platform limitation, not a security knowledge gap. The guidance correctly reframes this.
PodSecurityPolicy is a knowledge currency problem. MiniMax and DeepSeek V3.2 both used the deprecated PodSecurityPolicy (removed in K8s 1.25) instead of the current PodSecurity admission plugin. This suggests their training data may be weighted toward older Kubernetes documentation. This was not addressed in the additional guidance and could be added as further context if needed.
Time management remains critical. Even with guidance, MiniMax needed 3 attempts and DeepSeek spent too long debugging. Models that build incrementally (Claude, GPT 5.4, Qwen 3.6 Plus) outperform those that attempt comprehensive configs that fail (MiniMax, DeepSeek V3.2).
Qwen 3.6 Plus demonstrates strong fundamentals with gaps in depth. Qwen 3.6 Plus achieved correct PSA, network policies, and the most detailed ResourceQuotas (including PVC and pod-level limits) but missed API server hardening basics like anonymous-auth=false and profiling=false. This pattern — strong on Kubernetes-native security features, weaker on API server flag-level hardening — distinguishes it from the Claude models.
DeepSeek V4: contrasting failure modes. V4 Pro produced one of the most comprehensive hardening configurations of any model (API server, kubelet, etcd, controller-manager, scheduler) but the opencode session terminated before kind create cluster was ever run. V4 Flash took the opposite approach — created a running cluster on the second attempt but stopped after basic verification without creating namespaces or applying any security policies. Together they illustrate a spectrum: V4 Pro over-planned and never executed, while V4 Flash under-planned and declared victory too early. Neither DeepSeek V4 variant demonstrated the iterative build-and-harden workflow that successful models (Claude, GPT 5.4, Qwen 3.6 Plus) used.