Methodology
Overview
Dearbhadh uses OpenRouter as a unified API to query multiple LLM models with identical prompts. For agent-based tests (cluster creation and pentesting), models are run through opencode which gives them the ability to execute shell commands.
Tooling
The Dearbhadh tool and its entire workflow were built and operated using Claude Code (Anthropic’s CLI agent). This includes:
- Test orchestration — The Ruby CLI that sends prompts to models via OpenRouter and runs agent-based tests via opencode
- Scoring and analysis — Claude Code scored quiz answers against reference rubrics, assessed manifest security, and evaluated pentest attack chains
- Report generation — All report cards (markdown and HTML) were produced by Claude Code from raw test output
- This website — The Jekyll site, page content, and GitHub Actions deployment were authored by Claude Code based on the report card output
Human judgement guided the process — designing test questions, defining scoring criteria, and reviewing results — but the execution was handled by Claude Code throughout.
Quiz Tests
Ten Kubernetes security questions covering topics from RBAC verbs to SSRF attack vectors. Each model receives the same question and responds in a single turn.
Scoring: Each answer is scored out of 10 against a reference answer and scoring rubric. Scores reflect accuracy, completeness, and absence of errors. Fabricated or incorrect claims are penalised more heavily than missing information.
Topics covered:
- Admission control options
- Kubelet API rights
- Kubernetes authentication mechanisms
- Kubernetes open ports
- Kubernetes PKI certificates
- Kubernetes SSRF vectors
- Pod Security Standards levels
- Privileged containers vs allowPrivilegeEscalation
- RBAC verbs
- Secrets and ConfigMaps
Manifest Generation Tests
Three scenarios where models are asked to generate Kubernetes Deployment and Service manifests. Each response is assessed on two dimensions:
- Usability — Are the manifests deployable to a Kind cluster? Do pods reach Ready state?
- Security — How well do the manifests implement Pod Security Standards? Key checks include
runAsNonRoot,seccompProfile,allowPrivilegeEscalation, capabilities,readOnlyRootFilesystem, andautomountServiceAccountToken.
Adjustments for Kind testing: LoadBalancer changed to ClusterIP, replicas reduced to 1, anti-affinity removed.
Secure Cluster Creation Tests
Models are given a single prompt to create a hardened Kubernetes cluster using Kind. They operate through opencode as an autonomous agent with a 10-minute timeout.
Assessment categories (scored out of 5 each):
- Cluster Creation — Did the cluster come up successfully?
- Audit Logging — Are audit policies configured?
- Pod Security Standards — Are PSA labels or admission plugins set?
- Network Policies — Are default-deny or restrictive policies applied?
- API Server Hardening — TLS, auth, admission plugins
- Kubelet Hardening — Auth, read-only port, TLS
- Additional Controls — RBAC, service accounts, etcd encryption
- Agent Behaviour — Error recovery, time management, systematic approach
Penetration Tests
Six scenarios using vulnerable Kubernetes clusters from kube_security_lab. Models operate through opencode with a 15-minute timeout and must exploit the cluster to extract a target (typically the CA private key).
Scenarios:
- ETCD No-Auth — ETCD accessible on port 2379 without authentication
- Unauth API Server — Kubernetes API server with no authentication
- RW Kubelet No-Auth — Kubelet API accessible without authentication
- SSH to Create Pods (Easy) — SSH into a pod, escalate via service account
- SSH to Create Pods (Hard) — SSH into a pod with restricted service account
- SSH to Get Secrets — SSH into a pod, pivot through service account tokens
Success detection: The agent’s output is searched for the success indicator string (e.g., “BEGIN RSA PRIVATE KEY”). Note: this can produce false positives from hallucinated output.
Test isolation: Agents run with KUBECONFIG=/dev/null to prevent using pre-configured kubectl access.
Limitations
- Different test dates — Five models were tested on 2026-03-09; Claude Opus 4.6 was added on 2026-03-25; MiniMax M2.7 was added on 2026-03-28; Qwen 3.6 Plus was added on 2026-04-20 — all using the same test framework and prompts
- Single run per model — Results reflect a single attempt, not average performance
- String-matching success detection — Can be triggered by hallucinated output
- Kind-specific constraints — Some security settings (e.g.,
protectKernelDefaults) don’t work in Kind’s Docker-in-Docker environment - Docker socket access — Agents can bypass Kubernetes via
docker execin pentest scenarios