Methodology

Overview

Dearbhadh uses OpenRouter as a unified API to query multiple LLM models with identical prompts. For agent-based tests (cluster creation and pentesting), models are run through opencode which gives them the ability to execute shell commands.

Tooling

The Dearbhadh tool and its entire workflow were built and operated using Claude Code (Anthropic’s CLI agent). This includes:

Test orchestration — The Ruby CLI that sends prompts to models via OpenRouter and runs agent-based tests via opencode
Scoring and analysis — Claude Code scored quiz answers against reference rubrics, assessed manifest security, and evaluated pentest attack chains
Report generation — All report cards (markdown and HTML) were produced by Claude Code from raw test output
This website — The Jekyll site, page content, and GitHub Actions deployment were authored by Claude Code based on the report card output

Human judgement guided the process — designing test questions, defining scoring criteria, and reviewing results — but the execution was handled by Claude Code throughout.

Quiz Tests

Ten Kubernetes security questions covering topics from RBAC verbs to SSRF attack vectors. Each model receives the same question and responds in a single turn.

Scoring: Each answer is scored out of 10 against a reference answer and scoring rubric. Scores reflect accuracy, completeness, and absence of errors. Fabricated or incorrect claims are penalised more heavily than missing information.

Topics covered:

Admission control options
Kubelet API rights
Kubernetes authentication mechanisms
Kubernetes open ports
Kubernetes PKI certificates
Kubernetes SSRF vectors
Pod Security Standards levels
Privileged containers vs allowPrivilegeEscalation
RBAC verbs
Secrets and ConfigMaps

Manifest Generation Tests

Three scenarios where models are asked to generate Kubernetes Deployment and Service manifests. Each response is assessed on two dimensions:

Usability — Are the manifests deployable to a Kind cluster? Do pods reach Ready state?
Security — How well do the manifests implement Pod Security Standards? Key checks include runAsNonRoot, seccompProfile, allowPrivilegeEscalation, capabilities, readOnlyRootFilesystem, and automountServiceAccountToken.

Adjustments for Kind testing: LoadBalancer changed to ClusterIP, replicas reduced to 1, anti-affinity removed.

Secure Cluster Creation Tests

Models are given a single prompt to create a hardened Kubernetes cluster using Kind. They operate through opencode as an autonomous agent with a 10-minute timeout.

Assessment categories (scored out of 5 each):

Cluster Creation — Did the cluster come up successfully?
Audit Logging — Are audit policies configured?
Pod Security Standards — Are PSA labels or admission plugins set?
Network Policies — Are default-deny or restrictive policies applied?
API Server Hardening — TLS, auth, admission plugins
Kubelet Hardening — Auth, read-only port, TLS
Additional Controls — RBAC, service accounts, etcd encryption
Agent Behaviour — Error recovery, time management, systematic approach

Penetration Tests

Six scenarios using vulnerable Kubernetes clusters from kube_security_lab. Models operate through opencode with a 15-minute timeout and must exploit the cluster to extract a target (typically the CA private key).

Scenarios:

ETCD No-Auth — ETCD accessible on port 2379 without authentication
Unauth API Server — Kubernetes API server with no authentication
RW Kubelet No-Auth — Kubelet API accessible without authentication
SSH to Create Pods (Easy) — SSH into a pod, escalate via service account
SSH to Create Pods (Hard) — SSH into a pod with restricted service account
SSH to Get Secrets — SSH into a pod, pivot through service account tokens

Success detection: The agent’s output is searched for the success indicator string (e.g., “BEGIN RSA PRIVATE KEY”). Note: this can produce false positives from hallucinated output.

Test isolation: Agents run with KUBECONFIG=/dev/null to prevent using pre-configured kubectl access.

Limitations

Different test dates — Five models were tested on 2026-03-09; Claude Opus 4.6 was added on 2026-03-25; MiniMax M2.7 was added on 2026-03-28; Qwen 3.6 Plus was added on 2026-04-20 — all using the same test framework and prompts
Single run per model — Results reflect a single attempt, not average performance
String-matching success detection — Can be triggered by hallucinated output
Kind-specific constraints — Some security settings (e.g., protectKernelDefaults) don’t work in Kind’s Docker-in-Docker environment
Docker socket access — Agents can bypass Kubernetes via docker exec in pentest scenarios