Structured experiments and field reports from the governed AI lab. Each lab states a hypothesis, runs a controlled procedure, and publishes the data — whether the hypothesis holds or not. Reproducibility is the discipline.
Operational labs on the Apple Silicon home-lab itself — hardware bring-up, inference benchmarks, model installs, and the cluster substrate the other series depend on.
Recursive learning on local models. Can a governed agent improve by iterating on its own outputs without violating its constitutional boundaries? Behavioral consistency, self-critique loops, composition drift, recursive ceiling.
End-to-end governed automation on Apple Silicon. ClawLaw install, boundary enforcement, escalation flow, composition detection, audit trail integrity, multi-agent contention.
Model drift detection and inference reliability as continuous MLOps discipline. Token throughput benchmarks, latency SLO definition, drift classification, alert routing from model metrics to on-call.
What tokens-per-second actually measures, why most published comparisons are methodologically broken, and how to run a controlled experiment that produces data you can trust.
Bring up a single-node Apple Silicon home-lab. Ollama and MLX side by side, Open WebUI as the lab interface, Langfuse capturing every call. Six experiments from baseline tok/s to governed inference.
Install the OpenClaw agentic CLI, wire it to four backends — Anthropic, OpenAI, Gemini, and a local Ollama model — and run six experiments comparing tool-use behaviour across providers.
Establish the ceiling for behavioral consistency on a fixed-prompt local agent.
Test whether self-evaluation produces measurable quality gains under governance.
Detect quality drift inside a session before it reaches the output boundary.
Find the plateau point where iterative self-improvement stops yielding gains.
Constitutional governance bootstrap on a clean Apple Silicon node.
Verify the filesystem boundary holds against sequential probe patterns.
Confirm ESCALATE verdicts pause execution until principal review completes.
Catch boundary-probe sequences using session-aware composition rules.
Replay the audit log against the same initial state and prove determinism.
Two governed agents, one governance layer — verify the state store under contention.
A full 8-hour governed development session, end to end, with zero governance failures.