Silicon/Decisions/ADR-0006
Silicon · Decisions · ADR-0006

The governance split.

Where does the governance layer live, and who is allowed to change it? The question every serious AI deployment eventually has to answer — and the architecture that answers it without compromise.

Status
Accepted
Format
Position paper · ADR
Area
Governance
01 · The question

Two questions, before any others.

Any AI governance system worth taking seriously has to answer two structural questions before it answers anything else. The instinct is to answer both the same way: governance lives outside, on dedicated hardware, changeable only by a separate authority. That instinct is right about where the boundary belongs and wrong about where the policy belongs.

Question 01
Where does it live?

Inside the platform that runs the agents, or outside it on independent infrastructure?

Question 02
Who can change it?

The same operators who run the agent workloads, or someone structurally separate from them?

02 · Two forces that don't cooperate

The lab is built to demonstrate both. They pull against each other.

One requirement wants inference inside the cluster. The other wants the governor structurally out of the governed's reach. The cheap resolutions sacrifice one to satisfy the other.

The MLOps requirement
Inference must run inside the cluster.

If inference never enters Kubernetes, the model lifecycle never closes. Training, registry, and deployment are all platform-managed — and then the actual inference step runs on hardware outside the platform. The pipeline is half-managed, half-ad-hoc. The MLOps claim can't be made honestly.

The governance requirement
The governed cannot modify the governor.

The first requirement of any governance system is that the thing being governed can't change the thing governing it. Physical separation on dedicated hardware is the strongest form of that guarantee. Anything weaker is a promise rather than a structure.

03 · Three options

Two carry a structural cost. One doesn't.

"Governance" turns out to be two functions — policy management (the rules, thresholds, and version history) and policy enforcement (the gate that turns a proposed action into a verdict). The winning option is the one that stops treating them as a single thing.

Option A Rejected
Keep everything outside

Governance stays on dedicated hardware. Inference inside the cluster routes out to it for every action.

The cost

The MLOps pipeline never closes. A network round-trip sits in the hot path, and the cluster is permanently calling an external service to evaluate its own workloads.

Option B Rejected
Move everything inside

Governance becomes a cluster service. GitOps manages it, ArgoCD deploys it, Prometheus observes it.

The cost

Structural separation collapses. Anyone with cluster-admin rights can redeploy the governance layer — so a cluster compromise is a governance compromise.

Option C Accepted
Split the two functions

Recognise that "governance" is two jobs — managing policy and enforcing it — and give each a different home.

Why it works

Policy management runs inside the cluster, with every operational benefit. Enforcement runs at a boundary the cluster cannot reach. Both stories hold.

04 · The decision

Policy management lives in the cluster. Policy enforcement lives on the M4 Pro. A signed channel connects them.

Split the governance architecture into two layers with different infrastructure homes. The cluster pushes policy. The boundary applies it. The cluster cannot reach the boundary as a workload.

Inside the cluster · worker-02
Policy management

Rules, risk thresholds, escalation criteria, version history. GitOps-managed, ArgoCD-deployed, Harbor-versioned, observable through the telemetry stack. The Git commit log is the audit trail.

signed policy →
On the M4 Pro · Tier II
Policy enforcement

The gate that intercepts each proposed action and issues a verdict. Runs on a separate machine, a different OS, behind a different network boundary. An agent inside K3s cannot modify it.

05 · Why this is the right answer

Separation is about enforcement, not declaration.

The argument against putting governance inside the cluster turns on one thing: the governed cannot be able to modify the governor. That is a claim about enforcement, not about where the rules are written down.

The Kyverno precedent: Kyverno's policies live as ordinary resources inside the very cluster they govern. Nobody calls that a separation failure, because the enforcement webhook intercepts requests at a structural boundary before they reach the API server. The rules live inside; the gate sits where the workloads can't reach it. This decision applies the same pattern — with the boundary at an unusually concrete location: a separate physical machine.

Governance gets stronger operations, not weaker

Putting policy management on the platform makes governance harder to change ad hoc, not easier. Every policy change becomes a Git commit — reviewed, dry-run against live traffic, promoted through the same pipeline as application config. The audit trail is the commit log. The alerting fires on policy drift. Governance that is unobservable is governance that cannot be improved, and external dedicated hardware is exactly where governance goes unobserved.

This is the pattern that scales

Beyond a single deployment, governance cannot run on dedicated external hardware indefinitely — the cost and operational overhead compound until something gives. The pattern that scales is the one this decision implements: policy as a platform service, enforcement at a structural boundary, with a verified channel between them. It is the architecture inside hardware security modules, Kubernetes admission webhooks, and network access control. Most production deployments will eventually move the boundary from a physical machine to a logical equivalent — an attested execution environment, a separately-administered account. The split is the architecture; the boundary's implementation is a deployment detail.

The MLOps story closes

With the split accepted, inference becomes a fully governed, fully schedulable Kubernetes workload. Training, evaluation, registry promotion, inference, observability, governance, and evidence capture all live in one operational environment. The full MLOps architecture →

What the decision triggers
New component
The governance proxy

A new service on worker-02. It owns policy management, versioning, and the routing that connects the inside of the cluster to the enforcement layer outside it. Without it, the split has no inside half.

New channel
A signed distribution path

A defined protocol for pushing policy from cluster to boundary — authenticated and cryptographically signed. The boundary refuses policy it cannot verify came from an authorised source. Anything weaker turns the split back into a trust relationship.

New failure mode
A fail-closed default

If policy management is unavailable — node down, network partition, upgrade in progress — the boundary needs a safe default. Three candidates: deny all, apply last-known-good, or escalate everything to a human. Fail-closed is the committed default.

What does not change

The enforcement kernel on the M4 Pro is unchanged. It still evaluates proposed actions and issues verdicts — it now receives policy from the cluster proxy rather than managing it locally. The evaluation logic, the verdict format (ALLOW · DENY · ESCALATE), and the evidence schema are all the same. The audit record still captures the verdict, the policy version that produced it, and the full action context.