Install the OpenClaw agentic CLI, wire it to four backends — Anthropic, OpenAI, Gemini, and a local Ollama model — and run six experiments comparing tool-use behaviour across providers.
Lab 002 installed a model. This lab installs the agent — the thing that uses models to do work — and points it at four different brains in turn. Frontier API on one side, sovereign local inference on the other, the same agent in the middle.
A model on its own is a brain in a jar. It can answer when prompted, but it cannot read a file, run a command, fetch a URL, or modify a project. An agentic CLI is what closes that gap. It exposes a small, fixed set of tools — read, write, shell, search — and lets the model call them in a loop until the task is done. OpenClaw is one such CLI. It is open source, runs locally, and is provider-agnostic: the model behind the agent is a configuration choice, not a fork in the road.
That last property is what makes this lab worth running. The same agent, the same tools, the same task — pointed at OpenAI one minute and a 26B parameter model on the Mac next to it the minute after. The differences that fall out of that comparison are the practitioner's actual signal: where frontier models earn their price, where local inference is already enough, and where DeepSeek V4 — released three days ago — slots in between the two.
Lab 002 brought up Gemma 4 on Apple Silicon. That gave us an inference endpoint. Lab 003 brings up the consumer of that endpoint and a few API alternatives, so the question shifts from "can this model talk?" to "can this agent do work, and at what cost on which backend?" Lab 004 (planned) installs ClawLaw on top of this stack — the governance layer that turns the agentic loop from convenient into auditable.
If Lab 002 is complete, most of this is already done. The new work in this lab is Node.js and three sets of API credentials. Run each phase in order and confirm the version output matches before moving on.
OpenClaw is distributed as a Node package. Use Homebrew rather than a system installer so updates flow through brew upgrade with the rest of the lab toolchain.
# Install Node — LTS line is sufficient for OpenClaw brew install node node --version v20.x.x npm --version 10.x.x
Three keys to provision. None of these expose long-running cost — OpenClaw will not spend tokens until you invoke it. But treat each key as production credential: generate per-machine, never commit, and rotate when this lab ends.
forge-lab-[tool]-[YYYY-MM]. Used here only for downloading mlx-community quantisations.Store keys in ~/.config/forge/secrets.env with chmod 600. Do not put them in .zshrc, do not put them in any file under a Git directory, and do not put them in shell history. The lab scripts expect them sourced from this single file. Worker-04's Vault instance is the long-term destination — for this lab, the encrypted file is sufficient.
mkdir -p ~/.config/forge touch ~/.config/forge/secrets.env chmod 600 ~/.config/forge/secrets.env # Edit ~/.config/forge/secrets.env — fill in real values, no quotes around them OPENAI_API_KEY=sk-proj-... GEMINI_API_KEY=AIza... DEEPSEEK_API_KEY=sk-... HF_TOKEN=hf_... # Source it in the current shell (or add to ~/.zprofile guarded by a check) set -a; source ~/.config/forge/secrets.env; set +a
Required only if Lab 002 was skipped. The local backend needs an OpenAI-compatible endpoint at http://localhost:11434/v1. Confirm before continuing.
# Health check curl -s http://localhost:11434 Ollama is running # Confirm at least one model is present — gemma4:26b is the lab default ollama list | grep gemma4 gemma4:26b 16 GB 8 days ago
OpenClaw is a Node package that ships a CLI binary and a long-running daemon. The CLI is what you type into; the daemon is what holds session state and proxies tool calls. They communicate over a localhost HTTP port — by default 18789. Same port discipline as the lab-series-darwin-openclaw doc; nothing exotic.
npm install -g openclaw openclaw --version openclaw 0.x.x # Check that the binary is on PATH and resolves to the brew-managed Node prefix which openclaw /opt/homebrew/bin/openclaw
The agent is a tool, not a dependency. Treat it like git or kubectl — one binary on PATH, not vendored into every repo. The state and config live in ~/.openclaw, isolated from any individual project.
First run scaffolds the config directory and starts the daemon. The default config points at OpenAI; we will rewire it in §3.
openclaw init Created ~/.openclaw/config.toml Created ~/.openclaw/sessions/ Created ~/.openclaw/logs/ Daemon not running — start with: openclaw daemon start openclaw daemon start Daemon listening on http://127.0.0.1:18789 Workspace: $PWD Provider: openai (default) Status: READY # Health check the daemon (this is the same probe ClawLaw will use later) curl -s http://127.0.0.1:18789/health {"status":"ready","provider":"openai","tools":7,"session":null}
Open ~/.openclaw/config.toml and read it before you change it. Three sections matter: [provider], [agent], and [tools]. Everything in §3 and §4 is editing this file.
# OpenClaw configuration — see openclaw docs for full reference [provider] name = "openai" model = "gpt-5.4-mini" api_key = "$OPENAI_API_KEY" # env-var interpolation base_url = "https://api.openai.com/v1" [agent] max_iterations = 20 timeout_seconds = 120 working_dir = "$PWD" [tools] enabled = ["read", "write", "bash", "glob", "grep", "web_search", "message"] [daemon] port = 18789 log_dir = "~/.openclaw/logs"
Three frontier-or-near-frontier APIs, configured one at a time. Each is selected by editing [provider] in config.toml and running openclaw provider switch <name>. The session state is preserved across switches, which is what makes the matrix experiments in the next tab tractable.
This is the shipped default. Verify it works end-to-end before moving on. The smoke test prompt is the same one used in every provider verification — keeping prompts identical across providers is the whole point of the lab.
[provider] name = "openai" model = "gpt-5.4-mini" # or "gpt-5.4" for the larger variant api_key = "$OPENAI_API_KEY" base_url = "https://api.openai.com/v1"
openclaw provider switch openai Provider: openai · model: gpt-5.4-mini · ready # Smoke test — no tools used, just a model round-trip openclaw ask "In one sentence: what is the Agency Paradox?" [gpt-5.4-mini] The Agency Paradox is the gap between what an autonomous agent is technically capable of doing and what it has been authorized to do.
Gemini exposes an OpenAI-compatible endpoint at generativelanguage.googleapis.com/v1beta/openai/. OpenClaw treats it as a drop-in OpenAI provider with a different base URL and key — no Gemini-specific adapter required. Same code path, different brain.
[provider] name = "gemini" model = "gemini-3.1-flash" # or gemini-3.1-pro api_key = "$GEMINI_API_KEY" base_url = "https://generativelanguage.googleapis.com/v1beta/openai"
openclaw provider switch gemini openclaw ask "In one sentence: what is the Agency Paradox?" # Direct API smoke test (bypassing OpenClaw, useful for debugging) curl -s https://generativelanguage.googleapis.com/v1beta/openai/chat/completions \ -H "Authorization: Bearer $GEMINI_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"gemini-3.1-flash","messages":[{"role":"user","content":"hi"}]}' \ | jq .choices[0].message.content
Gemini 3.1 ships with 1M tokens standard across both Flash and Pro tiers. For agentic loops that read large repos before acting, this is the default to beat. Note the cost asymmetry: cached inputs are heavily discounted, which favours long-running sessions over one-shot calls.
DeepSeek V4 dropped three days before this lab was written. Two preview variants ship: V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B total / 13B active MoE), both with 1M token context and Hybrid Attention Architecture (CSA + HCA). The DeepSeek release notes specifically call out OpenClaw as a supported agent integration — that is the reason this lab exists this week and not next month.
[provider] name = "deepseek" model = "deepseek-v4-flash" # or deepseek-v4-pro api_key = "$DEEPSEEK_API_KEY" base_url = "https://api.deepseek.com/v1" [provider.options] reasoning_effort = "medium" # low · medium · high · max
openclaw provider switch deepseek openclaw ask "In one sentence: what is the Agency Paradox?" # Direct API check — DeepSeek is OpenAI-compatible curl -s https://api.deepseek.com/v1/chat/completions \ -H "Authorization: Bearer $DEEPSEEK_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"hi"}]}'
| Variant | Total / Active | Context | Input $/M | Output $/M | Use case |
|---|---|---|---|---|---|
| V4-Flash | 284B / 13B | 1M | $0.14 | $0.28 | Default agent loop · cheap iterations · long context retrieval |
| V4-Pro | 1.6T / 49B | 1M | $1.74 | $3.48 | Recommended Hard reasoning · multi-step refactors · still ~10× cheaper than GPT-5.4 / Opus 4.6 |
For the first time, an open-weight model is priced low enough that running it via API is cost-equivalent to running a smaller local model on your own electricity bill. The interesting comparison is no longer "API quality vs free local quality" — it is "API cost vs local sovereignty". DeepSeek V4-Flash is the pivot point that makes that comparison sharp. Experiment 03 in the next tab is built around it.
V4 is a preview release. Pricing has been announced as preview pricing and DeepSeek has signalled it may shift downward as Huawei Ascend production scales. Re-pull pricing into the cost model in Experiment 06 before publishing any benchmark numbers — and date-stamp the report.
Same agent, no network. Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1, which means OpenClaw treats it identically to a frontier API — just with a different base URL and no key. That symmetry is not an accident; it is the entire reason provider-agnostic matters. The only practical difference your wallet will notice is that this provider costs zero per token.
[provider] name = "ollama" model = "gemma4:26b" api_key = "ollama" # literal placeholder, ignored base_url = "http://localhost:11434/v1" [provider.options] num_ctx = 32768 # match what Ollama is configured for temperature = 0.2 # keep agentic loops boring
openclaw provider switch ollama openclaw ask "In one sentence: what is the Agency Paradox?" # Confirm Ollama actually loaded the model — check ps ollama ps NAME SIZE PROCESSOR UNTIL gemma4:26b 20 GB 100% GPU 4 minutes from now
The obvious question after seeing the V4-Flash spec sheet: can I run it locally? On 64GB unified memory the answer is no, not at any quality that justifies the workflow. V4-Flash is 284B parameters total. Even at the FP4-mixed precision DeepSeek ships, the weights alone are roughly 140–160GB. Q2 community quantisations might fit, but at that compression the model is no longer the model.
The hardware that would actually run V4-Flash locally is a Mac Studio with 192GB or more — Tier III in the Forge architecture, slated for an M5 Ultra refresh. Until then, V4 lives on the API side of the matrix. That is fine. Acknowledging the gap is the point — it makes the cost-vs-sovereignty trade-off in Experiment 04 concrete rather than aspirational.
Gemma 4 26B MoE at Q4–Q8 (~16–26GB), Gemma 4 31B Dense at Q4 (~17GB), Qwen2.5-Coder 32B at Q4_K_M (~19GB), Llama 4 70B at Q3 (tight at ~30GB). For the agentic workloads in this lab, Gemma 4 26B MoE is the recommended local default — it has the function-calling discipline the loop requires and leaves enough headroom for OpenClaw's own working state.
All four providers now resolve. The lab discipline is to treat the active provider as a deliberate, logged choice — not a sticky default that drifts. Three commands cover ninety percent of the daily workflow.
# List every configured provider and which one is active openclaw provider list openai gpt-5.4-mini configured * gemini gemini-3.1-flash configured (active) deepseek deepseek-v4-flash configured ollama gemma4:26b configured # Switch — daemon stays up, only the upstream client rebinds openclaw provider switch deepseek Provider: deepseek · model: deepseek-v4-flash · ready # One-shot override for a single command (does not persist) openclaw ask --provider openai "What does this codebase do?" # Inspect the current provider's exact config openclaw provider show name: deepseek model: deepseek-v4-flash base_url: https://api.deepseek.com/v1 options: reasoning_effort=medium session: none active
The Field Report template at the back of this document records provider, model, and exact config for every experiment session. This is not bureaucracy — it is the only way matrix benchmarks remain comparable a month later when GPT-5.4-mini has been silently updated and Gemma's Ollama tag has been re-quantised. Numbers without provenance are noise.
OpenClaw exposes seven tools to whichever model is active. Each is a function the model can call between turns; the daemon executes it on the model's behalf and returns the result. Understanding what each tool does, what it costs, and what it risks is the prerequisite for everything in the next tab — and the prerequisite for ClawLaw in Lab 004.
| Tool | What it does | Cost | Risk surface |
|---|---|---|---|
| read | Read a file from the working directory | Tokens for file content + a small model turn | Information leak — sensitive files surface into the model context |
| write | Write or overwrite a file | Tokens for content + a model turn | Irreversible if no version control · can clobber human edits |
| bash | Execute a shell command in the working directory | Output capture + a model turn | Highest — full process authority unless sandboxed · network egress · destructive commands |
| glob | Find files by pattern | Path list + a model turn | Low · enumerates filesystem layout (light info disclosure) |
| grep | Search file contents by regex | Match snippets + a model turn | Low–medium · can surface secrets if pointed at config dirs |
| web_search | Query a search engine, return links and snippets | Tokens for results + provider search fee (varies) | Egress · provider sees query text · injected content into the model |
| message | Speak to the human operator (no side effect) | One model turn | None directly · surfaces intent, useful for traceability |
Six of seven tools can be approximated through other tools. The bash tool is the one that cannot. Anything OpenClaw is permitted to do via bash, it can do — including rm -rf, including outbound HTTP, including pulling and running other binaries. The default config enables it because the lab is observational, not production. Lab 004 in this series installs ClawLaw, whose ShellCommandApprovalLaw turns bash into an approval-gated tool. Until then: run OpenClaw in a directory you can git restore from, and never against your home directory.
# Single-shot question, no agentic loop, no tool calls openclaw ask "explain this error: ..." # Full agentic task — model can call tools until done or max_iterations openclaw task "refactor utils.py to use pathlib instead of os.path" # Resume the last interrupted session openclaw resume # Show the live tool-call trace for the active session openclaw trace --follow # Inspect a completed session — every tool call, every model turn openclaw session show --last # Daemon control openclaw daemon status openclaw daemon stop openclaw daemon restart # after config edits
Final installation step: one task, run sequentially against all four providers, in a throwaway directory. The point is not to evaluate the providers — that comes in the experiments — but to confirm every cell of the matrix is wired and that the same observed prompt produces a coherent tool-call trace on each backend.
# Build a deliberately tiny scratch repo mkdir -p ~/forge/lab/smoke && cd ~/forge/lab/smoke git init -q echo "# Smoke" > README.md echo "def add(a, b): return a + b" > util.py git add -A && git commit -q -m "baseline" # Standardised task — small enough to complete in 1–3 tool calls per provider PROMPT="Read util.py, then write a new file test_util.py with one pytest case for add()." for p in openai gemini deepseek ollama; do echo "--- provider: $p ---" git reset --hard -q # clean slate each pass openclaw provider switch $p > /dev/null openclaw task "$PROMPT" --quiet openclaw session show --last --summary # tool-call count, tokens, ms done
test_util.py in the working directorygit status after each passMove to the Experiments tab. Lab 003 is installed. Lab 004 (ClawLaw) is the next tab in this series; until it is published, the operator is the governance layer — read every trace before you trust the outcome.
Each experiment runs the same agent against multiple providers and produces a publishable Field Report. The whole tab is structured so the matrix from Experiment 02 becomes the comparison frame for everything afterward.
Run the smoke-test task from §7 again, but this time treat it as a real experiment: record tool-call count, total wall-clock time, total tokens in/out, and total cost for each of the four providers. Produce one row of the matrix per provider. This is Forge Lab Log Entry 003.01.
Reuse ~/forge/lab/smoke from §7. Reset to baseline before each provider pass with git reset --hard. Identical starting state is the whole methodology.
Use one of the three prompts below verbatim. Do not improve the prompt between runs. The prompt itself is part of the experimental constant.
Run the prompt against each provider in turn using the script below. openclaw session show --last --json emits the full trace; pipe it to a file per provider for later analysis.
"Read util.py, then write test_util.py with one pytest case for add()."
"Audit this directory for any Python file using os.path. Replace with pathlib equivalents in place. Do not modify anything else."
"Look up the latest stable Python release. Update any version pin in pyproject.toml or requirements.txt to use that version. Note your source."
#!/usr/bin/env bash # Forge Lab — Experiment 01 · First Contact set -euo pipefail cd ~/forge/lab/smoke PROMPT="Read util.py, then write test_util.py with one pytest case for add()." OUT=~/forge/lab/exp01-$(date +%Y%m%d-%H%M) mkdir -p "$OUT" for p in openai gemini deepseek ollama; do echo "=== $p ===" | tee -a "$OUT/log.txt" git reset --hard -q openclaw provider switch $p > /dev/null START=$(date +%s) openclaw task "$PROMPT" --quiet END=$(date +%s) openclaw session show --last --json > "$OUT/$p.json" echo " wall_seconds: $((END - START))" >> "$OUT/log.txt" done # Quick matrix from the JSON traces jq -r '[.provider, .tool_calls|length, .tokens.total, .cost_usd, .duration_ms] | @tsv' \ "$OUT"/*.json | column -t
| Provider | Model | Tool calls | Tokens in/out | Wall time | Cost USD |
|---|---|---|---|---|---|
| openai | gpt-5.4-mini | — | — | — | — |
| gemini | gemini-3.1-flash | — | — | — | — |
| deepseek | deepseek-v4-flash | — | — | — | — |
| ollama | gemma4:26b | — | — | — | $0.00 |
test_util.py exists and runs pytest -q cleanly for each provider$OUT/<provider>.json for later replayFive tasks of increasing complexity, run across all four providers, recorded in one matrix. The deliverable is a 5×4 grid of cells with cost, tool-call count, and a quality score. This becomes the reference table the rest of the lab cites — and the table the community does not yet have for OpenClaw + DeepSeek V4 as of this week.
The practitioner's value is publishing a matrix others can reproduce. Every cell needs exact provider + model + date — drift in any of those invalidates comparison.
| Task | Description | Tools expected |
|---|---|---|
| T1 · trivial | Read a file, write one pytest case based on it | read, write |
| T2 · refactor | Replace os.path usage with pathlib across a small package | glob, read, write |
| T3 · fix-the-bug | A failing test is provided; identify and fix the underlying bug | read, bash (run pytest), write |
| T4 · web-grounded | Look up a fact, update a config file accordingly, cite the source | web_search, read, write, message |
| T5 · multi-step | Read README, inspect code, generate a CHANGELOG entry for the last commit | read, bash (git log), grep, write |
Each task gets two scores, judged independently on a 0–3 scale. Correctness: did the output do what was asked? Reasoning independence: did the model figure out the task on its own, or did it require operator nudging mid-loop? Recording these separately is what makes provider comparison interesting — a model can be highly correct only when heavily guided, and that should not score the same as a model that just gets it right.
| Task | openai | gemini | deepseek | ollama |
|---|---|---|---|---|
| T1 trivial | — | — | — | — |
| T2 refactor | — | — | — | — |
| T3 fix-the-bug | — | — | — | — |
| T4 web-grounded | — | — | — | — |
| T5 multi-step | — | — | — | — |
An OpenClaw + DeepSeek V4 + Apple Silicon benchmark has not been published anywhere as of April 27, 2026 — the V4 release is three days old. Publishing this matrix today, with exact dates and reproducible scripts, makes it a primary citation source. Include the prompts, the seed scratch repo, and the per-provider session JSONs alongside the report. Drop the date stamp prominently.
DeepSeek V4 ships in two sizes with very different price points. The hypothesis to test: V4-Flash is cost-equivalent to local inference but quality-equivalent to a frontier API on agentic-coding tasks. The control is GPT-5.4 (frontier reference); the contrast is V4-Pro (the high-effort variant, where DeepSeek's max-reasoning mode lives). Three providers, the same five tasks from Experiment 02, with cost-per-completed-task as the headline metric.
Duplicate the deepseek block in config.toml with name deepseek-pro and model deepseek-v4-pro. Keep V4-Flash as the default. Switching becomes a one-liner.
V4 supports four effort modes: low, medium, high, max. Run V4-Flash on medium (its sweet spot) and V4-Pro on high (its sweet spot). Note: max mode requires >=384K context allocation per the DeepSeek release notes — only worth it for the hardest task in the suite.
OpenAI GPT-5.4 (control), DeepSeek V4-Flash, DeepSeek V4-Pro. Same prompts, same scratch repo, three passes each. Median of three is the reportable number.
#!/usr/bin/env bash # Forge Lab — Experiment 03 · DeepSeek V4 Head-to-Head set -euo pipefail cd ~/forge/lab/scratch declare -a PROVIDERS=("openai" "deepseek" "deepseek-pro") declare -a TASKS=(T1 T2 T3 T4 T5) OUT=~/forge/lab/exp03-$(date +%Y%m%d) mkdir -p "$OUT" for p in "${PROVIDERS[@]}"; do openclaw provider switch "$p" > /dev/null for t in "${TASKS[@]}"; do for run in 1 2 3; do # 3 runs per cell, take median git reset --hard -q openclaw task "$(cat prompts/$t.txt)" --quiet openclaw session show --last --json > "$OUT/${p}_${t}_run${run}.json" done done done # Build the matrix — median wall_ms and median cost_usd per (provider, task) python3 ~/forge/lab/aggregate_matrix.py "$OUT" > "$OUT/matrix.tsv"
V4 is preview-tier. DeepSeek's own release notes say it trails frontier models by 3–6 months on knowledge benchmarks. The interesting finding is rarely "V4 wins" — it is "V4 wins on these task types, loses on those, and the cost ratio means it wins the practitioner's allocation decision anyway." Report the loss cases, including the loss to free local inference where it occurs.
The pivot question of the lab: at what task complexity does local inference (Gemma 4 26B on Ollama) stop being the right answer compared to a cheap API (DeepSeek V4-Flash)? The hypothesis is not that one wins outright. The hypothesis is that there is a clean cutoff in task complexity above which the API saves you more time than the local model saves you in dollars.
Two providers, five tasks, three runs each. Use the data already collected for V4-Flash from Experiment 03; only Ollama needs fresh runs.
Wall-clock seconds (latency), tool calls (loop efficiency), correctness score (did it work). Cost is always $0.00 for Ollama and ~$0.001 for V4-Flash on T1–T3.
Plot wall-clock seconds vs task. The crossover task — where Ollama starts taking visibly longer per loop — is the practitioner's actual decision point.
"""Forge Lab — Experiment 04 · Local vs API crossover analysis. Reads session JSONs from experiments 02–03 and finds the task at which the time-saved by API exceeds the dollars-spent on API. """ import json, glob, statistics from pathlib import Path DEV_HOURLY_USD = 90 # lab operator's effective hourly rate SECONDS_PER_USD = 3600 / DEV_HOURLY_USD def load_runs(provider, task): files = glob.glob(f"~/forge/lab/exp03-*/{provider}_{task}_run*.json") return [json.loads(Path(f).expanduser().read_text()) for f in files] for task in ["T1", "T2", "T3", "T4", "T5"]: local = load_runs("ollama", task) api = load_runs("deepseek", task) local_s = statistics.median(r["duration_ms"] / 1000 for r in local) api_s = statistics.median(r["duration_ms"] / 1000 for r in api) api_$ = statistics.median(r["cost_usd"] for r in api) time_saved = local_s - api_s breakeven = api_$ * SECONDS_PER_USD # seconds the API has to save to pay for itself verdict = "API wins" if time_saved > breakeven else "local wins" print(f"{task}: local {local_s:.1f}s · api {api_s:.1f}s · ${api_$:.4f} · {verdict}")
Most lab operators find the crossover sits between T2 and T3. Below that, the local model is fast enough that paying for the API is operator-time-negative. Above that, the API loop completes so much faster that even at frontier prices it pays back. V4-Flash compresses that crossover region — when the API is <1 cent per task, the threshold for switching shifts down by one task tier. That is the practitioner-relevant story.
Move out of toy tasks. Pick a real, ~500-line Python or Swift project from your own work — small enough that one OpenClaw session can hold it in mind, big enough that no single tool call fixes everything. Run a substantial refactor against it on each provider. The goal is not just correctness — it is to count how many individually approved actions collectively constitute scope creep. This is the Agency Paradox composition problem, made measurable.
Do this in a fresh git clone of the project under ~/forge/lab/scratch. Reset between providers. Do not run against the canonical working copy of any project you cannot afford to lose. The point of measuring scope creep is that it happens — that is the finding.
After each provider's session, classify every tool call into one of three buckets: in-scope (directly serves the stated refactor), adjacent (related cleanup the model decided to do), out-of-scope (formatting, renames, dependency edits, comment additions the operator did not request). The ratio of these buckets is the composition footprint.
| Provider | Total calls | In-scope | Adjacent | Out-of-scope | Tests pass? |
|---|---|---|---|---|---|
| openai | — | — | — | — | — |
| gemini | — | — | — | — | — |
| deepseek-flash | — | — | — | — | — |
| deepseek-pro | — | — | — | — | — |
| ollama | — | — | — | — | — |
This is the bridge to Lab 004. The Agency Paradox is not a thought experiment — it is a number you can put on the page. Every cell in the out-of-scope column is an instance where the model did something individually defensible that, in aggregate, is not what was asked. ClawLaw's composition tracing is the architectural answer; this experiment is the empirical baseline that proves the answer is needed. Save the session JSONs — they become Lab 004's input fixtures.
Synthesize Experiments 01–05 into a single allocation table: for each task class, which provider is the right default? The output is a one-page decision aid you keep open while working — not a benchmark report, a working tool. The matrix is reproducible, dated, and tied to the price sheets in effect on the day of the run.
~/forge/lab/pricing/| Task class | Default provider | Reason | Fallback |
|---|---|---|---|
| trivial | ollama (gemma4:26b) | Free · low latency on M4 Pro · API spend not justified | deepseek-flash if Ollama is offline |
| refactor | deepseek-flash | ~10× cheaper than GPT-5.4 · matches quality on T2 in pilot runs | openai gpt-5.4-mini for tightly typed languages |
| fix-the-bug | — record from data | — record from data | — |
| web-grounded | — record from data | — record from data | — |
| multi-step refactor | — record from data | — record from data | — |
This decision table is one tab. The other two appear when ClawLaw is installed in Lab 004: Langfuse (the trace tab — every tool call as Intelligence), and Grafana (the metrics tab — system load as Surveillance). Three tabs, ISR-mapped, three views of the same agentic session. That is the screenshot worth publishing — but it is impossible without first having the cost model in place. Lab 003 builds the cost model; Lab 004 wires it to governance.
Tags for this document: OpenClaw DeepSeek V4 Gemma 4 Apple Silicon Agency Paradox Provider Matrix 2026-04
Every experiment produces one of these. ISR reconnaissance format adapted for lab work — what was observed, where, when, against what baseline, with what confidence, and what it implies for the AgentVector argument.
File one per experiment session. Date it. These accumulate into the lab log that becomes the Silicon Forge Field Guide. The provider lock-in (exact model string, exact provider config, exact date) is what makes a report citable a month from now.