Silicon / Lab / L-003
macsweeney.tech · AI Lab OpenClaw series · 2026-04-27
L-003 · OpenClaw

OpenClaw Across the Provider Matrix

Install the OpenClaw agentic CLI, wire it to four backends — Anthropic, OpenAI, Gemini, and a local Ollama model — and run six experiments comparing tool-use behaviour across providers.

Status Active
Difficulty Advanced
Duration 4–6 hours
Hardware Mac Mini M4 Pro · 64GB unified memory
Prerequisites Lab 002 complete, API keys for the providers you intend to test
agentsopenclawproviderstool-usegovernance
macsweeney.tech · AI Lab Rev 1.0 · April 2026
Lab 003 · Forge Lab Series

OpenClaw Across the Provider Matrix —
Installation, Configuration, and First Tasks

Lab 002 installed a model. This lab installs the agent — the thing that uses models to do work — and points it at four different brains in turn. Frontier API on one side, sovereign local inference on the other, the same agent in the middle.

Hardware M4 Pro · 64GB
OS macOS Sequoia
Install time ~30 min
Cross-ref L-026 · L-028
§0

Why an agentic CLI matters

A model on its own is a brain in a jar. It can answer when prompted, but it cannot read a file, run a command, fetch a URL, or modify a project. An agentic CLI is what closes that gap. It exposes a small, fixed set of tools — read, write, shell, search — and lets the model call them in a loop until the task is done. OpenClaw is one such CLI. It is open source, runs locally, and is provider-agnostic: the model behind the agent is a configuration choice, not a fork in the road.

That last property is what makes this lab worth running. The same agent, the same tools, the same task — pointed at OpenAI one minute and a 26B parameter model on the Mac next to it the minute after. The differences that fall out of that comparison are the practitioner's actual signal: where frontier models earn their price, where local inference is already enough, and where DeepSeek V4 — released three days ago — slots in between the two.

Why this is Lab 003

Lab 002 brought up Gemma 4 on Apple Silicon. That gave us an inference endpoint. Lab 003 brings up the consumer of that endpoint and a few API alternatives, so the question shifts from "can this model talk?" to "can this agent do work, and at what cost on which backend?" Lab 004 (planned) installs ClawLaw on top of this stack — the governance layer that turns the agentic loop from convenient into auditable.

Frontier API
OpenAI
GPT-5.4 family · function calling mature · highest baseline cost.
Frontier API
Google Gemini
Gemini 3.1 Pro / Flash · 1M context standard · OpenAI-compatible endpoint.
Open-weight API
DeepSeek V4
Pro 1.6T / Flash 284B MoE · 1M context · <1/10th frontier cost.
Local · Sovereign
Ollama · Gemma 4
26B MoE on the M4 Pro · zero per-token cost · slower, fully offline.
§1

Prerequisites

If Lab 002 is complete, most of this is already done. The new work in this lab is Node.js and three sets of API credentials. Run each phase in order and confirm the version output matches before moving on.

Phase 1 Node.js 20+ via Homebrew ~2 min

OpenClaw is distributed as a Node package. Use Homebrew rather than a system installer so updates flow through brew upgrade with the rest of the lab toolchain.

Terminal
# Install Node — LTS line is sufficient for OpenClaw
brew install node

node --version
v20.x.x

npm --version
10.x.x
Phase 2 Provider API keys ~10 min

Three keys to provision. None of these expose long-running cost — OpenClaw will not spend tokens until you invoke it. But treat each key as production credential: generate per-machine, never commit, and rotate when this lab ends.

  • OpenAI · platform.openai.com → API keys → forge-openclaw-2026-04. Scope: project key, restrict to chat completions and responses endpoints.
  • Google AI Studio · aistudio.google.com → Get API key → enable Gemini API. The free tier is generous for benchmarking; the paid tier removes rate-limit drama.
  • DeepSeek · platform.deepseek.com → API keys. Add $5 in credit; that lasts a long time at V4-Flash pricing.
  • Hugging Face · already provisioned in Lab 002 if you followed the token convention forge-lab-[tool]-[YYYY-MM]. Used here only for downloading mlx-community quantisations.
Key hygiene

Store keys in ~/.config/forge/secrets.env with chmod 600. Do not put them in .zshrc, do not put them in any file under a Git directory, and do not put them in shell history. The lab scripts expect them sourced from this single file. Worker-04's Vault instance is the long-term destination — for this lab, the encrypted file is sufficient.

Terminal — secrets.env scaffolding
mkdir -p ~/.config/forge
touch ~/.config/forge/secrets.env
chmod 600 ~/.config/forge/secrets.env

# Edit ~/.config/forge/secrets.env — fill in real values, no quotes around them
OPENAI_API_KEY=sk-proj-...
GEMINI_API_KEY=AIza...
DEEPSEEK_API_KEY=sk-...
HF_TOKEN=hf_...

# Source it in the current shell (or add to ~/.zprofile guarded by a check)
set -a; source ~/.config/forge/secrets.env; set +a
Phase 3 Ollama running with at least one model ~1 min verify

Required only if Lab 002 was skipped. The local backend needs an OpenAI-compatible endpoint at http://localhost:11434/v1. Confirm before continuing.

Terminal — verify Ollama
# Health check
curl -s http://localhost:11434
Ollama is running

# Confirm at least one model is present — gemma4:26b is the lab default
ollama list | grep gemma4
gemma4:26b    16 GB    8 days ago
§2

Install OpenClaw

OpenClaw is a Node package that ships a CLI binary and a long-running daemon. The CLI is what you type into; the daemon is what holds session state and proxies tool calls. They communicate over a localhost HTTP port — by default 18789. Same port discipline as the lab-series-darwin-openclaw doc; nothing exotic.

Phase 1 Install the CLI globally ~2 min
Terminal
npm install -g openclaw

openclaw --version
openclaw 0.x.x

# Check that the binary is on PATH and resolves to the brew-managed Node prefix
which openclaw
/opt/homebrew/bin/openclaw
Why global, not per-project

The agent is a tool, not a dependency. Treat it like git or kubectl — one binary on PATH, not vendored into every repo. The state and config live in ~/.openclaw, isolated from any individual project.

Phase 2 Initialise config and daemon ~3 min

First run scaffolds the config directory and starts the daemon. The default config points at OpenAI; we will rewire it in §3.

Terminal
openclaw init
Created ~/.openclaw/config.toml
Created ~/.openclaw/sessions/
Created ~/.openclaw/logs/
Daemon not running — start with: openclaw daemon start

openclaw daemon start
Daemon listening on http://127.0.0.1:18789
Workspace: $PWD
Provider:  openai (default)
Status:    READY

# Health check the daemon (this is the same probe ClawLaw will use later)
curl -s http://127.0.0.1:18789/health
{"status":"ready","provider":"openai","tools":7,"session":null}
Phase 3 Tour the config file ~3 min

Open ~/.openclaw/config.toml and read it before you change it. Three sections matter: [provider], [agent], and [tools]. Everything in §3 and §4 is editing this file.

~/.openclaw/config.toml — the shipped default
# OpenClaw configuration — see openclaw docs for full reference

[provider]
name     = "openai"
model    = "gpt-5.4-mini"
api_key  = "$OPENAI_API_KEY"     # env-var interpolation
base_url = "https://api.openai.com/v1"

[agent]
max_iterations  = 20
timeout_seconds = 120
working_dir     = "$PWD"

[tools]
enabled = ["read", "write", "bash", "glob", "grep", "web_search", "message"]

[daemon]
port    = 18789
log_dir = "~/.openclaw/logs"
§3

Wire the API providers

Three frontier-or-near-frontier APIs, configured one at a time. Each is selected by editing [provider] in config.toml and running openclaw provider switch <name>. The session state is preserved across switches, which is what makes the matrix experiments in the next tab tractable.

Provider 1 OpenAI · GPT-5.4 family ~3 min

This is the shipped default. Verify it works end-to-end before moving on. The smoke test prompt is the same one used in every provider verification — keeping prompts identical across providers is the whole point of the lab.

~/.openclaw/config.toml — [provider] section
[provider]
name     = "openai"
model    = "gpt-5.4-mini"      # or "gpt-5.4" for the larger variant
api_key  = "$OPENAI_API_KEY"
base_url = "https://api.openai.com/v1"
Terminal — verify the provider
openclaw provider switch openai
Provider: openai · model: gpt-5.4-mini · ready

# Smoke test — no tools used, just a model round-trip
openclaw ask "In one sentence: what is the Agency Paradox?"
[gpt-5.4-mini] The Agency Paradox is the gap between what an autonomous
agent is technically capable of doing and what it has been authorized to do.
Provider 2 Google Gemini · OpenAI-compat endpoint ~5 min

Gemini exposes an OpenAI-compatible endpoint at generativelanguage.googleapis.com/v1beta/openai/. OpenClaw treats it as a drop-in OpenAI provider with a different base URL and key — no Gemini-specific adapter required. Same code path, different brain.

~/.openclaw/config.toml — Gemini variant
[provider]
name     = "gemini"
model    = "gemini-3.1-flash"     # or gemini-3.1-pro
api_key  = "$GEMINI_API_KEY"
base_url = "https://generativelanguage.googleapis.com/v1beta/openai"
Terminal — verify Gemini
openclaw provider switch gemini
openclaw ask "In one sentence: what is the Agency Paradox?"

# Direct API smoke test (bypassing OpenClaw, useful for debugging)
curl -s https://generativelanguage.googleapis.com/v1beta/openai/chat/completions \
  -H "Authorization: Bearer $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemini-3.1-flash","messages":[{"role":"user","content":"hi"}]}' \
  | jq .choices[0].message.content
1M context as default

Gemini 3.1 ships with 1M tokens standard across both Flash and Pro tiers. For agentic loops that read large repos before acting, this is the default to beat. Note the cost asymmetry: cached inputs are heavily discounted, which favours long-running sessions over one-shot calls.

Provider 3 DeepSeek V4 · open-weight, frontier-class Released Apr 24, 2026 ~5 min

DeepSeek V4 dropped three days before this lab was written. Two preview variants ship: V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B total / 13B active MoE), both with 1M token context and Hybrid Attention Architecture (CSA + HCA). The DeepSeek release notes specifically call out OpenClaw as a supported agent integration — that is the reason this lab exists this week and not next month.

~/.openclaw/config.toml — DeepSeek variant
[provider]
name     = "deepseek"
model    = "deepseek-v4-flash"     # or deepseek-v4-pro
api_key  = "$DEEPSEEK_API_KEY"
base_url = "https://api.deepseek.com/v1"

[provider.options]
reasoning_effort = "medium"     # low · medium · high · max
Terminal — verify DeepSeek V4
openclaw provider switch deepseek
openclaw ask "In one sentence: what is the Agency Paradox?"

# Direct API check — DeepSeek is OpenAI-compatible
curl -s https://api.deepseek.com/v1/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"hi"}]}'
Variant Total / Active Context Input $/M Output $/M Use case
V4-Flash 284B / 13B 1M $0.14 $0.28 Default agent loop · cheap iterations · long context retrieval
V4-Pro 1.6T / 49B 1M $1.74 $3.48 Recommended Hard reasoning · multi-step refactors · still ~10× cheaper than GPT-5.4 / Opus 4.6
Why this matters for the matrix

For the first time, an open-weight model is priced low enough that running it via API is cost-equivalent to running a smaller local model on your own electricity bill. The interesting comparison is no longer "API quality vs free local quality" — it is "API cost vs local sovereignty". DeepSeek V4-Flash is the pivot point that makes that comparison sharp. Experiment 03 in the next tab is built around it.

Preview pricing, preview behaviour

V4 is a preview release. Pricing has been announced as preview pricing and DeepSeek has signalled it may shift downward as Huawei Ascend production scales. Re-pull pricing into the cost model in Experiment 06 before publishing any benchmark numbers — and date-stamp the report.

§4

Wire the local backend

Same agent, no network. Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1, which means OpenClaw treats it identically to a frontier API — just with a different base URL and no key. That symmetry is not an accident; it is the entire reason provider-agnostic matters. The only practical difference your wallet will notice is that this provider costs zero per token.

Provider 4 Ollama · Gemma 4 26B MoE ~3 min
~/.openclaw/config.toml — Ollama variant
[provider]
name     = "ollama"
model    = "gemma4:26b"
api_key  = "ollama"                   # literal placeholder, ignored
base_url = "http://localhost:11434/v1"

[provider.options]
num_ctx     = 32768                    # match what Ollama is configured for
temperature = 0.2                      # keep agentic loops boring
Terminal — verify the local backend
openclaw provider switch ollama
openclaw ask "In one sentence: what is the Agency Paradox?"

# Confirm Ollama actually loaded the model — check ps
ollama ps
NAME           SIZE     PROCESSOR    UNTIL
gemma4:26b     20 GB    100% GPU     4 minutes from now
Note DeepSeek V4 locally — the honest answer is no Reference

The obvious question after seeing the V4-Flash spec sheet: can I run it locally? On 64GB unified memory the answer is no, not at any quality that justifies the workflow. V4-Flash is 284B parameters total. Even at the FP4-mixed precision DeepSeek ships, the weights alone are roughly 140–160GB. Q2 community quantisations might fit, but at that compression the model is no longer the model.

The hardware that would actually run V4-Flash locally is a Mac Studio with 192GB or more — Tier III in the Forge architecture, slated for an M5 Ultra refresh. Until then, V4 lives on the API side of the matrix. That is fine. Acknowledging the gap is the point — it makes the cost-vs-sovereignty trade-off in Experiment 04 concrete rather than aspirational.

What actually fits on 64GB at usable quality

Gemma 4 26B MoE at Q4–Q8 (~16–26GB), Gemma 4 31B Dense at Q4 (~17GB), Qwen2.5-Coder 32B at Q4_K_M (~19GB), Llama 4 70B at Q3 (tight at ~30GB). For the agentic workloads in this lab, Gemma 4 26B MoE is the recommended local default — it has the function-calling discipline the loop requires and leaves enough headroom for OpenClaw's own working state.

§5

Switching providers cleanly

All four providers now resolve. The lab discipline is to treat the active provider as a deliberate, logged choice — not a sticky default that drifts. Three commands cover ninety percent of the daily workflow.

Terminal — provider command reference
# List every configured provider and which one is active
openclaw provider list
  openai      gpt-5.4-mini          configured
* gemini      gemini-3.1-flash      configured (active)
  deepseek    deepseek-v4-flash     configured
  ollama      gemma4:26b            configured

# Switch — daemon stays up, only the upstream client rebinds
openclaw provider switch deepseek
Provider: deepseek · model: deepseek-v4-flash · ready

# One-shot override for a single command (does not persist)
openclaw ask --provider openai "What does this codebase do?"

# Inspect the current provider's exact config
openclaw provider show
name:     deepseek
model:    deepseek-v4-flash
base_url: https://api.deepseek.com/v1
options:  reasoning_effort=medium
session:  none active
Lab discipline — every provider switch is a log entry

The Field Report template at the back of this document records provider, model, and exact config for every experiment session. This is not bureaucracy — it is the only way matrix benchmarks remain comparable a month later when GPT-5.4-mini has been silently updated and Gemma's Ollama tag has been re-quantised. Numbers without provenance are noise.

§6

The tool taxonomy

OpenClaw exposes seven tools to whichever model is active. Each is a function the model can call between turns; the daemon executes it on the model's behalf and returns the result. Understanding what each tool does, what it costs, and what it risks is the prerequisite for everything in the next tab — and the prerequisite for ClawLaw in Lab 004.

Tool What it does Cost Risk surface
read Read a file from the working directory Tokens for file content + a small model turn Information leak — sensitive files surface into the model context
write Write or overwrite a file Tokens for content + a model turn Irreversible if no version control · can clobber human edits
bash Execute a shell command in the working directory Output capture + a model turn Highest — full process authority unless sandboxed · network egress · destructive commands
glob Find files by pattern Path list + a model turn Low · enumerates filesystem layout (light info disclosure)
grep Search file contents by regex Match snippets + a model turn Low–medium · can surface secrets if pointed at config dirs
web_search Query a search engine, return links and snippets Tokens for results + provider search fee (varies) Egress · provider sees query text · injected content into the model
message Speak to the human operator (no side effect) One model turn None directly · surfaces intent, useful for traceability
The bash tool is the boundary

Six of seven tools can be approximated through other tools. The bash tool is the one that cannot. Anything OpenClaw is permitted to do via bash, it can do — including rm -rf, including outbound HTTP, including pulling and running other binaries. The default config enables it because the lab is observational, not production. Lab 004 in this series installs ClawLaw, whose ShellCommandApprovalLaw turns bash into an approval-gated tool. Until then: run OpenClaw in a directory you can git restore from, and never against your home directory.

Terminal — essential daily commands
# Single-shot question, no agentic loop, no tool calls
openclaw ask "explain this error: ..."

# Full agentic task — model can call tools until done or max_iterations
openclaw task "refactor utils.py to use pathlib instead of os.path"

# Resume the last interrupted session
openclaw resume

# Show the live tool-call trace for the active session
openclaw trace --follow

# Inspect a completed session — every tool call, every model turn
openclaw session show --last

# Daemon control
openclaw daemon status
openclaw daemon stop
openclaw daemon restart                    # after config edits
§7

Smoke test the whole stack

Final installation step: one task, run sequentially against all four providers, in a throwaway directory. The point is not to evaluate the providers — that comes in the experiments — but to confirm every cell of the matrix is wired and that the same observed prompt produces a coherent tool-call trace on each backend.

Terminal — install smoke test
# Build a deliberately tiny scratch repo
mkdir -p ~/forge/lab/smoke && cd ~/forge/lab/smoke
git init -q
echo "# Smoke" > README.md
echo "def add(a, b): return a + b" > util.py
git add -A && git commit -q -m "baseline"

# Standardised task — small enough to complete in 1–3 tool calls per provider
PROMPT="Read util.py, then write a new file test_util.py with one pytest case for add()."

for p in openai gemini deepseek ollama; do
  echo "--- provider: $p ---"
  git reset --hard -q                  # clean slate each pass
  openclaw provider switch $p > /dev/null
  openclaw task "$PROMPT" --quiet
  openclaw session show --last --summary     # tool-call count, tokens, ms
done
If the smoke test passes

Move to the Experiments tab. Lab 003 is installed. Lab 004 (ClawLaw) is the next tab in this series; until it is published, the operator is the governance layer — read every trace before you trust the outcome.

macsweeney.tech · AI Lab Six experiments · ~14 hrs total
Experiments · Forge Lab Series · Lab 003

The provider matrix —
six experiments, four backends, one agent

Each experiment runs the same agent against multiple providers and produces a publishable Field Report. The whole tab is structured so the matrix from Experiment 02 becomes the comparison frame for everything afterward.

Experiment 01 · Tier I
First Contact — One Task, Four Brains
Entry Level ~45 min

Run the smoke-test task from §7 again, but this time treat it as a real experiment: record tool-call count, total wall-clock time, total tokens in/out, and total cost for each of the four providers. Produce one row of the matrix per provider. This is Forge Lab Log Entry 003.01.

Setup
1

Use the same scratch repo

Reuse ~/forge/lab/smoke from §7. Reset to baseline before each provider pass with git reset --hard. Identical starting state is the whole methodology.

2

Pick the prompt and freeze it

Use one of the three prompts below verbatim. Do not improve the prompt between runs. The prompt itself is part of the experimental constant.

3

Run, log, compare

Run the prompt against each provider in turn using the script below. openclaw session show --last --json emits the full trace; pipe it to a file per provider for later analysis.

Standard prompts (pick one and stick with it)

Prompt A · trivial

"Read util.py, then write test_util.py with one pytest case for add()."

Prompt B · light reasoning

"Audit this directory for any Python file using os.path. Replace with pathlib equivalents in place. Do not modify anything else."

Prompt C · web-grounded

"Look up the latest stable Python release. Update any version pin in pyproject.toml or requirements.txt to use that version. Note your source."

~/forge/lab/exp01_first_contact.sh
#!/usr/bin/env bash
# Forge Lab — Experiment 01 · First Contact
set -euo pipefail
cd ~/forge/lab/smoke

PROMPT="Read util.py, then write test_util.py with one pytest case for add()."
OUT=~/forge/lab/exp01-$(date +%Y%m%d-%H%M)
mkdir -p "$OUT"

for p in openai gemini deepseek ollama; do
  echo "=== $p ===" | tee -a "$OUT/log.txt"
  git reset --hard -q
  openclaw provider switch $p > /dev/null

  START=$(date +%s)
  openclaw task "$PROMPT" --quiet
  END=$(date +%s)

  openclaw session show --last --json > "$OUT/$p.json"
  echo "  wall_seconds: $((END - START))" >> "$OUT/log.txt"
done

# Quick matrix from the JSON traces
jq -r '[.provider, .tool_calls|length, .tokens.total, .cost_usd, .duration_ms] | @tsv' \
  "$OUT"/*.json | column -t
What to record
Experiment 01 · First Contact Matrix Fill from $OUT/<provider>.json
Provider Model Tool calls Tokens in/out Wall time Cost USD
openai gpt-5.4-mini
gemini gemini-3.1-flash
deepseek deepseek-v4-flash
ollama gemma4:26b $0.00
  • Output file test_util.py exists and runs pytest -q cleanly for each provider
  • Tool-call counts within ~1.5× of each other across providers (large drift = a prompt the model misreads)
  • Cost on Ollama is exactly zero (this is the canary — non-zero means a misrouted call)
  • Each session JSON saved to $OUT/<provider>.json for later replay
Experiment 02 · Tier I
Provider Matrix — Five Standard Tasks Across Four Backends
Entry Level ~3 hrs

Five tasks of increasing complexity, run across all four providers, recorded in one matrix. The deliverable is a 5×4 grid of cells with cost, tool-call count, and a quality score. This becomes the reference table the rest of the lab cites — and the table the community does not yet have for OpenClaw + DeepSeek V4 as of this week.

The practitioner's value is publishing a matrix others can reproduce. Every cell needs exact provider + model + date — drift in any of those invalidates comparison.
The five standard tasks
TaskDescriptionTools expected
T1 · trivial Read a file, write one pytest case based on it read, write
T2 · refactor Replace os.path usage with pathlib across a small package glob, read, write
T3 · fix-the-bug A failing test is provided; identify and fix the underlying bug read, bash (run pytest), write
T4 · web-grounded Look up a fact, update a config file accordingly, cite the source web_search, read, write, message
T5 · multi-step Read README, inspect code, generate a CHANGELOG entry for the last commit read, bash (git log), grep, write
Quality scoring (the two-score method)

Each task gets two scores, judged independently on a 0–3 scale. Correctness: did the output do what was asked? Reasoning independence: did the model figure out the task on its own, or did it require operator nudging mid-loop? Recording these separately is what makes provider comparison interesting — a model can be highly correct only when heavily guided, and that should not score the same as a model that just gets it right.

Experiment 02 · Provider × Task Matrix Quality = correctness · independence (each 0–3)
Taskopenaigeminideepseekollama
T1 trivial
T2 refactor
T3 fix-the-bug
T4 web-grounded
T5 multi-step
Publication angle

An OpenClaw + DeepSeek V4 + Apple Silicon benchmark has not been published anywhere as of April 27, 2026 — the V4 release is three days old. Publishing this matrix today, with exact dates and reproducible scripts, makes it a primary citation source. Include the prompts, the seed scratch repo, and the per-provider session JSONs alongside the report. Drop the date stamp prominently.

Experiment 03 · Tier II DeepSeek V4 focus
DeepSeek V4 Head-to-Head — Flash vs Pro vs Frontier Reference
Intermediate ~2 hrs

DeepSeek V4 ships in two sizes with very different price points. The hypothesis to test: V4-Flash is cost-equivalent to local inference but quality-equivalent to a frontier API on agentic-coding tasks. The control is GPT-5.4 (frontier reference); the contrast is V4-Pro (the high-effort variant, where DeepSeek's max-reasoning mode lives). Three providers, the same five tasks from Experiment 02, with cost-per-completed-task as the headline metric.

Setup
1

Add the V4-Pro provider variant

Duplicate the deepseek block in config.toml with name deepseek-pro and model deepseek-v4-pro. Keep V4-Flash as the default. Switching becomes a one-liner.

2

Set the reasoning effort dial

V4 supports four effort modes: low, medium, high, max. Run V4-Flash on medium (its sweet spot) and V4-Pro on high (its sweet spot). Note: max mode requires >=384K context allocation per the DeepSeek release notes — only worth it for the hardest task in the suite.

3

Run all five tasks against three providers

OpenAI GPT-5.4 (control), DeepSeek V4-Flash, DeepSeek V4-Pro. Same prompts, same scratch repo, three passes each. Median of three is the reportable number.

~/forge/lab/exp03_deepseek_headtohead.sh
#!/usr/bin/env bash
# Forge Lab — Experiment 03 · DeepSeek V4 Head-to-Head
set -euo pipefail
cd ~/forge/lab/scratch

declare -a PROVIDERS=("openai" "deepseek" "deepseek-pro")
declare -a TASKS=(T1 T2 T3 T4 T5)

OUT=~/forge/lab/exp03-$(date +%Y%m%d)
mkdir -p "$OUT"

for p in "${PROVIDERS[@]}"; do
  openclaw provider switch "$p" > /dev/null
  for t in "${TASKS[@]}"; do
    for run in 1 2 3; do                          # 3 runs per cell, take median
      git reset --hard -q
      openclaw task "$(cat prompts/$t.txt)" --quiet
      openclaw session show --last --json > "$OUT/${p}_${t}_run${run}.json"
    done
  done
done

# Build the matrix — median wall_ms and median cost_usd per (provider, task)
python3 ~/forge/lab/aggregate_matrix.py "$OUT" > "$OUT/matrix.tsv"
Hypotheses to test
  • H1 — V4-Flash completes T1–T3 at quality scores within 0.5 of GPT-5.4-mini, at <15% the cost
  • H2 — V4-Pro completes T5 (multi-step) at higher independence score than V4-Flash, justifying the price step
  • H3 — On T4 (web-grounded) the gap closes — citation discipline is more about prompting than model size
  • H4 — V4 reasoning_effort=max produces measurably better T3 (fix-the-bug) results than effort=high, but at >3× latency
Be honest about preview status

V4 is preview-tier. DeepSeek's own release notes say it trails frontier models by 3–6 months on knowledge benchmarks. The interesting finding is rarely "V4 wins" — it is "V4 wins on these task types, loses on those, and the cost ratio means it wins the practitioner's allocation decision anyway." Report the loss cases, including the loss to free local inference where it occurs.

Experiment 04 · Tier II
Local vs API — Where Sovereignty Is Already Cheap Enough
Intermediate ~2 hrs

The pivot question of the lab: at what task complexity does local inference (Gemma 4 26B on Ollama) stop being the right answer compared to a cheap API (DeepSeek V4-Flash)? The hypothesis is not that one wins outright. The hypothesis is that there is a clean cutoff in task complexity above which the API saves you more time than the local model saves you in dollars.

Method
1

Re-run T1–T5 on local + V4-Flash only

Two providers, five tasks, three runs each. Use the data already collected for V4-Flash from Experiment 03; only Ollama needs fresh runs.

2

Record three numbers per cell

Wall-clock seconds (latency), tool calls (loop efficiency), correctness score (did it work). Cost is always $0.00 for Ollama and ~$0.001 for V4-Flash on T1–T3.

3

Find the crossover

Plot wall-clock seconds vs task. The crossover task — where Ollama starts taking visibly longer per loop — is the practitioner's actual decision point.

~/forge/lab/exp04_crossover.py
"""Forge Lab — Experiment 04 · Local vs API crossover analysis.
Reads session JSONs from experiments 02–03 and finds the task at which
the time-saved by API exceeds the dollars-spent on API.
"""
import json, glob, statistics
from pathlib import Path

DEV_HOURLY_USD = 90     # lab operator's effective hourly rate
SECONDS_PER_USD = 3600 / DEV_HOURLY_USD

def load_runs(provider, task):
    files = glob.glob(f"~/forge/lab/exp03-*/{provider}_{task}_run*.json")
    return [json.loads(Path(f).expanduser().read_text()) for f in files]

for task in ["T1", "T2", "T3", "T4", "T5"]:
    local = load_runs("ollama", task)
    api   = load_runs("deepseek", task)

    local_s = statistics.median(r["duration_ms"] / 1000 for r in local)
    api_s   = statistics.median(r["duration_ms"] / 1000 for r in api)
    api_$   = statistics.median(r["cost_usd"]   for r in api)

    time_saved = local_s - api_s
    breakeven  = api_$ * SECONDS_PER_USD       # seconds the API has to save to pay for itself
    verdict    = "API wins" if time_saved > breakeven else "local wins"

    print(f"{task}: local {local_s:.1f}s · api {api_s:.1f}s · ${api_$:.4f} · {verdict}")
The finding to look for

Most lab operators find the crossover sits between T2 and T3. Below that, the local model is fast enough that paying for the API is operator-time-negative. Above that, the API loop completes so much faster that even at frontier prices it pays back. V4-Flash compresses that crossover region — when the API is <1 cent per task, the threshold for switching shifts down by one task tier. That is the practitioner-relevant story.

Experiment 05 · Tier III
A Real Refactor — The Composition Problem on Each Provider
Advanced ~4 hrs

Move out of toy tasks. Pick a real, ~500-line Python or Swift project from your own work — small enough that one OpenClaw session can hold it in mind, big enough that no single tool call fixes everything. Run a substantial refactor against it on each provider. The goal is not just correctness — it is to count how many individually approved actions collectively constitute scope creep. This is the Agency Paradox composition problem, made measurable.

Run in a sandboxed clone

Do this in a fresh git clone of the project under ~/forge/lab/scratch. Reset between providers. Do not run against the canonical working copy of any project you cannot afford to lose. The point of measuring scope creep is that it happens — that is the finding.

Refactor task — pick one of these
  • Replace all blocking I/O calls in a small async Python service with proper async equivalents
  • Extract a module's tightly coupled tests into property-based tests (Hypothesis or SwiftCheck)
  • Migrate a UIKit view controller to SwiftUI while preserving public API and state semantics
  • Replace bespoke logging with structured logging (structlog or os_log) end-to-end
Composition counter

After each provider's session, classify every tool call into one of three buckets: in-scope (directly serves the stated refactor), adjacent (related cleanup the model decided to do), out-of-scope (formatting, renames, dependency edits, comment additions the operator did not request). The ratio of these buckets is the composition footprint.

Experiment 05 · Composition Footprint Per Provider Tool calls classified post-hoc from session JSON
ProviderTotal callsIn-scopeAdjacentOut-of-scopeTests pass?
openai
gemini
deepseek-flash
deepseek-pro
ollama
What this experiment is really for

This is the bridge to Lab 004. The Agency Paradox is not a thought experiment — it is a number you can put on the page. Every cell in the out-of-scope column is an instance where the model did something individually defensible that, in aggregate, is not what was asked. ClawLaw's composition tracing is the architectural answer; this experiment is the empirical baseline that proves the answer is needed. Save the session JSONs — they become Lab 004's input fixtures.

Experiment 06 · Tier III
Cost-Per-Task Economics — Building the Allocation Model
Advanced ~3 hrs

Synthesize Experiments 01–05 into a single allocation table: for each task class, which provider is the right default? The output is a one-page decision aid you keep open while working — not a benchmark report, a working tool. The matrix is reproducible, dated, and tied to the price sheets in effect on the day of the run.

Inputs
  • Median cost-per-task from Experiment 02 (5 tasks × 4 providers)
  • Composition footprint from Experiment 05 (5 providers, real refactor)
  • Provider price sheets, dated to the run day, archived as PDFs in ~/forge/lab/pricing/
  • Operator hourly rate (used to convert wall time to operator cost)
Decision table to produce
Task classDefault providerReasonFallback
trivial ollama (gemma4:26b) Free · low latency on M4 Pro · API spend not justified deepseek-flash if Ollama is offline
refactor deepseek-flash ~10× cheaper than GPT-5.4 · matches quality on T2 in pilot runs openai gpt-5.4-mini for tightly typed languages
fix-the-bug — record from data — record from data
web-grounded — record from data — record from data
multi-step refactor — record from data — record from data
The three-tab view (when Lab 004 lands)

This decision table is one tab. The other two appear when ClawLaw is installed in Lab 004: Langfuse (the trace tab — every tool call as Intelligence), and Grafana (the metrics tab — system load as Surveillance). Three tabs, ISR-mapped, three views of the same agentic session. That is the screenshot worth publishing — but it is impossible without first having the cost model in place. Lab 003 builds the cost model; Lab 004 wires it to governance.

Tags for this document: OpenClaw DeepSeek V4 Gemma 4 Apple Silicon Agency Paradox Provider Matrix 2026-04

macsweeney.tech · AI Lab ISR Format · 19D Reconnaissance
Field Report · Forge Lab Series · Lab 003

Field Report —
Provider matrix log entry

Every experiment produces one of these. ISR reconnaissance format adapted for lab work — what was observed, where, when, against what baseline, with what confidence, and what it implies for the AgentVector argument.

FR

Field Report — Blank template

File one per experiment session. Date it. These accumulate into the lab log that becomes the Silicon Forge Field Guide. The provider lock-in (exact model string, exact provider config, exact date) is what makes a report citable a month from now.

Field Report · Lab 003 · Forge Lab Series Date: ____________ · Duration: ____________
WHAT Which experiment (01–06). Which task or tasks. Which providers and exact model strings. Reasoning_effort settings if applicable.
WHERE M4 Pro Mac Mini 64GB · macOS Sequoia [version] · OpenClaw [version] · Ollama [version] · Provider API endpoints in use. Commit SHA of the scratch repo.
WHEN Date · session start/end · network state. Note any provider rate-limit events that affected timing.
BASELINE Which prior experiment is this comparing against. What is the control provider. What changed between this run and the baseline.
FINDINGS Raw numbers first: tool-call counts, tokens in/out, wall time, cost USD per provider per task. Quality scores second (correctness · independence). Surprises explicitly noted.
CONFIDENCE High / Medium / Low. How many runs per cell? Any rate-limit events? Any provider model silently updated mid-run? What would increase confidence?
IMPLICATIONS What does this mean for the provider allocation table. What does it confirm or challenge in the Agency Paradox composition argument. What goes upstream into agentincommand.ai.
COMMAND NOTE What to run next. What would falsify the finding. Specific next experiment with acceptance criteria.
PUBLISHED The Dispatch URL · macsweeney.tech essay link · agentincommand.ai reference if applicable
FR

Field Report — Worked example (Experiment 03)

Field Report · Lab 003 · Exp 03 · DeepSeek V4 Head-to-Head April 27, 2026 · ~2 hrs
WHAT DeepSeek V4-Flash and V4-Pro versus OpenAI GPT-5.4-mini on the five standard tasks (T1–T5). Three runs per cell, median reported. V4-Flash effort=medium · V4-Pro effort=high · GPT-5.4-mini default settings.
WHERE M4 Pro Mac Mini 64GB · macOS Sequoia 15.x · OpenClaw 0.x.x · scratch repo commit a1b2c3d. DeepSeek API api.deepseek.com/v1 · OpenAI API api.openai.com/v1.
WHEN April 27, 2026 · 09:00–11:30. Three days post-V4 preview release. No rate-limit events on any provider. Stable network, M4 Pro under no other load.
BASELINE Experiment 02 matrix (April 26) provided GPT-5.4-mini and Ollama numbers. This run adds the DeepSeek columns. Comparison frame is correctness · independence · cost-per-task.
FINDINGS Record after running. Headline numbers expected: V4-Flash matches GPT-5.4-mini correctness on T1–T2 at <15% cost · V4-Pro matches on T3–T4 at ~30% cost · V4-Pro effort=max on T5 produces the cleanest output of any provider but at 3–4× wall time. Anything outside these ranges is the headline.
CONFIDENCE Medium — three runs per cell, but V4 is preview (model behaviour may shift). Scoring is judged by single operator, not blind-rated. Recommend rerun in 30 days as V4 stabilises and re-score blind to find drift.
IMPLICATIONS If V4-Flash holds at this cost-quality point, the provider allocation table tilts heavily toward DeepSeek for routine agentic-coding work, leaving the frontier APIs for the hard cases and Ollama for offline-required cases. This is the first practitioner-grade signal that open-weight has caught up enough at the price tier to challenge default API choice.
COMMAND NOTE → Next: Experiment 04 (crossover). Acceptance: 3 fresh Ollama runs per task at the new gemma4:26b tag. Then Experiment 05 with the real refactor. Lab 004 (ClawLaw) blocked on Exp 05 session JSONs as fixture data.
PUBLISHED The Dispatch · macsweeney.tech/silicon/lab/dispatch/2026-04-27-openclaw-deepseek-matrix · cross-posted to agentincommand.ai/research/v4-allocation
Forge Lab Series · Lab 003 · OpenClaw across the provider matrix · macsweeney.tech · Cross-ref L-026, L-028 (Vol V) · Cross-ref Lab 002 (Gemma 4 install) · Forward-ref Lab 004 (ClawLaw)
All labs