macsweeney.tech · AI Lab Rev 1.0 · April 2026

Lab 003 · Forge Lab Series

OpenClaw Across the Provider Matrix —
Installation, Configuration, and First Tasks

Lab 002 installed a model. This lab installs the agent — the thing that uses models to do work — and points it at four different brains in turn. Frontier API on one side, sovereign local inference on the other, the same agent in the middle.

Hardware M4 Pro · 64GB

OS macOS Sequoia

Install time ~30 min

Cross-ref L-026 · L-028

§0

Why an agentic CLI matters

A model on its own is a brain in a jar. It can answer when prompted, but it cannot read a file, run a command, fetch a URL, or modify a project. An agentic CLI is what closes that gap. It exposes a small, fixed set of tools — read, write, shell, search — and lets the model call them in a loop until the task is done. OpenClaw is one such CLI. It is open source, runs locally, and is provider-agnostic: the model behind the agent is a configuration choice, not a fork in the road.

That last property is what makes this lab worth running. The same agent, the same tools, the same task — pointed at OpenAI one minute and a 26B parameter model on the Mac next to it the minute after. The differences that fall out of that comparison are the practitioner's actual signal: where frontier models earn their price, where local inference is already enough, and where DeepSeek V4 — released three days ago — slots in between the two.

Why this is Lab 003

Lab 002 brought up Gemma 4 on Apple Silicon. That gave us an inference endpoint. Lab 003 brings up the consumer of that endpoint and a few API alternatives, so the question shifts from "can this model talk?" to "can this agent do work, and at what cost on which backend?" Lab 004 (planned) installs ClawLaw on top of this stack — the governance layer that turns the agentic loop from convenient into auditable.

Frontier API

OpenAI

GPT-5.4 family · function calling mature · highest baseline cost.

Frontier API

Google Gemini

Gemini 3.1 Pro / Flash · 1M context standard · OpenAI-compatible endpoint.

Open-weight API

DeepSeek V4

Pro 1.6T / Flash 284B MoE · 1M context · <1/10th frontier cost.

Local · Sovereign

Ollama · Gemma 4

26B MoE on the M4 Pro · zero per-token cost · slower, fully offline.

§1

Prerequisites

If Lab 002 is complete, most of this is already done. The new work in this lab is Node.js and three sets of API credentials. Run each phase in order and confirm the version output matches before moving on.

Phase 1 Node.js 20+ via Homebrew ~2 min

OpenClaw is distributed as a Node package. Use Homebrew rather than a system installer so updates flow through brew upgrade with the rest of the lab toolchain.

Terminal

# Install Node — LTS line is sufficient for OpenClaw
brew install node

node --version
v20.x.x

npm --version
10.x.x

Phase 2 Provider API keys ~10 min

Three keys to provision. None of these expose long-running cost — OpenClaw will not spend tokens until you invoke it. But treat each key as production credential: generate per-machine, never commit, and rotate when this lab ends.

OpenAI · platform.openai.com → API keys → forge-openclaw-2026-04. Scope: project key, restrict to chat completions and responses endpoints.
Google AI Studio · aistudio.google.com → Get API key → enable Gemini API. The free tier is generous for benchmarking; the paid tier removes rate-limit drama.
DeepSeek · platform.deepseek.com → API keys. Add $5 in credit; that lasts a long time at V4-Flash pricing.
Hugging Face · already provisioned in Lab 002 if you followed the token convention forge-lab-[tool]-[YYYY-MM]. Used here only for downloading mlx-community quantisations.

Key hygiene

Store keys in ~/.config/forge/secrets.env with chmod 600. Do not put them in .zshrc, do not put them in any file under a Git directory, and do not put them in shell history. The lab scripts expect them sourced from this single file. Worker-04's Vault instance is the long-term destination — for this lab, the encrypted file is sufficient.

Terminal — secrets.env scaffolding

mkdir -p ~/.config/forge
touch ~/.config/forge/secrets.env
chmod 600 ~/.config/forge/secrets.env

# Edit ~/.config/forge/secrets.env — fill in real values, no quotes around them
OPENAI_API_KEY=sk-proj-...
GEMINI_API_KEY=AIza...
DEEPSEEK_API_KEY=sk-...
HF_TOKEN=hf_...

# Source it in the current shell (or add to ~/.zprofile guarded by a check)
set -a; source ~/.config/forge/secrets.env; set +a

Phase 3 Ollama running with at least one model ~1 min verify

Required only if Lab 002 was skipped. The local backend needs an OpenAI-compatible endpoint at http://localhost:11434/v1. Confirm before continuing.

Terminal — verify Ollama

# Health check
curl -s http://localhost:11434
Ollama is running

# Confirm at least one model is present — gemma4:26b is the lab default
ollama list | grep gemma4
gemma4:26b    16 GB    8 days ago

§2

Install OpenClaw

OpenClaw is a Node package that ships a CLI binary and a long-running daemon. The CLI is what you type into; the daemon is what holds session state and proxies tool calls. They communicate over a localhost HTTP port — by default 18789. Same port discipline as the lab-series-darwin-openclaw doc; nothing exotic.

Phase 1 Install the CLI globally ~2 min

Terminal

npm install -g openclaw

openclaw --version
openclaw 0.x.x

# Check that the binary is on PATH and resolves to the brew-managed Node prefix
which openclaw
/opt/homebrew/bin/openclaw

Why global, not per-project

The agent is a tool, not a dependency. Treat it like git or kubectl — one binary on PATH, not vendored into every repo. The state and config live in ~/.openclaw, isolated from any individual project.

Phase 2 Initialise config and daemon ~3 min

First run scaffolds the config directory and starts the daemon. The default config points at OpenAI; we will rewire it in §3.

Terminal

openclaw init
Created ~/.openclaw/config.toml
Created ~/.openclaw/sessions/
Created ~/.openclaw/logs/
Daemon not running — start with: openclaw daemon start

openclaw daemon start
Daemon listening on http://127.0.0.1:18789
Workspace: $PWD
Provider:  openai (default)
Status:    READY

# Health check the daemon (this is the same probe ClawLaw will use later)
curl -s http://127.0.0.1:18789/health
{"status":"ready","provider":"openai","tools":7,"session":null}

Phase 3 Tour the config file ~3 min

Open ~/.openclaw/config.toml and read it before you change it. Three sections matter: [provider], [agent], and [tools]. Everything in §3 and §4 is editing this file.

~/.openclaw/config.toml — the shipped default

# OpenClaw configuration — see openclaw docs for full reference

[provider]
name     = "openai"
model    = "gpt-5.4-mini"
api_key  = "$OPENAI_API_KEY"     # env-var interpolation
base_url = "https://api.openai.com/v1"

[agent]
max_iterations  = 20
timeout_seconds = 120
working_dir     = "$PWD"

[tools]
enabled = ["read", "write", "bash", "glob", "grep", "web_search", "message"]

[daemon]
port    = 18789
log_dir = "~/.openclaw/logs"

§3

Wire the API providers

Three frontier-or-near-frontier APIs, configured one at a time. Each is selected by editing [provider] in config.toml and running openclaw provider switch <name>. The session state is preserved across switches, which is what makes the matrix experiments in the next tab tractable.

Provider 1 OpenAI · GPT-5.4 family ~3 min

This is the shipped default. Verify it works end-to-end before moving on. The smoke test prompt is the same one used in every provider verification — keeping prompts identical across providers is the whole point of the lab.

~/.openclaw/config.toml — [provider] section

[provider]
name     = "openai"
model    = "gpt-5.4-mini"      # or "gpt-5.4" for the larger variant
api_key  = "$OPENAI_API_KEY"
base_url = "https://api.openai.com/v1"

Terminal — verify the provider

openclaw provider switch openai
Provider: openai · model: gpt-5.4-mini · ready

# Smoke test — no tools used, just a model round-trip
openclaw ask "In one sentence: what is the Agency Paradox?"
[gpt-5.4-mini] The Agency Paradox is the gap between what an autonomous
agent is technically capable of doing and what it has been authorized to do.

Provider 2 Google Gemini · OpenAI-compat endpoint ~5 min

Gemini exposes an OpenAI-compatible endpoint at generativelanguage.googleapis.com/v1beta/openai/. OpenClaw treats it as a drop-in OpenAI provider with a different base URL and key — no Gemini-specific adapter required. Same code path, different brain.

~/.openclaw/config.toml — Gemini variant

[provider]
name     = "gemini"
model    = "gemini-3.1-flash"     # or gemini-3.1-pro
api_key  = "$GEMINI_API_KEY"
base_url = "https://generativelanguage.googleapis.com/v1beta/openai"

Terminal — verify Gemini

openclaw provider switch gemini
openclaw ask "In one sentence: what is the Agency Paradox?"

# Direct API smoke test (bypassing OpenClaw, useful for debugging)
curl -s https://generativelanguage.googleapis.com/v1beta/openai/chat/completions \
  -H "Authorization: Bearer $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemini-3.1-flash","messages":[{"role":"user","content":"hi"}]}' \
  | jq .choices[0].message.content

1M context as default

Gemini 3.1 ships with 1M tokens standard across both Flash and Pro tiers. For agentic loops that read large repos before acting, this is the default to beat. Note the cost asymmetry: cached inputs are heavily discounted, which favours long-running sessions over one-shot calls.

Provider 3 DeepSeek V4 · open-weight, frontier-class Released Apr 24, 2026 ~5 min

DeepSeek V4 dropped three days before this lab was written. Two preview variants ship: V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B total / 13B active MoE), both with 1M token context and Hybrid Attention Architecture (CSA + HCA). The DeepSeek release notes specifically call out OpenClaw as a supported agent integration — that is the reason this lab exists this week and not next month.

~/.openclaw/config.toml — DeepSeek variant

[provider]
name     = "deepseek"
model    = "deepseek-v4-flash"     # or deepseek-v4-pro
api_key  = "$DEEPSEEK_API_KEY"
base_url = "https://api.deepseek.com/v1"

[provider.options]
reasoning_effort = "medium"     # low · medium · high · max

Terminal — verify DeepSeek V4

openclaw provider switch deepseek
openclaw ask "In one sentence: what is the Agency Paradox?"

# Direct API check — DeepSeek is OpenAI-compatible
curl -s https://api.deepseek.com/v1/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"hi"}]}'

Variant	Total / Active	Context	Input $/M	Output $/M	Use case
V4-Flash	284B / 13B	1M	$0.14	$0.28	Default agent loop · cheap iterations · long context retrieval
V4-Pro	1.6T / 49B	1M	$1.74	$3.48	Recommended Hard reasoning · multi-step refactors · still ~10× cheaper than GPT-5.4 / Opus 4.6

Why this matters for the matrix

For the first time, an open-weight model is priced low enough that running it via API is cost-equivalent to running a smaller local model on your own electricity bill. The interesting comparison is no longer "API quality vs free local quality" — it is "API cost vs local sovereignty". DeepSeek V4-Flash is the pivot point that makes that comparison sharp. Experiment 03 in the next tab is built around it.

Preview pricing, preview behaviour

V4 is a preview release. Pricing has been announced as preview pricing and DeepSeek has signalled it may shift downward as Huawei Ascend production scales. Re-pull pricing into the cost model in Experiment 06 before publishing any benchmark numbers — and date-stamp the report.

§4

Wire the local backend

Same agent, no network. Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1, which means OpenClaw treats it identically to a frontier API — just with a different base URL and no key. That symmetry is not an accident; it is the entire reason provider-agnostic matters. The only practical difference your wallet will notice is that this provider costs zero per token.

Provider 4 Ollama · Gemma 4 26B MoE ~3 min

~/.openclaw/config.toml — Ollama variant

[provider]
name     = "ollama"
model    = "gemma4:26b"
api_key  = "ollama"                   # literal placeholder, ignored
base_url = "http://localhost:11434/v1"

[provider.options]
num_ctx     = 32768                    # match what Ollama is configured for
temperature = 0.2                      # keep agentic loops boring

Terminal — verify the local backend

openclaw provider switch ollama
openclaw ask "In one sentence: what is the Agency Paradox?"

# Confirm Ollama actually loaded the model — check ps
ollama ps
NAME           SIZE     PROCESSOR    UNTIL
gemma4:26b     20 GB    100% GPU     4 minutes from now

Note DeepSeek V4 locally — the honest answer is no Reference

The obvious question after seeing the V4-Flash spec sheet: can I run it locally? On 64GB unified memory the answer is no, not at any quality that justifies the workflow. V4-Flash is 284B parameters total. Even at the FP4-mixed precision DeepSeek ships, the weights alone are roughly 140–160GB. Q2 community quantisations might fit, but at that compression the model is no longer the model.

The hardware that would actually run V4-Flash locally is a Mac Studio with 192GB or more — Tier III in the Forge architecture, slated for an M5 Ultra refresh. Until then, V4 lives on the API side of the matrix. That is fine. Acknowledging the gap is the point — it makes the cost-vs-sovereignty trade-off in Experiment 04 concrete rather than aspirational.

What actually fits on 64GB at usable quality

Gemma 4 26B MoE at Q4–Q8 (~16–26GB), Gemma 4 31B Dense at Q4 (~17GB), Qwen2.5-Coder 32B at Q4_K_M (~19GB), Llama 4 70B at Q3 (tight at ~30GB). For the agentic workloads in this lab, Gemma 4 26B MoE is the recommended local default — it has the function-calling discipline the loop requires and leaves enough headroom for OpenClaw's own working state.

§5

Switching providers cleanly

All four providers now resolve. The lab discipline is to treat the active provider as a deliberate, logged choice — not a sticky default that drifts. Three commands cover ninety percent of the daily workflow.

Terminal — provider command reference

# List every configured provider and which one is active
openclaw provider list
  openai      gpt-5.4-mini          configured
* gemini      gemini-3.1-flash      configured (active)
  deepseek    deepseek-v4-flash     configured
  ollama      gemma4:26b            configured

# Switch — daemon stays up, only the upstream client rebinds
openclaw provider switch deepseek
Provider: deepseek · model: deepseek-v4-flash · ready

# One-shot override for a single command (does not persist)
openclaw ask --provider openai "What does this codebase do?"

# Inspect the current provider's exact config
openclaw provider show
name:     deepseek
model:    deepseek-v4-flash
base_url: https://api.deepseek.com/v1
options:  reasoning_effort=medium
session:  none active

Lab discipline — every provider switch is a log entry

The Field Report template at the back of this document records provider, model, and exact config for every experiment session. This is not bureaucracy — it is the only way matrix benchmarks remain comparable a month later when GPT-5.4-mini has been silently updated and Gemma's Ollama tag has been re-quantised. Numbers without provenance are noise.

§6

The tool taxonomy

OpenClaw exposes seven tools to whichever model is active. Each is a function the model can call between turns; the daemon executes it on the model's behalf and returns the result. Understanding what each tool does, what it costs, and what it risks is the prerequisite for everything in the next tab — and the prerequisite for ClawLaw in Lab 004.

Tool	What it does	Cost	Risk surface
read	Read a file from the working directory	Tokens for file content + a small model turn	Information leak — sensitive files surface into the model context
write	Write or overwrite a file	Tokens for content + a model turn	Irreversible if no version control · can clobber human edits
bash	Execute a shell command in the working directory	Output capture + a model turn	Highest — full process authority unless sandboxed · network egress · destructive commands
glob	Find files by pattern	Path list + a model turn	Low · enumerates filesystem layout (light info disclosure)
grep	Search file contents by regex	Match snippets + a model turn	Low–medium · can surface secrets if pointed at config dirs
web_search	Query a search engine, return links and snippets	Tokens for results + provider search fee (varies)	Egress · provider sees query text · injected content into the model
message	Speak to the human operator (no side effect)	One model turn	None directly · surfaces intent, useful for traceability

The bash tool is the boundary

Six of seven tools can be approximated through other tools. The bash tool is the one that cannot. Anything OpenClaw is permitted to do via bash, it can do — including rm -rf, including outbound HTTP, including pulling and running other binaries. The default config enables it because the lab is observational, not production. Lab 004 in this series installs ClawLaw, whose ShellCommandApprovalLaw turns bash into an approval-gated tool. Until then: run OpenClaw in a directory you can git restore from, and never against your home directory.

Terminal — essential daily commands

# Single-shot question, no agentic loop, no tool calls
openclaw ask "explain this error: ..."

# Full agentic task — model can call tools until done or max_iterations
openclaw task "refactor utils.py to use pathlib instead of os.path"

# Resume the last interrupted session
openclaw resume

# Show the live tool-call trace for the active session
openclaw trace --follow

# Inspect a completed session — every tool call, every model turn
openclaw session show --last

# Daemon control
openclaw daemon status
openclaw daemon stop
openclaw daemon restart                    # after config edits

§7

Smoke test the whole stack

Final installation step: one task, run sequentially against all four providers, in a throwaway directory. The point is not to evaluate the providers — that comes in the experiments — but to confirm every cell of the matrix is wired and that the same observed prompt produces a coherent tool-call trace on each backend.

Terminal — install smoke test

# Build a deliberately tiny scratch repo
mkdir -p ~/forge/lab/smoke && cd ~/forge/lab/smoke
git init -q
echo "# Smoke" > README.md
echo "def add(a, b): return a + b" > util.py
git add -A && git commit -q -m "baseline"

# Standardised task — small enough to complete in 1–3 tool calls per provider
PROMPT="Read util.py, then write a new file test_util.py with one pytest case for add()."

for p in openai gemini deepseek ollama; do
  echo "--- provider: $p ---"
  git reset --hard -q                  # clean slate each pass
  openclaw provider switch $p > /dev/null
  openclaw task "$PROMPT" --quiet
  openclaw session show --last --summary     # tool-call count, tokens, ms
done

Each provider produces test_util.py in the working directory
Tool-call count is roughly comparable (2–4 calls per provider for this task)
No tool calls escape the working directory — confirm with git status after each pass
The local backend (Ollama) is slower per call but produces a working test file
Total cost printed by the matrix is non-zero only on the API providers

If the smoke test passes

Move to the Experiments tab. Lab 003 is installed. Lab 004 (ClawLaw) is the next tab in this series; until it is published, the operator is the governance layer — read every trace before you trust the outcome.

macsweeney.tech · AI Lab Six experiments · ~14 hrs total

Experiments · Forge Lab Series · Lab 003

The provider matrix —
six experiments, four backends, one agent

Each experiment runs the same agent against multiple providers and produces a publishable Field Report. The whole tab is structured so the matrix from Experiment 02 becomes the comparison frame for everything afterward.

Experiment 01 · Tier I

First Contact — One Task, Four Brains

Entry Level ~45 min

Run the smoke-test task from §7 again, but this time treat it as a real experiment: record tool-call count, total wall-clock time, total tokens in/out, and total cost for each of the four providers. Produce one row of the matrix per provider. This is Forge Lab Log Entry 003.01.

Setup

Use the same scratch repo

Reuse ~/forge/lab/smoke from §7. Reset to baseline before each provider pass with git reset --hard. Identical starting state is the whole methodology.

Pick the prompt and freeze it

Use one of the three prompts below verbatim. Do not improve the prompt between runs. The prompt itself is part of the experimental constant.

Run, log, compare

Run the prompt against each provider in turn using the script below. openclaw session show --last --json emits the full trace; pipe it to a file per provider for later analysis.

Standard prompts (pick one and stick with it)

Prompt A · trivial

"Read util.py, then write test_util.py with one pytest case for add()."

Prompt B · light reasoning

"Audit this directory for any Python file using os.path. Replace with pathlib equivalents in place. Do not modify anything else."

Prompt C · web-grounded

"Look up the latest stable Python release. Update any version pin in pyproject.toml or requirements.txt to use that version. Note your source."

~/forge/lab/exp01_first_contact.sh

#!/usr/bin/env bash
# Forge Lab — Experiment 01 · First Contact
set -euo pipefail
cd ~/forge/lab/smoke

PROMPT="Read util.py, then write test_util.py with one pytest case for add()."
OUT=~/forge/lab/exp01-$(date +%Y%m%d-%H%M)
mkdir -p "$OUT"

for p in openai gemini deepseek ollama; do
  echo "=== $p ===" | tee -a "$OUT/log.txt"
  git reset --hard -q
  openclaw provider switch $p > /dev/null

  START=$(date +%s)
  openclaw task "$PROMPT" --quiet
  END=$(date +%s)

  openclaw session show --last --json > "$OUT/$p.json"
  echo "  wall_seconds: $((END - START))" >> "$OUT/log.txt"
done

# Quick matrix from the JSON traces
jq -r '[.provider, .tool_calls|length, .tokens.total, .cost_usd, .duration_ms] | @tsv' \
  "$OUT"/*.json | column -t

What to record

Experiment 01 · First Contact Matrix Fill from $OUT/<provider>.json

Provider	Model	Tool calls	Tokens in/out	Wall time	Cost USD
openai	gpt-5.4-mini	—	—	—	—
gemini	gemini-3.1-flash	—	—	—	—
deepseek	deepseek-v4-flash	—	—	—	—
ollama	gemma4:26b	—	—	—	$0.00

Output file test_util.py exists and runs pytest -q cleanly for each provider
Tool-call counts within ~1.5× of each other across providers (large drift = a prompt the model misreads)
Cost on Ollama is exactly zero (this is the canary — non-zero means a misrouted call)
Each session JSON saved to $OUT/<provider>.json for later replay

Experiment 02 · Tier I

Provider Matrix — Five Standard Tasks Across Four Backends

Entry Level ~3 hrs

Five tasks of increasing complexity, run across all four providers, recorded in one matrix. The deliverable is a 5×4 grid of cells with cost, tool-call count, and a quality score. This becomes the reference table the rest of the lab cites — and the table the community does not yet have for OpenClaw + DeepSeek V4 as of this week.

The practitioner's value is publishing a matrix others can reproduce. Every cell needs exact provider + model + date — drift in any of those invalidates comparison.

The five standard tasks

Task	Description	Tools expected
T1 · trivial	Read a file, write one pytest case based on it	read, write
T2 · refactor	Replace os.path usage with pathlib across a small package	glob, read, write
T3 · fix-the-bug	A failing test is provided; identify and fix the underlying bug	read, bash (run pytest), write
T4 · web-grounded	Look up a fact, update a config file accordingly, cite the source	web_search, read, write, message
T5 · multi-step	Read README, inspect code, generate a CHANGELOG entry for the last commit	read, bash (git log), grep, write

Quality scoring (the two-score method)

Each task gets two scores, judged independently on a 0–3 scale. Correctness: did the output do what was asked? Reasoning independence: did the model figure out the task on its own, or did it require operator nudging mid-loop? Recording these separately is what makes provider comparison interesting — a model can be highly correct only when heavily guided, and that should not score the same as a model that just gets it right.

Experiment 02 · Provider × Task Matrix Quality = correctness · independence (each 0–3)

Task	openai	gemini	deepseek	ollama
T1 trivial	—	—	—	—
T2 refactor	—	—	—	—
T3 fix-the-bug	—	—	—	—
T4 web-grounded	—	—	—	—
T5 multi-step	—	—	—	—

Publication angle

An OpenClaw + DeepSeek V4 + Apple Silicon benchmark has not been published anywhere as of April 27, 2026 — the V4 release is three days old. Publishing this matrix today, with exact dates and reproducible scripts, makes it a primary citation source. Include the prompts, the seed scratch repo, and the per-provider session JSONs alongside the report. Drop the date stamp prominently.

Experiment 03 · Tier II DeepSeek V4 focus

DeepSeek V4 Head-to-Head — Flash vs Pro vs Frontier Reference

Intermediate ~2 hrs

DeepSeek V4 ships in two sizes with very different price points. The hypothesis to test: V4-Flash is cost-equivalent to local inference but quality-equivalent to a frontier API on agentic-coding tasks. The control is GPT-5.4 (frontier reference); the contrast is V4-Pro (the high-effort variant, where DeepSeek's max-reasoning mode lives). Three providers, the same five tasks from Experiment 02, with cost-per-completed-task as the headline metric.

Setup

Add the V4-Pro provider variant

Duplicate the deepseek block in config.toml with name deepseek-pro and model deepseek-v4-pro. Keep V4-Flash as the default. Switching becomes a one-liner.

Set the reasoning effort dial

V4 supports four effort modes: low, medium, high, max. Run V4-Flash on medium (its sweet spot) and V4-Pro on high (its sweet spot). Note: max mode requires >=384K context allocation per the DeepSeek release notes — only worth it for the hardest task in the suite.

Run all five tasks against three providers

OpenAI GPT-5.4 (control), DeepSeek V4-Flash, DeepSeek V4-Pro. Same prompts, same scratch repo, three passes each. Median of three is the reportable number.

~/forge/lab/exp03_deepseek_headtohead.sh

#!/usr/bin/env bash
# Forge Lab — Experiment 03 · DeepSeek V4 Head-to-Head
set -euo pipefail
cd ~/forge/lab/scratch

declare -a PROVIDERS=("openai" "deepseek" "deepseek-pro")
declare -a TASKS=(T1 T2 T3 T4 T5)

OUT=~/forge/lab/exp03-$(date +%Y%m%d)
mkdir -p "$OUT"

for p in "${PROVIDERS[@]}"; do
  openclaw provider switch "$p" > /dev/null
  for t in "${TASKS[@]}"; do
    for run in 1 2 3; do                          # 3 runs per cell, take median
      git reset --hard -q
      openclaw task "$(cat prompts/$t.txt)" --quiet
      openclaw session show --last --json > "$OUT/${p}_${t}_run${run}.json"
    done
  done
done

# Build the matrix — median wall_ms and median cost_usd per (provider, task)
python3 ~/forge/lab/aggregate_matrix.py "$OUT" > "$OUT/matrix.tsv"

Hypotheses to test

H1 — V4-Flash completes T1–T3 at quality scores within 0.5 of GPT-5.4-mini, at <15% the cost
H2 — V4-Pro completes T5 (multi-step) at higher independence score than V4-Flash, justifying the price step
H3 — On T4 (web-grounded) the gap closes — citation discipline is more about prompting than model size
H4 — V4 reasoning_effort=max produces measurably better T3 (fix-the-bug) results than effort=high, but at >3× latency

Be honest about preview status

V4 is preview-tier. DeepSeek's own release notes say it trails frontier models by 3–6 months on knowledge benchmarks. The interesting finding is rarely "V4 wins" — it is "V4 wins on these task types, loses on those, and the cost ratio means it wins the practitioner's allocation decision anyway." Report the loss cases, including the loss to free local inference where it occurs.

Experiment 04 · Tier II

Local vs API — Where Sovereignty Is Already Cheap Enough

Intermediate ~2 hrs

The pivot question of the lab: at what task complexity does local inference (Gemma 4 26B on Ollama) stop being the right answer compared to a cheap API (DeepSeek V4-Flash)? The hypothesis is not that one wins outright. The hypothesis is that there is a clean cutoff in task complexity above which the API saves you more time than the local model saves you in dollars.

Method

Re-run T1–T5 on local + V4-Flash only

Two providers, five tasks, three runs each. Use the data already collected for V4-Flash from Experiment 03; only Ollama needs fresh runs.

Record three numbers per cell

Wall-clock seconds (latency), tool calls (loop efficiency), correctness score (did it work). Cost is always $0.00 for Ollama and ~$0.001 for V4-Flash on T1–T3.

Find the crossover

Plot wall-clock seconds vs task. The crossover task — where Ollama starts taking visibly longer per loop — is the practitioner's actual decision point.

~/forge/lab/exp04_crossover.py

"""Forge Lab — Experiment 04 · Local vs API crossover analysis.
Reads session JSONs from experiments 02–03 and finds the task at which
the time-saved by API exceeds the dollars-spent on API.
"""
import json, glob, statistics
from pathlib import Path

DEV_HOURLY_USD = 90     # lab operator's effective hourly rate
SECONDS_PER_USD = 3600 / DEV_HOURLY_USD

def load_runs(provider, task):
    files = glob.glob(f"~/forge/lab/exp03-*/{provider}_{task}_run*.json")
    return [json.loads(Path(f).expanduser().read_text()) for f in files]

for task in ["T1", "T2", "T3", "T4", "T5"]:
    local = load_runs("ollama", task)
    api   = load_runs("deepseek", task)

    local_s = statistics.median(r["duration_ms"] / 1000 for r in local)
    api_s   = statistics.median(r["duration_ms"] / 1000 for r in api)
    api_$   = statistics.median(r["cost_usd"]   for r in api)

    time_saved = local_s - api_s
    breakeven  = api_$ * SECONDS_PER_USD       # seconds the API has to save to pay for itself
    verdict    = "API wins" if time_saved > breakeven else "local wins"

    print(f"{task}: local {local_s:.1f}s · api {api_s:.1f}s · ${api_$:.4f} · {verdict}")

The finding to look for

Most lab operators find the crossover sits between T2 and T3. Below that, the local model is fast enough that paying for the API is operator-time-negative. Above that, the API loop completes so much faster that even at frontier prices it pays back. V4-Flash compresses that crossover region — when the API is <1 cent per task, the threshold for switching shifts down by one task tier. That is the practitioner-relevant story.

Experiment 05 · Tier III

A Real Refactor — The Composition Problem on Each Provider

Advanced ~4 hrs

Move out of toy tasks. Pick a real, ~500-line Python or Swift project from your own work — small enough that one OpenClaw session can hold it in mind, big enough that no single tool call fixes everything. Run a substantial refactor against it on each provider. The goal is not just correctness — it is to count how many individually approved actions collectively constitute scope creep. This is the Agency Paradox composition problem, made measurable.

Run in a sandboxed clone

Do this in a fresh git clone of the project under ~/forge/lab/scratch. Reset between providers. Do not run against the canonical working copy of any project you cannot afford to lose. The point of measuring scope creep is that it happens — that is the finding.

Refactor task — pick one of these

Replace all blocking I/O calls in a small async Python service with proper async equivalents
Extract a module's tightly coupled tests into property-based tests (Hypothesis or SwiftCheck)
Migrate a UIKit view controller to SwiftUI while preserving public API and state semantics
Replace bespoke logging with structured logging (structlog or os_log) end-to-end

Composition counter

After each provider's session, classify every tool call into one of three buckets: in-scope (directly serves the stated refactor), adjacent (related cleanup the model decided to do), out-of-scope (formatting, renames, dependency edits, comment additions the operator did not request). The ratio of these buckets is the composition footprint.

Experiment 05 · Composition Footprint Per Provider Tool calls classified post-hoc from session JSON

Provider	Total calls	In-scope	Adjacent	Out-of-scope	Tests pass?
openai	—	—	—	—	—
gemini	—	—	—	—	—
deepseek-flash	—	—	—	—	—
deepseek-pro	—	—	—	—	—
ollama	—	—	—	—	—

What this experiment is really for

This is the bridge to Lab 004. The Agency Paradox is not a thought experiment — it is a number you can put on the page. Every cell in the out-of-scope column is an instance where the model did something individually defensible that, in aggregate, is not what was asked. ClawLaw's composition tracing is the architectural answer; this experiment is the empirical baseline that proves the answer is needed. Save the session JSONs — they become Lab 004's input fixtures.

Experiment 06 · Tier III

Cost-Per-Task Economics — Building the Allocation Model

Advanced ~3 hrs

Synthesize Experiments 01–05 into a single allocation table: for each task class, which provider is the right default? The output is a one-page decision aid you keep open while working — not a benchmark report, a working tool. The matrix is reproducible, dated, and tied to the price sheets in effect on the day of the run.

Inputs

Median cost-per-task from Experiment 02 (5 tasks × 4 providers)
Composition footprint from Experiment 05 (5 providers, real refactor)
Provider price sheets, dated to the run day, archived as PDFs in ~/forge/lab/pricing/
Operator hourly rate (used to convert wall time to operator cost)

Decision table to produce

Task class	Default provider	Reason	Fallback
trivial	ollama (gemma4:26b)	Free · low latency on M4 Pro · API spend not justified	deepseek-flash if Ollama is offline
refactor	deepseek-flash	~10× cheaper than GPT-5.4 · matches quality on T2 in pilot runs	openai gpt-5.4-mini for tightly typed languages
fix-the-bug	— record from data	— record from data	—
web-grounded	— record from data	— record from data	—
multi-step refactor	— record from data	— record from data	—

The three-tab view (when Lab 004 lands)

This decision table is one tab. The other two appear when ClawLaw is installed in Lab 004: Langfuse (the trace tab — every tool call as Intelligence), and Grafana (the metrics tab — system load as Surveillance). Three tabs, ISR-mapped, three views of the same agentic session. That is the screenshot worth publishing — but it is impossible without first having the cost model in place. Lab 003 builds the cost model; Lab 004 wires it to governance.

Tags for this document: OpenClaw DeepSeek V4 Gemma 4 Apple Silicon Agency Paradox Provider Matrix 2026-04

OpenClaw Across the Provider Matrix

OpenClaw Across the Provider Matrix —
Installation, Configuration, and First Tasks

Why an agentic CLI matters

Prerequisites

Install OpenClaw

Wire the API providers

Wire the local backend

Switching providers cleanly

The tool taxonomy

Smoke test the whole stack

The provider matrix —
six experiments, four backends, one agent

Use the same scratch repo

Pick the prompt and freeze it

Run, log, compare

Prompt A · trivial

Prompt B · light reasoning

Prompt C · web-grounded

Add the V4-Pro provider variant

Set the reasoning effort dial

Run all five tasks against three providers

Re-run T1–T5 on local + V4-Flash only

Record three numbers per cell

Find the crossover

Field Report —
Provider matrix log entry

Field Report — Blank template

Field Report — Worked example (Experiment 03)

OpenClaw Across the Provider Matrix

OpenClaw Across the Provider Matrix —Installation, Configuration, and First Tasks

Why an agentic CLI matters

Prerequisites

Install OpenClaw

Wire the API providers

Wire the local backend

Switching providers cleanly

The tool taxonomy

Smoke test the whole stack

The provider matrix —six experiments, four backends, one agent

Use the same scratch repo

Pick the prompt and freeze it

Run, log, compare

Prompt A · trivial

Prompt B · light reasoning

Prompt C · web-grounded

Add the V4-Pro provider variant

Set the reasoning effort dial

Run all five tasks against three providers

Re-run T1–T5 on local + V4-Flash only

Record three numbers per cell

Find the crossover

Field Report —Provider matrix log entry

Field Report — Blank template

Field Report — Worked example (Experiment 03)

OpenClaw Across the Provider Matrix —
Installation, Configuration, and First Tasks

The provider matrix —
six experiments, four backends, one agent

Field Report —
Provider matrix log entry