Silicon · Learn · Module 01

Foundations — inference on Apple Silicon.

Pick the right hardware and the rest of this track is achievable. Pick the wrong hardware and you spend the next eight modules fighting the floor.

Module

M-01

Status

Published

Paired lab

L-001

Why does the chip matter for inference?

Every AI model you run locally is doing the same fundamental work: multiplying large matrices of floating-point numbers, billions of times per second. Where those multiplications happen — and how the data moves to get there — determines everything about performance and efficiency.

Traditional computers separate memory from compute. The CPU sits here; the RAM sits over there; they communicate over a bus with measurable latency. When you add a discrete GPU, the situation compounds: data must travel from system RAM to the GPU's dedicated VRAM before computation can begin. Large models have large weights. Moving them is expensive.

Apple Silicon collapses this architecture. The CPU, GPU, Neural Engine, and memory live on the same silicon die. There is no transfer. When the GPU reads model weights, it reads from the same physical location the CPU already loaded. This is unified memory architecture, and for inference workloads it is the most significant architectural advantage available on consumer hardware today.

Four primitives to understand before you benchmark

Unified memory architecture

CPU, GPU, and Neural Engine share one memory pool. Model weights loaded once are available to any compute unit without transfer. This is why 64GB of unified memory outperforms discrete GPUs with more raw TFLOPS for inference workloads. The advantage is not compute — it is the absence of a copy.

Tokens per second

The standard inference metric. One token is roughly 0.75 words. 30 tokens per second is comfortable reading speed. 10 is slow. The number depends on model size, quantisation, and memory bandwidth — not raw TFLOPS. A faster published number on a different model size is not a faster machine; it's a different experiment.

Quantisation

Full 32-bit weights require four times the memory of 8-bit equivalents with minimal quality loss. Q4_K_M is the practitioner's default: 4-bit, K-quantised, medium — the best balance of size, speed, and quality for daily use. Higher quantisation (Q8, FP16) is for evaluation runs where quality is the variable being measured. Lower (Q2, Q3) is for memory-constrained boxes where you accept the quality hit.

Memory bandwidth

For inference, the bottleneck is almost never compute — it is how fast you can read weights from memory. Apple Silicon's unified bandwidth (273 GB/s on M4 Pro, higher on M2/M3 Ultra) explains its efficiency better than any other single number. If you remember one thing from this module, remember that.

The comparison-video trap

The trap

The RTX 4090 has 2.5× the memory bandwidth of an M4 Pro but can't fit a 70B model — it spills to system RAM and loses the advantage entirely. The Mac Mini holds the entire model in unified memory and runs it cleanly. For 7B to 70B models locally, Apple Silicon is the correct architecture.

This is the framing the rest of the track is built on. Apple Silicon does not "win" inference because it has more raw compute — it doesn't. It wins because it has more usable memory, addressable by every compute unit, without copies. The architecture decides the ceiling, not the marketing.

If you take away the wrong lesson here, you'll spend the next seven modules confused about why your local 7B model on a Mini outperforms a video you watched of a 14B model on a different machine. Different model size is a different experiment. The lab will make this concrete.

Pick the hardware

If you have an Apple Silicon Mac with 16GB+

Start here. You already have what you need to run the first half of this track. The 7B and 13B model classes fit comfortably. The 32B class is reachable with quantisation. You will see the architecture argument in your own data within an hour.

If you want a dedicated lab machine

The two reasonable options:

Mac Mini M4 Pro · 64GB — the daily-driver inference node. Runs 32B comfortably, 70B with breathing room. Doubles as the governance boundary if you go down the ClawLaw path. Used-market refurbs available.
Mac Studio (M2 Ultra or newer) · 128GB+ — the inference substrate. Runs multiple models simultaneously. Required if you intend to run the full Forge architecture's Tier III.

For the cluster path

Module 02 and beyond use Intel Mac Minis as cluster nodes — they are cheap, silent, and electrically efficient. Don't buy them yet. Read the modules first so you understand why six nodes with specialised roles is the design, not six identical workers.

What to skip

Discrete-GPU PCs for inference under 70B. The bandwidth advantage is real but the memory ceiling makes them awkward for daily use.
Cloud GPUs by the hour for development. Fine for one-off training jobs; expensive and slow for the constant iteration of building a lab.
M1 Macs as primary inference nodes — they are great machines, but the unified memory bandwidth jump from M1 to M2/M3/M4 is meaningful for inference. Use them as cluster nodes (Phase 2) if you have them; don't buy them new for this purpose.

Now go measure it

Reading the module is necessary; running the lab is sufficient. The hypothesis the lab tests is the central claim of this module: unified memory architecture means inference throughput is bounded by memory bandwidth, not compute. Quantised models should run proportionally faster than parameter count alone predicts, and performance should degrade gracefully as model size approaches available memory.

The paired lab walks you through three model sizes, records tokens per second at each, and asks you to compare your results against the module's claims after recording — not before. The methodology matters more than the numbers.

Paired lab

L-001 · Inference Benchmark: Apple Silicon vs Discrete GPU →

What tokens-per-second actually measures, why most published comparisons are methodologically broken, and how to run a controlled experiment that produces data you can trust.

Where this module leads

Once you have your inference baseline, Module 02 takes you to the cluster. Intel Mac Minis become the control plane and the workloads. K3s lights up. The same hardware-architecture discipline applies — but now to multi-node operations instead of single-node inference.

If you skip the cluster path and stay on a single Apple Silicon node, Module 06 (Inference) and Module 07 (Governance) are still fully accessible. The track branches; the modules don't gate each other after Module 01.

← Track index

All modules

L-001 · Inference Benchmark: Apple Silicon vs Discrete GPU