Every AI model you run locally is doing the same fundamental work: multiplying large matrices of floating-point numbers, billions of times per second. Where those multiplications happen — and how the data moves to get there — determines everything about performance and efficiency.
Traditional computers separate memory from compute. The CPU sits here; the RAM sits over there; they communicate over a bus with measurable latency. When you add a discrete GPU, the situation compounds: data must travel from system RAM to the GPU's dedicated VRAM before computation can begin. Large models have large weights. Moving them is expensive.
Apple Silicon collapses this architecture. The CPU, GPU, Neural Engine, and memory live on the same silicon die. There is no transfer. When the GPU reads model weights, it reads from the same physical location the CPU already loaded. This is unified memory architecture, and for inference workloads it is the most significant architectural advantage available on consumer hardware today.
CPU, GPU, and Neural Engine share one memory pool. Model weights loaded once are available to any compute unit without transfer. This is why 64GB of unified memory outperforms discrete GPUs with more raw TFLOPS for inference workloads. The advantage is not compute — it is the absence of a copy.
The standard inference metric. One token is roughly 0.75 words. 30 tokens per second is comfortable reading speed. 10 is slow. The number depends on model size, quantisation, and memory bandwidth — not raw TFLOPS. A faster published number on a different model size is not a faster machine; it's a different experiment.
Full 32-bit weights require four times the memory of 8-bit equivalents with minimal quality loss. Q4_K_M is the practitioner's default: 4-bit, K-quantised, medium — the best balance of size, speed, and quality for daily use. Higher quantisation (Q8, FP16) is for evaluation runs where quality is the variable being measured. Lower (Q2, Q3) is for memory-constrained boxes where you accept the quality hit.
For inference, the bottleneck is almost never compute — it is how fast you can read weights from memory. Apple Silicon's unified bandwidth (273 GB/s on M4 Pro, higher on M2/M3 Ultra) explains its efficiency better than any other single number. If you remember one thing from this module, remember that.
The RTX 4090 has 2.5× the memory bandwidth of an M4 Pro but can't fit a 70B model — it spills to system RAM and loses the advantage entirely. The Mac Mini holds the entire model in unified memory and runs it cleanly. For 7B to 70B models locally, Apple Silicon is the correct architecture.
This is the framing the rest of the track is built on. Apple Silicon does not "win" inference because it has more raw compute — it doesn't. It wins because it has more usable memory, addressable by every compute unit, without copies. The architecture decides the ceiling, not the marketing.
If you take away the wrong lesson here, you'll spend the next seven modules confused about why your local 7B model on a Mini outperforms a video you watched of a 14B model on a different machine. Different model size is a different experiment. The lab will make this concrete.
Start here. You already have what you need to run the first half of this track. The 7B and 13B model classes fit comfortably. The 32B class is reachable with quantisation. You will see the architecture argument in your own data within an hour.
The two reasonable options:
Module 02 and beyond use Intel Mac Minis as cluster nodes — they are cheap, silent, and electrically efficient. Don't buy them yet. Read the modules first so you understand why six nodes with specialised roles is the design, not six identical workers.
Reading the module is necessary; running the lab is sufficient. The hypothesis the lab tests is the central claim of this module: unified memory architecture means inference throughput is bounded by memory bandwidth, not compute. Quantised models should run proportionally faster than parameter count alone predicts, and performance should degrade gracefully as model size approaches available memory.
The paired lab walks you through three model sizes, records tokens per second at each, and asks you to compare your results against the module's claims after recording — not before. The methodology matters more than the numbers.
L-001 · Inference Benchmark: Apple Silicon vs Discrete GPU →
What tokens-per-second actually measures, why most published comparisons are methodologically broken, and how to run a controlled experiment that produces data you can trust.
Once you have your inference baseline, Module 02 takes you to the cluster. Intel Mac Minis become the control plane and the workloads. K3s lights up. The same hardware-architecture discipline applies — but now to multi-node operations instead of single-node inference.
If you skip the cluster path and stay on a single Apple Silicon node, Module 06 (Inference) and Module 07 (Governance) are still fully accessible. The track branches; the modules don't gate each other after Module 01.