L-001 — Inference Benchmark: Apple Silicon vs Discrete GPU

Why this experiment exists

A video circulated in early 2026 showing a Mac Mini M4 Pro with 64GB of unified memory running a "Qwen 2.5 model" at 38 tokens per second, with a 13-second time-to-first-token. The comparison machine — a Linux box with an NVIDIA RTX 3060 12GB — scored 52 tokens per second. The conclusion drawn: the discrete GPU wins.

The conclusion is wrong. Not because the numbers are false, but because the numbers are measuring different things. The RTX 3060 almost certainly ran a 7B or 14B parameter model that fit within its 12GB of VRAM. The Mac Mini — signalled by that 13-second load time — was almost certainly running a 32B or 72B model that the 3060 is physically incapable of running at all.

Comparing tokens-per-second without locking model size and quantization is like comparing lap times without mentioning one car is a motorcycle and the other is a truck. Faster number, completely different task.

This is not a criticism of the person who made the comparison. It is an endemic problem in AI benchmarking content right now. This lab exists to teach the correct methodology — controlled, reproducible, and honest about what is and is not being measured.

The error this lab corrects

Tokens-per-second is not a hardware performance score. It is a throughput measurement that is only meaningful when the model, quantization level, prompt length, and memory configuration are identical across test subjects. Without those controls, you are not comparing hardware — you are comparing workloads.

Required reading

Complete this section before running the experiment. These are not suggestions. Understanding the underlying concepts is what separates a practitioner running an experiment from someone generating numbers they cannot interpret.

What a Large Language Model actually is

An LLM is a neural network with billions of numerical parameters — weights — stored as floating-point numbers. Inference is the process of loading those weights into memory and performing matrix multiplications against them. The fundamental constraint is whether those weights fit in fast memory. Everything else follows from this.

Quantization: what it is and why it matters

Model weights are originally stored as 16-bit or 32-bit floating point numbers (FP16, FP32). Quantization reduces this precision to shrink the file and memory footprint. A 7B parameter model at FP16 requires ~14GB. The same model at Q4_K_M (4-bit quantization) requires ~4GB. Reduced precision costs some quality; the tradeoff is generally worth it below Q4. This experiment uses Q4_K_M and Q8_0 as the primary test quantizations.

FP16 = full precision Q8_0 = 8-bit, high quality Q4_K_M = 4-bit, practical sweet spot Q2_K = 2-bit, degraded quality

The critical difference: VRAM vs unified memory

An NVIDIA GPU has dedicated VRAM — fast memory soldered to the graphics card. The RTX 3060 has 12GB. If a model exceeds 12GB, it cannot be held in VRAM and must be partially offloaded to system RAM over the PCIe bus, which is catastrophically slower. This is called the VRAM cliff — performance does not degrade gradually, it collapses. Apple Silicon's unified memory is a single pool shared by CPU and GPU. On the Mac Mini M4 Pro with 64GB, the full 60+ GB is available for model weights without a PCIe penalty. There is no cliff — only a ceiling much higher than any discrete GPU currently available at this price point.

The two speed metrics: TTFT and T/s

Time to First Token (TTFT) is the delay from prompt submission to the appearance of the first output token. It is dominated by model load time (if not already in memory) and prefill computation. A long TTFT is not necessarily a bad sign — it can indicate a very large model being fully loaded. Tokens per second (T/s) is the sustained generation rate after the first token. This is the number most people cite. Both matter; neither is sufficient alone.

Model families and parameter scales: Qwen2.5 as a case study

The Qwen2.5 family from Alibaba spans: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B. Saying "I'm running Qwen 2.5" without specifying which size is like saying "I'm driving a Ford" without specifying a Fiesta or an F-350. The models have completely different memory requirements, capability profiles, and performance characteristics. This experiment uses 7B, 14B, and 32B as its primary subjects because they span the VRAM cliff of the 3060.

MLX vs llama.cpp: Apple Silicon's two inference paths

On Apple Silicon, two primary inference backends exist. llama.cpp with Metal is cross-platform software that includes Metal GPU acceleration for Apple hardware. It works well. MLX is Apple's own machine learning framework, purpose-built for Apple Silicon's unified memory architecture. MLX often outperforms llama.cpp on Apple Silicon, particularly at larger model sizes, because it was designed for the hardware rather than adapted to it. Ollama uses llama.cpp under the hood. For best Apple Silicon numbers, MLX benchmarks should be run separately.

What we are measuring and why

Each metric in this experiment exists for a specific reason. Understanding why a metric is collected is as important as collecting it correctly.

Metric	Unit	Why we measure it
Time to First Token (TTFT)	seconds	Reveals model load behavior and prefill cost. A TTFT above ~5s on a warm model (already in memory) suggests memory bandwidth limitations. A long TTFT on first run is expected and normal — it is model loading, not inference failure.
Sustained tokens/sec	tok/s	The primary generation rate. Measured after the first token to exclude prefill. Run at least 3 generations and average them — the first run on a cold model will always be slower due to cache warming.
Memory consumed	GB	Confirms the model is fully resident in fast memory. If memory consumption approaches VRAM capacity on the GPU machine, all subsequent numbers are unreliable — the system is swapping. This is the variable that most published comparisons fail to report.
GPU/Neural Engine utilization	%	Confirms the inference is using the accelerator, not falling back to CPU. A CPU-bound inference run on either platform will appear much slower than hardware-accelerated inference and constitutes a configuration error, not a hardware result.
Power draw during inference	watts	Optional but illuminating. Tokens-per-watt is a real-world metric that matters for always-on edge AI deployments. Apple Silicon's efficiency advantage often appears here even when raw tok/s is comparable.
Output quality (perplexity proxy)	qualitative	Run the same prompt on all configurations and record whether outputs are coherent and on-task. Speed is irrelevant if the model is too quantized to produce useful output. Q4_K_M is generally the lowest acceptable quantization for general use.

The controlled variable rule

In every cross-machine comparison, exactly one variable may differ: the hardware. Model name, model size, quantization level, prompt text, prompt length, context length, and temperature must be identical across all test subjects. If any of these differ, you are not measuring hardware — you are measuring the interaction of hardware with a different workload.

Equipment and materials

Machine A · Apple Silicon

Mac Mini M4 Pro (or any Apple Silicon Mac). 16GB minimum. 64GB recommended for 32B+ models. macOS Sequoia or later. Ollama + MLX installed.

Machine B · Discrete GPU

Any NVIDIA GPU machine running Linux. CUDA 12.x. Note VRAM capacity carefully — it determines the model ceiling. Ollama installed. nvidia-smi available.

Software: Ollama

The common interface across both machines. Provides identical API surface and model management. Ensures like-for-like comparison of the same model files. Install at ollama.com.

Software: MLX (Mac only)

Apple's native inference framework. Run separately from Ollama to capture Apple Silicon's full capability. Install via pip install mlx-lm. Compare against Ollama Metal results.

Models to download

qwen2.5:7b-instruct-q4_K_M
qwen2.5:14b-instruct-q4_K_M
qwen2.5:32b-instruct-q4_K_M
qwen2.5:7b-instruct-q8_0

Monitoring tools

Mac: Activity Monitor, sudo powermetrics, iStat Menus.
Linux/NVIDIA: nvidia-smi dmon, nvtop, htop.

Before you begin

Close all non-essential applications on both machines. Reboot both machines before the first benchmark run of the day. Wait 2 minutes after boot before running inference. Run each benchmark at least 3 times and record all values — not just the best.

Procedure

Install and verify Ollama on both machines

Confirm Ollama is running and GPU acceleration is active before pulling any models.

# Both machines — confirm Ollama is running
ollama --version

# Check that GPU acceleration is active
ollama info

# On Linux/NVIDIA — confirm CUDA is visible
nvidia-smi
# Expect: GPU 0 · RTX 3060 · 12288MiB · Driver XX.X

# On Mac — confirm Metal backend
# Activity Monitor → GPU tab should show utilization during inference

Pull the benchmark models

Pull models on both machines. Record the exact model tag — this is your evidence that the test is controlled. On the Linux machine, only pull models that fit within your VRAM. Attempting to run a model that exceeds VRAM will either fail or produce invalid results due to CPU offloading.

# Pull all test models (do this on both machines)
ollama pull qwen2.5:7b-instruct-q4_K_M
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull qwen2.5:7b-instruct-q8_0

# RTX 3060 12GB VRAM budget check:
# 7B  Q4_K_M ~ 4.1GB  fits comfortably
# 14B Q4_K_M ~ 8.4GB  fits with headroom
# 32B Q4_K_M ~ 19.8GB EXCEEDS VRAM — skip on 3060
# 7B  Q8_0   ~ 7.7GB  fits

# Verify model list
ollama list

Set the standard test prompt

Use this exact prompt for every run on every machine. Do not modify it. The prompt is designed to produce a medium-length response (~200 tokens) which gives a stable sustained tok/s reading. Record the prompt in your lab notes verbatim.

# Standard test prompt — use exactly as written
PROMPT="Explain the concept of entropy in thermodynamics. Cover what it means physically, why it always increases in a closed system, and give one concrete everyday example. Be thorough but concise."

# Approximate expected output: 180-220 tokens
# This length gives a stable sustained generation rate reading

Run the benchmark sequence

For each model/quant combination: run the prompt once to warm the model (cold run), then run it three more times and record those values. The Ollama API returns timing data directly. Use the following script to capture structured output.

# Benchmark runner — run on BOTH machines
# Substitute MODEL_TAG for each test configuration

MODEL_TAG="qwen2.5:7b-instruct-q4_K_M"

# Warm run (discard results)
echo "Warming model: $MODEL_TAG"
ollama run $MODEL_TAG "hello" --verbose 2>&1 | tail -5

# Benchmark runs (record these)
for RUN in 1 2 3; do
  echo "=== Run $RUN ==="
  ollama run $MODEL_TAG "$PROMPT" --verbose 2>&1 | grep -E \
    "eval rate|load duration|total duration|prompt eval rate"
done

# Key output lines to capture:
# load duration:   X.XXs      ← model load (TTFT component)
# prompt eval rate: X.XX tokens/s ← prefill speed
# eval rate:       X.XX tokens/s  ← THIS IS YOUR T/S NUMBER
# total duration:  X.XXs

Monitor and record memory usage during inference

This step is non-optional. Memory data is the evidence that determines whether your tok/s numbers are valid. If VRAM is at or near capacity on the GPU machine, the run is invalid — the system is offloading to system RAM and tok/s will be artificially depressed.

# On Linux/NVIDIA — run in a second terminal during inference
watch -n 0.5 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu \
  --format=csv,noheader,nounits
# Record: memory.used at peak during inference
# If memory.used approaches 12000 on RTX 3060 → INVALID RUN

# On Mac — run in a second terminal during inference
sudo powermetrics --samplers gpu_power -i 1000 -n 20
# Also: Activity Monitor → Memory tab → Memory Pressure
# Green = healthy. Any yellow/red = memory pressure, run is suspect

Run MLX benchmarks on Apple Silicon (bonus round)

This step is Mac-only and captures Apple Silicon's native performance ceiling, which Ollama/llama.cpp may not fully realize. Compare these results against your Ollama results to understand the headroom.

# Install MLX inference
pip install mlx-lm

# Download a Qwen2.5 MLX model (4-bit quantized)
# Find at: huggingface.co — search "mlx-community/Qwen2.5-*"

# Run benchmark via MLX
python -m mlx_lm.generate \
  --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "$PROMPT" \
  --max-tokens 300 \
  --verbose
# Output includes: prompt processing speed + generation speed
# These numbers represent Apple Silicon's real ceiling

Data recording sheet

Photocopy or transcribe this sheet. Fill in one row per machine-per-model run. Do not average before recording individual runs.

Machine A: Mac Mini M4 Pro · 64GB Unified Memory Ollama + Metal backend

Model	Quant	Mem used (GB)	TTFT (s)	Run 1 tok/s	Run 2 tok/s	Run 3 tok/s	Avg tok/s	GPU util %
Qwen2.5-7B	Q4_K_M	__	__	__	__	__	__	__
Qwen2.5-7B	Q8_0	__	__	__	__	__	__	__
Qwen2.5-14B	Q4_K_M	__	__	__	__	__	__	__
Qwen2.5-32B	Q4_K_M	__	__	__	__	__	__	__
Qwen2.5-7B	Q4_K_M MLX	__	__	__	__	__	__	__
Qwen2.5-14B	Q4_K_M MLX	__	__	__	__	__	__	__

Machine B: Linux · RTX 3060 12GB VRAM Ollama + CUDA backend

Model	Quant	VRAM used (GB)	TTFT (s)	Run 1 tok/s	Run 2 tok/s	Run 3 tok/s	Avg tok/s	GPU util %
Qwen2.5-7B	Q4_K_M	__	__	__	__	__	__	__
Qwen2.5-7B	Q8_0	__	__	__	__	__	__	__
Qwen2.5-14B	Q4_K_M	__	__	__	__	__	__	__
Qwen2.5-32B	Q4_K_M	SKIP — EXCEEDS 12GB VRAM · LEAVE BLANK · THIS IS THE POINT

The blank row is not missing data — it is the finding

The Qwen2.5-32B row for Machine B is intentionally left incomplete. A machine with 12GB of VRAM cannot run a 19.8GB model. This is not a configuration problem to be solved — it is a capability boundary. The Mac Mini row for the same model will have real numbers. That difference is the story this experiment is measuring.

Interpreting your results

When you have data in all cells, read it as follows. Each rubric item covers a specific pattern you will likely see in your results.

What you see in the data

What it means

Machine B faster on 7B and 14BThe RTX 3060 scores 50–70 tok/s while the Mac Mini scores 30–45 tok/s on the same model at the same quant.

Expected and correct. A high-bandwidth discrete GPU at small model sizes has raw memory bandwidth that beats Apple Silicon. This is the NVIDIA architecture's strength: fast parallel throughput on workloads that fit in VRAM. This is the number that gets cited in "GPU wins" videos.

Machine B blank at 32B; Machine A has a real numberMachine A shows ~18–28 tok/s at 32B Q4_K_M. Machine B has no entry.

This is the capability story. The Mac Mini is slower per token on smaller models but can run a model the 3060 cannot touch. For use cases requiring instruction-following quality or reasoning that only large models provide, the Mac Mini wins by default — not on speed, on capability.

MLX numbers exceed Ollama numbers on MacQwen2.5-7B via MLX scores noticeably higher tok/s than via Ollama.

MLX is better-optimized for Apple Silicon's architecture than llama.cpp with Metal. The Ollama numbers represent a reasonable but not maximum estimate of Apple Silicon capability. Publish both. When someone cites Ollama numbers to claim "Mac lost," the MLX numbers are the rebuttal.

Q8_0 is noticeably slower than Q4_K_MOn both machines, Q8_0 runs at 60–70% of the tok/s of Q4_K_M for the same model size.

Correct and expected. Higher precision means more data to load per weight. Q8_0 uses roughly twice the memory of Q4_K_M, which approximately halves throughput. The quality improvement is real but modest for most tasks. Q4_K_M is the practical default for a reason.

Mac TTFT is longer at large modelsQwen2.5-32B on Mac Mini shows 10–15 second TTFT. The 7B model shows 1–3s.

A longer TTFT at larger model sizes is a sign that more work is being done, not that the hardware is failing. A 32B model has roughly 4x the weight data of a 7B. It takes longer to prefill. This is normal and expected. Report it accurately — it is a user experience consideration, not a performance defect.

Tokens/sec varies between runsYour three runs for the same model show values within 5–10% of each other rather than an identical number.

Normal thermal and scheduling variation. Take the arithmetic mean of three runs and report that. If variance exceeds 20% between runs on a warm model, investigate: background processes, thermal throttling (check CPU/GPU temperatures), or memory pressure are the likely causes. Do not publish a single run.

The finding this experiment produces

When your data sheet is complete, the finding is not "Mac wins" or "GPU wins." That is the wrong frame — and the frame that produces bad benchmark content. The correct finding is a capability-vs-speed profile for each platform.

Expected finding summary

At model sizes that fit in VRAM, discrete GPUs are faster. The RTX 3060 will outperform the Mac Mini on 7B and likely 14B models in raw tok/s. This is the architecture working as designed — high-bandwidth VRAM + CUDA pipeline is purpose-built for this.

At model sizes that exceed VRAM, only one machine can run the task. The Mac Mini's 64GB unified memory ceiling is approximately 5x the 3060's VRAM. Qwen2.5-32B runs on one machine and not the other. Qwen2.5-72B (with the M5 Ultra) will extend this further.

The correct question is not which is faster, but which fits your use case. If you need maximum throughput on 7B models and can live with the VRAM ceiling: discrete GPU. If you need to run larger models locally, want sustained inference without VRAM cliff risk, or are building edge AI infrastructure: Apple Silicon.

The video that prompted this experiment reported a number without context. This experiment produces context. That is the difference between a benchmark and a data point, and between a practitioner and a spectator.

Further experiments

Once your baseline data is collected, these extensions add depth to the story.

Ext A · Tokens-per-watt

Add power consumption measurement to your runs. Use a smart plug with power monitoring on both machines. Calculate tok/s / watts. Apple Silicon's efficiency advantage typically becomes decisive here, especially for always-on deployments.

Ext B · Sustained load over 30 minutes

Run continuous inference for 30 minutes. Record tok/s every 5 minutes. Discrete GPU systems under sustained load may exhibit thermal throttling. Apple Silicon systems are designed for sustained performance.

Ext C · MLX vs llama.cpp full matrix

Repeat the entire experiment on the Mac using MLX instead of Ollama. The gap between Ollama and MLX results across model sizes reveals how much of Apple Silicon's capability is currently left on the table by cross-platform inference tools.

Ext D · Concurrent inference

Run two simultaneous inference requests on each machine. Record combined throughput and per-request latency. Unified memory architecture handles concurrency differently than VRAM — this test may reveal an Apple Silicon advantage not visible in single-request benchmarks.

Publishing your results

When you publish this data on the AI Lab, include: hardware specs in full, exact model tag (name + size + quantization), Ollama version, operating system version, ambient temperature, and whether runs were warm or cold. Any published benchmark missing any of these variables is not reproducible and should not be cited as evidence. Hold your own work to this standard.

macsweeney.tech · AI Lab · Lab 001 v1.0 stephen@agentincommand.ai · March 2026

Run this experiment. Publish the data. Correct the record.