What tokens-per-second actually measures, why most published comparisons are methodologically broken, and how to run a controlled experiment that produces data you can trust.
A video circulated in early 2026 showing a Mac Mini M4 Pro with 64GB of unified memory running a "Qwen 2.5 model" at 38 tokens per second, with a 13-second time-to-first-token. The comparison machine — a Linux box with an NVIDIA RTX 3060 12GB — scored 52 tokens per second. The conclusion drawn: the discrete GPU wins.
The conclusion is wrong. Not because the numbers are false, but because the numbers are measuring different things. The RTX 3060 almost certainly ran a 7B or 14B parameter model that fit within its 12GB of VRAM. The Mac Mini — signalled by that 13-second load time — was almost certainly running a 32B or 72B model that the 3060 is physically incapable of running at all.
Comparing tokens-per-second without locking model size and quantization is like comparing lap times without mentioning one car is a motorcycle and the other is a truck. Faster number, completely different task.
This is not a criticism of the person who made the comparison. It is an endemic problem in AI benchmarking content right now. This lab exists to teach the correct methodology — controlled, reproducible, and honest about what is and is not being measured.
Tokens-per-second is not a hardware performance score. It is a throughput measurement that is only meaningful when the model, quantization level, prompt length, and memory configuration are identical across test subjects. Without those controls, you are not comparing hardware — you are comparing workloads.
Complete this section before running the experiment. These are not suggestions. Understanding the underlying concepts is what separates a practitioner running an experiment from someone generating numbers they cannot interpret.
An LLM is a neural network with billions of numerical parameters — weights — stored as floating-point numbers. Inference is the process of loading those weights into memory and performing matrix multiplications against them. The fundamental constraint is whether those weights fit in fast memory. Everything else follows from this.
Model weights are originally stored as 16-bit or 32-bit floating point numbers (FP16, FP32). Quantization reduces this precision to shrink the file and memory footprint. A 7B parameter model at FP16 requires ~14GB. The same model at Q4_K_M (4-bit quantization) requires ~4GB. Reduced precision costs some quality; the tradeoff is generally worth it below Q4. This experiment uses Q4_K_M and Q8_0 as the primary test quantizations.
An NVIDIA GPU has dedicated VRAM — fast memory soldered to the graphics card. The RTX 3060 has 12GB. If a model exceeds 12GB, it cannot be held in VRAM and must be partially offloaded to system RAM over the PCIe bus, which is catastrophically slower. This is called the VRAM cliff — performance does not degrade gradually, it collapses. Apple Silicon's unified memory is a single pool shared by CPU and GPU. On the Mac Mini M4 Pro with 64GB, the full 60+ GB is available for model weights without a PCIe penalty. There is no cliff — only a ceiling much higher than any discrete GPU currently available at this price point.
Time to First Token (TTFT) is the delay from prompt submission to the appearance of the first output token. It is dominated by model load time (if not already in memory) and prefill computation. A long TTFT is not necessarily a bad sign — it can indicate a very large model being fully loaded. Tokens per second (T/s) is the sustained generation rate after the first token. This is the number most people cite. Both matter; neither is sufficient alone.
The Qwen2.5 family from Alibaba spans: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B. Saying "I'm running Qwen 2.5" without specifying which size is like saying "I'm driving a Ford" without specifying a Fiesta or an F-350. The models have completely different memory requirements, capability profiles, and performance characteristics. This experiment uses 7B, 14B, and 32B as its primary subjects because they span the VRAM cliff of the 3060.
On Apple Silicon, two primary inference backends exist. llama.cpp with Metal is cross-platform software that includes Metal GPU acceleration for Apple hardware. It works well. MLX is Apple's own machine learning framework, purpose-built for Apple Silicon's unified memory architecture. MLX often outperforms llama.cpp on Apple Silicon, particularly at larger model sizes, because it was designed for the hardware rather than adapted to it. Ollama uses llama.cpp under the hood. For best Apple Silicon numbers, MLX benchmarks should be run separately.
Each metric in this experiment exists for a specific reason. Understanding why a metric is collected is as important as collecting it correctly.
| Metric | Unit | Why we measure it |
|---|---|---|
| Time to First Token (TTFT) | seconds | Reveals model load behavior and prefill cost. A TTFT above ~5s on a warm model (already in memory) suggests memory bandwidth limitations. A long TTFT on first run is expected and normal — it is model loading, not inference failure. |
| Sustained tokens/sec | tok/s | The primary generation rate. Measured after the first token to exclude prefill. Run at least 3 generations and average them — the first run on a cold model will always be slower due to cache warming. |
| Memory consumed | GB | Confirms the model is fully resident in fast memory. If memory consumption approaches VRAM capacity on the GPU machine, all subsequent numbers are unreliable — the system is swapping. This is the variable that most published comparisons fail to report. |
| GPU/Neural Engine utilization | % | Confirms the inference is using the accelerator, not falling back to CPU. A CPU-bound inference run on either platform will appear much slower than hardware-accelerated inference and constitutes a configuration error, not a hardware result. |
| Power draw during inference | watts | Optional but illuminating. Tokens-per-watt is a real-world metric that matters for always-on edge AI deployments. Apple Silicon's efficiency advantage often appears here even when raw tok/s is comparable. |
| Output quality (perplexity proxy) | qualitative | Run the same prompt on all configurations and record whether outputs are coherent and on-task. Speed is irrelevant if the model is too quantized to produce useful output. Q4_K_M is generally the lowest acceptable quantization for general use. |
In every cross-machine comparison, exactly one variable may differ: the hardware. Model name, model size, quantization level, prompt text, prompt length, context length, and temperature must be identical across all test subjects. If any of these differ, you are not measuring hardware — you are measuring the interaction of hardware with a different workload.
Mac Mini M4 Pro (or any Apple Silicon Mac). 16GB minimum. 64GB recommended for 32B+ models. macOS Sequoia or later. Ollama + MLX installed.
Any NVIDIA GPU machine running Linux. CUDA 12.x. Note VRAM capacity carefully — it determines the model ceiling. Ollama installed. nvidia-smi available.
The common interface across both machines. Provides identical API surface and model management. Ensures like-for-like comparison of the same model files. Install at ollama.com.
Apple's native inference framework. Run separately from Ollama to capture Apple Silicon's full capability. Install via pip install mlx-lm. Compare against Ollama Metal results.
qwen2.5:7b-instruct-q4_K_M
qwen2.5:14b-instruct-q4_K_M
qwen2.5:32b-instruct-q4_K_M
qwen2.5:7b-instruct-q8_0
Mac: Activity Monitor, sudo powermetrics, iStat Menus.
Linux/NVIDIA: nvidia-smi dmon, nvtop, htop.
Close all non-essential applications on both machines. Reboot both machines before the first benchmark run of the day. Wait 2 minutes after boot before running inference. Run each benchmark at least 3 times and record all values — not just the best.
Confirm Ollama is running and GPU acceleration is active before pulling any models.
# Both machines — confirm Ollama is running ollama --version # Check that GPU acceleration is active ollama info # On Linux/NVIDIA — confirm CUDA is visible nvidia-smi # Expect: GPU 0 · RTX 3060 · 12288MiB · Driver XX.X # On Mac — confirm Metal backend # Activity Monitor → GPU tab should show utilization during inference
Pull models on both machines. Record the exact model tag — this is your evidence that the test is controlled. On the Linux machine, only pull models that fit within your VRAM. Attempting to run a model that exceeds VRAM will either fail or produce invalid results due to CPU offloading.
# Pull all test models (do this on both machines) ollama pull qwen2.5:7b-instruct-q4_K_M ollama pull qwen2.5:14b-instruct-q4_K_M ollama pull qwen2.5:32b-instruct-q4_K_M ollama pull qwen2.5:7b-instruct-q8_0 # RTX 3060 12GB VRAM budget check: # 7B Q4_K_M ~ 4.1GB fits comfortably # 14B Q4_K_M ~ 8.4GB fits with headroom # 32B Q4_K_M ~ 19.8GB EXCEEDS VRAM — skip on 3060 # 7B Q8_0 ~ 7.7GB fits # Verify model list ollama list
Use this exact prompt for every run on every machine. Do not modify it. The prompt is designed to produce a medium-length response (~200 tokens) which gives a stable sustained tok/s reading. Record the prompt in your lab notes verbatim.
# Standard test prompt — use exactly as written PROMPT="Explain the concept of entropy in thermodynamics. Cover what it means physically, why it always increases in a closed system, and give one concrete everyday example. Be thorough but concise." # Approximate expected output: 180-220 tokens # This length gives a stable sustained generation rate reading
For each model/quant combination: run the prompt once to warm the model (cold run), then run it three more times and record those values. The Ollama API returns timing data directly. Use the following script to capture structured output.
# Benchmark runner — run on BOTH machines # Substitute MODEL_TAG for each test configuration MODEL_TAG="qwen2.5:7b-instruct-q4_K_M" # Warm run (discard results) echo "Warming model: $MODEL_TAG" ollama run $MODEL_TAG "hello" --verbose 2>&1 | tail -5 # Benchmark runs (record these) for RUN in 1 2 3; do echo "=== Run $RUN ===" ollama run $MODEL_TAG "$PROMPT" --verbose 2>&1 | grep -E \ "eval rate|load duration|total duration|prompt eval rate" done # Key output lines to capture: # load duration: X.XXs ← model load (TTFT component) # prompt eval rate: X.XX tokens/s ← prefill speed # eval rate: X.XX tokens/s ← THIS IS YOUR T/S NUMBER # total duration: X.XXs
This step is non-optional. Memory data is the evidence that determines whether your tok/s numbers are valid. If VRAM is at or near capacity on the GPU machine, the run is invalid — the system is offloading to system RAM and tok/s will be artificially depressed.
# On Linux/NVIDIA — run in a second terminal during inference watch -n 0.5 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu \ --format=csv,noheader,nounits # Record: memory.used at peak during inference # If memory.used approaches 12000 on RTX 3060 → INVALID RUN # On Mac — run in a second terminal during inference sudo powermetrics --samplers gpu_power -i 1000 -n 20 # Also: Activity Monitor → Memory tab → Memory Pressure # Green = healthy. Any yellow/red = memory pressure, run is suspect
This step is Mac-only and captures Apple Silicon's native performance ceiling, which Ollama/llama.cpp may not fully realize. Compare these results against your Ollama results to understand the headroom.
# Install MLX inference pip install mlx-lm # Download a Qwen2.5 MLX model (4-bit quantized) # Find at: huggingface.co — search "mlx-community/Qwen2.5-*" # Run benchmark via MLX python -m mlx_lm.generate \ --model mlx-community/Qwen2.5-7B-Instruct-4bit \ --prompt "$PROMPT" \ --max-tokens 300 \ --verbose # Output includes: prompt processing speed + generation speed # These numbers represent Apple Silicon's real ceiling
Photocopy or transcribe this sheet. Fill in one row per machine-per-model run. Do not average before recording individual runs.
| Model | Quant | Mem used (GB) | TTFT (s) | Run 1 tok/s | Run 2 tok/s | Run 3 tok/s | Avg tok/s | GPU util % |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B | Q4_K_M | __ | __ | __ | __ | __ | __ | __ |
| Qwen2.5-7B | Q8_0 | __ | __ | __ | __ | __ | __ | __ |
| Qwen2.5-14B | Q4_K_M | __ | __ | __ | __ | __ | __ | __ |
| Qwen2.5-32B | Q4_K_M | __ | __ | __ | __ | __ | __ | __ |
| Qwen2.5-7B | Q4_K_M MLX | __ | __ | __ | __ | __ | __ | __ |
| Qwen2.5-14B | Q4_K_M MLX | __ | __ | __ | __ | __ | __ | __ |
| Model | Quant | VRAM used (GB) | TTFT (s) | Run 1 tok/s | Run 2 tok/s | Run 3 tok/s | Avg tok/s | GPU util % |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B | Q4_K_M | __ | __ | __ | __ | __ | __ | __ |
| Qwen2.5-7B | Q8_0 | __ | __ | __ | __ | __ | __ | __ |
| Qwen2.5-14B | Q4_K_M | __ | __ | __ | __ | __ | __ | __ |
| Qwen2.5-32B | Q4_K_M | SKIP — EXCEEDS 12GB VRAM · LEAVE BLANK · THIS IS THE POINT | ||||||
The Qwen2.5-32B row for Machine B is intentionally left incomplete. A machine with 12GB of VRAM cannot run a 19.8GB model. This is not a configuration problem to be solved — it is a capability boundary. The Mac Mini row for the same model will have real numbers. That difference is the story this experiment is measuring.
When you have data in all cells, read it as follows. Each rubric item covers a specific pattern you will likely see in your results.
When your data sheet is complete, the finding is not "Mac wins" or "GPU wins." That is the wrong frame — and the frame that produces bad benchmark content. The correct finding is a capability-vs-speed profile for each platform.
At model sizes that fit in VRAM, discrete GPUs are faster. The RTX 3060 will outperform the Mac Mini on 7B and likely 14B models in raw tok/s. This is the architecture working as designed — high-bandwidth VRAM + CUDA pipeline is purpose-built for this.
At model sizes that exceed VRAM, only one machine can run the task. The Mac Mini's 64GB unified memory ceiling is approximately 5x the 3060's VRAM. Qwen2.5-32B runs on one machine and not the other. Qwen2.5-72B (with the M5 Ultra) will extend this further.
The correct question is not which is faster, but which fits your use case. If you need maximum throughput on 7B models and can live with the VRAM ceiling: discrete GPU. If you need to run larger models locally, want sustained inference without VRAM cliff risk, or are building edge AI infrastructure: Apple Silicon.
The video that prompted this experiment reported a number without context. This experiment produces context. That is the difference between a benchmark and a data point, and between a practitioner and a spectator.
Once your baseline data is collected, these extensions add depth to the story.
Add power consumption measurement to your runs. Use a smart plug with power monitoring on both machines. Calculate tok/s / watts. Apple Silicon's efficiency advantage typically becomes decisive here, especially for always-on deployments.
Run continuous inference for 30 minutes. Record tok/s every 5 minutes. Discrete GPU systems under sustained load may exhibit thermal throttling. Apple Silicon systems are designed for sustained performance.
Repeat the entire experiment on the Mac using MLX instead of Ollama. The gap between Ollama and MLX results across model sizes reveals how much of Apple Silicon's capability is currently left on the table by cross-platform inference tools.
Run two simultaneous inference requests on each machine. Record combined throughput and per-request latency. Unified memory architecture handles concurrency differently than VRAM — this test may reveal an Apple Silicon advantage not visible in single-request benchmarks.
When you publish this data on the AI Lab, include: hardware specs in full, exact model tag (name + size + quantization), Ollama version, operating system version, ambient temperature, and whether runs were warm or cold. Any published benchmark missing any of these variables is not reproducible and should not be cited as evidence. Hold your own work to this standard.