M-series architecture. Neural Engine. Metal Performance Shaders. Unified memory. The hardware substrate that makes on-device AI not just possible but inevitable. This is where the mathematics becomes silicon.
Apple Silicon's unified memory architecture eliminates the bottleneck between CPU and GPU. The Neural Engine adds a third compute domain purpose-built for inference. This is not a GPU with extra cores — it's a fundamentally different approach to computation.
Performance and efficiency cores. The control plane. Orchestration, scheduling, sequential logic. Where SwiftVector's kernel runs.
Massively parallel SIMD execution. Metal Performance Shaders. Training, matrix multiplication, convolution. Where the linear algebra lives.
16-core dedicated ML accelerator. 15.8 TOPS on M1, scaling to 38 TOPS on M4. Where CoreML models execute at wire speed.
In a discrete GPU system, data must be copied between CPU memory and GPU memory over a PCIe bus. This copy is the bottleneck. Apple Silicon eliminates it — CPU, GPU, and Neural Engine share the same memory pool. No copies. No bus transfers. The tensor stays where it is and every compute domain can access it at full bandwidth.
MPS is Apple's GPU compute framework optimized for the M-series architecture. MPSGraph provides a computation graph abstraction — you define operations, MPS schedules them across GPU cores and Neural Engine automatically. PyTorch's MPS backend makes this accessible from Python.
import torch
# Check MPS availability
if torch.backends.mps.is_available():
device = torch.device("mps")
print(f"Using Apple Silicon: device")
else:
device = torch.device("cpu")
# Move model to Apple Silicon
model = MyModel().to(device)
tensor = torch.randn(64, 768).to(device)
# Inference runs on GPU + Neural Engine
with torch.no_grad():
output = model(tensor) import CoreML
// Load a compiled CoreML model
let config = MLModelConfiguration()
config.computeUnits = .all // CPU + GPU + Neural Engine
let model = try MyModel(configuration: config)
let prediction = try model.prediction(input: inputFeatures)
// The runtime decides which compute unit
// handles each layer — automatically The canvas below will visualize memory bandwidth utilization across CPU, GPU, and Neural Engine in real time. See how unified memory eliminates the copy bottleneck.
Interactive canvas coming soon. This lab is under active development.
Unified memory utilization across compute domains
SwiftVector's kernel runs on the CPU cores. The constraints it evaluates are pure functions — deterministic, auditable, replayable. But the AI models those constraints govern run on the GPU and Neural Engine. Understanding the hardware is understanding why governance must be separated from inference.
— The forge thesis