Apple Silicon
as Forge.

M-series architecture. Neural Engine. Metal Performance Shaders. Unified memory. The hardware substrate that makes on-device AI not just possible but inevitable. This is where the mathematics becomes silicon.

Architecture

The unified memory thesis.

Apple Silicon's unified memory architecture eliminates the bottleneck between CPU and GPU. The Neural Engine adds a third compute domain purpose-built for inference. This is not a GPU with extra cores — it's a fundamentally different approach to computation.

CPU Cores

General purpose

Performance and efficiency cores. The control plane. Orchestration, scheduling, sequential logic. Where SwiftVector's kernel runs.

GPU Cores

Parallel compute

Massively parallel SIMD execution. Metal Performance Shaders. Training, matrix multiplication, convolution. Where the linear algebra lives.

Neural Engine

Inference accelerator

16-core dedicated ML accelerator. 15.8 TOPS on M1, scaling to 38 TOPS on M4. Where CoreML models execute at wire speed.

01

Unified memory: why it matters for AI

In a discrete GPU system, data must be copied between CPU memory and GPU memory over a PCIe bus. This copy is the bottleneck. Apple Silicon eliminates it — CPU, GPU, and Neural Engine share the same memory pool. No copies. No bus transfers. The tensor stays where it is and every compute domain can access it at full bandwidth.

Memory bandwidth: 100 GB/s (M1) → 120 GB/s (M2) → 150 GB/s (M3) → 273 GB/s (M4 Max)

02

Metal Performance Shaders

MPS is Apple's GPU compute framework optimized for the M-series architecture. MPSGraph provides a computation graph abstraction — you define operations, MPS schedules them across GPU cores and Neural Engine automatically. PyTorch's MPS backend makes this accessible from Python.

Implementation

Getting started with MPS.

PyTorch MPS Backend Python

import torch

# Check MPS availability
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Using Apple Silicon: device")
else:
    device = torch.device("cpu")

# Move model to Apple Silicon
model = MyModel().to(device)
tensor = torch.randn(64, 768).to(device)

# Inference runs on GPU + Neural Engine
with torch.no_grad():
    output = model(tensor)

CoreML Integration Swift

import CoreML

// Load a compiled CoreML model
let config = MLModelConfiguration()
config.computeUnits = .all  // CPU + GPU + Neural Engine

let model = try MyModel(configuration: config)
let prediction = try model.prediction(input: inputFeatures)

// The runtime decides which compute unit
// handles each layer — automatically

03

Memory bandwidth visualization

The canvas below will visualize memory bandwidth utilization across CPU, GPU, and Neural Engine in real time. See how unified memory eliminates the copy bottleneck.

Interactive canvas coming soon. This lab is under active development.

Memory Bandwidth Canvas

Unified memory utilization across compute domains

SwiftVector's kernel runs on the CPU cores. The constraints it evaluates are pure functions — deterministic, auditable, replayable. But the AI models those constraints govern run on the GPU and Neural Engine. Understanding the hardware is understanding why governance must be separated from inference.

— The forge thesis

Continue the sequence

Previous Module