The capstone of the Quadrivium. Astronomy trained science to reason under radical observational constraint — to draw precise conclusions from incomplete, indirect, unrepeatable data about subjects you cannot touch, move, or repeat. That is the defining epistemological condition of every practitioner working with a model they didn’t build.
“The astronomer cannot move the star. Every conclusion is inference from light that left its source before the observer was born.”
— the defining condition of the disciplineThe astronomer’s defining constraint: the stars do not move on command. Every conclusion must be derived from observation of a system that proceeds entirely on its own terms. You can point the telescope in a different direction. You cannot make the Crab Nebula expand faster to test your model of it.
This is the condition of AI evaluation. You cannot observe what a model knows — you can only observe its outputs. Every attribution technique — attention visualisation, probing classifiers, activation patching — is an instrument for observing indirect evidence of something you cannot see directly.
In 1929, Hubble observed that light from distant galaxies was systematically shifted toward longer wavelengths. From this single, indirect measurement, he inferred velocity, distance, and ultimately the expansion of the universe itself. He could not go to the galaxy and measure. He read the light.
An embedding distance works by the same logic. Two tokens separated by a large cosine distance are inferred to be semantically dissimilar. The distance encodes a relationship that was never explicitly taught — it emerged from training on the structure of language itself. We infer meaning from position the way Hubble inferred velocity from wavelength.
Astronomers at Harvard classified hundreds of thousands of stellar spectra by hand. Annie Jump Cannon developed the OBAFGKM spectral classification system. Cecilia Payne discovered that stars are made mostly of hydrogen. Both were inferences from spectral lines.
The Vera Rubin Observatory will survey the entire visible sky every three nights, producing approximately 20TB of raw data per night. Every transient must be identified within minutes. Astronomy invented the machine learning pipeline before machine learning existed. The data volume outgrew human classification capacity.
Stellar parallax measures distance by observing apparent shift as Earth orbits the Sun. Two observations, six months apart, from different positions. The angle of displacement encodes the distance. No direct measurement is possible. Only triangulation from known baselines.
Calibration in machine learning works the same way. Benchmarks are baselines. Evaluation is parallax. The model’s distance from correct behaviour is inferred, never observed.
Astronomy of Inference is the Quadrivium’s capstone. Interactive H-R diagram visualisation, parallax measurement simulator, and spectral classification exercises are being designed. The Vera Rubin Observatory dataset integration is under exploration.