From Neurons to Neutrons: A Case Study in Interpretability
Authors
Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz, Sokratis Trifinopoulos, Mike Williams
Abstract
Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.
Concepts
The Big Picture
Hand a student a massive table of measurements (thousands of rows of numbers describing atomic nuclei) and ask them to find the patterns. No textbook, no hints, no physics curriculum. Now imagine that student not only learns to predict new measurements accurately, but spontaneously invents concepts that took physicists decades to discover. That is, roughly, what researchers at IAIFI found when they cracked open neural networks trained on nuclear data.
Mechanistic interpretability asks a simple question: when a neural network makes a prediction, how does it actually do it? Prior work showed that networks trained on basic arithmetic can invent surprising internal algorithms, sometimes several at once, depending on random initialization. One reading of that result is pessimistic: maybe neural network internals are too unpredictable to tell us anything useful about the real world. This paper argues the opposite.
A team of physicists and ML researchers trained networks to reproduce experimental nuclear data, then dissected what those networks had learned. What they found wasn’t arbitrary computational machinery. It was nuclear physics, reconstructed from scratch.
Key Insight: Neural networks trained on nuclear data spontaneously learn internal representations that mirror the conceptual structures physicists already use. Interpretability tools can pull genuine scientific knowledge out of these models, not just better predictions.
How It Works
The researchers begin with a warm-up: modular arithmetic, the math of clock faces and remainders. When a small neural network (an MLP, or multi-layer perceptron) learns to add numbers in this system, its embeddings, the internal vectors representing each number, arrange themselves in a perfect circle. The network computes a sum by averaging two points on the circle and reading off which slice the midpoint falls in.
Nobody imposed this circular structure. It emerges purely from training, and it matches how humans naturally teach modular arithmetic.
This circular example establishes the method: latent space topography (LST). Instead of inspecting individual neurons, the researchers project the network’s high-dimensional internal representations down to their first two or three principal components (PCs), the directions that capture the most variation in the data. Think of it as a topographic survey: sample elevation at a grid of points and reconstruct the terrain.
With that technique validated, they turn to nuclear physics. The setup: train a neural network to predict properties of atomic nuclei, including binding energies (the energy holding a nucleus together), given only the number of protons (Z) and neutrons (N). No shell models, no magic numbers, no physics priors. Just data.
They extract the learned embeddings for each neutron number and project them onto the first three principal components.
Out comes a helix. The embeddings spiral through three-dimensional space in a corkscrew pattern, encoding layers of nuclear structure simultaneously. When the researchers repeat the experiment on synthetic data from an established theory (the liquid drop model with shell corrections), the same helix appears. The two spirals are structurally identical.
The helix isn’t decorative. Its geometry encodes specific physical concepts:
- Nuclear shells: Nuclei with certain “magic numbers” of protons or neutrons (2, 8, 20, 28, 50, 82, 126) are unusually stable, much like filled electron shells in chemistry. These show up as distinctive kinks or clustering along the helix.
- Pairing effects: Nuclei with even numbers of neutrons are more stable than odd-neutron nuclei, a quantum effect from nucleon pairing. This produces a systematic wobble in the helix.
- Bulk nuclear matter: The overall scale of the helix encodes the dominant contribution to binding energy, the same term that anchors the liquid drop model physicists derived decades ago.
None of these concepts were given to the network. They fell out of fitting the data.
Why It Matters
The standard story about neural network interpretability is cautionary: networks can implement wildly different algorithms depending on initial conditions, so extracting universal lessons is hard. But for scientific data generated by real physical laws, the picture changes. Physical laws constrain what representations are useful, and networks tend to converge on ones that actually reflect the underlying physics.
This reframes what interpretability tools are for. Instead of only asking “how does this model make predictions?”, you can ask “what does this model’s internal structure reveal about the data?” In domains where human understanding is incomplete (exotic nuclei, novel materials, poorly understood particle interactions), a trained network’s internal representations could point toward concepts that theorists haven’t articulated yet.
There is a practical takeaway on the ML side, too. High-dimensional networks, even trained on messy real-world data, can learn low-dimensional representations that are both accurate and interpretable. This is evidence for the manifold hypothesis: natural data concentrates near low-dimensional surfaces in high-dimensional space. Not just as a mathematical nicety, but as a description of what networks actually learn.
Bottom Line: Applying mechanistic interpretability to nuclear physics shows that neural networks can rediscover established scientific knowledge on their own, turning interpretability from a tool for auditing AI into a tool for doing science.
IAIFI Research Highlights
The work connects mechanistic interpretability with nuclear structure physics, showing that analytical tools developed to understand arithmetic-solving networks can extract established physical concepts from data-driven models.
The paper pushes mechanistic interpretability beyond algorithmic toy problems. Latent space topography and PCA-based analysis reveal scientifically meaningful low-dimensional structure in networks trained on real experimental data.
By recovering nuclear shell structure, magic numbers, and pairing effects from learned representations, all without physics priors, the work opens a route for using neural networks as hypothesis-generating tools in nuclear and particle physics.
Future directions include applying this framework to domains where human theory is incomplete, such as exotic nuclei and strongly coupled systems. The paper is available as [arXiv:2405.17425](https://arxiv.org/abs/2405.17425).
Original Paper Details
From Neurons to Neutrons: A Case Study in Interpretability
2405.17425
Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz, Sokratis Trifinopoulos, Mike Williams
Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.