Non-parametric Lagrangian biasing from the insights of neural nets
Authors
Xiaohan Wu, Julian B. Munoz, Daniel J. Eisenstein
Abstract
We present a Lagrangian model of galaxy clustering bias in which we train a neural net using the local properties of the smoothed initial density field to predict the late-time mass-weighted halo field. By fitting the mass-weighted halo field in the AbacusSummit simulations at z=0.5, we find that including three coarsely spaced smoothing scales gives the best recovery of the halo power spectrum. Adding more smoothing scales may lead to 2-5% underestimation of the large-scale power and can cause the neural net to overfit. We find that the fitted halo-to-mass ratio can be well described by two directions in the original high-dimension feature space. Projecting the original features into these two principal components and re-training the neural net either reproduces the original training result, or outperforms it with a better match of the halo power spectrum. The elements of the principal components are unlikely to be assigned physical meanings, partly owing to the features being highly correlated between different smoothing scales. Our work illustrates a potential need to include multiple smoothing scales when studying galaxy bias, and this can be done easily with machine-learning methods that can take in high dimensional input feature space.
Concepts
The Big Picture
Imagine trying to predict where cities will grow on a continent, knowing only the terrain as it existed millions of years ago. You have mountains, valleys, rivers, but the patterns that emerge depend on subtle interactions between all these features in ways that resist simple rules.
Now scale that up to the entire universe. Where galaxies form depends on the initial density ripples laid down just after the Big Bang: tiny fluctuations in how evenly matter was spread across space. The connection between those primordial bumps and the galaxies we observe today involves billions of years of gravity slowly pulling matter together. Getting that connection right is a core problem in cosmology.
Astronomers describe this connection through galaxy bias, a mathematical relationship between the underlying matter distribution and where galaxies actually sit. The traditional approach writes this as a polynomial expansion: galaxy density equals a constant times matter density, plus another constant times density squared, and so on. Clean. Physically motivated.
But as surveys grow more precise, cracks appear. The polynomial can produce nonsensical results, like predicting negative galaxy densities in regions that happen to be emptier than average.
A Harvard-led team has taken a machine-learning approach, using neural networks to learn the bias function directly from simulations without imposing any polynomial form. The network recovers the halo power spectrum, a statistical measure of how clustered dark matter halos are at different scales, to within 1–2% accuracy. It also points toward a surprisingly compact description of how halos form. (Dark matter halos are the massive, gravitationally bound structures of invisible matter that anchor galaxies in place.)
Key Insight: By training a neural network on initial cosmic density fields, the researchers discovered that galaxy formation bias can be captured by just two dominant directions in a high-dimensional feature space, recovering the halo power spectrum to within 1–2% at large scales.
How It Works
The starting point is the Lagrangian framework, which follows individual patches of matter through cosmic time rather than watching fixed points in space. Instead of asking “what’s at this location today?”, you ask “where was this clump of matter at the very beginning, and what was its environment like?”
Each point in that initial field carries a set of descriptors:
- Overdensity δ: how much denser this patch is compared to the cosmic average
- Laplacian ∇²δ: how sharply the density peaks or dips, capturing the curvature of the local density profile
- Tidal shear G₂: how much the surrounding gravitational field is trying to stretch or compress that region
These are computed at multiple smoothing scales. The density map is blurred at different resolutions from fine-grained to coarse, so each particle carries a whole vector of features describing its environment at different radii.
The neural network learns a weight function: given these initial conditions, how much halo mass forms here? The architecture is modest (three fully connected layers with 20 neurons each) and trained on particles from the AbacusSummit simulation suite at redshift z=0.5.

Before training, the team applied feature orthogonalization to transform correlated inputs into independent ones. Density fields at different smoothing scales are highly correlated. The field smoothed over 20 Mpc/h and the field smoothed over 30 Mpc/h tell nearly the same story. Transforming these into orthogonal components keeps the network from chewing on the same information twice in disguise.
The team then tested different combinations of smoothing scales:
- One scale: Too limited; misses environmental information at different radii
- Three coarsely spaced scales: The sweet spot, recovering the halo power spectrum to within 1–2% at k < 0.1 h/Mpc
- More than three scales: Counterproductive, producing 2–5% underestimation of large-scale power with signs of overfitting
That last point is the surprising one. More information makes things worse. Finely spaced smoothing scales are nearly redundant; they add correlated noise rather than new physics, and the network gets confused.

The Two-Direction Discovery
After training the full network with nine features (three quantities at three scales), the team examined the gradient of the learned function to see which directions in feature space the network actually responds to.
They computed principal components of this gradient field. Think of it as identifying the main axes of an elongated cloud of points. The answer was clean: two components dominate. Two directions, in a nine-dimensional space, capture nearly all the variation in how halos form.
When they compressed the nine features down to these two principal components and retrained, performance matched or improved over the full nine-dimensional version. Dimensionality reduction made the model more accurate, not less.
What do these two directions represent physically? The team resists assigning clean labels. The original features are so strongly correlated across scales that the principal components end up being complex mixtures. They don’t map neatly onto familiar concepts like “density” or “tidal shear.” The network has found something real, but it speaks a different language than traditional bias theory.
Why It Matters
Surveys like DESI, Euclid, and the Rubin Observatory are reaching a precision where percent-level accuracy from traditional bias models no longer cuts it. Getting galaxy bias right at the 1% level now matters directly for extracting cosmological information from these datasets.
Here, machine learning does more than predict. It works as a tool for discovering structure in a complex physical system.
The compression to two principal components is telling. Despite the apparent complexity of galaxy formation, the relevant information in the initial conditions seems to live in a low-dimensional subspace. If those two directions can eventually be connected to specific physical processes (halo mass, assembly history, or something not yet named), it would tighten our theoretical picture of structure formation considerably.
The multi-scale approach has practical value too. Traditional bias expansions typically operate at a single smoothing scale. Spanning multiple scales simultaneously, even coarsely, captures physics that single-scale approaches miss. Neural networks handle this naturally, without the memory constraints that limited earlier non-parametric methods to two features at a time.
Bottom Line: A neural network trained on initial cosmic density fields learns galaxy bias in a way that requires just two dominant directions to describe. This compression recovers the halo power spectrum to 1–2% and points to a low-dimensional structure underlying halo formation, one that traditional polynomial bias models cannot easily express.
IAIFI Research Highlights
This work puts machine learning to work on large-scale structure cosmology, using neural networks not just as predictors but as probes of the intrinsic dimensionality of galaxy bias.
Principal component analysis of learned gradients both improves neural network performance and reveals interpretable structure in high-dimensional astrophysical feature spaces.
Achieving 1–2% recovery of the halo power spectrum in a non-parametric framework moves toward the precision modeling needed to constrain inflation physics and dark energy from upcoming galaxy surveys.
Future work will extend this framework to observable galaxies and test whether the two dominant principal components correspond to known physical quantities. The paper is available as [arXiv:2212.08095](https://arxiv.org/abs/2212.08095), part of the AbacusSummit simulation analysis series.
Original Paper Details
Non-parametric Lagrangian biasing from the insights of neural nets
2212.08095
Xiaohan Wu, Julian B. Munoz, Daniel J. Eisenstein
We present a Lagrangian model of galaxy clustering bias in which we train a neural net using the local properties of the smoothed initial density field to predict the late-time mass-weighted halo field. By fitting the mass-weighted halo field in the AbacusSummit simulations at z=0.5, we find that including three coarsely spaced smoothing scales gives the best recovery of the halo power spectrum. Adding more smoothing scales may lead to 2-5% underestimation of the large-scale power and can cause the neural net to overfit. We find that the fitted halo-to-mass ratio can be well described by two directions in the original high-dimension feature space. Projecting the original features into these two principal components and re-training the neural net either reproduces the original training result, or outperforms it with a better match of the halo power spectrum. The elements of the principal components are unlikely to be assigned physical meanings, partly owing to the features being highly correlated between different smoothing scales. Our work illustrates a potential need to include multiple smoothing scales when studying galaxy bias, and this can be done easily with machine-learning methods that can take in high dimensional input feature space.