Probabilistic reconstruction of Dark Matter fields from biased tracers using diffusion models
Authors
Core Francisco Park, Victoria Ono, Nayantara Mudur, Yueying Ni, Carolina Cuesta-Lazaro
Abstract
Galaxies are biased tracers of the underlying cosmic web, which is dominated by dark matter components that cannot be directly observed. The relationship between dark matter density fields and galaxy distributions can be sensitive to assumptions in cosmology and astrophysical processes embedded in the galaxy formation models, that remain uncertain in many aspects. Based on state-of-the-art galaxy formation simulation suites with varied cosmological parameters and sub-grid astrophysics, we develop a diffusion generative model to predict the unbiased posterior distribution of the underlying dark matter fields from the given stellar mass fields, while being able to marginalize over the uncertainties in cosmology and galaxy formation.
Concepts
The Big Picture
Imagine trying to map an entire city using only the locations of coffee shops. They cluster in busy neighborhoods, but the relationship is messy. Some areas have dozens of cafes, others surprisingly few, and the reasons involve rent, foot traffic, cultural history, and factors you can’t fully measure. Now scale that up: the “city” is the universe, the “coffee shops” are galaxies, and what you’re actually trying to map is dark matter, an invisible substance making up about 85% of all matter.
That’s the core problem in modern cosmology. We can observe galaxies with telescopes. We cannot observe dark matter directly. The relationship between the two is tangled up in complex physics: supernova explosions, black hole jets blasting energy into their surroundings, the messy details of star formation. No single model captures all of it.
Previous machine learning approaches sidestepped this by training on a single simulation, baking in one set of assumptions and leaving no room for uncertainty. A team from Harvard, MIT, and the Harvard-Smithsonian Center for Astrophysics took a different approach. They built a model that generates an entire probability distribution of possible dark matter fields rather than a single best-guess map. The output captures what we don’t know just as much as what we do.
Key Insight: By training a diffusion model on over 1,000 simulations with varied cosmological and astrophysical parameters, this model produces probabilistic dark matter reconstructions that honestly reflect uncertainty about galaxy formation, a step needed for unbiased cosmological inference.
How It Works
The core tool is a diffusion generative model, the same class of AI behind image generators like DALL-E and Stable Diffusion, but repurposed for physics. Instead of generating pictures from text prompts, it generates dark matter density maps conditioned on stellar mass maps showing where galaxies are and how massive they are.
Training data comes from CAMELS (Cosmology and Astrophysics with MachinE Learning Simulations), a suite of over 1,000 galaxy formation simulations. These simulations systematically vary both cosmological parameters (like Ω_m, the matter density of the universe, and σ_8, how clumpy matter is) and astrophysical parameters controlling supernova feedback and black hole jet strength. Training across this wide parameter space is what makes the approach work: the model learns to reconstruct dark matter fields without anchoring to any single set of physical assumptions.

The reconstruction works in three stages:
- Start with noise. The model begins with a completely random field, pure static.
- Denoise conditioned on galaxies. Over 250 iterative steps, a U-Net neural network (an hourglass-shaped architecture that compresses information down through layers, then builds it back up) progressively removes that noise. At each step, the observed stellar mass field guides the process. The network learns which dark matter configurations are consistent with the galaxies it sees.
- Repeat to get a posterior. Run this 100 times and you get 100 different dark matter maps, all consistent with the same galaxy observation but spanning the genuine uncertainty in what the underlying dark matter looks like.
The result is a posterior distribution p(x_DM | x_stars): given these galaxies, what could the dark matter field look like? The posterior mean gives the best single estimate. The posterior standard deviation reveals where the model is confident (near dense galaxy clusters) and where it isn’t. The greatest uncertainty shows up in cosmic filaments, the thread-like bridges of dark matter stretching between clusters that stars don’t directly trace.
Why It Matters

The numbers hold up. Across 100 sampled dark matter fields compared against simulation ground truth, the model achieves cross-correlations consistently above 0.8. The generated fields closely track the true spatial clustering of dark matter across scales, and density histograms and power spectra match the true distributions well. The uncertainty estimates also look properly calibrated: confident where stellar structures directly constrain the dark matter, appropriately uncertain where galaxies provide weak information.
This matters because of what’s coming. The Vera Rubin Observatory’s LSST and the Euclid space telescope will map hundreds of millions of galaxies across cosmic history. Extracting dark matter information from those observations with honest uncertainty quantification is essential for testing fundamental physics and measuring dark energy. A model that accounts for galaxy formation uncertainty, rather than assuming it away, is a prerequisite for that science.
There’s a broader point about how generative AI gets used in science. The real value isn’t just prediction; it’s characterizing the space of possibilities consistent with data. When the gap between observation and theory is filled by unverified assumptions, quantifying that uncertainty is itself a scientific contribution.
Bottom Line: A diffusion model trained on 1,000+ cosmological simulations can reconstruct the invisible dark matter web from galaxy maps. Unlike previous approaches, it tells you how uncertain that answer is, which may prove just as valuable as the reconstruction itself.
IAIFI Research Highlights
This work puts deep generative AI to work in observational cosmology, applying diffusion models to one of the most basic unmeasurable quantities in physics: the dark matter density field.
Diffusion models trained across heterogeneous simulation suites can produce calibrated, physically meaningful posterior distributions. This is a proof of concept for uncertainty-aware scientific inference with generative AI.
By simultaneously marginalizing over cosmological and astrophysical uncertainties, this approach removes a systematic bias in dark matter field reconstruction, moving galaxy surveys closer to unbiased constraints on fundamental parameters like Ω_m.
Future extensions could apply this framework to 3D reconstructions and real observational data from Euclid and LSST. The work appeared at the Machine Learning and the Physical Sciences Workshop at NeurIPS 2023 ([arXiv:2311.08558](https://arxiv.org/abs/2311.08558)).
Original Paper Details
Probabilistic reconstruction of Dark Matter fields from biased tracers using diffusion models
2311.08558
Core Francisco Park, Victoria Ono, Nayantara Mudur, Yueying Ni, Carolina Cuesta-Lazaro
Galaxies are biased tracers of the underlying cosmic web, which is dominated by dark matter components that cannot be directly observed. The relationship between dark matter density fields and galaxy distributions can be sensitive to assumptions in cosmology and astrophysical processes embedded in the galaxy formation models, that remain uncertain in many aspects. Based on state-of-the-art galaxy formation simulation suites with varied cosmological parameters and sub-grid astrophysics, we develop a diffusion generative model to predict the unbiased posterior distribution of the underlying dark matter fields from the given stellar mass fields, while being able to marginalize over the uncertainties in cosmology and galaxy formation.