← Back to Timeline

Debiasing with Diffusion: Probabilistic reconstruction of Dark Matter fields from galaxies with CAMELS

Astrophysics

Authors

Victoria Ono, Core Francisco Park, Nayantara Mudur, Yueying Ni, Carolina Cuesta-Lazaro, Francisco Villaescusa-Navarro

Abstract

Galaxies are biased tracers of the underlying cosmic web, which is dominated by dark matter components that cannot be directly observed. Galaxy formation simulations can be used to study the relationship between dark matter density fields and galaxy distributions. However, this relationship can be sensitive to assumptions in cosmology and astrophysical processes embedded in the galaxy formation models, that remain uncertain in many aspects. In this work, we develop a diffusion generative model to reconstruct dark matter fields from galaxies. The diffusion model is trained on the CAMELS simulation suite that contains thousands of state-of-the-art galaxy formation simulations with varying cosmological parameters and sub-grid astrophysics. We demonstrate that the diffusion model can predict the unbiased posterior distribution of the underlying dark matter fields from the given stellar mass fields, while being able to marginalize over uncertainties in cosmological and astrophysical models. Interestingly, the model generalizes to simulation volumes approximately 500 times larger than those it was trained on, and across different galaxy formation models. Code for reproducing these results can be found at https://github.com/victoriaono/variational-diffusion-cdm

Concepts

diffusion models posterior estimation dark matter cosmological simulation uncertainty quantification generative models galaxy bias debiasing inverse problems simulation-based inference bayesian inference convolutional networks out-of-distribution detection

The Big Picture

Imagine mapping a city using only coffee shop locations. You’d get a rough picture (clusters near offices, gaps in residential areas) but you’d miss parks, warehouses, quiet streets. All the structure that coffee shops don’t care about. Astronomers face a version of this problem at cosmic scales. Galaxies light up the universe, but they trace something far more vast and invisible: the dark matter web that accounts for 85% of all matter.

Dark matter doesn’t emit light or interact with anything we can directly detect. Gravity pulls galaxies into clusters along dark matter filaments and halos, though, so the galaxy distribution carries a scrambled, imperfect imprint of the dark matter beneath it. The fact that galaxies are biased tracers of the underlying dark matter is one of the central challenges in modern cosmology.

Unscrambling that bias could sharpen our maps of the universe, revealing voids and filaments invisible in galaxy surveys alone. A team from Harvard, MIT, and the Flatiron Institute built a machine learning model to do just that. Given a map of where galaxies are, it reconstructs a probabilistic picture of the dark matter underneath, accounting for everything we don’t know about how galaxies form.

Key Insight: By training a diffusion generative model on thousands of galaxy formation simulations, researchers can reconstruct dark matter density fields from galaxy observations while quantifying the uncertainties introduced by unknown astrophysical physics.

How It Works

The model is a diffusion network, the same family of generative AI behind image generators like DALL-E and Stable Diffusion. But instead of generating images from text prompts, it generates dark matter density maps from stellar mass maps. It learns to reverse a gradual noise-injection process: starting from pure random noise and refining step by step toward a plausible dark matter field that matches the observed galaxy distribution.

Training data comes from CAMELS (Cosmology and Astrophysics with MachinE Learning Simulations), a suite of thousands of hydrodynamical simulations. These are detailed computer models tracking how both ordinary and dark matter behave under gravity and other physical forces. Each simulation uses different cosmological parameters and different sub-grid physics: small-scale models for star formation, supernova explosions, and black hole feedback that no simulation can resolve directly.

Training on this diverse set means the model doesn’t learn a single fixed galaxy-to-dark-matter relationship. It learns the full spread of possibilities.

The training setup uses paired 2D maps drawn from three galaxy formation frameworks:

  • Data format: 256×256 pixel projections of stellar mass and dark matter density
  • Scale: Simulation boxes 25 h⁻¹ Mpc (~115 million light-years) per side
  • Simulation suites: ASTRID, IllustrisTNG, and SIMBA, each making fundamentally different physical assumptions about how galaxies form

Figure 1

The model doesn’t just produce one map. It generates an ensemble of plausible dark matter fields, each consistent with the input galaxy distribution. Multiple dark matter configurations can explain the same galaxy field equally well. The model captures that genuine physical uncertainty rather than papering over it.

Why It Matters

The generalization results are unusual. Trained on small simulation boxes, the model was applied to IllustrisTNG-300, a simulation roughly 500 times larger in volume. Reconstructed dark matter fields remained statistically accurate, recovering correct power spectra and probability density functions even on data well outside the training range. That kind of extrapolation is rare, and it suggests the model has learned genuine physical relationships rather than shortcuts tied to its training data.

Figure 2

Transfer across galaxy formation models works too. A diffusion model trained only on IllustrisTNG can reconstruct dark matter fields from SIMBA or ASTRID galaxy distributions with reasonable accuracy, despite different feedback physics and noticeably different galaxy populations. This cross-suite robustness matters for real observational data, where we don’t know which simulation most accurately describes the universe.

The approach connects to two major goals in observational cosmology. First, reconstructed dark matter fields enable field-level cosmological inference: extracting information from the full density field, including cosmic voids and filaments, rather than just clustering statistics. Second, by identifying regions with unusually low stellar-to-dark-matter mass ratios, the reconstruction could help guide searches for dark matter detection signatures.

Upcoming surveys from DESI, Euclid, Roman, and Rubin will deliver galaxy maps of unprecedented size and depth. Turning those observations into constraints on the nature of dark matter will require methods like this one.

Bottom Line: A diffusion model trained on thousands of diverse galaxy formation simulations can reconstruct probabilistic dark matter density maps from observed galaxy distributions, generalizing to volumes 500× larger than its training data and across entirely different physical models, bringing field-level cosmology with next-generation surveys closer to reality.

IAIFI Research Highlights

Interdisciplinary Research Achievement
This paper combines generative AI with cosmological simulation science, using diffusion models trained on the CAMELS suite to recover the invisible dark matter web from observable galaxy distributions.
Impact on Artificial Intelligence
Diffusion models can generalize well beyond their training distribution in scientific contexts, maintaining accuracy on volumes 500× larger than training data and across fundamentally different physical model families.
Impact on Fundamental Interactions
By producing unbiased posterior distributions of dark matter density fields from galaxy observations, this method enables field-level cosmological inference and targeted searches for dark matter in regions with low stellar-to-dark-matter-mass ratios.
Outlook and References
Next steps include extending to 3D reconstructions and real observational data from DESI and Euclid. Code is available at [github.com/victoriaono/variational-diffusion-cdm](https://github.com/victoriaono/variational-diffusion-cdm) and the paper is on [arXiv:2403.10648](https://arxiv.org/abs/2403.10648).

Original Paper Details

Title
Debiasing with Diffusion: Probabilistic reconstruction of Dark Matter fields from galaxies with CAMELS
arXiv ID
2403.10648
Authors
Victoria Ono, Core Francisco Park, Nayantara Mudur, Yueying Ni, Carolina Cuesta-Lazaro, Francisco Villaescusa-Navarro
Abstract
Galaxies are biased tracers of the underlying cosmic web, which is dominated by dark matter components that cannot be directly observed. Galaxy formation simulations can be used to study the relationship between dark matter density fields and galaxy distributions. However, this relationship can be sensitive to assumptions in cosmology and astrophysical processes embedded in the galaxy formation models, that remain uncertain in many aspects. In this work, we develop a diffusion generative model to reconstruct dark matter fields from galaxies. The diffusion model is trained on the CAMELS simulation suite that contains thousands of state-of-the-art galaxy formation simulations with varying cosmological parameters and sub-grid astrophysics. We demonstrate that the diffusion model can predict the unbiased posterior distribution of the underlying dark matter fields from the given stellar mass fields, while being able to marginalize over uncertainties in cosmological and astrophysical models. Interestingly, the model generalizes to simulation volumes approximately 500 times larger than those it was trained on, and across different galaxy formation models. Code for reproducing these results can be found at https://github.com/victoriaono/variational-diffusion-cdm