AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing
Authors
Samuel Bright-Thonney, Christina Reissel, Gaia Grosso, Nathaniel Woodward, Katya Govorkova, Andrzej Novak, Sang Eon Park, Eric Moreno, Philip Harris
Abstract
Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.
Concepts
The Big Picture
Imagine searching for a counterfeit coin in a vault of a billion identical-looking ones, except the coins keep changing shape due to temperature and humidity, making “normal” impossible to define.
Scientists at particle colliders, astronomical observatories, and genomics labs deal with this kind of problem every day. Datasets are vast, signals are faint, and the difference between a genuine discovery and a statistical fluke can define a career.
Standard anomaly-detection tools have a fundamental weakness: they can tell you something looks weird, but not how weird in any rigorous scientific sense. A detector that flags anomalies is useful. One that can say “this deviation has a one-in-ten-million chance of occurring by coincidence” is science. That distinction matters when you’re claiming a new particle, a novel astrophysical phenomenon, or an unexpected biological mechanism.
A team of MIT physicists at IAIFI has built AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline that combines powerful pattern-recognition with rigorous statistical testing. The goal: make automated discovery sensitive and scientifically credible.
AutoSciDACT goes beyond flagging anomalies. It quantifies their statistical significance through principled hypothesis testing, producing results that meet the standards of scientific publication.
How It Works
AutoSciDACT runs in two phases that mirror the scientific method: first compress and represent the data, then rigorously test what you find.

Phase 1: Contrastive Pre-Training. Raw scientific data (collision events, galaxy spectra, gene expression arrays) arrives with thousands of variables per data point. Direct analysis at that scale is impractical. AutoSciDACT uses contrastive learning to solve this: a neural network sees pairs of similar data points and pairs of different ones, learning to distinguish them without labeled data.
The trick is that labeled real data isn’t needed. The pipeline draws on something physicists and astronomers already have in abundance: high-quality simulations. These encode expert knowledge of what “normal” looks like. AutoSciDACT generates data augmentations from simulations (systematic variations like rotations or small perturbations) that teach the model which differences carry physical meaning and which are noise.
The result is a compact embedding: a handful of numbers capturing the essential physics of each data point, noise stripped away. This compression is what makes the next phase possible.
Phase 2: NPLM Hypothesis Testing. With embeddings in hand, AutoSciDACT runs the New Physics Learning Machine (NPLM), a neural-network-based statistical test originally developed for high-energy physics. NPLM takes a reference dataset (what you’d expect if nothing unusual is happening) and checks whether the observed data could plausibly have come from the same distribution.
It goes further than a simple comparison, though. NPLM trains a neural network to find the most anomalous region in the compressed data, then computes how unlikely that deviation would be by pure chance. The output is a p-value: the probability the result is a fluke. Low p-values mean the anomaly is real and worth investigating.
The pipeline in steps:
- Collect simulated (reference) and real (observed) data
- Train a contrastive encoder using simulations and domain-guided augmentations
- Embed both datasets into low-dimensional space
- Run NPLM to identify the most anomalous region and compute statistical significance
- Report the deviation with a quantified p-value
Why It Matters
The team tested AutoSciDACT on benchmarks spanning astronomical, physical, biological, image, and synthetic datasets, all using the same pipeline architecture. The only domain-specific piece was the augmentation strategy.
AutoSciDACT consistently detected small injections of anomalous data (sometimes a fraction of a percent of the dataset) while keeping false positives under control. The real advantage over prior methods isn’t raw sensitivity; it’s calibration. The test statistics behave as expected under the null hypothesis, so the p-values are trustworthy.
Why do contrastive embeddings outperform simpler approaches? Standard techniques like PCA (which finds the most variable directions in data) or autoencoders (neural networks trained to compress and reconstruct data) tend to lose physically meaningful structure during compression. Contrastive embeddings preserve it. Because the network learned to cluster physically similar things together, NPLM can spot deviations that would be invisible in any naive low-dimensional representation.
The gap between “anomaly detection” and “scientific discovery” has been a persistent frustration in data-driven science. Dozens of papers propose sophisticated outlier-detection methods. Almost none produce outputs a scientist can act on: a number to cite, a significance to defend at a conference. AutoSciDACT closes that gap by treating statistical rigor as a requirement from the start.
Any field with abundant simulation data (climate science, drug discovery, materials science, cosmology) could plug in its own augmentation strategies and get a statistically trustworthy anomaly scanner. As AI takes on a larger role in scientific workflows, AutoSciDACT shows what it looks like to do it carefully: finding patterns and knowing when to trust them.
AutoSciDACT combines contrastive learning with hypothesis testing in a general-purpose pipeline for statistically credible automated discovery. Anomaly flags become quantified scientific claims, applicable from astronomy and particle physics to biology.
IAIFI Research Highlights
AutoSciDACT fuses representation learning from modern machine learning with statistical hypothesis-testing frameworks developed for high-energy physics, creating a discovery pipeline applicable across multiple scientific domains.
Domain-guided contrastive pre-training, using simulations rather than labels, produces embeddings expressive enough for sensitive distribution-level anomaly detection.
Statistically rigorous automated novelty detection at scale addresses one of particle physics' core problems: finding subtle deviations from Standard Model predictions in enormous, noisy collider datasets.
Future directions include handling systematic uncertainties and streaming data, and integrating agentic AI for automated follow-up hypothesis generation. The paper is available at [arXiv:2510.21935](https://arxiv.org/abs/2510.21935).