← Back to Timeline

Isolating Unisolated Upsilons with Anomaly Detection in CMS Open Data

Experimental Physics

Authors

Rikab Gambhir, Radha Mastandrea, Benjamin Nachman, Jesse Thaler

Abstract

We present the first study of anti-isolated Upsilon decays to two muons ($Υ\to μ^+ μ^-$) in proton-proton collisions at the Large Hadron Collider. Using a machine learning (ML)-based anomaly detection strategy, we "rediscover" the $Υ$ in 13 TeV CMS Open Data from 2016, despite overwhelming anti-isolated backgrounds. We elevate the signal significance to $6.4 σ$ using these methods, starting from $1.6 σ$ using the dimuon mass spectrum alone. Moreover, we demonstrate improved sensitivity from using an ML-based estimate of the multi-feature likelihood compared to traditional "cut-and-count" methods. Our work demonstrates that it is possible and practical to find real signals in experimental collider data using ML-based anomaly detection, and we distill a readily-accessible benchmark dataset from the CMS Open Data to facilitate future anomaly detection developments.

Concepts

anomaly detection collider physics anti-isolated quarkonia density estimation likelihood ratio normalizing flows hypothesis testing jet physics signal detection classification new physics searches simulation-based inference

The Big Picture

Imagine trying to spot a single whispered conversation in a stadium full of screaming fans during the loudest play of the game. That’s roughly the challenge particle physicists face when searching for rare quantum particles buried inside the chaos of high-energy proton collisions at the Large Hadron Collider. Most searches sidestep this problem by filtering for “quiet” events, particles that emerge cleanly separated from surrounding debris. But what about the signals hiding inside the noise?

The Upsilon (Υ) is a particle made of a bottom quark and its antimatter partner. It forms from the raw energy of collisions and quickly decays into two muons. Physicists have studied it for decades in clean, uncrowded conditions. When an Upsilon forms within a roiling jet of other particles, though, its signal gets drowned out.

That anti-isolated scenario is what a team from MIT and Lawrence Berkeley National Laboratory went after. Nobody had pulled such a signal out of real collider data before. The backgrounds are simply too overwhelming for traditional methods.

Using machine learning-based anomaly detection on publicly available CMS Open Data, the team took a nearly invisible hint and turned it into a clear detection. The signal jumped from a barely perceptible 1.6σ to 6.4σ. In physics, sigma (σ) measures statistical confidence: 5σ is the conventional discovery threshold, meaning the chance of a random fluctuation mimicking the signal is less than one in a million. This result clears that bar.

Key Insight: Machine learning anomaly detection can find real physics signals in messy, real-world collider data, not just in clean simulations. The same approach could help spot genuinely unknown particles hiding in similarly crowded environments.

How It Works

The analysis starts with a deliberate choice of battlefield. The team pulled data from CMS’s 2016 run at 13 TeV and focused on events where two muons were recorded. At that collision energy, the LHC produces Upsilons many thousands of times over. Rather than requiring the muons to be cleanly separated from surrounding activity, the team reversed the usual requirement: they imposed an anti-isolation criterion that forced muon pairs to be embedded in surrounding particle activity, with non-muon momentum exceeding 55% of the muon momentum within a cone of radius ΔR = 0.4.

Figure 1

This cut dramatically suppresses the signal. Without any cuts, Upsilons shine out at 28σ. With anti-isolation imposed, the dimuon mass spectrum (a plot of how many muon pairs appear at each combined mass, where a real particle shows up as a bump at its known mass) yields a barely perceptible 1.6σ bump. Two dominant backgrounds conspire to bury the Upsilon’s characteristic peaks near 9–10 GeV: uncorrelated hadron decays and Drell-Yan production, where a virtual photon decays into a muon pair, producing a smoothly falling background.

To fight back, the researchers used CATHODE (Classifying Anomalies THrough Outer Density Estimation), an anomaly detection technique that requires no prior knowledge of what the signal looks like. The strategy works in five steps:

  1. Define a signal region around the expected Upsilon mass (roughly 8.5–11 GeV) and mask it off
  2. Train a density estimator on sideband data, events just outside the signal region, to learn how the background behaves in auxiliary feature space
  3. Extrapolate the background model into the signal region using three additional features: dimuon transverse momentum and the 3D impact parameters of each muon (how far off-center each muon’s track is from the collision point)
  4. Build an anomaly score comparing actual data to the predicted background, so events that don’t look like background light up as anomalous
  5. Reweight and scan the mass spectrum, looking for a resonant bump that anomaly detection makes visible

Figure 2

The choice of auxiliary features matters. Dimuon transverse momentum and impact parameters carry information about how the Upsilon was produced: fragmentation-produced Upsilons behave differently from background muon pairs. The team verified that these features don’t sculpt artificial peaks in the mass spectrum. That check is essential, since anomaly scores can preferentially select events at certain masses, creating false bumps.

After applying the learned anomaly score, the Upsilon signal jumps from 1.6σ to 6.4σ. The team also compared two ways of using the anomaly score: a simple cut-and-count approach (keeping only events above a chosen threshold) versus likelihood-ratio reweighting, which uses the full continuous distribution of anomaly scores rather than drawing a hard cutoff. Reweighting consistently outperformed cuts.

Figure 3

Why It Matters

Anti-isolated quarkonia (heavy particles made of a quark and its antimatter partner) probe QCD fragmentation, the process by which collision energy gradually congeals into massive bound particles. Think of it like cooling steam condensing into water droplets. Understanding how Upsilons form within jets tests the boundary between perturbative QCD, where the underlying math is tractable, and non-perturbative physics, where it isn’t. These measurements could sharpen fragmentation models for bottomonium-in-jets, just as earlier charmonium measurements did.

But the method itself may be the bigger deal. Anomaly detection at the LHC has been tested on synthetic benchmarks and applied by CMS and ATLAS to search for new physics, so far without significant excesses. A previous open data study rediscovered the top quark, but in signal regions people already knew about.

This paper is different. It finds a real signal in real data in a region of phase space (the full range of particle configurations and energies an experiment can probe) that nobody had examined before, using methods that require no signal simulation. That’s the scenario anomaly detection was built for. The authors also published a curated benchmark dataset alongside their analysis code, giving the ML community a real-world testbed for new search algorithms.

Bottom Line: ML anomaly detection works on real collider data, not just in simulation. Pulling a 6.4σ Upsilon signal out of anti-isolated backgrounds that buried it at 1.6σ makes the case convincingly. It’s a working tool for finding new physics.

IAIFI Research Highlights

Interdisciplinary Research Achievement
Machine learning anomaly detection, a technique from the AI community, solves an open problem in experimental particle physics by extracting real signals from regions of phase space where traditional analysis falls short.
Impact on Artificial Intelligence
CATHODE-style density estimation holds up on real, messy experimental data. Likelihood-ratio reweighting outperforms cut-based anomaly score usage, which matters for how future anomaly detectors handle continuous scores.
Impact on Fundamental Interactions
This is the first observation of anti-isolated Υ → μ⁺μ⁻ decays at the LHC, providing a new channel to study QCD bottomonium fragmentation inside jets and to probe the nonperturbative regime of the strong force.
Outlook and References
Future work could apply these methods to search for genuinely unknown resonances in anti-isolated channels. The published CMS Open Data benchmark gives the community a shared testbed for developing anomaly detection tools. The paper is available at [arXiv:2502.14036](https://arxiv.org/abs/2502.14036).

Original Paper Details

Title
Isolating Unisolated Upsilons with Anomaly Detection in CMS Open Data
arXiv ID
2502.14036
Authors
Rikab Gambhir, Radha Mastandrea, Benjamin Nachman, Jesse Thaler
Abstract
We present the first study of anti-isolated Upsilon decays to two muons ($Υ\to μ^+ μ^-$) in proton-proton collisions at the Large Hadron Collider. Using a machine learning (ML)-based anomaly detection strategy, we "rediscover" the $Υ$ in 13 TeV CMS Open Data from 2016, despite overwhelming anti-isolated backgrounds. We elevate the signal significance to $6.4 σ$ using these methods, starting from $1.6 σ$ using the dimuon mass spectrum alone. Moreover, we demonstrate improved sensitivity from using an ML-based estimate of the multi-feature likelihood compared to traditional "cut-and-count" methods. Our work demonstrates that it is possible and practical to find real signals in experimental collider data using ML-based anomaly detection, and we distill a readily-accessible benchmark dataset from the CMS Open Data to facilitate future anomaly detection developments.