Disentangling Quarks and Gluons with CMS Open Data

The Big Picture

Imagine trying to figure out whether a crowd came from two different cities without asking anyone where they’re from. All you can do is observe their behavior, clothes, and speech patterns, then use statistics to infer the mixture. That’s the challenge physicists face when looking at “jets,” the sprays of particles that shoot out when protons smash together inside the Large Hadron Collider.

Every proton collision involves quarks and gluons, bound together by the strong nuclear force as described by quantum chromodynamics (QCD). You can never catch a single quark or gluon in isolation. Color confinement ensures these particles immediately cluster into cascading sprays of hundreds of other particles the moment they’re knocked free. Those sprays are the jets.

When you record a jet in a detector, you know something initiated it, but was it a quark or a gluon? The two types produce different-looking sprays, and telling them apart matters for testing QCD and hunting for new physics. The catch: labeling any given jet as “quark-initiated” or “gluon-initiated” is fundamentally ambiguous. The detector sees the spray, never the source.

Patrick Komiske, Serhii Kryhin, and Jesse Thaler at MIT’s Center for Theoretical Physics went after this problem using publicly available CMS detector data: 2.3 fb⁻¹ of real proton-proton collisions at 7 TeV. They borrowed a technique from text mining, applied it to actual LHC data, and used machine learning to pull apart quark and gluon distributions from samples that only ever show their mixture.

By treating quark and gluon jets like “topics” in a statistical mixture, the researchers extracted individual quark and gluon distributions directly from real LHC data, without simulation labels, using open-access CMS collisions from 2011.

How It Works

The approach is called jet topic modeling, adapted from natural language processing. In text mining, a topic model identifies latent themes in document collections: a document might be 70% politics and 30% economics. A jet sample works the same way. It’s a mixture of quark and gluon jets, and given two samples with different mixing fractions, you can solve for the underlying pure distributions.

The two mixed samples come from splitting jets by pseudorapidity (η), a measure of a jet’s angle relative to the beam. Central jets (|η| < 0.65) tend to be quark-enriched, while forward jets (|η| > 0.65) carry more gluons. Neither sample is pure, but their differences are enough to work with.

The next step is extracting the reducibility factors, numbers that pin down how much of one distribution “leaks” into the other and fix the quark fraction in each sample. The paper presents and compares three approaches:

Anchor bin method (existing): Identify regions of phase space that are nearly 100% quark or gluon and use them as statistical anchors. In practice, those regions are rare, and statistical noise degrades precision fast.
Log-likelihood ratio fit (L-fit): Fit probability distributions across all data using quantile binning, dividing data into equally populated buckets. More data contribute to the estimate, which improves stability.
ROC curve fit (R-fit): The newest of the three. Train a machine learning classifier to distinguish central from forward jets, then analyze the endpoints of the resulting receiver operating characteristic (ROC) curve. These endpoints encode the reducibility factors directly, bypassing binning altogether.

The R-fit turned out to be the most stable option. By parametrizing only the endpoints of the ROC curve rather than its full shape, it sidesteps sensitivity to training imperfections and statistical fluctuations in the body of the distribution.

There’s a catch, though. The CMS detector doesn’t respond to jets identically at all pseudorapidities. A gluon jet in the central region gets measured differently than one in the forward region, not because the physics changes, but because the detector does. This sample dependence would introduce spurious differences between the two jet samples, corrupting the topic modeling.

To handle this, the team applied OmniFold, an ML-based unfolding technique that statistically removes detector distortions to recover the underlying particle distributions. To the authors’ knowledge, this was the first application of full phase-space unfolding to real collider data. After unfolding, the extracted quark/gluon fractions agreed well with predictions from Pythia, the standard Monte Carlo event generator used at the LHC.

Why It Matters

What comes out is a set of extracted quark and gluon jet distributions (substructure shapes, tagging performance curves, rapidity spectra) derived empirically from data, not assumed from theory or simulation.

One finding offers a clean validation: Casimir scaling of intrinsic dimensionality. The “intrinsic dimensionality” of a jet captures how many independent degrees of freedom describe it, roughly how complex the particle spray is. QCD predicts this quantity should scale with the Casimir factor of the initiating particle: 4/3 for quarks, 3 for gluons.

The extracted intrinsic dimensionality from the unfolded quark and gluon samples follows this ratio. A textbook QCD prediction, confirmed with machine learning on publicly available data. No proprietary datasets, no special experimental access.

Jet topic modeling works on real experimental data. Open LHC data can now be a testbed for QCD measurements that previously seemed to require direct quark/gluon labels. On the machine learning side, methods like OmniFold and ROC-curve-based inference hold up against the messiness of real-world data, not just clean simulations.

Combining jet topic modeling with ML-based unfolding on CMS Open Data, the authors separate quark and gluon jets from data alone, validate QCD’s Casimir scaling prediction, and perform what is, to their knowledge, the first full phase-space unfolding of real LHC data.

IAIFI Research Highlights

Interdisciplinary Research Achievement
The work brings statistical topic modeling from computational linguistics into experimental particle physics, extracting fundamental QCD properties from real LHC data without requiring direct quark or gluon labels.

Impact on Artificial Intelligence
The paper introduces a ROC-curve-fit method and applies OmniFold to real collider data for the first time, showing that ML-based inference can handle the statistical and systematic challenges of actual experimental environments.

Impact on Fundamental Interactions
Extracting separate quark and gluon jet distributions from CMS Open Data, and confirming Casimir scaling in their intrinsic dimensionality, gives physicists a new empirical handle on QCD jet physics without relying on Monte Carlo simulation assumptions.

Outlook and References
Future work will extend this approach to include full systematic uncertainty analyses and apply topic modeling to other mixed jet samples at the LHC. The full analysis is publicly available; the paper appears as [arXiv:2205.04459](https://arxiv.org/abs/2205.04459).

Original Paper Details

Title
Disentangling Quarks and Gluons with CMS Open Data

arXiv ID
2205.04459

Authors
Patrick T. Komiske, Serhii Kryhin, Jesse Thaler

Abstract
We study quark and gluon jets separately using public collider data from the CMS experiment. Our analysis is based on 2.3/fb of proton-proton collisions at 7 TeV, collected at the Large Hadron Collider in 2011. We define two non-overlapping samples via a pseudorapidity cut -- central jets with |eta| < 0.65 and forward jets with |eta| > 0.65 -- and employ jet topic modeling to extract individual distributions for the maximally separable categories. Under certain assumptions, such as sample independence and mutual irreducibility, these categories correspond to "quark" and "gluon" jets, as given by a recently proposed operational definition. We consider a number of different methods for extracting reducibility factors from the central and forward datasets, from which the fractions of quark jets in each sample can be determined. The greatest stability and robustness to statistical uncertainties is achieved by a novel method based on parametrizing the endpoints of a receiver operating characteristic (ROC) curve. To mitigate detector effects, which would otherwise induce unphysical differences between central and forward jets, we use the OmniFold method to perform central value unfolding. As a demonstration of the power of this method, we extract the intrinsic dimensionality of the quark and gluon jet samples, which exhibit Casimir scaling, as expected from the strongly-ordered limit. To our knowledge, this work is the first application of full phase space unfolding to real collider data, and one of the first applications of topic modeling to extract separate quark and gluon distributions at the LHC.

Disentangling Quarks and Gluons with CMS Open Data

Authors

Abstract

Concepts

The Big Picture

How It Works

Why It Matters

IAIFI Research Highlights

Original Paper Details