The Pareto Frontier of Resilient Jet Tagging
Authors
Rikab Gambhir, Matt LeBlanc, Yuanchen Zhou
Abstract
Classifying hadronic jets using their constituents' kinematic information is a critical task in modern high-energy collider physics. Often, classifiers are designed by targeting the best performance using metrics such as accuracy, AUC, or rejection rates. However, the use of a single metric can lead to the use of architectures that are more model-dependent than competitive alternatives, leading to potential uncertainty and bias in analysis. We explore such trade-offs and demonstrate the consequences of using networks with high performance metrics but low resilience.
Concepts
The Big Picture
Imagine hiring the world’s best sprinter to run a marathon. On a dry track, they’re unbeatable. In mud or at altitude, a steady ultrarunner finishes while the sprinter falters. In high-energy physics, machine learning classifiers face exactly this dilemma.
At CERN’s Large Hadron Collider, protons smash together millions of times per second, producing tight sprays of particles called jets. These debris cones carry fingerprints of the particle that produced them. Figuring out which kind of particle spawned a given jet is called jet tagging, and it’s central to particle physics: searches for undiscovered particles and precision measurements of known ones both depend on it.
Over the past decade, physicists have turned to increasingly powerful AI architectures, chasing ever-higher scores on standard benchmarks. But a new study from IAIFI-affiliated researchers raises an uncomfortable question: what if those higher test scores are making our physics analyses worse?
Key Insight: The most accurate jet-tagging models are often the least reliable on real data, because they’ve learned quirks of their training simulations rather than genuine physics. The researchers map out a trade-off between raw performance and robustness that constrains all current architectures.
How It Works
The core problem is simulation dependence. Real collisions are too complex for exact equations, so researchers rely on Monte Carlo event generators: software that simulates millions of virtual collisions. No simulator perfectly captures nature. Two leading packages, PYTHIA and HERWIG, make different modeling choices that produce subtly different jet distributions. A classifier trained on PYTHIA may score brilliantly on PYTHIA test data, then degrade when faced with HERWIG data or real detector output.
The researchers introduce resilience as a metric: how much a classifier’s performance drops when tested on a different simulator than the one it trained on. They measure it as the percent difference in AUC (area under the ROC curve, a standard score where 1.0 is perfect and 0.5 is no better than a coin flip) between PYTHIA and HERWIG samples. A resilient model shows little change; a fragile one diverges.
They tested five architecture classes of increasing complexity:
- Expert features: hand-crafted physics variables like angularities (how spread out a jet is) and multiplicities (particle count inside the jet)
- Deep Neural Networks (DNNs): 2–10 hidden layers with varying neuron counts
- Particle-Flow Networks (PFNs) and Energy-Flow Networks (EFNs): architectures treating each jet as an unordered particle cloud, built so that results don’t change if particles are reordered
- Particle Transformer (ParT): the current top-performing attention-based architecture
All models received only raw kinematic information (particle momenta and angles) and were trained on two tasks: quark/gluon discrimination and boosted top-quark tagging.

Plot the AUC against resilience and the trade-off jumps out. The frontier traces a clear curve. The Particle Transformer hits the highest raw AUC, but its resilience drops substantially. EFNs and simple expert features sit in the opposite corner: lower peak performance, but far more stable across simulators. Models in the “Pareto-excluded” region are simply inferior, beaten on both metrics by another architecture.
The Knowledge Distillation Attempt
Could training tricks break through the frontier? The team tested knowledge distillation, where a smaller “student” model learns to mimic a larger, more powerful “teacher,” hoping to inherit the teacher’s accuracy while keeping the student’s stability.

Students did improve beyond naive linear interpolation between teacher and baseline, so distillation delivered real gains. But no distilled student pushed past the existing Pareto frontier. Whatever sets that boundary seems to run deeper than any training trick can fix.
A Real-World Consequence
The team ran a concrete case study: estimating the flavor mixture fraction κ (the proportion of quark jets versus gluon jets in a mixed sample) using two PFNs at different points on the Pareto frontier.

The large, high-AUC PFN (latent dimension 128, 250 nodes per hidden layer) produces biased estimates of κ when test data comes from a different simulator than training data. The small, resilient PFN (latent dimension 8, 50 nodes per hidden layer) yields less precise estimates under ideal conditions but far more accurate ones when simulator mismatch is present. By conventional metrics, the “better” classifier actively misleads a downstream physics analysis.
Why It Matters
The LHC’s Run 3 is underway, and the upcoming High-Luminosity LHC will produce data at vastly higher rates. Graph networks, transformers, and foundation models are all moving into production at the experiments. If their performance is still measured purely on simulation-matched test sets, those scores may say nothing about real-world reliability.
The same failure mode shows up across machine learning: models trained and tested under matched conditions turn brittle the moment real-world data shifts even slightly. The Pareto frontier gives this problem a proper framework. Classifier quality isn’t a single number. It’s a curve in a multidimensional space.
Open questions remain. Does the frontier shift as training datasets grow, or does it hit a hard limit set by the information content of jet constituents? Can physics-informed architectures push the boundary? How does the picture change when you fold in detector uncertainties, pile-up effects, and domain shifts between LHC runs?
Bottom Line: Chasing AUC alone builds jet taggers that are fast but fragile. The Pareto frontier shows there’s no free lunch: physicists should treat resilience as a first-class benchmark alongside accuracy, or risk biasing the very measurements they’re trying to make.
IAIFI Research Highlights
This work imports the Pareto frontier concept from economics and optimization theory into particle physics classifier design, making visible a trade-off that raw performance metrics hide.
Knowledge distillation provides genuine gains but cannot overcome the performance-robustness trade-off, a cautionary result for practitioners who rely on distillation to compress brittle models.
High-AUC jet taggers can systematically bias measurements like quark/gluon fraction estimation. This raises questions about standard practices at the LHC and argues for resilience-aware classifier design in precision analyses.
Future work includes extending the Pareto framework to detector effects, pile-up, and future collider conditions, and exploring whether physics-informed architectures can shift the frontier; the paper is available at [arXiv:2509.19431](https://arxiv.org/abs/2509.19431).
Original Paper Details
The Pareto Frontier of Resilient Jet Tagging
2509.19431
Rikab Gambhir, Matt LeBlanc, Yuanchen Zhou
Classifying hadronic jets using their constituents' kinematic information is a critical task in modern high-energy collider physics. Often, classifiers are designed by targeting the best performance using metrics such as accuracy, AUC, or rejection rates. However, the use of a single metric can lead to the use of architectures that are more model-dependent than competitive alternatives, leading to potential uncertainty and bias in analysis. We explore such trade-offs and demonstrate the consequences of using networks with high performance metrics but low resilience.