Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model
Authors
Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin
Abstract
A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.
Concepts
The Big Picture
Imagine you’re hiking through mountains, looking for the lowest valley. But the terrain is strange: dozens of valleys all sit at exactly the same depth. Do you wander into a wide, gentle bowl, or a narrow, knife-edge trench? The valley you end up in could determine how well you survive the next earthquake. In machine learning, it determines how well your model generalizes to new data.
This question sits at the center of one of deep learning’s most persistent debates. When you train a neural network, the algorithm makes tiny adjustments to millions of numerical settings, searching for configurations that produce fewer errors. The standard tool is stochastic gradient descent (SGD), which uses random subsets of training data rather than the whole dataset at once, navigating what researchers call a loss landscape: a mathematical terrain where every point represents a different network configuration and the height represents how many errors it makes.
For decades, researchers believed SGD has an innate preference for “flat” minima: wide valleys where the loss barely changes even if you nudge the parameters. Flat regions, the intuition goes, are more robust and generalize better.
But the evidence kept undercutting that story. Modern large models routinely converge to “sharp” minima (narrow, steep troughs) and work just fine. So which is it?
A new paper from researchers at MIT and EPFL offers a way through the impasse: an exactly solvable model that exhibits both behaviors, and a precise mathematical answer for when each occurs.
Key Insight: SGD has no inherent preference for flatness. Instead, it minimizes gradient fluctuations. Whether that produces a flat or sharp minimum depends entirely on the geometry of the noise in your training labels.
How It Works
The team builds on a theoretical result that has gained traction in recent years: SGD naturally gravitates toward solutions with small gradient fluctuations, meaning solutions where the error signal used to update the network doesn’t jump around much from one mini-batch to the next. This “minimal-fluctuation” principle is subtly different from seeking flat minima. Flatness is measured by the Hessian, a matrix capturing how steeply the loss curves in every direction. It sounds related to gradient fluctuation, but the two are not interchangeable. They line up only under special conditions.
To pin down exactly when they diverge, the researchers constructed an exact test case: a deep linear network (multiple matrix layers chained together) trained on data from a linear teacher with noisy labels. Simple enough to solve in closed form, yet rich enough to produce nontrivial sharpness behavior.

Even when networks start at wildly different sharpness levels, they converge to the same sharpness, a unique fixed point independent of initialization. That’s exactly what you want from a theory. Clean predictions, not noisy trajectories muddied by starting conditions.
The controlling variable turns out to be label noise anisotropy, how evenly noise is distributed across different output dimensions:
- Isotropic label noise (noise spread evenly across all output dimensions): SGD converges to the flattest possible solution among all equally good minima.
- Anisotropic label noise (noise concentrated in some dimensions more than others): SGD converges to a sharper solution. The sharpness scales with the imbalance, and extreme imbalance can push the model to arbitrarily sharp solutions.
The sharpness at convergence is a closed-form function of the data distribution, network depth, and the label noise covariance matrix. No approximations involved.

Does this hold outside the simplified model? The authors tested it on MLPs, RNNs, and transformers. In each case, tuning the anisotropy of label noise shifted the converged sharpness in the direction the theory predicted. Isotropic noise pushed toward flat solutions; anisotropic noise pushed toward sharp ones.

Why It Matters
This reframes a decade of conflicting results. The apparent paradox of SGD sometimes seeking flatness and sometimes seeking sharpness was never a contradiction; researchers were looking at different data regimes without realizing that data geometry was the hidden variable.
Real-world datasets have highly structured, anisotropic noise. Think of uneven label uncertainty across object categories in image classification, or lopsided prediction difficulty across tokens in language modeling. Under the minimal-fluctuation framework, this structure directly shapes what kind of minimum SGD finds.
What follows for practice? Training recipes that inject synthetic noise (data augmentation, label smoothing, dropout) implicitly tune noise anisotropy. Per this theory, they also tune the sharpness of the final model. Sharpness-aware methods like SAM, which explicitly push networks toward flat minima, may work best where label noise happens to be isotropic.
If anisotropic noise can drive SGD to arbitrarily sharp solutions, a harder question follows: is sharpness even the right metric to monitor, or is gradient fluctuation the more fundamental quantity?
Bottom Line: SGD minimizes gradient fluctuation, not flatness. The structure of your data’s noise decides whether those two goals point in the same direction. This exactly solvable result gives the field a clean causal handle on one of deep learning’s most stubborn puzzles.
IAIFI Research Highlights
This work applies the mathematical toolkit of exactly solvable physics models to resolve a foundational question in deep learning theory, putting the IAIFI mission of physics-informed AI research into practice.
The paper provides an exact, causal account of when SGD prefers flat versus sharp minima, replacing a muddled empirical picture with a clear theoretical principle validated across MLP, RNN, and transformer architectures.
By framing optimization dynamics through noise geometry and gradient fluctuation theory, the work treats learning algorithms as physical processes governed by precise mathematical laws, opening new theoretical ground.
Future work could extend the minimal-fluctuation framework to real-world data distributions and explore how noise anisotropy interacts with sharpness-aware training methods. The paper by Xu, Beneventano, Chuang, and Ziyin is available at [arXiv:2602.05065](https://arxiv.org/abs/2602.05065).
Original Paper Details
Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model
2602.05065
Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin
A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.