← Back to Timeline

Formation of Representations in Neural Networks

Foundational AI

Authors

Liu Ziyin, Isaac Chuang, Tomer Galanti, Tomaso Poggio

Abstract

Understanding neural representations will help open the black box of neural networks and advance our scientific understanding of modern AI systems. However, how complex, structured, and transferable representations emerge in modern neural networks has remained a mystery. Building on previous results, we propose the Canonical Representation Hypothesis (CRH), which posits a set of six alignment relations to universally govern the formation of representations in most hidden layers of a neural network. Under the CRH, the latent representations (R), weights (W), and neuron gradients (G) become mutually aligned during training. This alignment implies that neural networks naturally learn compact representations, where neurons and weights are invariant to task-irrelevant transformations. We then show that the breaking of CRH leads to the emergence of reciprocal power-law relations between R, W, and G, which we refer to as the Polynomial Alignment Hypothesis (PAH). We present a minimal-assumption theory proving that the balance between gradient noise and regularization is crucial for the emergence of the canonical representation. The CRH and PAH lead to an exciting possibility of unifying major key deep learning phenomena, including neural collapse and the neural feature ansatz, in a single framework.

Concepts

representation learning canonical representation hypothesis polynomial alignment hypothesis interpretability feature extraction neural collapse eigenvalue decomposition scalability loss function design transfer learning kernel methods

The Big Picture

Imagine trying to understand how a sculptor thinks. You could study the finished statue, but you’d learn more watching the sculptor’s hands at work, noticing how tools, hands, and stone all move in synchrony. Researchers at MIT have found something analogous inside neural networks: the internal patterns a network builds, the connection weights between its neurons, and the error signals (called gradients) guiding its learning all snap into alignment during training. This alignment may be a universal law governing how neural networks form their representations.

For decades, neural networks have been black boxes. We know they work well, but not why. What internal structures do they build, and what mathematical laws govern how those structures form? A team from MIT and Texas A&M has proposed a framework called the Canonical Representation Hypothesis that describes representation formation through just six equations.

Key Insight: Neural networks don’t learn representations by accident. Their internal activations, weights, and gradients are driven into mutual alignment during training, producing compact and transferable internal structure.

How It Works

The core idea is simple. Inside any hidden layer of a neural network, three mathematical objects are constantly in play: the latent representations R (what the network “sees” at that layer), the weights W (connection strengths between neurons), and the neuron gradients G (error signals driving learning). The Canonical Representation Hypothesis (CRH) proposes that after training, these three objects don’t merely coexist. They align with each other in a precise mathematical sense.

The team identifies three types of alignment, where covariance measures how two quantities vary together across the network:

  • Representation-Gradient Alignment (RGA): The covariance of activations becomes proportional to the covariance of gradients
  • Representation-Weight Alignment (RWA): The covariance of activations becomes proportional to the weight outer product
  • Gradient-Weight Alignment (GWA): The covariance of gradients becomes proportional to the weight outer product

Each layer has both a “pre-activation” side (the raw input before a neuron fires) and a “post-activation” side (the output after firing). The three relations come in forward and backward versions, yielding six alignment equations in total. When all six hold simultaneously, the network satisfies the CRH.

Figure 1

What drives this alignment? The researchers prove it emerges from a balance between two competing forces: gradient noise (the randomness in how training samples are drawn) and regularization (the pressure to keep weights small). Neither alone produces alignment.

Together, they act like opposing forces reaching equilibrium, pushing representations toward a canonical form. The proof requires few assumptions, so it applies broadly across architectures and tasks.

Figure 2

Real networks are messy, and the CRH doesn’t always hold perfectly. When alignment breaks (in early layers or certain training regimes), the paper predicts something unexpected: R, W, and G don’t simply become uncorrelated. Instead, they follow reciprocal power-law relations, which the team calls the Polynomial Alignment Hypothesis (PAH). Rather than perfect proportionality, you get scaling relationships with characteristic exponents, analogous to phases of matter in physics.

Figure 3

Why It Matters

The CRH turns out to be a Rosetta Stone for neural network theory. Two major phenomena in deep learning had been studied in isolation: neural collapse, where final-layer representations collapse onto a perfectly symmetric geometric structure in classifiers, and the neural feature ansatz, where weight matrices evolve according to gradient outer products. The CRH shows both are special cases of the same alignment mechanism.

Neural collapse emerges when the CRH holds in the penultimate layer of a classifier. The neural feature ansatz is recovered as a particular forward alignment relation. What looked like separate discoveries are really two views of one deeper law.

There are practical implications too. If trained networks obey universal alignment laws, we can predict structural properties of a network’s internals without expensive experiments, guiding choices about architecture, initialization, and training dynamics. The framework could also help with mechanistic interpretability: knowing the geometry that representations converge toward lets us design probes and interventions that exploit that geometry, rather than treating the network as opaque.

Bottom Line: The Canonical Representation Hypothesis is a first candidate for a universal law of representation formation in neural networks, unifying previously disconnected phenomena and explaining why deep learning produces compact, transferable internal structure.

IAIFI Research Highlights

Interdisciplinary Research Achievement
The work applies concepts from statistical physics (phase transitions, scaling laws, thermodynamic balance) to explain how structure emerges in artificial neural networks, connecting AI theory with physics in a way that reflects IAIFI's core mission.
Impact on Artificial Intelligence
The CRH and PAH provide a unified theoretical framework that brings neural collapse, the neural feature ansatz, and emergent power-law scaling under a single set of equations.
Impact on Fundamental Interactions
Universal phases in neural network layers, governed by the balance between gradient noise and regularization, look a lot like physical phase transitions driven by competing forces. The analogy between learning systems and physical systems may be more than superficial.
Outlook and References
Future work could extend the CRH to attention-based architectures and use the framework to guide more efficient and interpretable model design. The paper, by Liu Ziyin, Isaac Chuang, Tomer Galanti, and Tomaso Poggio, appears at ICLR 2025 ([arXiv:2410.03006](https://arxiv.org/abs/2410.03006)).

Original Paper Details

Title
Formation of Representations in Neural Networks
arXiv ID
2410.03006
Authors
Liu Ziyin, Isaac Chuang, Tomer Galanti, Tomaso Poggio
Abstract
Understanding neural representations will help open the black box of neural networks and advance our scientific understanding of modern AI systems. However, how complex, structured, and transferable representations emerge in modern neural networks has remained a mystery. Building on previous results, we propose the Canonical Representation Hypothesis (CRH), which posits a set of six alignment relations to universally govern the formation of representations in most hidden layers of a neural network. Under the CRH, the latent representations (R), weights (W), and neuron gradients (G) become mutually aligned during training. This alignment implies that neural networks naturally learn compact representations, where neurons and weights are invariant to task-irrelevant transformations. We then show that the breaking of CRH leads to the emergence of reciprocal power-law relations between R, W, and G, which we refer to as the Polynomial Alignment Hypothesis (PAH). We present a minimal-assumption theory proving that the balance between gradient noise and regularization is crucial for the emergence of the canonical representation. The CRH and PAH lead to an exciting possibility of unifying major key deep learning phenomena, including neural collapse and the neural feature ansatz, in a single framework.