← Back to Timeline

I-Con: A Unifying Framework for Representation Learning

Foundational AI

Authors

Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton

Abstract

As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

Concepts

representation learning contrastive learning kl divergence unification loss function design self-supervised learning clustering dimensionality reduction information geometry spectral methods embeddings kernel methods multi-task learning

The Big Picture

Imagine you’re a cartographer handed dozens of hand-drawn maps of the same territory, each made by a different explorer, each with its own symbols, projections, and scales. Some show mountains, some rivers, some roads. They all describe the same landscape, but they look nothing alike. Your job is to prove they’re the same map.

That’s roughly the situation machine learning researchers have faced for the past decade. The field of representation learning (teaching computers to extract meaningful structure from raw data) has produced a zoo of competing algorithms.

Some methods teach a model to recognize that two versions of the same image are “the same thing,” even when one is flipped, cropped, or color-shifted. That’s contrastive learning. Others compress complex data into something you can actually visualize: dimensionality reduction. Clustering algorithms sort data into groups automatically. And then there’s supervised learning, the everyday workhorse, where a model learns to match each image to a human-provided label.

Each approach was invented separately, justified on its own terms, and still occupies its own corner of the literature. Connecting them felt almost hopeless.

A new paper from MIT, Google, and Microsoft argues that all these maps describe the same territory, and that a single equation draws all of them at once. The researchers call their framework I-Con, short for Information Contrastive Learning. It doesn’t just unify existing methods theoretically: it uses those connections to build state-of-the-art unsupervised image classifiers.

Key Insight: I-Con shows that over 23 major machine learning algorithms, from t-SNE to CLIP to k-Means, are all minimizing the same underlying objective: a divergence between two probability distributions, one learned and one provided as supervision.

How It Works

The central mathematical object in I-Con is simpler than you’d expect. Given a data point, you can ask: which other data points is it related to, and how strongly? That question defines a conditional probability distribution, a ranked list of neighbors with scores for how closely related each one is. I-Con proposes that every representation learning method does one of three things with such distributions:

  1. Builds a supervisory distribution q based on external signals: class labels, image augmentations, graph structure, or nearest neighbors
  2. Builds a learned distribution p based on the model’s current embeddings, the compact numerical descriptions a neural network assigns to each data point
  3. Minimizes the KL divergence (a measure of dissimilarity between two distributions) between p and q, integrated across all data points

That’s it. One equation.

What makes this powerful is how those distributions get defined. A Gaussian neighborhood, a smooth bell-curve of similarity scores, gives you SNE. Swap in a heavy-tailed Student-T distribution, which spreads scores more broadly to better separate distant points, and you get t-SNE. Use uniform distributions over augmentation pairs and you recover SimCLR. Uniform distributions over class prototypes give you cross-entropy loss. Clustering-based distributions give you k-Means or spectral clustering.

Each choice snaps a different well-known algorithm into place like a puzzle piece.

The authors describe the result as a “periodic table” of representation learning, a grid where rows correspond to different learned distributions and columns to different supervisory signals. Every cell is a real, named algorithm. Some cells were previously empty, hinting at undiscovered methods hiding in the gaps.

Figure 1

The paper includes a wide array of formal proofs connecting specific algorithms to I-Con, spanning from Pearson’s PCA (1901) to CLIP (2021). The team then put the framework to work. By treating image clusters as the learned distribution and contrastive pairs as the supervisory signal (a combination the table exposes as natural but previously unexplored), they built a novel unsupervised classification pipeline.

On ImageNet-1K, one of computer vision’s hardest benchmarks, their method achieves a +8% improvement over the previous state of the art in unsupervised classification, with no labels used during training. The same approach also delivered gains of +3% on CIFAR-100 and +2% on STL-10.

I-Con also exposes a subtle flaw in contrastive learning: representation bias, where a model over-represents frequent patterns at the expense of rare ones. Because I-Con makes the loss structure explicit, the researchers derived a principled debiasing correction directly from the math, not a heuristic patch but a natural consequence of the framework.

Why It Matters

Unification papers are rare in machine learning. The field moves fast and pragmatically. A new trick works, researchers use it, another trick appears, they use that too. Asking why a trick works, and whether it shares DNA with something from a different subfield, takes a different kind of patience. When it pays off, the results can reshape how people think about the whole discipline.

I-Con’s immediate contribution is clarity. Researchers comparing contrastive learning to clustering, or dimensionality reduction to supervised classification, now share a common language. The framework turns “what happens if I borrow this supervisory signal and that learned distribution?” from guesswork into a systematic search. The +8% ImageNet result shows that the gaps in the table contain real treasures.

There’s a longer arc here too. Physics went through similar episodes: Maxwell’s equations unified electricity and magnetism; general relativity unified space, time, and gravity. Those unifications didn’t just simplify textbooks. They revealed new phenomena and enabled new predictions. I-Con hints that representation learning might be at a similar inflection point, where the apparent diversity of techniques was always masking a simpler underlying structure.

Bottom Line: I-Con shows that 23+ machine learning algorithms are all doing the same thing in disguise. By making that structure explicit, it achieves +8% on unsupervised ImageNet classification and opens a systematic path to inventing better algorithms.

IAIFI Research Highlights

Interdisciplinary Research Achievement
I-Con's information-geometric framework echoes the kind of unification sought in theoretical physics: finding a single principle beneath apparent diversity, then applying it to the mathematical foundations of modern AI.
Impact on Artificial Intelligence
By unifying over 23 representation learning algorithms under one KL-divergence objective, I-Con gives researchers a principled basis for designing new loss functions and achieves state-of-the-art unsupervised ImageNet accuracy with a +8% gain over prior methods.
Impact on Fundamental Interactions
The framework's information-theoretic foundation draws on concepts from statistical mechanics and information geometry, exposing mathematical connections between learned representations and physical theories of inference.
Outlook and References
Future work may systematically explore the "empty cells" in I-Con's periodic table to discover novel training objectives; the paper appeared at ICLR 2025 and the project page is at [arXiv:2504.16929](https://arxiv.org/abs/2504.16929).

Original Paper Details

Title
I-Con: A Unifying Framework for Representation Learning
arXiv ID
2504.16929
Authors
["Shaden Alshammari", "John Hershey", "Axel Feldmann", "William T. Freeman", "Mark Hamilton"]
Abstract
As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.