← Back to Timeline

Synthesis and Analysis of Data as Probability Measures with Entropy-Regularized Optimal Transport

Foundational AI

Authors

Brendan Mallery, James M. Murphy, Shuchin Aeron

Abstract

We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of reference measures given a set of coefficients belonging to the simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure. Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of barycenters with respect to the entropy-regularized Wasserstein-2 cost as solutions that correspond to a fixed point of an average of the entropy-regularized displacement maps. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when the measure being analyzed is a barycenter with respect to the entropy-regularized Wasserstein-2 cost. We show that these coefficients, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, and that barycentric coefficients are stable with respect to perturbations in the Wasserstein-2 metric. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show that compared to neural network baselines, our approach is more efficient in small training data regimes.

Concepts

optimal transport wasserstein barycenters barycentric coordinates dimension-free convergence classification density estimation scalability dimensionality reduction kernel methods bayesian inference

The Big Picture

Imagine a collection of photographs of handwritten letters: not the pixel arrays, but the ink distribution on each page. Each “A” smears ink differently, some bold and angular, others loopy and soft. Can you build a Platonic ideal of the letter A by averaging these distributions intelligently? And given an unknown ink blob, can you determine how much it resembles each letter in your alphabet?

That’s the problem a team from Tufts University takes on in this paper. Instead of ink blobs, they work with any data modeled as a probability measure: a point cloud, a distribution of particle energies, a histogram of features. The challenge is designing principled tools for synthesis (building new distributions from references) and analysis (decomposing a distribution into its components).

Their approach uses entropy-regularized optimal transport, a framework for measuring geometric distances between distributions. They prove it works efficiently, with stable guarantees, and even beats neural networks when labeled data is scarce.

Key Insight: Adding an entropy regularization term to classical optimal transport yields dimension-free convergence rates and fast computation, turning a computationally expensive problem into a practical data analysis tool with strong theoretical backing.

How It Works

Classical optimal transport (OT) measures the “distance” between two probability distributions: the minimum cost to rearrange one pile of dirt into the shape of another. The Wasserstein-2 distance squares the cost of moving each grain, making it sensitive to distributional geometry. But exact computation scales cubically in the number of samples, and in high dimensions, estimation requires exponentially many of them.

Entropy regularization fixes this. The entropy-regularized Wasserstein-2 cost adds a penalty proportional to the Kullback-Leibler divergence between the transport plan and the baseline independent coupling. The practical payoff is large: the problem becomes parallelizable via the Sinkhorn-Knopp algorithm, and the cost can be estimated from data at rates that do not depend on dimension. The Sinkhorn divergence is an unbiased variant that corrects for self-transport costs.

Figure 1

Two linked problems sit at the center of the paper:

  1. Synthesis (the barycenter problem): Given reference measures and a weight vector summing to one, compute their geometric “average,” a new measure that lies between the references in Wasserstein space.
  2. Analysis (the inverse problem): Given an unknown measure, find the weights that make the barycenter of your references as close as possible to it.

The main theoretical result is a fixed-point characterization of entropy-regularized barycenters. A measure is a barycenter if and only if it is a fixed point of a weighted average of entropy-regularized displacement maps. Think of it as an equilibrium condition: let each reference optimally push mass toward the barycenter, take the weighted average of those pushes, and it doesn’t move.

Figure 2

This leads to a clean practical payoff. When the target measure is itself a barycenter of the references, recovering its coefficients reduces to a finite-dimensional, convex quadratic program with a guaranteed unique global solution. No gradient descent, no iterative black-box optimization. Just a convex problem you can solve exactly.

The recovered coefficients are stable: small perturbations in the Wasserstein-2 metric produce small changes in the output. Convergence rates don’t depend on dimension, so the method holds up as data dimensionality grows. For scientific applications where distributions live in high-dimensional spaces, this is a critical property.

Why It Matters

Point clouds (collections of 3D coordinates representing object surfaces) show up everywhere in robotics, autonomous vehicles, and scientific simulation. Classifying corrupted or occluded point clouds is hard. Deep learning attacks the problem with complex architectures that need large labeled datasets.

The Tufts team applied their barycentric coefficient framework to corrupted point cloud classification. Using the coefficients as features fed into a standard classifier, they outperformed neural network baselines when training data was limited. With only a handful of labeled examples, the geometric structure captured by optimal transport provides a stronger inductive bias than a network can learn from scratch.

The applications go well beyond point clouds. Any domain where data is naturally modeled as distributions stands to benefit: particle physics (energy depositions in detectors), cosmology (galaxy surveys as point processes), materials science (atomic configurations). Because the convergence rates are dimension-free, the framework scales to high-dimensional probability measures without the curse of dimensionality dragging performance down.

Open questions remain. The analysis result currently requires knowing the target measure is itself a barycenter of the references. Extending recovery to approximate barycenters, or to settings where the reference set is learned jointly, would open up many more use cases.

Bottom Line: By combining entropy regularization with a new fixed-point characterization of barycenters, this work delivers a rigorous and computationally tractable framework for distribution-valued data analysis that beats deep learning when labels are scarce.


IAIFI Research Highlights

Interdisciplinary Research Achievement
This work connects mathematical analysis (optimal transport theory, convex optimization) with practical machine learning, providing a principled alternative to neural networks for data that naturally lives in the space of probability measures, including the geometric and field-theoretic data common in fundamental physics.
Impact on Artificial Intelligence
The paper establishes dimension-independent convergence guarantees and a convex quadratic formulation for the analysis problem, giving AI practitioners both theoretical foundations and efficient algorithms for distribution-valued feature extraction.
Impact on Fundamental Interactions
Provable sample efficiency with high-dimensional probability measures gives particle physicists, cosmologists, and materials scientists a new tool for analyzing datasets that are naturally represented as distributions.
Outlook and References
Future work may extend barycentric analysis to approximate settings and learned reference sets. The paper appeared at AISTATS 2025, with code available at the authors' repository ([arXiv:2501.07446](https://arxiv.org/abs/2501.07446)).

Original Paper Details

Title
Synthesis and Analysis of Data as Probability Measures with Entropy-Regularized Optimal Transport
arXiv ID
2501.07446
Authors
["Brendan Mallery", "James M. Murphy", "Shuchin Aeron"]
Abstract
We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of reference measures given a set of coefficients belonging to the simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure. Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of barycenters with respect to the entropy-regularized Wasserstein-2 cost as solutions that correspond to a fixed point of an average of the entropy-regularized displacement maps. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when the measure being analyzed is a barycenter with respect to the entropy-regularized Wasserstein-2 cost. We show that these coefficients, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, and that barycentric coefficients are stable with respect to perturbations in the Wasserstein-2 metric. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show that compared to neural network baselines, our approach is more efficient in small training data regimes.