Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models
Authors
Archer Wang, Emile Anand, Yilun Du, Marin Soljačić
Abstract
Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.
Concepts
The Big Picture
Look at a portrait photograph. Your brain splits it into pieces without thinking: face, lighting, background. You don’t need anyone to tell you where one ends and another begins. Getting an AI to do the same thing, with zero labels, is one of machine learning’s oldest unsolved problems: unsupervised disentanglement.
“Unsupervised” means no human-provided labels. “Disentanglement” means the system has to figure out which hidden factors shaped what it sees. It’s like reconstructing a meal’s ingredients after you’ve already eaten it. Once those factors are identified, they can be recombined: swap the background from one photo onto a different subject, or blend motion patterns from separate robot demonstrations into new trajectories.
A team at MIT has developed a method that makes this recombination far more coherent. The core idea is to add a “judge” network trained to catch implausible blends, then use its feedback to discipline a diffusion model, a generative AI that builds images by progressively refining noise into structure.
Train a discriminator to spot “fake” recombinations. Optimize the diffusion model to fool it. The result: blended outputs that aren’t just pixel-plausible but physically and semantically consistent.
How It Works
The foundation is a factorized latent diffusion model, a generative model that encodes each input into multiple separate hidden “slots,” each capturing a different aspect of the scene. (“Latent” means the model works in a compressed internal representation rather than on raw pixels.) Think of it as assigning separate notebooks to background, lighting, and subject identity. New images come from mixing notebooks across different sources.
The team builds on a prior framework called Decomp Diffusion, which showed promise for unsupervised slot learning. The problem: swapping components across images often looked uncanny. Lighting didn’t match. Textures clashed. The seams showed.

The fix is an adversarial training signal. A discriminator network learns to tell apart two kinds of outputs:
- Single-source samples: outputs generated entirely from one input’s latent factors
- Recombined samples: outputs generated by mixing factors from different sources
During training, the diffusion model gets rewarded for fooling the discriminator into accepting recombinations as real. This creates a feedback loop: as the discriminator gets better at spotting inconsistencies, the generator has to produce more physically coherent blends to keep up.
The adversarial signal operates on intermediate denoising predictions rather than waiting for the fully finished output. At each step of generation, the model maintains a best guess about the final image, and the discriminator evaluates that guess directly. Training updates flow through without running the full denoising chain every time.

Results across four standard benchmarks (CelebA-HQ for celebrity faces, Virtual KITTI for driving scenes, CLEVR for colored geometric objects, and Falcor3D for synthetic 3D scenes) all favor the new method. It achieves lower FID scores (Fréchet Inception Distance, a measure of image quality and diversity) and scores better on MIG (Mutual Information Gap) and MCC (Mean Correlation Coefficient), two metrics that capture how cleanly the learned factors line up with true underlying variables.
Why It Matters
The most surprising result comes not from images but from robots.
The team applied their method to video trajectories from the LIBERO benchmark, a standard testbed for robotic manipulation. Here, the “factors” aren’t visual attributes but action components: reusable motion patterns that recur across task demonstrations. By discovering these components without any labeled examples and recombining them, the system generates entirely new robot trajectories absent from the training set.
These synthetic trajectories dramatically increase state-space coverage, meaning the robot encounters a far wider variety of configurations during exploration. On LIBERO, this translates into more effective reinforcement learning: a policy guided by synthetic demonstrations explores much more of the environment than one relying on original data alone. Real demonstration data is expensive and state coverage is everything. Unsupervised factorization turns out to be a practical augmentation tool, not just an academic exercise.
There’s a deeper tension between representation learning (building useful internal descriptions of data) and generative modeling (creating new data that looks real). The theoretical outlook is discouraging: without structural constraints, fully unsupervised disentanglement is provably impossible. There are infinitely many equally valid ways to factor data, and nothing in the raw data alone picks the “right” one.
This paper doesn’t resolve that impossibility. What it offers is a training signal that nudges the system toward representations that behave as if they’re disentangled, at least for recombination purposes. The adversarial discriminator acts as a proxy for external feedback, the kind that in a richer setting might come from human preferences, physical simulation, or task success. Could it be replaced by reward functions or physics engines? Can it scale to complex real-world video? Those questions now have a concrete working system to test against.
Pairing an adversarial discriminator with a factorized diffusion model produces cleaner disentanglement and more realistic compositional generation across images and robotic video, pointing toward unsupervised factor discovery as a practical tool for boosting exploration in robot learning.
IAIFI Research Highlights
This work ties representation learning theory, including connections to causal inference and independent component analysis, to modern generative modeling. The adversarial feedback mechanism imposes physical consistency on learned compositional structure.
The discriminator-driven diffusion framework sets a new bar for unsupervised disentanglement, achieving better FID, MIG, and MCC scores than prior baselines across four benchmark datasets, all without factor-level supervision.
The robotics application shows that unsupervised latent decomposition can generate physically realistic novel trajectories that substantially increase state-space coverage, providing a new data augmentation approach for embodied AI systems.
Future work could incorporate richer feedback signals like physics simulators and human preferences, and extend the approach to complex real-world video. The preprint ([arXiv:2601.22057](https://arxiv.org/abs/2601.22057)) is available from MIT collaborators.
Original Paper Details
Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models
2601.22057
Archer Wang, Emile Anand, Yilun Du, Marin Soljačić
Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.