← Back to Timeline

Materialistic: Selecting Similar Materials in Images

Foundational AI

Authors

Prafull Sharma, Julien Philip, Michaël Gharbi, William T. Freeman, Fredo Durand, Valentin Deschaintre

Abstract

Separating an image into meaningful underlying components is a crucial first step for both editing and understanding images. We present a method capable of selecting the regions of a photograph exhibiting the same material as an artist-chosen area. Our proposed approach is robust to shading, specular highlights, and cast shadows, enabling selection in real images. As we do not rely on semantic segmentation (different woods or metal should not be selected together), we formulate the problem as a similarity-based grouping problem based on a user-provided image location. In particular, we propose to leverage the unsupervised DINO features coupled with a proposed Cross-Similarity module and an MLP head to extract material similarities in an image. We train our model on a new synthetic image dataset, that we release. We show that our method generalizes well to real-world images. We carefully analyze our model's behavior on varying material properties and lighting. Additionally, we evaluate it against a hand-annotated benchmark of 50 real photographs. We further demonstrate our model on a set of applications, including material editing, in-video selection, and retrieval of object photographs with similar materials.

Concepts

cross-similarity feature weighting representation learning self-supervised learning feature extraction cross-image material similarity transformers embeddings synthetic-to-real generalization attention mechanisms transfer learning contrastive learning inverse problems

The Big Picture

Imagine walking into a furniture store and pointing to a chair, asking the salesperson to find everything else made from the same wood. You wouldn’t be confused by different lighting, the sheen of varnish, or a lamp’s shadow. You’d just know that table shares the same grain. Computer vision researchers have been trying to teach machines the same trick for decades, and it turns out to be very hard.

The challenge isn’t recognizing a material category like “wood” or “metal.” It’s recognizing this specific wood, with its grain, color, and reflective character, versus that other wood across the room, while ignoring how dramatically lighting changes a surface’s appearance. Shading, specular highlights, and cast shadows transform how a material looks without changing what it is. Standard color-picking tools and AI labeling systems fall apart under these conditions.

Materialistic, from researchers at MIT and Adobe Research, reframes the problem entirely.

Key Insight: By treating material selection as a similarity problem rather than a classification problem, Materialistic identifies matching materials without being fooled by lighting effects, and generalizes to real photographs despite training entirely on synthetic scenes.

How It Works

The core design decision is almost philosophical: don’t classify materials into fixed buckets. Traditional segmentation systems label pixels as “wood,” “metal,” or “fabric,” but coarse labels can’t distinguish two different woods in the same scene. Materialistic asks a different question: given a user-clicked pixel, which other pixels are made of the same stuff?

To answer this, the system builds on DINO (self-distillation with no labels), a pretrained visual model that recognizes meaningful structure in image patches without human annotation. Raw DINO features mix color, texture, shape, and object type together. Materialistic adds a specialized layer on top to tease them apart.

Figure 1

The pipeline works in three steps:

  1. Feature extraction: DINO processes the image at multiple scales, producing compact numerical summaries for each image region.
  2. Cross-Similarity Feature Weighting: A novel module takes the query pixel’s summary and reweights features across the entire image, asking which properties are most diagnostic for matching the query. This operates at multiple resolutions simultaneously.
  3. MLP scoring head: A lightweight multilayer perceptron takes the modulated, multi-scale summaries and outputs a per-pixel similarity heatmap indicating which regions share the same material.

Training data posed its own problem. Real photographs rarely carry per-pixel material identity labels. The team sidestepped this by building a synthetic dataset: 50,000 rendered images from 100 indoor scenes, drawing on 16,000 physically-based materials. The renderer produced realistic shading, reflections, and shadows while providing exact ground-truth material labels for every pixel.

Figure 2

Despite training only on synthetic indoor scenes, the model generalizes to real-world photographs, including outdoor images with entirely different lighting and materials. That synthetic-to-real transfer suggests the Cross-Similarity module is picking up on genuine material properties rather than overfitting to rendering artifacts.

No ground-truth material selection benchmark previously existed, so the team hand-annotated 50 real photographs to create one. They also ran controlled ablation studies, varying lighting angle, material glossiness, query pixel location, and image resolution to isolate what each component contributes.

Why It Matters

Material selection sounds niche, but it reaches further than you’d think. The authors show material editing (swap all the upholstery in a scene at once), in-video selection (pick a material in one frame and track it through a whole clip), and image retrieval (search product photos for objects made from a specific material). Think e-commerce, architectural visualization, digital content creation, film production.

Figure 3

There’s a deeper payoff too. An image isn’t just a pixel grid; it’s the result of geometry, illumination, and materials interacting through the physics of light. Systems that pull apart these components advance inverse rendering (recovering 3D scene properties from 2D photos), physically accurate image synthesis, and AI that reasons about physical properties rather than pixel statistics.

Open questions remain. The method operates at the patch level, which limits spatial precision at fine material boundaries. Highly transparent or translucent materials, where light behavior gets especially complex, are still difficult. Synthetic-to-real generalization works well here, but AI-generated image augmentation might help close the remaining domain gap on harder cases.

Bottom Line: Materialistic shows that material identity can be extracted from photographs without semantic labels, using a query-driven similarity architecture over self-supervised features. It’s a concrete step toward AI that understands the physical world, not just its appearance.

IAIFI Research Highlights

Interdisciplinary Research Achievement
This work straddles computer graphics and machine learning, using physically-based rendering to generate training data for a visual AI system. Light transport physics meets self-supervised deep learning here, solving a problem that resisted either approach on its own.
Impact on Artificial Intelligence
The Cross-Similarity Feature Weighting module shows how to condition patch-level visual features on a user query, opening up similarity tasks that go beyond fixed-class classification. The same design pattern could transfer to other problems in visual computing.
Impact on Fundamental Interactions
By learning to separate material identity from illumination, Materialistic moves toward the scientific goal of inverse rendering: recovering physical scene parameters (geometry, materials, and lights) from raw image data. This is a central problem in physically-grounded computer vision.
Outlook and References
Next steps include finer spatial resolution, better handling of transparent materials, and integration with full inverse rendering pipelines. The synthetic dataset is publicly released. [arXiv:2305.13291](https://arxiv.org/abs/2305.13291)

Original Paper Details

Title
Materialistic: Selecting Similar Materials in Images
arXiv ID
2305.13291
Authors
Prafull Sharma, Julien Philip, Michaël Gharbi, William T. Freeman, Fredo Durand, Valentin Deschaintre
Abstract
Separating an image into meaningful underlying components is a crucial first step for both editing and understanding images. We present a method capable of selecting the regions of a photograph exhibiting the same material as an artist-chosen area. Our proposed approach is robust to shading, specular highlights, and cast shadows, enabling selection in real images. As we do not rely on semantic segmentation (different woods or metal should not be selected together), we formulate the problem as a similarity-based grouping problem based on a user-provided image location. In particular, we propose to leverage the unsupervised DINO features coupled with a proposed Cross-Similarity module and an MLP head to extract material similarities in an image. We train our model on a new synthetic image dataset, that we release. We show that our method generalizes well to real-world images. We carefully analyze our model's behavior on varying material properties and lighting. Additionally, we evaluate it against a hand-annotated benchmark of 50 real photographs. We further demonstrate our model on a set of applications, including material editing, in-video selection, and retrieval of object photographs with similar materials.