Feature Splatting: Language-Driven Physics-Based Scene Synthesis and Editing
Authors
Ri-Zhao Qiu, Ge Yang, Weijia Zeng, Xiaolong Wang
Abstract
Scene representations using 3D Gaussian primitives have produced excellent results in modeling the appearance of static and dynamic 3D scenes. Many graphics applications, however, demand the ability to manipulate both the appearance and the physical properties of objects. We introduce Feature Splatting, an approach that unifies physics-based dynamic scene synthesis with rich semantics from vision language foundation models that are grounded by natural language. Our first contribution is a way to distill high-quality, object-centric vision-language features into 3D Gaussians, that enables semi-automatic scene decomposition using text queries. Our second contribution is a way to synthesize physics-based dynamics from an otherwise static scene using a particle-based simulator, in which material properties are assigned automatically via text queries. We ablate key techniques used in this pipeline, to illustrate the challenge and opportunities in using feature-carrying 3D Gaussians as a unified format for appearance, geometry, material properties and semantics grounded on natural language. Project website: https://feature-splatting.github.io/
Concepts
The Big Picture
Imagine photographing a vase of flowers on your kitchen table. The image is static, frozen in time. Now imagine typing “make the flowers sway in a gentle breeze” and watching the petals bend and flutter realistically, each stem flexing according to its actual material stiffness, while the ceramic vase sits immovably rigid beneath them. No animator needed. No physics degree required. Just a sentence.
That’s what Feature Splatting does. Developed by researchers at UC San Diego and MIT’s IAIFI, it goes after a stubborn problem: 3D scene representations, no matter how photorealistic, are dumb about the world. They capture what things look like but know nothing about what things are or how they behave. A digital flower and a digital rock look different but, to the computer, are equally inert.
Feature Splatting combines three previously separate technologies: a method for building precise 3D scenes from ordinary photographs, AI models trained to understand both images and language, and a physics engine that simulates how real materials move and deform. Feed it a set of photos, and it builds a 3D scene you can animate with plain English.
Key Insight: Feature Splatting embeds language-grounded meaning directly into 3D scene geometry. Text queries can both identify objects in a scene and automatically assign them physically realistic material properties, turning static captures into dynamic simulations.
How It Works
Feature Splatting builds on 3D Gaussian Splatting (GS), a technique that represents a scene as a cloud of millions of fuzzy ellipsoids (Gaussians), each carrying color and opacity information. Not a mesh. Not a neural network. GS renders fast and looks sharp, but those Gaussians are semantically blind. Feature Splatting’s first trick is to make them see.

The team augments each Gaussian with a high-dimensional feature vector, a long list of numbers encoding what that point in the scene semantically “means.” These features come from two large 2D vision foundation models: CLIP (which connects images to language) and DINOv2 (which captures rich visual structure). The catch is that both models produce low-resolution, noisy outputs when analyzing a photo. Naively projecting those outputs onto 3D Gaussians creates artifacts.
To clean this up, the team runs SAM (Segment Anything Model) to produce part-level masks, coherent regions like “petal,” “stem,” or “vase body,” and pools features within those masks before distillation. The result is much cleaner feature fields baked directly into the 3D structure.

With features embedded, scene decomposition becomes a text query. A user types “a vase with flowers,” and the system finds the Gaussians whose feature vectors are closest in meaning to that description. This is open-vocabulary segmentation: the system doesn’t need to have been trained on “vase” or “flowers” as labeled categories. It inherits that knowledge from CLIP’s pretraining.
The second half of the pipeline handles what happens after segmentation: physics. Feature Splatting integrates an MPM (Material Point Method) physics engine, a particle-based simulator well-suited to materials that deform, flow, and break. Two ideas make this work:
- Material assignment via language: The same text-query mechanism that decomposes scenes also identifies material categories. The system queries whether a segmented region is “rigid,” “elastic,” “viscous,” and so on, then looks up corresponding physical parameters (Young’s modulus, Poisson’s ratio, yield stress) from a predefined table.
- Infilling for volume-dependent physics: GS represents surfaces, not volumes. MPM needs particles throughout an object’s interior to simulate effects like squishing or bouncing. The team developed an infilling procedure to populate object interiors with physics particles before simulation begins.
As the simulation runs, Gaussian primitives must deform realistically. The team uses a local affine transformation tied to nearby MPM particles, tracking how each small region stretches, rotates, and shifts. Each Gaussian’s position, rotation, and scale updates continuously, handling large deformations better than earlier mesh-based approaches.
Why It Matters
Feature Splatting sits at an unusual intersection. On the AI side, it shows that knowledge inside large 2D vision-language models, trained on internet-scale image-text pairs, can be transferred into explicit 3D representations without retraining those models from scratch. You get the benefit of billion-parameter pretraining without the cost.
The feature-carrying Gaussian format holds appearance, geometry, semantics, and physics in one data structure. Computer vision researchers have wanted something like this for a long time.
On the physics side, the connection to MPM matters. Material Point Methods have a rich history, from snow simulation in Disney’s Frozen to soft robotics to geomechanics. Grounding MPM in natural language means a non-expert can specify “this is rubber, this is glass” and get physically plausible results. That kind of accessible simulation used to require serious domain expertise.
Open questions remain. The material property lookup table is hand-curated; future work might learn parameters directly from video observations. Feature quality degrades in cluttered scenes with heavy occlusion, and the pipeline requires reasonably clean multi-view captures. The paper’s ablation studies isolate which components matter most and where the biggest gains are still available.
Bottom Line: Feature Splatting proves that a single 3D representation can simultaneously encode what a scene looks like, what objects are in it, and how those objects behave physically, letting anyone animate a photograph with a sentence.
IAIFI Research Highlights
Feature Splatting combines large-scale language-vision AI with classical physics simulation, showing that semantic knowledge from internet-trained foundation models can drive physically correct material dynamics in reconstructed 3D scenes.
The work introduces a method for distilling noisy 2D vision-language features into clean, object-centric 3D Gaussian representations using SAM-pooled masks. This allows open-vocabulary 3D scene understanding without retraining foundation models.
By coupling language-grounded segmentation with an MPM physics engine, the system makes particle-based simulation of real-world material behavior controllable through natural language, putting physically realistic dynamic scene synthesis within reach of non-experts.
Future directions include learning material parameters directly from observed dynamics rather than lookup tables, and extending to more complex multi-object interactions; the work is available at https://feature-splatting.github.io/ ([arXiv:2404.01223](https://arxiv.org/abs/2404.01223)).
Original Paper Details
Feature Splatting: Language-Driven Physics-Based Scene Synthesis and Editing
[arXiv:2404.01223](https://arxiv.org/abs/2404.01223)
Ri-Zhao Qiu, Ge Yang, Weijia Zeng, Xiaolong Wang
Scene representations using 3D Gaussian primitives have produced excellent results in modeling the appearance of static and dynamic 3D scenes. Many graphics applications, however, demand the ability to manipulate both the appearance and the physical properties of objects. We introduce Feature Splatting, an approach that unifies physics-based dynamic scene synthesis with rich semantics from vision language foundation models that are grounded by natural language. Our first contribution is a way to distill high-quality, object-centric vision-language features into 3D Gaussians, that enables semi-automatic scene decomposition using text queries. Our second contribution is a way to synthesize physics-based dynamics from an otherwise static scene using a particle-based simulator, in which material properties are assigned automatically via text queries. We ablate key techniques used in this pipeline, to illustrate the challenge and opportunities in using feature-carrying 3D Gaussians as a unified format for appearance, geometry, material properties and semantics grounded on natural language. Project website: https://feature-splatting.github.io/