RelitLRM: Generative Relightable Radiance for Large Reconstruction Models
Authors
Tianyuan Zhang, Zhengfei Kuang, Haian Jin, Zexiang Xu, Sai Bi, Hao Tan, He Zhang, Yiwei Hu, Milos Hasan, William T. Freeman, Kai Zhang, Fujun Luan
Abstract
We propose RelitLRM, a Large Reconstruction Model (LRM) for generating high-quality Gaussian splatting representations of 3D objects under novel illuminations from sparse (4-8) posed images captured under unknown static lighting. Unlike prior inverse rendering methods requiring dense captures and slow optimization, often causing artifacts like incorrect highlights or shadow baking, RelitLRM adopts a feed-forward transformer-based model with a novel combination of a geometry reconstructor and a relightable appearance generator based on diffusion. The model is trained end-to-end on synthetic multi-view renderings of objects under varying known illuminations. This architecture design enables to effectively decompose geometry and appearance, resolve the ambiguity between material and lighting, and capture the multi-modal distribution of shadows and specularity in the relit appearance. We show our sparse-view feed-forward RelitLRM offers competitive relighting results to state-of-the-art dense-view optimization-based baselines while being significantly faster. Our project page is available at: https://relit-lrm.github.io/.
Concepts
The Big Picture
Imagine you snap a few quick photos of a ceramic vase on your kitchen counter on an overcast afternoon. Now imagine handing those photos to a computer system that, within seconds, returns a perfect 3D model you can drop into a sun-drenched desert scene, a neon-lit nightclub, or a candlelit room, and have it look exactly right. Shadows fall where they should. Glossy highlights flare at the correct angles. The vase belongs.
This has been a holy grail of computer vision for decades. The core problem is brutal: lighting and material properties are hopelessly entangled in any photograph. When you see a bright spot on that vase, is it because the ceramic is shiny, or because the light source is intense? A camera can’t tell you.
Untangling this so you can re-render an object under entirely new lighting (a task called relighting) has traditionally demanded hundreds of images taken under controlled, laboratory-grade illumination, followed by hours of computational optimization. Impressive, but impractical outside a professional visual-effects pipeline.
Researchers from MIT, Stanford, Cornell, and Adobe Research have built RelitLRM, a system that produces relighting quality on par with those slow, data-hungry methods from just 4–8 photographs in roughly one second.
Key Insight: RelitLRM separates the job of figuring out where an object is in 3D space from figuring out how light plays across it. It uses two fundamentally different AI approaches for each task, which lets it handle the ambiguity that defeats simpler systems.
How It Works
The trick at RelitLRM’s core is architectural: don’t try to solve the whole problem at once. Split it into two sequential tasks, each suited to a different machine-learning paradigm.
-
Reconstruct the geometry. A transformer-based geometry reconstructor identifies how parts of the scene relate in 3D space. (Transformers are the same architecture behind modern language models, adapted here for visual data.) The output is the object’s structure encoded as 3D Gaussian Splatting (3DGS) primitives: millions of tiny translucent blobs, each with a position, size, orientation, and opacity. 3DGS has become the field’s go-to compact, fast-to-render 3D representation since 2023.
-
Generate the appearance under new light. Once geometry is locked in, RelitLRM hands off to a diffusion-based appearance generator from the same family of models behind Stable Diffusion. Why diffusion? Because relighting is fundamentally uncertain. A glossy surface can produce specular highlights (intense bright spots where light bounces directly toward the camera) that could appear in several plausible positions depending on tiny, unknown surface details. A regression model trained to predict a single “best” output averages over those possibilities, producing a blurry smear. A diffusion model instead samples from the full distribution of plausible appearances, picking one sharp, physically coherent answer.

The two stages are trained end-to-end on a large synthetic dataset of 3D objects rendered under many known illuminations. The target lighting condition is fed in as an additional input token, telling the system not just what the object looks like but what kind of light to simulate. Inference takes about one second on a single A100 GPU.

This design also sidesteps shadow baking, a persistent flaw in earlier methods. Per-scene optimization tends to accidentally embed the original lighting (shadows, highlights, all of it) directly into the learned surface materials. Move the sun in your relit scene, and the old shadow stubbornly stays put. Training end-to-end across diverse illuminations forces RelitLRM to pull apart lighting and material properties from the start.
Why It Matters
The obvious applications span digital content creation, gaming, and augmented reality: anywhere you need 3D assets that look correct under different lighting. The paper shows this directly. Photograph a real object, run RelitLRM, insert it into a synthetic 3D scene. It looks like it belongs.
The deeper significance is methodological. RelitLRM makes the case that the right answer to a physically ambiguous inverse problem isn’t always to push harder on optimization. Sometimes it’s to reframe the problem as generative sampling. The material-lighting ambiguity that has dogged inverse rendering for two decades is a multi-modal uncertainty: many equally valid answers exist. Diffusion models handle exactly that kind of uncertainty well, and the same logic should apply wherever physics produces genuinely ambiguous observations.
Open questions remain. The system trains on synthetic data, and real-world photographs introduce domain mismatches that the authors acknowledge. The geometry stage still relies on images where the camera position is roughly known. And the diffusion-based appearance generator, while good at capturing high-frequency specular effects, introduces some randomness that regression-based methods avoid.
Where does the field go from here? Fully unconstrained casual capture without known camera poses, larger real-world training datasets, and tighter integration between the geometry and appearance stages.
Bottom Line: RelitLRM turns a problem that once required a photography studio and a supercomputer into one that runs on a single GPU in about a second, by being smart about which parts of the problem need deterministic answers and which need probabilistic ones.
IAIFI Research Highlights
This work sits at the intersection of physically-based rendering (rooted in optics and light-transport physics) and modern deep learning, using transformer and diffusion architectures to implicitly learn how light interacts with surfaces.
RelitLRM shows that combining regression-based geometry estimation with diffusion-based appearance synthesis outperforms either approach alone on ambiguous inverse problems, advancing feed-forward generative 3D reconstruction.
The system disentangles geometry, material properties, and illumination from sparse observations, performing learned inverse rendering that recovers physically meaningful scene properties without explicit physical simulation at inference time.
Future work includes training on larger real-world datasets to close the synthetic-to-real gap and extending the framework to full scene relighting beyond individual objects; the paper is available on [arXiv:2410.06231](https://arxiv.org/abs/2410.06231) and at relit-lrm.github.io.
Original Paper Details
RelitLRM: Generative Relightable Radiance for Large Reconstruction Models
2410.06231
Tianyuan Zhang, Zhengfei Kuang, Haian Jin, Zexiang Xu, Sai Bi, Hao Tan, He Zhang, Yiwei Hu, Milos Hasan, William T. Freeman, Kai Zhang, Fujun Luan
We propose RelitLRM, a Large Reconstruction Model (LRM) for generating high-quality Gaussian splatting representations of 3D objects under novel illuminations from sparse (4-8) posed images captured under unknown static lighting. Unlike prior inverse rendering methods requiring dense captures and slow optimization, often causing artifacts like incorrect highlights or shadow baking, RelitLRM adopts a feed-forward transformer-based model with a novel combination of a geometry reconstructor and a relightable appearance generator based on diffusion. The model is trained end-to-end on synthetic multi-view renderings of objects under varying known illuminations. This architecture design enables to effectively decompose geometry and appearance, resolve the ambiguity between material and lighting, and capture the multi-modal distribution of shadows and specularity in the relit appearance. We show our sparse-view feed-forward RelitLRM offers competitive relighting results to state-of-the-art dense-view optimization-based baselines while being significantly faster. Our project page is available at: https://relit-lrm.github.io/.