← Back to Timeline

Surrogate- and invariance-boosted contrastive learning for data-scarce applications in science

Theoretical Physics

Authors

Charlotte Loh, Thomas Christensen, Rumen Dangovski, Samuel Kim, Marin Soljacic

Abstract

Deep learning techniques have been increasingly applied to the natural sciences, e.g., for property prediction and optimization or material discovery. A fundamental ingredient of such approaches is the vast quantity of labelled data needed to train the model; this poses severe challenges in data-scarce settings where obtaining labels requires substantial computational or labor resources. Here, we introduce surrogate- and invariance-boosted contrastive learning (SIB-CL), a deep learning framework which incorporates three ``inexpensive'' and easily obtainable auxiliary information sources to overcome data scarcity. Specifically, these are: 1)~abundant unlabeled data, 2)~prior knowledge of symmetries or invariances and 3)~surrogate data obtained at near-zero cost. We demonstrate SIB-CL's effectiveness and generality on various scientific problems, e.g., predicting the density-of-states of 2D photonic crystals and solving the 3D time-independent Schrodinger equation. SIB-CL consistently results in orders of magnitude reduction in the number of labels needed to achieve the same network accuracies.

Concepts

contrastive learning self-supervised learning surrogate modeling data-scarce learning transfer learning symmetry preservation representation learning semi-supervised learning fine-tuning data augmentation embeddings inverse problems

The Big Picture

Imagine trying to teach a child to recognize dogs with only three photographs. Three examples won’t cut it; the child needs hundreds before the concept clicks. Now scale that problem to physics: you want a neural network to predict the quantum properties of a material, but each calculation takes days of supercomputer time. You can afford maybe 50 labeled examples. That’s it.

This is the brutal reality facing researchers who apply deep learning to fundamental science. Unlike computer vision, where millions of labeled images can be crowdsourced from the internet, generating labeled data in physics means running expensive simulations or conducting painstaking experiments. Every data point costs real time and money.

A team at MIT has developed SIB-CL (Surrogate- and Invariance-Boosted Contrastive Learning), a framework that cuts the number of labeled examples needed to train accurate models by orders of magnitude.

Key Insight: SIB-CL exploits three cheap, readily available information sources (unlabeled data, physical symmetries, and approximate “surrogate” calculations) to teach neural networks about physical systems using a tiny fraction of the labeled data otherwise required.

How It Works

The core idea is that even when precise labels are scarce, science problems are rarely information-starved. They come packaged with structural knowledge that conventional supervised learning ignores. SIB-CL puts all of it to use.

Figure 1

The framework draws on three auxiliary information sources:

  1. Unlabeled data: raw inputs (say, the geometry of a photonic crystal) that haven’t been paired with expensive computed outputs. These are essentially free to generate.
  2. Invariances and symmetries: prior knowledge that certain transformations leave the label unchanged. If rotating a crystal by 90° doesn’t change its optical properties, the model should know that upfront rather than learning it from scratch across thousands of examples.
  3. Surrogate data: approximate labels computed using faster, cheaper methods. Not perfectly accurate, but they capture the essential structure at a fraction of the cost.

SIB-CL combines these sources through a two-stage training pipeline. In the first stage, the network undergoes contrastive pre-training, a self-supervised technique where the model learns useful internal representations without any labeled answers. The specific method used is SimCLR.

The model sees two differently-transformed versions of the same input and learns to place them close together in a shared representational space, while pushing representations of different inputs far apart. The transformations aren’t arbitrary. They’re chosen to reflect the actual physical symmetries of the problem. A rotation that leaves a crystal’s properties invariant becomes a principled training signal. The physics does the work.

Figure 2

Interleaved with this contrastive step is surrogate transfer learning: the model pre-trains on the cheaper approximate dataset before fine-tuning on the small, precious set of high-fidelity labels. This gives the network’s encoder, the part that compresses raw inputs into a compact representation, a meaningful head start. It arrives at fine-tuning already understanding the problem’s structure, even if the surrogate’s answers aren’t exact.

The final stage is conventional fine-tuning. A prediction head is attached to the pre-trained encoder and trained on the available accurate labels. Because the encoder already carries physics-informed representations, it needs far fewer expensive examples to converge.

The team validated SIB-CL on two demanding test cases. The first: predicting the density-of-states (DOS) of 2D photonic crystals, periodic structures that control light propagation and show up in lasers and optical computing. The input is the geometry of the crystal’s unit cell (its smallest repeating building block); the output is a complex spectral function requiring full electromagnetic simulation to compute accurately.

The second: solving the 3D time-independent Schrödinger equation for quantum systems, a central problem in quantum chemistry and materials design. In both cases, SIB-CL matched standard supervised accuracy with a small fraction of the labeled data.

Figure 3

Why It Matters

The practical payoff is immediate. Tasks that would require thousands of expensive simulations might now be feasible with dozens. That’s a project shrinking from months to days, and problems opening up to groups without massive compute budgets.

There’s a deeper point here, though. SIB-CL acts on a principle that physicists have always relied on but machine learning has historically ignored: science problems are not naked datasets. They come embedded in a web of known structure, from symmetries and conservation laws to approximate solutions and limiting cases.

Every physics student learns that when you can’t solve a hard problem, you solve an easier one first and work outward. SIB-CL formalizes exactly that strategy for neural networks. As computational science leans more heavily on deep learning, reducing the label bottleneck will widen the range of questions that are actually tractable.

Open questions remain. The current demonstrations focus on well-defined problems where symmetries are known analytically. Extending the approach to messier domains (experimental data from telescopes or particle detectors, for instance) will require new thinking about how to define useful invariances when the underlying symmetry group isn’t clean. SIB-CL also inherits contrastive learning’s sensitivity to batch size and augmentation strategy.

Bottom Line: The rich structural knowledge embedded in physics problems (symmetries, approximate models, unlabeled structure) can be put to work to reduce labeled-data requirements by orders of magnitude, making deep learning viable for some of science’s most expensive computational problems.

IAIFI Research Highlights

Interdisciplinary Research Achievement
This work combines contrastive self-supervised learning with physics, building a framework that directly encodes physical symmetries and exploits the hierarchy between approximate and exact scientific calculations to overcome data scarcity.
Impact on Artificial Intelligence
SIB-CL shows that domain-specific physical invariances can replace generic image augmentations in self-supervised pre-training, producing far more data-efficient models when those invariances are grounded in real structure.
Impact on Fundamental Interactions
By making accurate property prediction possible with far fewer labeled simulations, this work opens up photonics and quantum mechanics problems that were previously too expensive to tackle computationally.
Outlook and References
Future work will likely extend SIB-CL to experimental datasets and to problems where invariances must be discovered rather than specified. The full paper is available at [arXiv:2110.08406](https://arxiv.org/abs/2110.08406).

Original Paper Details

Title
Surrogate- and invariance-boosted contrastive learning for data-scarce applications in science
arXiv ID
2110.08406
Authors
Charlotte Loh, Thomas Christensen, Rumen Dangovski, Samuel Kim, Marin Soljacic
Abstract
Deep learning techniques have been increasingly applied to the natural sciences, e.g., for property prediction and optimization or material discovery. A fundamental ingredient of such approaches is the vast quantity of labelled data needed to train the model; this poses severe challenges in data-scarce settings where obtaining labels requires substantial computational or labor resources. Here, we introduce surrogate- and invariance-boosted contrastive learning (SIB-CL), a deep learning framework which incorporates three ``inexpensive'' and easily obtainable auxiliary information sources to overcome data scarcity. Specifically, these are: 1)~abundant unlabeled data, 2)~prior knowledge of symmetries or invariances and 3)~surrogate data obtained at near-zero cost. We demonstrate SIB-CL's effectiveness and generality on various scientific problems, e.g., predicting the density-of-states of 2D photonic crystals and solving the 3D time-independent Schrodinger equation. SIB-CL consistently results in orders of magnitude reduction in the number of labels needed to achieve the same network accuracies.