AI-Assisted Discovery of Quantitative and Formal Models in Social Science
Authors
Julia Balla, Sihao Huang, Owen Dugan, Rumen Dangovski, Marin Soljacic
Abstract
In social science, formal and quantitative models, such as ones describing economic growth and collective action, are used to formulate mechanistic explanations, provide predictions, and uncover questions about observed phenomena. Here, we demonstrate the use of a machine learning system to aid the discovery of symbolic models that capture nonlinear and dynamical relationships in social science datasets. By extending neuro-symbolic methods to find compact functions and differential equations in noisy and longitudinal data, we show that our system can be used to discover interpretable models from real-world data in economics and sociology. Augmenting existing workflows with symbolic regression can help uncover novel relationships and explore counterfactual models during the scientific process. We propose that this AI-assisted framework can bridge parametric and non-parametric models commonly employed in social science research by systematically exploring the space of nonlinear models and enabling fine-grained control over expressivity and interpretability.
Concepts
The Big Picture
What drives economic growth across nations? Economists have decades of data (total output, capital stock, workforce size, savings rates) but the underlying equation is hidden, buried in noise.
You could fit a statistical model and call it a day. You could trust your theoretical intuitions. Neither approach can actually see the true shape of the relationship.
Social science has long relied on two imperfect tools: hand-crafted equations built on human intuition, or black-box machine learning that produces predictions nobody can explain. Intuition-driven models embed hidden assumptions. Black-box models can’t tell you why anything happens, which makes them poor tools for understanding mechanisms or designing policy.
Julia Balla and colleagues at MIT and IAIFI show that symbolic regression, an AI-powered search for interpretable mathematical expressions, can split the difference. By extending a neuro-symbolic system called OccamNet, they find compact, human-readable equations from messy real-world social science data. Not just accurate predictions, but actual equations that explain why social phenomena behave as they do.
How It Works
OccamNet combines neural networks with symbolic math to search for equations. Think Y = AK^α L^(1-α) rather than a pile of inscrutable neural network weights. The system samples candidate mathematical expressions, evaluates their fit, and iterates toward better solutions, running on GPUs to explore a vast space of possible functional forms.

Raw OccamNet wasn’t built for social science data. The researchers made two key extensions:
-
Noise tolerance via complexity regularization. Social science datasets are small and messy. Adding a complexity penalty discourages overly complicated equations and produces a Pareto front of solutions: a curve showing the trade-off between equation complexity and accuracy. You can then pick the right balance rather than accepting an arbitrary model.
-
Weight-sharing for panel data. Many social science datasets are panel datasets, where the same variables are measured across many countries or groups over time. Each country may have different parameter values, but the form of the equation is shared. The weight-sharing approach trains one model simultaneously across all panels, sharing structural knowledge while allowing panel-specific parameters. This matters most when individual time series are short.
Researchers supply a dataset, optionally inject inductive priors (known constants, preferred functional forms), and let OccamNet search the model space. The output is a ranked list of symbolic equations with error distributions, ready for interpretation.
Three benchmark tests show what this looks like in practice.
Solow-Swan growth model. The researchers fed OccamNet data from 18 OECD countries and asked it to discover the governing growth equation from scratch, with no theoretical guidance. It recovered the correct form. The Pareto complexity map also revealed which countries fit the standard model well and which showed systematic deviations, pointing toward where the theory needs updating.

Epidemic dynamics. Using real COVID-19 wave data, the system recovered the Lotka-Volterra equations, the classic predator-prey framework applied to disease spread. The weight-sharing approach, trained across multiple epidemic waves simultaneously, outperformed models trained on single waves. Panel structure genuinely helps when individual time series are short.
Production functions. OccamNet didn’t just recover the standard Cobb-Douglas form. It found higher-order corrections, suggesting where the standard model breaks down. The Pareto front gives researchers a systematic way to compare models rather than anchoring on whatever equation their theory happened to suggest first.

Why It Matters
One of the hardest problems in social science is that many plausible theories can fit the same data. Researchers’ priors, cultural backgrounds, and theoretical commitments shape which models they even think to test. Symbolic regression doesn’t eliminate human judgment, but it systematically explores the space of possible equations, reducing the risk that an analyst’s blind spots determine the answer.
The approach also makes counterfactual exploration practical. Once you have a symbolic equation, you can ask “what alternative models fit the data almost as well?” and “what would happen if I changed this one term?” Those questions are nearly impossible with a neural network but straightforward with an interpretable equation. For policymakers, explainability isn’t a luxury.
Open questions remain. How do you handle variables that are fundamentally unobservable, like institutional quality or social norms? Can symbolic regression guide theory development in real time as new data arrives? And as language models grow more capable at mathematical reasoning, could they work alongside systems like OccamNet to propose candidate functional forms before the search even begins?
The upshot: teaching machines to search for interpretable equations, not just patterns, gives social scientists a tool they haven’t had before. One that can find mathematical structure in human behavior while remaining transparent enough to interrogate.
IAIFI Research Highlights
This work brings physics-inspired symbolic regression, originally developed for discovering laws of nature, into the social sciences. The same AI tools that find physical equations can reveal mathematical structure in economic and sociological data.
The team extended neuro-symbolic regression with two advances: improved noise tolerance via complexity regularization and a novel weight-sharing scheme for longitudinal panel data. Together, these make symbolic AI practical for small, messy real-world datasets.
Recovering governing equations like the Solow-Swan growth model and Lotka-Volterra epidemic dynamics from empirical data alone shows that the mechanistic modeling approach central to physics transfers to complex social phenomena.
Future work could integrate large language models to propose candidate functional forms and extend the framework to handle latent variables and causal inference; the full paper is available as [arXiv:2210.00563](https://arxiv.org/abs/2210.00563).
Original Paper Details
AI-Assisted Discovery of Quantitative and Formal Models in Social Science
2210.00563
Julia Balla, Sihao Huang, Owen Dugan, Rumen Dangovski, Marin Soljacic
In social science, formal and quantitative models, such as ones describing economic growth and collective action, are used to formulate mechanistic explanations, provide predictions, and uncover questions about observed phenomena. Here, we demonstrate the use of a machine learning system to aid the discovery of symbolic models that capture nonlinear and dynamical relationships in social science datasets. By extending neuro-symbolic methods to find compact functions and differential equations in noisy and longitudinal data, we show that our system can be used to discover interpretable models from real-world data in economics and sociology. Augmenting existing workflows with symbolic regression can help uncover novel relationships and explore counterfactual models during the scientific process. We propose that this AI-assisted framework can bridge parametric and non-parametric models commonly employed in social science research by systematically exploring the space of nonlinear models and enabling fine-grained control over expressivity and interpretability.