← Back to Timeline

Mathematical Data Science

Theoretical Physics

Authors

Michael R. Douglas, Kyu-Hwan Lee

Abstract

Can machine learning help discover new mathematical structures? In this article we discuss an approach to doing this which one can call "mathematical data science". In this paradigm, one studies mathematical objects collectively rather than individually, by creating datasets and doing machine learning experiments and interpretations. After an overview, we present two case studies: murmurations in number theory and loadings of partitions related to Kronecker coefficients in representation theory and combinatorics.

Concepts

mathematical data science automated discovery interpretability murmurations scientific workflows feature extraction kronecker coefficients regression classification hypothesis testing self-supervised learning

The Big Picture

For centuries, mathematicians have worked the same way: pick a specific object, study it in detail, prove something about it, move on. Patterns showed up when someone computed enough examples by hand, but that was slow going. Now there are databases containing millions of mathematical objects, and the machine learning tools built to recommend movies and translate languages turn out to work surprisingly well on abstract structures too.

Michael R. Douglas and Kyu-Hwan Lee call this approach mathematical data science. Their new paper is part manifesto, part proof of concept. Through two case studies, they show the idea isn’t speculative. It already produces results.

Key Insight: Treating mathematical objects as data and applying machine learning lets researchers detect statistical patterns invisible to traditional analysis, generating precise new conjectures that humans can then rigorously prove.

How It Works

The mathematical data science paradigm breaks into four steps, simple to state but hard to execute well:

  1. Generate a dataset of mathematical objects, not one but thousands or millions, computed systematically.
  2. Apply ML tools to find structure in that dataset, treating invariants (measurable properties) as features.
  3. Interpret the results to understand what the patterns actually mean mathematically.
  4. Formulate conjectures and theorems from the evidence.

Every step still requires human mathematicians. What to measure, how to sample, which patterns matter: none of that is automated. The computer is a powerful telescope, not an autonomous explorer.

One of the paper’s conceptual contributions is the platonic dataset. Most ML datasets are fuzzy: web-scraped text, photographs with inconsistent labels. A platonic dataset is mathematically precise. It consists of a well-defined set of objects, a function mapping each to measurable invariants, and a principled rule for choosing finite subsets to analyze. That precision means any conjecture the ML suggests can be stated in exact mathematical language, which you need before you can prove anything.

Figure 1

The first case study involves murmurations, a phenomenon recently discovered in the statistics of elliptic curve L-functions. An L-function encodes deep arithmetic information about a mathematical object. Here, the objects are elliptic curves: smooth curves defined by cubic equations with rich number-theoretic properties. Researchers had studied these functions individually for decades.

When Lee and collaborators looked at large collections of elliptic curves, sorted them by rank (the number of independent rational solutions), and plotted average values of the L-functions, an unexpected oscillatory correlation appeared. They named it after the collective undulation of starling flocks. The pattern is invisible when you study one curve at a time. It only emerges from the crowd.

Figure 2

The second case study tackles Kronecker coefficients, numbers that appear throughout the mathematics of symmetry and have applications in quantum information theory. These coefficients are notoriously hard to compute and harder still to understand conceptually.

The authors define a dataset of integer partitions (ways to write a number as an ordered sum; for example, 4 = 3+1, or 2+2, or 2+1+1) and compute associated loadings: numerical summaries capturing how Kronecker coefficients are distributed across the dataset. ML experiments reveal clustering in these loadings, pointing toward combinatorial structure that pure theory had not predicted.

Why It Matters

There is real historical precedent here. The Birch and Swinnerton-Dyer conjecture, one of the Millennium Prize Problems (with a $1 million reward for its proof), started from exactly this kind of computer experiment in the 1960s. Bryan Birch and Peter Swinnerton-Dyer generated elliptic curves by computer and ran linear regression on the results. They were doing mathematical data science before anyone had a name for it.

Douglas and Lee are honest about what sets math apart from other data sciences: interpretability. In biology or medicine, a neural network that predicts outcomes accurately is already useful even if no one can explain why. In math, a black-box prediction is close to worthless. A pattern only becomes meaningful when it can be stated precisely and, eventually, proved.

That tension between ML’s power and math’s demand for rigor is where the field has to grow. Techniques like attribution analysis, which pinpoints which input features drove a model’s prediction, offer one route from statistical pattern to mathematical insight.

The conditions are good. Mathematical databases like the LMFDB (L-functions and Modular Forms Database) and KnotInfo have matured. Computing power makes million-object datasets routine. The culture is shifting too: mathematicians and ML researchers are increasingly working together. A 2021 Nature paper by DeepMind, collaborating with mathematicians Geordie Williamson and Marc Lackenby, produced new theorems in knot theory and representation theory. Douglas and Lee’s paper is an invitation to more such work.

Bottom Line: Murmurations in number theory and hidden structure in Kronecker coefficients show that collective analysis of mathematical objects can reveal patterns no individual study could find. Mathematical data science is changing how conjectures get discovered, and the most productive phase is probably still ahead.

IAIFI Research Highlights

Interdisciplinary Research Achievement
This work connects AI methodology with pure mathematics, applying supervised and unsupervised machine learning to abstract mathematical datasets to discover new structures in number theory and representation theory.
Impact on Artificial Intelligence
The paper introduces the platonic dataset framework, a principled standard for ML experiments where data has exact definitions rather than empirical noise. It offers a template for rigorous ML-driven scientific discovery.
Impact on Fundamental Interactions
The murmuration phenomenon in *L*-functions of elliptic curves is a new statistical regularity in arithmetic objects, with potential implications for the Langlands program and related areas of mathematical physics.
Outlook and References
Future directions include automating more of the conjecture-generation pipeline and extending MDS to geometric and topological objects; the paper is available at [arXiv:2502.08620](https://arxiv.org/abs/2502.08620).

Original Paper Details

Title
Mathematical Data Science
arXiv ID
2502.08620
Authors
["Michael R. Douglas", "Kyu-Hwan Lee"]
Abstract
Can machine learning help discover new mathematical structures? In this article we discuss an approach to doing this which one can call "mathematical data science". In this paradigm, one studies mathematical objects collectively rather than individually, by creating datasets and doing machine learning experiments and interpretations. After an overview, we present two case studies: murmurations in number theory and loadings of partitions related to Kronecker coefficients in representation theory and combinatorics.