Thermodynamics of Reinforcement Learning Curricula
Authors
Jacob Adamczyk, Juan Sebastian Rojas, Rahul V. Kulkarni
Abstract
Connections between statistical mechanics and machine learning have repeatedly proven fruitful, providing insight into optimization, generalization, and representation learning. In this work, we follow this tradition by leveraging results from non-equilibrium thermodynamics to formalize curriculum learning in reinforcement learning (RL). In particular, we propose a geometric framework for RL by interpreting reward parameters as coordinates on a task manifold. We show that, by minimizing the excess thermodynamic work, optimal curricula correspond to geodesics in this task space. As an application of this framework, we provide an algorithm, "MEW" (Minimum Excess Work), to derive a principled schedule for temperature annealing in maximum-entropy RL.
Concepts
The Big Picture
Imagine teaching a child to play chess. You wouldn’t start with grandmaster-level opponents. You’d begin with simple puzzles, gradually introduce tougher challenges, and build complexity over time. This intuition, formalized as curriculum learning, drives much of how modern AI systems are trained. But nobody really knows how to choose the ideal sequence of tasks. Most practitioners just increase difficulty at a constant rate and hope for the best.
A team of physicists and engineers thinks that’s the wrong approach. Their tool for proving it? The laws of thermodynamics.
Jacob Adamczyk, Juan Sebastian Rojas, and Rahul V. Kulkarni argue that training an AI agent through a sequence of tasks is mathematically equivalent to driving a physical system out of balance. Think of slowly heating a gas versus rapidly compressing it. By borrowing results from the physics of far-from-equilibrium systems, they’ve built a geometric theory of curriculum design that tells you not just where to go, but the optimal path to get there.
Key Insight: Optimal training curricula in reinforcement learning correspond to geodesics (the shortest paths) on a curved “task manifold,” where the curvature is set by the agent’s own learning dynamics and the structure of its rewards.
How It Works
The core move is reinterpreting reward function parameters as coordinates in a geometric space. When training a reinforcement learning agent, you’re typically tuning things like reward weights, temperature parameters, or task difficulty settings. Instead of treating these as dials to twist arbitrarily, the researchers treat them as positions on a task manifold, a curved surface where the geometry encodes how hard it is to move from one task to another.
Here’s where physics enters. In thermodynamics, driving a system between two states too quickly dissipates extra energy as heat, a quantity called excess work. This dissipation is path-dependent: a slow, careful route wastes far less than a fast, jerky one.
The researchers argue the same phenomenon happens in reinforcement learning. Move through task space too abruptly and the agent pays a learning inefficiency tax. It can’t keep up with the optimal policy and wastes training steps catching up.
To quantify this, they introduce the friction tensor ζ, computed using the Green-Kubo relations, classical physics formulas that measure how a system responds to small disturbances over time. Each entry captures how much a small change in the reward signal ripples through the agent’s behavior in the long run. The friction tensor acts like a ruler warped by the agent’s own behavior:
- Some directions in task space are slippery: easy to traverse, little excess work incurred
- Others are sticky: costly to cross, requiring slower, more deliberate transitions
- The geometry varies across task space, so no single annealing rate fits everywhere

With this metric in hand, minimizing total excess work over a curriculum becomes a classical geometry problem: find the geodesic, the shortest path in the curved space defined by the friction tensor. For one-dimensional task schedules, the researchers derive a closed-form expression for the optimal path. It’s not a straight line. It spends more time in regions of high friction, treading carefully where learning is hardest.
The algorithm they build from this, MEW (Minimum Excess Work), applies directly to temperature annealing in maximum-entropy reinforcement learning, a popular framework that encourages agents to explore broadly rather than prematurely commit to one strategy. (Soft Actor-Critic is probably the best-known example.) A temperature parameter α controls how adventurous the agent is: high values encourage exploration, low values encourage exploitation. Standard practice anneals α linearly toward zero. MEW replaces this with a non-linear schedule derived from the friction geometry, slowing down precisely where the agent is most sensitive to change.
Why It Matters
This goes well beyond a cleaner annealing schedule. The framework offers something unusual: a principled language for curriculum design. Right now, questions like “should I increase task difficulty fast or slow?” or “which reward parameter should I tune first?” get answered by intuition and trial-and-error. The friction tensor, once efficiently estimated, lets practitioners measure the geometry of their problem and compute optimal schedules rather than guess.

The paper also points toward possible unifying connections. Phenomena like potential-based reward shaping (guiding agents with engineered bonus rewards), simulated annealing (finding good solutions by gradually reducing randomness), and feature collapse (when an agent’s internal representations become repetitive and uninformative) may all fit naturally into this geometric picture. Extending the framework to multi-dimensional curricula, where geodesic computation gets harder, is the obvious next step. Developing practical friction tensor estimators for deep RL, where closed-form solutions aren’t available, is another.
The analogy to physical control theory also means decades of results from statistical physics on optimal protocols, fluctuation theorems, and entropy production could transfer directly to AI training.
Bottom Line: By treating RL training as a thermodynamic process, this work turns curriculum design from an art into a geometry problem, one where the optimal answer is a geodesic, not a guess.
IAIFI Research Highlights
This work maps reinforcement learning curriculum design onto non-equilibrium thermodynamics, importing the friction tensor and Green-Kubo relations from statistical mechanics to formalize learning efficiency as a geometric quantity.
The MEW algorithm provides the first thermodynamically principled approach to temperature annealing in maximum-entropy RL, replacing ad hoc linear schedules with geodesic-optimal protocols derived from the agent's own policy dynamics.
The framework connects statistical mechanics and machine learning at a formal level, showing that non-equilibrium control theory, developed for physical systems like colloidal particles and spin glasses, applies directly to the learning dynamics of artificial agents.
Future directions include extending MEW to multi-dimensional task spaces and developing scalable friction tensor estimators for deep RL. The paper is available at [arXiv:2603.12324](https://arxiv.org/abs/2603.12324).
Original Paper Details
Thermodynamics of Reinforcement Learning Curricula
[2603.12324](https://arxiv.org/abs/2603.12324)
Jacob Adamczyk, Juan Sebastian Rojas, Rahul V. Kulkarni
Connections between statistical mechanics and machine learning have repeatedly proven fruitful, providing insight into optimization, generalization, and representation learning. In this work, we follow this tradition by leveraging results from non-equilibrium thermodynamics to formalize curriculum learning in reinforcement learning (RL). In particular, we propose a geometric framework for RL by interpreting reward parameters as coordinates on a task manifold. We show that, by minimizing the excess thermodynamic work, optimal curricula correspond to geodesics in this task space. As an application of this framework, we provide an algorithm, "MEW" (Minimum Excess Work), to derive a principled schedule for temperature annealing in maximum-entropy RL.