← Back to Timeline

Average-Reward Soft Actor-Critic

Foundational AI

Authors

Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni

Abstract

The average-reward formulation of reinforcement learning (RL) has drawn increased interest in recent years for its ability to solve temporally-extended problems without relying on discounting. Meanwhile, in the discounted setting, algorithms with entropy regularization have been developed, leading to improvements over deterministic methods. Despite the distinct benefits of these approaches, deep RL algorithms for the entropy-regularized average-reward objective have not been developed. While policy-gradient based approaches have recently been presented for the average-reward literature, the corresponding actor-critic framework remains less explored. In this paper, we introduce an average-reward soft actor-critic algorithm to address these gaps in the field. We validate our method by comparing with existing average-reward algorithms on standard RL benchmarks, achieving superior performance for the average-reward criterion.

Concepts

reinforcement learning average-reward mdp entropy regularization actor-critic framework reward optimization loss function design stochastic processes robustness scalability bayesian inference uncertainty quantification

The Big Picture

Imagine training a dog by rewarding every trick it performs today slightly more than tricks it might perform next week. The dog learns fast, but becomes weirdly shortsighted, optimizing for the next few minutes rather than becoming a well-behaved companion over years. Most modern AI agents work this way.

In reinforcement learning, where AI agents learn by collecting rewards through trial and error, tasks often run indefinitely: a robot walking, a trading algorithm ticking, a game with no natural end. For these situations, researchers commonly apply a discount factor (γ) that makes future rewards worth less than immediate ones. It’s convenient, but it introduces an arbitrary knob. Set γ too low and the agent becomes reckless. Set it slightly wrong and performance collapses. The agent ends up optimizing a quantity nobody actually cares about, since practitioners ultimately measure performance by average reward over time anyway.

A cleaner alternative: optimize for the average reward directly. How much reward per timestep can the agent earn in the long run? Researchers at UMass Boston, San José State, and Texas Tech have combined this objective with a technique that encourages agents to keep their options open rather than prematurely committing to a single strategy. The result is Average-Reward Soft Actor-Critic (AR-SAC), a new algorithm that outperforms existing approaches while requiring minimal changes to existing codebases.

Key Insight: By combining average-reward optimization with entropy regularization, AR-SAC eliminates discount factor tuning while producing policies that are naturally stochastic and varied. No prior deep RL implementation achieved this combination.

How It Works

The workhorse of modern deep RL is Soft Actor-Critic (SAC), an algorithm that maximizes both reward and the entropy of its policy. A temperature parameter (β⁻¹) controls how strongly the agent is rewarded for staying flexible rather than committing to a single action. This produces stochastic policies (randomly varying rather than rigidly deterministic) that generalize better and handle unexpected situations more gracefully. SAC has become a standard in the discounted setting.

Average-reward RL operates differently at a mathematical level. In discounted RL, value functions satisfy a clean recursive structure called the Bellman equation, which contracts toward a unique solution and makes optimization stable. Remove discounting and this contraction disappears. The math gets messier, and most existing deep RL tools simply don’t apply.

Figure 1

The authors reformulate SAC’s core quantities for the average-reward setting. Instead of a standard Q-function estimating discounted future rewards, AR-SAC tracks the differential Q-function: how much better or worse a state-action pair performs relative to the agent’s average reward rate ρ. The Bellman update subtracts this running reward rate at every step, keeping value estimates bounded without discounting. Entropy regularization folds into this modified objective consistently, producing what the authors call the Entropy-Regularized Average-Reward (ERAR) MDP framework. (An MDP, or Markov Decision Process, is the standard mathematical model for sequential decisions.)

The algorithm follows a familiar actor-critic structure:

  • A critic network estimates the differential Q-function, updated by minimizing a modified Bellman residual that tracks the average reward rate
  • An actor network updates the policy to maximize both expected differential value and policy entropy
  • The reward rate ρ is learned online, adapting as the agent improves
  • The temperature β⁻¹ is tuned automatically via a dual optimization trick inherited from SAC

In practice, implementing AR-SAC requires only modest changes to an existing SAC codebase: remove the γ multiplier from the Bellman backup, add the reward rate estimate, and adjust the entropy target.

Why It Matters

The immediate payoff is freedom from discount factor tuning. Standard SAC on the Swimmer-v5 MuJoCo environment (a widely used physics simulator for robot locomotion) fails completely at its default γ = 0.99. Only careful manual tuning unlocks good performance. AR-SAC sidesteps this problem entirely. Benchmarks across standard continuous-control tasks show AR-SAC achieving superior performance on the average-reward criterion compared to existing average-reward algorithms, including policy-gradient methods that previously represented the state of the art.

Figure 2

What makes this relevant for physics? Many physical systems (thermal machines, biological motor control, particle accelerators in steady-state operation) are inherently continuing tasks where long-run average performance is the meaningful objective, not discounted returns. Entropy regularization also connects naturally to statistical physics: maximizing entropy corresponds to finding the least-biased probability distribution consistent with observed constraints, a concept central to thermodynamics and information theory. Having these ideas together in a working deep RL algorithm gives the physics-AI community tools better matched to their actual problems.

Theoretical convergence guarantees for average-reward deep RL remain weaker than in the discounted case, and closing that gap is an active area of research. Extending AR-SAC to offline settings, multi-agent systems, and partial observability are natural next steps. The authors also note that their formulation handles general KL-divergence regularization relative to arbitrary reference policies, not just maximum entropy, which could allow incorporating prior knowledge into policy structure.

Bottom Line: AR-SAC is a principled deep RL algorithm for long-horizon continuous control that beats the competition without discount factor tuning, with entropy regularization built in for stability.

IAIFI Research Highlights

Interdisciplinary Research Achievement
This work connects statistical physics and reinforcement learning by formalizing entropy-regularized average-reward optimization, a framework where maximum-entropy principles from thermodynamics directly shape how AI agents learn optimal long-run behavior.
Impact on Artificial Intelligence
AR-SAC is the first deep RL algorithm combining entropy regularization with the average-reward objective, achieving state-of-the-art performance on continuous control benchmarks while eliminating the need to tune the discount factor.
Impact on Fundamental Interactions
Average-reward RL is naturally suited to physical systems in steady-state or ergodic regimes. AR-SAC gives researchers a practical tool for applying modern deep RL to physics problems where long-run average performance is the correct objective.
Outlook and References
Future work includes extending AR-SAC to offline learning and multi-agent settings. The paper is by Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, and Rahul V. Kulkarni, published in *Reinforcement Learning Journal* (2025). [[arXiv:2501.09080](https://arxiv.org/abs/2501.09080)]