Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

2026-05-08Machine Learning

Machine Learning
AI summary

The authors study how to teach computers to make decisions when they care about risk in a specific way called exponential utility. They develop new methods that help find the best decision-making strategies by extending existing math tools and proving these methods reliably work. They create two learning algorithms and prove that one definitely converges over time, while the other is more complex but still converges under special conditions. Their results lay groundwork for teaching machines to handle risk-sensitive choices using value-based reinforcement learning.

reinforcement learningexponential utilityMarkov decision processesQ-learningBellman equationstationary policycontraction mappingtimescale separationLipschitz continuityDini derivative
Authors
Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat
Abstract
Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.