Date of Award
Winter 2025
Project Type
Dissertation
Program or Major
Computer Science
Degree Name
Doctor of Philosophy
First Advisor
Marek Petrik
Second Advisor
Se Young Yoon
Third Advisor
Wheeler Ruml
Abstract
Reinforcement learning~(RL) studies the methodologies of improving the performance of sequential decision-making for autonomous agents and has achieved remarkable success in domains such as games, robotics, autonomous systems, finance, and healthcare. The Markov Decision Process~(MDP) is a mathematical framework for modeling agent-environment interactions in sequential decision-making problems. There are two primary sources of uncertainty in RL: epistemic uncertainty and aleatoric uncertainty. In RL, risk refers to the potential for an agent's policy to lead to undesirable outcomes, especially when the environment is uncertain. In many domains, researchers seek policies that maximize the objectives while mitigating the uncertainty and risk in policymaking.
This dissertation makes three main contributions. First, the Multi-Model Markov Decision Process (MMDP) is a model that captures epistemic uncertainty, and solving MMDPs optimally is NP-hard. Previous work proposed a dynamic programming algorithm to compute the optimal policy approximately; however, it lacks an optimality guarantee for the computed policy. We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum.
Second, risk-averse policies are employed in RL to handle uncertainty and risk. In a discounted setting, the discount factor plays a crucial role in ensuring that the Bellman operator (used in dynamic programming ) satisfies the contraction property, and the value functions are always bounded. However, there are situations where the agent needs to optimize for long-term goals, and then ignoring discounting can be a suitable approach. The challenge is that the Bellman operator may not be a contraction if there is no discounting of future rewards. We study the risk-averse objectives in the total reward criterion(TRC), which means no discounting of future rewards. We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC.
Third, a major challenge in deriving practical RL algorithms is that the model of the environment is often unknown. However, traditional definitions of risk measures assume a known discounted or transient MDP model. We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.
Recommended Citation
Su, Xihong, "Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning" (2025). Doctoral Dissertations. 2971.
https://scholars.unh.edu/dissertation/2971