# Course:CPSC522/Partially Observable Markov Decision Processes

## Partially Observable Markov Decision Processes (POMDPs)

A Partially Observable Markov Decision Processes (POMDs) is a mathematical model for acting under uncertainty. It expands from the Markov Decision Process (MDP) by relaxing the constraint of having a fully observable state space. In this paper we will evaluate the underlying mathematical model and explore the potential of POMDPs for belief MDP.

Principal Author: Tommaso D'Amico

Collaborators: -

## Abstract

POMDPs are a type of Decision Network that expands from Markov Decision Processes by not requiring all states to be observable by the agent. POMDPs still maintain the same system dynamics of an MDP although the acting agent cannot always observe its current state but rather it needs to maintain a probability distribution over the set of possible states it may be in based on its observations and the underlying MPD.

### Builds on

POMDPs is a mathematical model expanding from the concepts of Decision networks, specifically generalizing from the Markov Decision Process algorithm for acting under uncertainty.

### Related Pages

POMDPs are widely implemented in applications that interact with the real world, some examples are particle filtering techniques such as MC-POMDPs, an extension of Markov Chains Monte Carlo for POMDPs and Reinforcement Learning.

POMDPs are also closely related to Hidden Markov Models and Markov Chains.

## Content

### Introduction

Partially Observable Markov Decision Processes (POMDPs) are a type of Markov Process closely related to Markov Decision Processes (MDP) as:

• there exist a finite number of discrete states
• the next state is only determined by the current state and the current action taken by the agent
• there exists a probabilistic transition between states and controllable actions in each state

The way a POMDP differs from an MDP is that the agent is unsure of which state it is in as it has only partial observability of the environment states. It instead needs to develop a probability distribution over the set of states it believes to be in and condition its observations and the underlying MDP structure. A comparison can be drawn for clarity to the relationship of Markov Chains with Hidden Markov Models, as the latter too is an extension of the prior with reduced observability. In fact a helpful illustration of the difference among these models can be seen in the chart below:

 Does the agent have control over state transition? no yes Are the states fully observable? yes Markov Chain Markov Decision Process no Hidden Markov Model Partilly Observable Markov Decision Process

#### POMDP versus MPD

MPD is more tractable to solve and is relatively easy to specify although it assumes perfect knowledge of the states which is often unlikely for many use cases.

POMDP on the other hand treats all sources of uncertainty uniformly and allows for information gathering actions, although it is hugely intractable to solve optimally.

### Formal Definition

Partially Obsevable Markov Decision Process example (POMDP)

Formally a POMDP working in discrete-time models the relationship between an agent and its environment[1]. This is commonly done with a 7-tuple ${\displaystyle (S,A,T,R,\Omega ,O,\gamma )}$where:

• ${\displaystyle S}$ is a set of all states
• ${\displaystyle A}$ is a set of all actions
• ${\displaystyle T}$ is a set that specifies the conditional probability of the next state givent he previous state and action
• ${\displaystyle R:S\times A\rightarrow \mathbb {R} }$ is the reward function
• ${\displaystyle \Omega }$ is the set of observations
• ${\displaystyle O}$is the set of conditional observation probabilities
• ${\displaystyle \gamma \in [0,1]}$is the discount factor

At each time step the environment is in state ${\displaystyle s\in S}$. The agent takes action ${\displaystyle a\in A}$, which causes the environment to transition to ${\displaystyle s'}$with probability given by ${\displaystyle T(s'|s,a).}$at the same time the agent receives an observation ${\displaystyle o\in \Omega }$which is dependent on the new environment state ${\displaystyle s'}$and the action ${\displaystyle a}$just done by the agent, this is described by the probability distribution of ${\displaystyle O(o|s',a)}$. This state and action is then used to develop a reward ${\displaystyle r}$equal to ${\displaystyle R(s,a)}$. This process then repeats for the next time step until the agent completes its task or to infinity. The goal is for the agent to maximize its expected future discounted reward: ${\displaystyle E[\sum _{t=0}^{\infty }\gamma ^{t}r_{t}]}$, where ${\displaystyle r_{t}}$is the reward earned at time ${\displaystyle t}$. ${\displaystyle \gamma }$is the discount factor that determines how much immediate rewards are favoured over distant rewards. When ${\displaystyle \gamma =0}$the agent only cares about which action will yield the largest expected immediate reward, while if ${\displaystyle \gamma =1}$the agent cares about maximizing the expected sum of future rewards.

#### Belief Update

After executing one time step an agent needs to update its belief of the state of the environment it is in[2]. Since we assume a Markovian state space, all we need to describe to execute this step is our prior state belief ${\displaystyle b}$, the last action ${\displaystyle a}$taken by the agent and the last observation made ${\displaystyle o}$. The belief update is an actionless step and is therefore the same as seen on a Hidden Markov Model belief update.

The belief state defined as ${\displaystyle b}$ is a function over all states ${\displaystyle S}$ from [0,1] that sums up to 1.

The update step can be denoted as ${\displaystyle b'=\tau (b,a,o)}$.

After reaching state ${\displaystyle s'}$, the agent observes ${\displaystyle o\in \Omega }$with probability ${\displaystyle O(o|s',a)}$as described in the prior section.

If we then let ${\displaystyle b}$ be a probability distribution over the state space ${\displaystyle S}$then ${\displaystyle b(s)}$denotes the probability that the environment is in state ${\displaystyle s}$. Given ${\displaystyle b(s)}$, then after taking an action ${\displaystyle a}$and an observation ${\displaystyle o}$,

${\displaystyle b'(s')=\eta O(o|s',a)\sum _{s\in S}T(s'|s,a)b(s)}$

where ${\displaystyle \eta =1/Pr(o|b,a)}$ is used as a normalising constant with

${\displaystyle Pr(o|b,a)=\sum _{s'\in S}O(o|s',a)\sum _{s\in S}T(s'|s,a)b(s)}$

#### Interpretation of Definition

Drone Navigating around space avoiding ground agent using POMDP

From the above definition the agent does not directly observe its environment. It instead has to make decisions about the true environment state under uncertainty. However, by interacting with the environment and receiving observations the agent is able to update its belief of the true state by updating the probability distribution of the current state. Through this property of the algorithm the extrapolated optimal behaviour may often include actions that are taken purely because they improve the agent's estimate of the current state, thereby allowing it to make better decisions in future time steps.

If we compare the formal definition of the POMDP described above with that of MDP we would have a very similar structure with the exception that the MDP algorithm would not include the conditional observation probabilities set ${\displaystyle O}$as it is always certain of the observed true state.

An example of a POMDP in action can be seen on the left, this animation illustrates a drone (in yellow) navigating a tiled environment. Its objective is to reach the oposing green square without coming into contact with the ground agent (in red). The blue squares represent the drone's observable environment. Since the drone is not capale of observing the state of the ground agent at all times, MDP is not tractable for this task, POMDP enables it to estimate a probability distribution of the likely states the ground agent might be in and act accordingly.

### Belief MDP

A Markovian belief state allows a POMDP to be formulated as an Markov Decision Process where every belief is a state. The resulting belief MDP will be defined on a continuous state space even though the the originating POMDP has a finite number of states. This is because there are infinite number of probability distributions over the state set.

The belief MDP is formally described as a tuple ${\displaystyle (B,A,\tau ,r,\gamma )}$where

${\displaystyle B}$is the set of belief states over the POMDP states

${\displaystyle A}$is the same finite set of actions as in the POMDP algorithm

${\displaystyle \tau }$is the belief state transition function

${\displaystyle r:B\times A\rightarrow \mathbb {R} }$is the reward function on belief states

${\displaystyle \gamma }$is the discount factor equal to the one in the original POMDP

from these values ${\displaystyle \tau }$and ${\displaystyle r}$are derived from the original POMDP via

${\displaystyle \tau (b,a,b')=\sum _{o\in \Omega }Pr(b'|b,a,o)Pr(o|a,b)}$

where ${\displaystyle Pr(o|a,b)}$ is the value derived in the previous section and

${\displaystyle Pr(b'|b,a,o)={\begin{cases}1,&{\text{if the belief update with arguments }}b,a,o{\text{ returns }}b'\\0,&{\text{otherwise}}\end{cases}}}$

and the reward function is the expected reward from the POMDP reward function over the belief distribution as follows

${\displaystyle r(b,a)=\sum _{s\in S}b(s)R(s,a)}$

By going through these steps the belief MDP is not partially observable anymore, since at any given time the agent knows its belief, and by extension the state of the belief MDP.

This means we were able to take a partially observable discrete space system and convert it to a continuous space fully observable belief system.

POMDP decomposed into a state estimator and a policy.

### Policy and Value Function

On a Belief MDP the agent is now able to choose between all possible actions to take as it may believe to be in any state with a given probability. We therefore must define a new variable ${\displaystyle \pi }$that describes an action ${\displaystyle a=\pi (b)}$given a certain belief ${\displaystyle b}$[2].

The expected reward for the policy ${\displaystyle \pi }$starting from belief ${\displaystyle b_{0}}$is

${\displaystyle V^{\pi }(b_{0})=\sum _{t=0}^{\infty }\gamma ^{t}r(b_{t},a_{t})=\sum _{t=0}^{\infty }\gamma ^{t}E{\Bigl [}R(s_{t},a_{t})\mid b_{0},\pi {\Bigr ]}}$
where again ${\displaystyle \gamma <1}$is the discount factor.

By optimizing for long-term reward we can obtain the optimal policy

${\displaystyle \pi ^{*}={\underset {\pi }{\mbox{argmax}}}\ V^{\pi }(b_{0})}$
where ${\displaystyle b_{0}}$is the initial state.

The optimal value function therefore can be described as

${\displaystyle {\displaystyle V^{*}(b)=\max _{a\in A}{\Bigl [}r(b,a)+\gamma \sum _{o\in \Omega }\Pr(o\mid b,a)V^{*}(\tau (b,a,o)){\Bigr ]}}}$

The value function is piecewise linear and convex for a finite-horizon POMDP[3]. For an infinite-horizon POMDP a finite vector set can approximate ${\displaystyle V^{*}}$with the use of dynamic programming techniques.

### Applications of POMDP

#### Pursuit-Evasion

Imagine a fleet of robots in a constrained environment, assign one robot to pursuit the others. The pursuer's state is known, but the evader's state is only partially observed. POMDP can be applied by the evaders in a multi-agent search fashion to observe the environment and build a belief of the pursuer's location.[4]

#### Sensor Placement

A store may have a limited amount of cameras to allocate in its interior to monitor its environment, since the "world " will be partially observable, using POMDP can help reconstruct intruder position or in the contrary facilitate "stealthy" movement.

### Conclusion

Probability theory is a powerful tool for modelling action under uncertainty, thanks to the Markov assumption many algorithms have become easy to implement for this purpose. POMDPs are one of the broader algorithms in this category as they take structure from highly constrained algorithms such as MDPs and relax assumptions of observability to allow for more real-world problem solving applications.

## Annotated Bibliography

[1] Åström, K.J. “Optimal Control of Markov Processes with Incomplete State Information.” Journal of Mathematical Analysis and Applications 10, no. 1 (February 1965): 174–205. https://doi.org/10.1016/0022-247X(65)90154-X.

[2] Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. “Planning and Acting in Partially Observable Stochastic Domains.” Artificial Intelligence 101, no. 1–2 (May 1998): 99–134. https://doi.org/10.1016/S0004-3702(98)00023-X.

[3] Smallwood, R. D., & Sondik, E. J. (1973). The Optimal Control of Partially Observable Markov Processes over a Finite Horizon. Operations Research, 21(5), 1071–1088. doi:10.1287/opre.21.5.1071