MDP for Locally Differentially Private Reinforcement Learning

Title

Principal Author: Yilin Yang

Here we will look at two papers that uses Linear Mixture Markov Decision Processes for Locally Differentially Private Reinforcement Learning (LDP-RL)

https://arxiv.org/pdf/2009.09052.pdf

https://arxiv.org/pdf/1703.07710.pdf

https://link.springer.com/chapter/10.1007/978-3-540-72927-3_23

Collaborators:

Abstract

Data safety issues nowadays have become more prevalent as we rely more on technology. In particular,

image tasks such as Medical Imaging are facing the challenge of privacy risks, adversaries can easily

obtain sensitive information through various attacks such as membership inference attacks [ 12 ].

This dramatically discourages patients and clients from contributing invaluable data that may be

beneficial to research. Additionally, traditional privacy measures such as adding noise or aggregating

the statistics are infeasible in high-dimensional settings [2 ]. This problem facilitates the need for

a gold standard privacy notion.

For this page, we will be skipping/briefly mentioning the many theoretical guarantees of these two papers as it is outside the scope of this course.

Builds on

Markov Decision Process (MDP)

Differential Privacy (DP)

Related Pages

This should contain the reverse links to the "builds on" links as well as other related pages. Use in sentences.

Paper 1: Markov Descision Process (MDP) and UBEV Algorithm [todo]

Here we breifly go over MDP, here the definition is slightly different from the CPSC 522 wiki page:

Following the notation by [todo] and Vietri et al.^[1] The notation is slightly different for the two papers, here we will follow the notation by Vietri et al.^[1] and define a time-dependent fixed-horizon Markov decision process (MDP) as $M=(S,A,R,P,p_{0},H)$ , where:

$S$ a finite set of states.
$A$ is the finite set of actions.
$R$ is the reward function where $R(s_{h},a_{h},h)$ has the range $[0,1]$ with mean $r(s_{h},a_{h},h)$ , for a given time $h\in [H]$ .
$P(s_{h+1}|s_{h},a_{h},h)$ is the transition function, for state $s_{h}$ , action $a_{h}$ and time $h\in [H]$ . The next state $s_{h+1}$ will be sampled from this distribution.
$p_{0}$ is the initial state distribution at time $h=0$ for each episode.
$h$ is the number of time steps in an interval $H$ , we also call this time interval an episode.
An agent will use a policy $\pi \in \Pi \ s.t.\pi (s,h)\in A$ to determine its next action.
Here $T$ is the number of episodes, therefore the total timestep would be $HT$ after $T$ episodes. Generally, we will follow the notation by Vietri et al.^[1] and use $t$ to denote episode index.

In order to find the optimal policy, we first specify the two following definitions:

Given a policy $\pi$ and a time step $h\in [H]$ , the value function is defined as: $V_{h}^{\pi }(s)=\mathbb {E} {\bigg [}\sum _{i=h}^{H}r(s_{i},a_{i},i){\bigg |}s_{h}=s,\pi {\bigg ]}=r(s_{i},\pi (s,h),h)+\sum _{s'\in S}V_{h+1}^{\pi }(s')P(s'|s,\pi (s,h),h)$

For the entire episode, the expected total reward is defined as: $\rho ^{\pi }=\mathbb {E} {\bigg [}\sum _{i=1}^{H}r(s_{i},a_{i},i){\bigg |}\pi {\bigg ]}=p_{0}^{\top }V_{1}^{\pi }$

Using these two definitions, intuitively the optimal value function can be found by: $V_{h}^{*}(s)=\max _{\pi \in \Pi }V_{h}^{\pi }(s)$ for a given state $s\in [S]$ and time $h\in [H]$ . Any policy that achives the optimal value for all time and state is called an optimal policy. And the policy will acheieve the optimal (maximum) total reward.

The objective/goal for the agent is to learn a (near-)optimal policy after finite episodes, each episode the agent will take a policy $\pi$ given the prior information and outputs a policy $\pi$ for the next episode/output.

The UBEV Algorithm

Episodic RL

Algorithm 1: Episodic RL Protocol
Input: Agent  $\mathrm {M}$  and users  $u_{1},\dots ,u_{T}$ 
For  $t\in [T]$  do:
 | For  $h\in [H]$  do:
 |  |  $u_{t}$  sends state  $s_{h}^{(t)}$  to  $\mathrm {M}$ 
 |  |  $\mathrm {M}$  sends action  $a_{h}^{(t)}$  to  $u_{t}$ 
 |  |  $u_{t}$  sends reward  $r_{h}^{(t)}$  to  $\mathrm {M}$ 
 | End
End
 $\mathrm {M}$  releases policy  $\pi$

Background: Differential Privacy in RL

First proposed by Dwork et al. ^[2] Differential Privacy (DP) has risen in popularity in recent years as it has been recognized as the gold standard notion of privacy due to its useful properties of composition and post-processing invariance, the various mathematical details of DP will be omitted as it is outside the scope of this page and course. This section is not integral for the understanding of the paper, but is crucial to know if you are concerned with privacy in ML, therefore in this section we will first list the formal definition followed by a high-level intuition that is sufficient to understand the paper.

Joint Differential Privacy^[1]

Definition: An agent $\mathrm {M}$ is $\varepsilon$ -jointly differentialy private (JDP) is for any time $t$ , all $t$ -neighboring user sequences $U$ , $U'$ , and all events $E\subseteq A^{H\times [T-1]}\times \Pi$ we have:

\Pr[\mathrm {M} _{-t}(U)\in E]\leq e^{\varepsilon }\Pr[\mathrm {M} _{-t}(U')\in E]

Here, the user scequence $U=(u_{1},\dots ,u_{T})$ represents the scequence of $T$ users participating in the RL algorithmn. The $H$ is the depth of the tree structure for all state and reward sequences, overall the possible sequences can be denoted as $A^{H}$ , during the RL training the agent only has limited knowledge of each user, and only knows the single root-to-leaf path. We use $\mathrm {M} _{-t}(U)$ to represent the output of the agent if the episode $t$ is removed, essentially removing all the sensitive information for user $t$ . Lastly, $U$ and $U'$ are $t$ -neighboring user sequences if they only differ by the $t$ -th user.

Intuition of JDP

Generally, DP methods aim to protect specific users/data in a task by adding controlled noise from a known distribution to sensitive data so that the output will contain useful aggregated statistical information while also not revealing any specific participants data. For JDP, the controlled noise is from the Leplace distribution and is called the Laplace mechanism [todo], it ensures that when one of the users in the sequence $T$ changes, the learned information does not change greatly. It should be emphasized that this condition holds true even in the extreme case where all the remaining $T-1$ are adversarial users that use coordinated attack method to attempt to get the sensitive information of the remaining user by fixing the the state and actions.

Paper 2: Private Reinforcement Learning with PAC and Regret Guarantees ^[1]

So far we have introducted the MDP algorithmn and a privitaization method, in this paper the author privitizes the general episodic RL algorithmn and creates a DP variant that has many useful bounds and guarantees. The remaining element is the counting mechanism, which we will describe in the following section. The private counter mechanism takes a stream of data as input and releases an approximation of the prefix count.

Private Counting Mechanism

The counting mechanism is a method used to keep track of events that occur while interacting with an MDP (Markov Decision Process). Following the definitions by Vietri et al.^[1] Here we outline the notation for the counters needed for MDP to work. Let ${\widehat {n_{t}}}(s,a,h)$ be the number of vists to state $(s,a,h)$ before episode $t$ , where the state tuple contains the action $a$ taken on the state $s$ on time step $h$ before episode $t$ . However as this counter depends on sensitive data in our defined sceanario, we must derive a method that privately releases the counts. The private counter mechanism takes a data stream as input and releases an approximation of the prefix count. We denote this privacy counting mechanism as PC and it takes $\varepsilon$ and $T$ as input paramters, thus for all $1\leq t\leq T$ , we have:

\Pr {\bigg (}{\big |}\mathrm {M} (\sigma )(t)-c(\sigma )(t){\big |}\leq {\frac {4}{\varepsilon }}\ln(1/\beta )\log(T)^{5/2}{\bigg )}\geq 1-\beta

However as this bound only applies to a single $\varepsilon$ -DP counter, maintaining all $S^{2}AH$ counters will incur a large amount of weight that sclaes polynomailly with $S$ , $A$ and $H$ . Here the authors use the fact that the total change across all counters a user can have scales linearly with the episode $H$ to add a much smaller amount of DP noise.

Now that we have all the ingredients, we can finally introduce the private MDP algorithm introduced by this paper called the PUCB algorithm:

The PUCB Algorithm^[1]

At a high level, Private Upper Confidence Bound (PUCB) is a variation of the UBEV [todo] algorithm, specifically designed for privacy-sensitive environments using JDP. The pseudo-code is shown in Algorithm 2. The calculatated policy at each step depends on the average empirical reward ${\widetilde {r}}(s,a,h)$ , the number of times ${\widetilde {n}}(s,a,h)$ the agent has been to state $(s,a,h)$ and the number of times the agent has transitioned from $(s,a,h)$ to state $s'$ , denoted as ${\widetilde {m}}(s,a,s',h)$ . Thus in the PUCB, as the policy only depends on these privitized statistics from the private counters, if we set the privacy level correctly (adding the correct amount of controlled noise), JDP is satisfied.

Algorithm 2: Private Upper Confidence Bound (PUCB)                                                                                                                                

Parameters: Privacy parameter  $\varepsilon$ , target failure probability  $\beta$ 
Input: Maximum number of episodes  $T$ , horizon  $H$ , state space  $S$ , action space  $A$ 
 $\varepsilon ':=\varepsilon /(3H)$ 
For  $s,a,s',h\in S\times A\times S\times [H]$  do:
 | Initialize private counters:  ${\widetilde {r}}(s,a,h),{\widetilde {n}}(s,a,h),{\widetilde {m}}(s,a,s',h):={\text{PC}}(T,\varepsilon ',\beta )$ 
End
For  $t$  ← 1 to  $T$  do:
| Private planning:  ${\widetilde {Q}}_{t}^{+}:={\text{PrivQ}}({\tilde {r}},{\tilde {n}},{\tilde {m}},\varepsilon ,\beta )$ 
| For  $h$  ← 1 to  $H$  do:
|  |  Let  $s$  denote the state during step  $h$  and episode  $t$ 
|  |  Execute  $a:={\text{argmax}}_{a'}{\widetilde {Q}}_{t}^{+}(s,a',h)$ 
|  |  Observe  $r\sim R(s,a,h)$  and  $s'\sim P(\cdot |s,a,h)$ 
|  |  Feed r to  ${\widetilde {r}}(s,a,h)$ 
|  |  Feed 1 to  ${\widetilde {n}}(s,a,h)$  and  ${\widetilde {m}}(s,a,s',h)$  and 0 to all other counters  ${\widetilde {n}}(\cdot ,\cdot ,h)$  and
|  |   ${\widetilde {m}}(\cdot ,\cdot ,\cdot ,h)$ 
| End
End

Private Counting Details: The PUCB initilzes $2SAH+S^{2}AH$ private counters for all ${\widetilde {r}}_{t}$ , ${\widetilde {n}}_{t}$ , and ${\widetilde {m}}_{t}$ . By requiring that each set of counters ( $\{{\widetilde {r}}_{t}\}$ , $\{{\widetilde {n}}_{t}\}$ , and $\{{\widetilde {m}}_{t}\}$ ) is $\varepsilon /3$ -JDP, with

E_{\varepsilon }={\frac {3}{\varepsilon }}H\log {\bigg (}{\frac {2SAH+S^{2}AH}{\beta '}}{\bigg )}\log(T)^{5/2}

we can ensure that thus for all

1\leq t\leq T

, we have:

\Pr {\bigg (}{\big |}{\widehat {n}}_{t}(s,a,h)-{\widetilde {n}}_{t}(s,a,h){\big |}<E_{\varepsilon }{\bigg )}\geq 1-\beta

Here the

{\widehat {n}}_{t}

and

{\widetilde {n}}_{t}

are the count and release right before episode

t

, this guarantee is uniform in

(s,a,h)

and also holds simultaneously for

{\widetilde {r}}_{t}

and

{\widetilde {m}}_{t}

. Similar to the Private Counting Mechanism, the full understanding of the details above is not required.

In algorithm 3, we calculate the private Q-function for t-th episode, PrivQ is a standard batch Q-learning update, with ${\widetilde {conf}}_{t}(s,a,h)$ serving as an optimism bonus. The resulting Q-function, denoted as ${\widetilde {Q}}_{t}^{+}$ , is a greedy policy which we use for the t-th episode.

Algorithm 3: PrivQ
Input: Private counters  ${\tilde {r}},{\tilde {n}},{\tilde {m}},$  privacy parameter  $\varepsilon$ , target failure probability  $\beta$ 
 $E_{\varepsilon }:={\frac {3}{\varepsilon }}H\log {\bigg (}{\frac {2SAH+S^{2}AH}{\beta '}}{\bigg )}\log(T)^{5/2}$ 
 ${\widetilde {V}}_{H+1}(s):=0\ \ \ \ \ \forall s\in S$ 
For  $h$  ← 1 to  $H$  do:
 | For  $s,a\in S\times A$  do:
 |  | If  ${\widetilde {n}}(s,a,h)\geq 2E_{\varepsilon }$  then:
 |  |  |  ${\widetilde {conf}}_{t}(s,a,h):=(H+1){\widetilde {\phi }}_{t}(s,a,h)+{\widetilde {\psi }}_{t}(s,a,h)$ 
 |  | Else:
 |  |  |  ${\widetilde {conf}}_{t}(s,a,h):=H$ 
 |  | End
 |  |  ${\widetilde {Q}}_{t}(s,a,h):={\frac {1}{{\tilde {n}}_{t}(s,a,h)}}{\bigg (}{\tilde {r}}_{t}(s,a,h)+\sum _{s\in S}{\widetilde {V}}_{h+1}(s'){\widetilde {m}}_{t}(s,a,s',h){\bigg )}$ 
 |  |  ${\tilde {Q}}_{t}^{+}(s,a,h):=\min {\bigg \{}H,{\widetilde {Q}}_{t}(s,a,h)+{\widetilde {\text{conf}}}_{t}(s,a,h){\bigg \}}$ 
 | End
 |  ${\widetilde {V}}_{h}(s):=\max _{a}{\widetilde {Q}}_{t}^{+}(s,a,h)\ \ \ \ \ \ \forall s\in S$ 
End
Output:  ${\tilde {Q}}_{t}^{+}$

Here the bonus function ${\widetilde {conf}}_{t}(s,a,h)$ for each $(s,a,h)$ can be decomped to two components ${\widetilde {\phi }}_{t}(s,a,h)$ and ${\widetilde {\psi }}_{t}(s,a,h)$ where:

{\widetilde {\phi }}_{t}(s,a,h)={\sqrt {\frac {2\ln(T/\beta ')}{\max({\widetilde {n}}_{t}(s,a,h)-E_{\varepsilon },1)}}}

{\widetilde {\psi }}_{t}(s,a,h)=(1+SH){\bigg (}{\frac {3E_{\varepsilon }}{{\widetilde {n}}_{t}(s,a,h)}}+{\frac {2E_{\varepsilon }^{2}}{{\widetilde {n}}_{t}(s,a,h)^{2}}}{\bigg )}

The component

{\widetilde {\phi }}_{t}(s,a,h)

can be roughly thought of as the sampling error, where

{\widetilde {\psi }}_{t}(s,a,h)

represents the errors introduced by private counters.

Notible theoretical guarantees of the PUCB Algorithm

In the section we breifly showcase two of the theoretical guarantees of the PUCB alogorithmn, we first introduce two definitions used the evaulate similar problem, PAC and Regret before we show the bounds proved by this paper.

PAC in RL:

An agent is called $(\alpha ,\beta )$ -probably approximately correct (PAC) with sample complexity $f(S,A,H,{\frac {1}{\alpha }},\log({\frac {1}{\beta }}))$ , if it follows an $\alpha$ -optimal policy $\pi$ such that $\rho ^{*}-\rho ^{\pi }\leq \alpha$ except for at most $f(S,A,H,{\frac {1}{\alpha }},\log({\frac {1}{\beta }}))$ episodes with at least proability of $\beta$ .

Regret in RL:

The (expected cumulative) regret of an agent after $T$ episodes is given by:

{\text{Regret}}(T)=\sum _{t=1}^{T}(\rho ^{*}-\rho ^{\pi _{t}})

where

\pi _{1},\dots ,\pi _{T}

are the policies the agent follows for each episode

t

.

Theorem 3 (PAC guarantee for PUCB)^[1] PUCB follows a policy that with probability at least $1-\beta$ is $\alpha$ -optimal on all but

O{\bigg (}{\bigg (}{\frac {SAH^{4}}{\alpha ^{2}}}{\frac {S^{2}AH^{4}}{\varepsilon \alpha }}{\bigg )}{\text{polylog}}(T,S,A,H,{\frac {1}{\alpha }},{\frac {1}{\beta }},{\frac {1}{\varepsilon }}){\bigg )}

episodes, where

\alpha \in (0,H]

.

This mean that suffient epiodes of PUCB, a large fraction of the episodes will act near-optimally. The cost of privacy is very low as the privacy parameter $\varepsilon$ is only in the term scaling as ${\frac {1}{\alpha }}$ , and typically $\alpha$ is small and this term is of a lower order.

Theorem 3 (Regret bound for PUCB)^[1] With probability at least $1-\beta$ , the regret of PUCB up to episode $T$ is at most

O{\bigg (}{\bigg (}H^{2}{\sqrt {SAT}}+{\frac {SAH^{3}+S^{2}AH^{3}}{\varepsilon }}{\bigg )}{\text{polylog}}(T,S,A,H,{\frac {1}{\alpha }},{\frac {1}{\beta }},{\frac {1}{\varepsilon }}){\bigg )}

The remark is similar to the PAC guarantee, as the privacy parameter

\varepsilon

is only in the term scaling

{\text{polylog}}(T)

, the utilitiy cost for privacy is negligible as the leading order term scales with

{\sqrt {T}}

. If the readers are interested in the proof, a high level sketch as well as the detailed proof is provided in the paper.

Conclusion

This paper provide a novel study of reinforment learning with differential privacy, the authors utilizes joint differential privacy to select future decisions based off privitazed data rether than sensitive data. The paper established a JDP alortihmn along with an analysis for its PAC and regret utility guarantees. Surprisingly, the utility cost for privacy is asymptotically negligible, in the future the author plans to further close the bound between DP and non-DP settings as well as the designing RL algorithmns in non-tabular settings has the potential to many real-life applications.

Annotated Bibliography

Put your annotated bibliography here. Add links where appropriate.

↑ ^{Jump up to: 1.0} ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 ^1.7 ^1.8 Vietri, Giuseppe; et al. "Private Reinforcement Learning with PAC and Regret Guarantees" (PDF). International Conference on Machine Learning – via arxiv. line feed character in |journal= at position 36 (help)
↑ Dwork, Cynthia; Roth, Aaron (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science.

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

https://papers.nips.cc/paper/2017/file/17d8da815fa21c57af9829fb0a869602-Paper.pdf

Following the definitions by Ngo & Vietri et al.^[1] The difference between LMDP and MDP is the additional assumtion that the reward function and transiton functions are linear. Therefore we can focus on linear action-value funtions.

Linear MDP [TODO]

A deterministic policy is

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0

↑ Ngo, Dung Daniel; Vietri, Giuseppe; et al. "Improved Regret for Differentially Private Exploration in Linear MDP" (PDF). International Conference on Machine Learning.

[:0-1] {Jump up to: 1.0} ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 ^1.7 ^1.8 Vietri, Giuseppe; et al. "Private Reinforcement Learning with PAC and Regret Guarantees" (PDF). International Conference on Machine Learning – via arxiv. line feed character in |journal= at position 36 (help)

[2] Dwork, Cynthia; Roth, Aaron (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science.

[3] Ngo, Dung Daniel; Vietri, Giuseppe; et al. "Improved Regret for Differentially Private Exploration in Linear MDP" (PDF). International Conference on Machine Learning.

[1]

[2]

[1]