MDP for Differentially Private Reinforcement Learning

From UBC Wiki


MDP for Differentially Private Reinforcement Learning Using the PUCB Algorithm

Principal Author: Yilin Yang

Here we will look at two papers that use Markov Decision Processes for Reinforcement Learning, the first paper will introduce an algorithm that has very good theoretical bounds, and the second paper will construct a variant of the algorithm proposed by the first algorithm that has Differentially Private (DP) guarantees.

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Private Reinforcement Learning with PAC and Regret Guarantees


Collaborators:

Abstract

Data safety issues nowadays have become more prevalent as we rely more on technology. For Episodic RL applications, states and awards of the MDP may contain sensitive information such as personalized medical treatment applications. In this scenario, users/patients will communicate with the agent for a fixed duration and obtain a corresponding medical treatment. As information is sensitive and will influence the policy produced, it will affect the subsequent actions as well. Adversarial users can input intentional, coordinated data in an attempt to reveal the sensitive state and actions of a target user. These harmful attack methods facilitate a need for a gold standard privacy notion. In recent years, DP has risen in popularity as it provides many useful properties and is a mathematical guarantee of privacy. The UBEV Algorithm proposed by Dann et al.[1] is a recent advancement in the episodic RL regime, providing both near-optimal PAC and regret guarantees, this has inspired Vietri et al.[2] to create PUCB, a privatized variant of UBEV. For this page, we will be introducing the definitions and required background information to understand both papers, and introduce the pseudo-code of the two algorithms as well as their PAC and regret guarantees.

Builds on

Markov Decision Process (MDP)

Reinforcement Learning

Differential Privacy (DP)

Related Pages

Q-Learning and MDP for healthcare are two interesting reads related to these two papers

Paper 1 Background: Markov Decision Process (MDP), Q-learning, and Episodic RL

Following the notation by Dann et al.[1] and Vietri et al.[2], here we briefly go over MDP, the notation is slightly different for the two papers, here we will follow the notation by Vietri et al.[2] as it is our main focus and define a time-dependent fixed-horizon Markov decision process (MDP) as , where:

  • is a finite set of states.
  • is the finite set of actions.
  • is the reward function where has the range with mean , for a given time .
  • is the transition function, for state , action and time . The next state will be sampled from this distribution.
  • is the initial state distribution at time for each episode.
  • is the number of time steps in an interval , we also call this time interval an episode.
  • An agent will use a policy to determine its next action.
  • Here is the number of episodes, therefore the total timestep would be after episodes. We use to denote the episode index.

In order to find the optimal policy, we first specify the two following definitions:

  • Given a policy and a time step , the value function is defined as:
  • For the entire episode, the expected total reward is defined as:

Using these two definitions, intuitively the optimal value function can be found by: for a given state and time . Any policy that achieves the optimal value for all times and states is called an optimal policy. And the policy will achieve the optimal (maximum) total reward.

Q-Learning

In reinforcement learning, for any finite Markov decision process (FMDP), we can use Q-learning to learn the MDP task. For those who are interested in the details of Q-learning, the related page on Q-Learning is strongly recommended. From a high level, Q-Learning is an algorithm that finds the optimal policy by maximizing the expected total reward for all successive steps starting from the current state. Given infinite exploration time, Q-learning can find an optimal action-selection policy for any MDP. An example of a Q-function can be found on the Q-Learning page. In our papers, the Q-function will be defined differently.

Episodic RL

The objective/goal for the agent is to learn a (near-)optimal policy after finite episodes, each episode the agent will take a policy given the prior information and outputs a policy for the next episode/output. Using MDP, we can construct an episodic RL algorithm as follows:

Algorithm 0: Episodic RL Protocol
Input: Agent  and users 
For  do:
 | For  do:
 |  |  sends state  to 
 |  |  sends action  to 
 |  |  sends reward  to 
 | End
End
 releases policy 

Here, in contrast to directly interacting with an environment, the agent will communicate with a user for a fixed horizon for all users. Each will interact with the environment and send their current state and will recommend an action, and will send the reward received to .

Paper 1: Markov Decision Process (MDP) and UBEV Algorithm [1]

We now introduce the UBEV Algorithm, to the authors' knowledge, UBEV is the first algorithm with both near-optimal PAC and regret guarantees (We will introduce PAC and regret in the analysis section):

The UBEV Algorithm[1]

At a high level, during each episode , the agent follows an optimal policy by backward induction and a curated confidence interval for the transition probabilities in each state. Here is an optimistic estimate of the Q-function for the current using the empirical estimates of the next state value , and the expected immediate reward and the confidence bounds and . Additionally, and are the counters used to make MDP work, we will be outlining the details in the second paper as the high-level intuition is very similar.

Algorithm 1: Upper Bounding the Expected Next State Value (UBEV)
Input: Failure tolerance 
; 
For  do:
 | /* Optimistic planning */
 | For  do:
 |  | For  do:
 |  |  | For  do:
 |  |  |  |  // confidence bound
 |  |  |  | ;  // empirical estimates
 |  |  |  | 
 |  |  | End
 |  |  | , 
 |  | End
 | End
 | /* Execute policy for one episode */
 | 
 | For  do:
 |  | ,  and 
 |  | ++; ++;  +=  // update statistics
 | End
End

Notible theoretical guarantees of the UBEV Algorithm

In this section we briefly showcase two of the theoretical guarantees of the UBEV algorithm, we first introduce two definitions used the evaluate these types of algorithms——PAC and Regret before we show the bounds proved by this paper.

PAC in RL[2]:

An agent is called -probably approximately correct (PAC) with sample complexity , if it follows an -optimal policy such that except for at most episodes with at least proability of .

Regret in RL[2]:

The (expected cumulative) regret of an agent after episodes is given by:

where are the policies the agent follows for each episode .

Theorem 1 (PAC guarantee for UBEV)[1] UBEV follows a policy that with probability at least is -optimal on all but

episodes, where . Here the is a function that is bounded by a polynomial of logarithms, that is:

Theorem 2 (Regret bound for UBEV)[1] With probability at least , the regret of UBEV for all episodes is at most:

For people interested in the proof details, a high-level sketch, as well as the detailed proof, is provided in the paper. To the authors' knowledge, UBEV is the first algorithm with both near-tight (up to a factor of ) high probability regret and -PAC bounds as well as the first algorithm with any nontrivial uniform-PAC bound. The good theoretical guarantees in the first paper essentially motivated and inspired the authors from the second paper to incorporate privacy concerns into this algorithm, starting an interesting line of research. Thus, in the following sections, we will first formally define privacy, as well as provide a high-level intuition for the privacy setting, we will then introduce the PUCB algorithm, and discuss the results.

Paper 2 Background: Differential Privacy in RL

First proposed by Dwork et al. [3] Differential Privacy (DP) has risen in popularity in recent years as it has been recognized as the gold standard notion of privacy due to its useful properties of composition and post-processing invariance, the various mathematical details of DP will be omitted as it is outside the scope of this page and course. This section is not integral for the understanding of the paper but is crucial to know if you are concerned with privacy in ML, therefore in this section, we will first list the formal definition followed by a high-level intuition that is sufficient to understand the paper.

Joint Differential Privacy[2]

Definition: An agent is -jointly differentialy private (JDP) is for any time , all -neighboring user sequences , , and all events we have:

Here, the user sequence represents the sequence of users participating in the RL algorithm. The is the depth of the tree structure for all state and reward sequences, overall the possible sequences can be denoted as , during the RL training the agent only has limited knowledge of each user, and only knows the single root-to-leaf path. We use to represent the output of the agent if the episode is removed, essentially removing all the sensitive information for user . Lastly, and are -neighboring user sequences if they only differ by the -th user.

Intuition of JDP

Generally, DP methods aim to protect specific users/data in a task by adding controlled noise from a known distribution to sensitive data so that the output will contain useful aggregated statistical information while also not revealing any specific participant's data. For JDP, the controlled noise is from the Laplace distribution and is called the Laplace mechanism [3], it ensures that when one of the users in the sequence changes, the learned information does not change greatly. It should be emphasized that this condition holds true even in the extreme case where all the remaining are adversarial users that use coordinated attack methods to attempt to get the sensitive information of the remaining user by fixing the state and actions.

Paper 2: Private Reinforcement Learning with PAC and Regret Guarantees [2]

So far we have introduced the UBEV algorithm and a privatization method, in this paper, the author privatizes the general episodic RL algorithm and creates a DP variant that has many useful bounds and guarantees. The remaining element is the counting mechanism, similar to states and values, counts for the states and actions visted are also considered private. Therefore we will describe in the following section a private counter mechanism that takes a stream of data as input and releases an approximation of such a count.

Private Counting Mechanism

The counting mechanism is a method used to keep track of events that occur while interacting with an MDP (Markov Decision Process). Following the definitions by Vietri et al.[2] Here we outline the notation for the counters needed for MDP to work. Let be the number of visits to state before episode , where the state tuple contains the action taken on the state on time step before episode . However as this counter depends on sensitive data in our defined scenario, we must derive a method that privately releases the counts. The private counter mechanism takes a data stream as input and releases an approximation of the count. We denote this privacy counting mechanism as PC and it takes and as input parameters, thus for all , we have:

However as this bound only applies to a single -DP counter, maintaining all counters will incur a large amount of weight that scales polynomially with , and . Here the authors use the fact that the total change across all counters a user can have scales linearly with the episode to add a much smaller amount of DP noise.

Now that we have all the ingredients, we can finally introduce the private MDP algorithm introduced by this paper called the PUCB algorithm:

The PUCB Algorithm[2]

At a high level, Private Upper Confidence Bound (PUCB) is a variation of the UBEV [1] algorithm, specifically designed for privacy-sensitive environments using JDP. The pseudo-code is shown in Algorithm 2. The calculated policy at each step depends on the average empirical reward , the number of times the agent has been to state and the number of times the agent has transitioned from to state , denoted as . Thus in the PUCB, as the policy only depends on these privatized statistics from the private counters, if we set the privacy level correctly (adding the correct amount of controlled noise), JDP is satisfied.

Algorithm 2: Private Upper Confidence Bound (PUCB)                                                                                                                                

Parameters: Privacy parameter , target failure probability 
Input: Maximum number of episodes , horizon , state space , action space 

For  do:
 | Initialize private counters: 
End
For  ← 1 to  do:
| Private planning: 
| For  ← 1 to  do:
|  |  Let  denote the state during step  and episode 
|  |  Execute 
|  |  Observe  and 
|  |  Feed r to 
|  |  Feed 1 to  and  and 0 to all other counters  and
|  |  
| End
End

Private Counting Details: The PUCB initilzes private counters for all , , and . By requiring that each set of counters ( , , and ) is -JDP, with

we can ensure that thus for all , we have:
Here the and are the count and release right before episode , this guarantee is uniform in and also holds simultaneously for and . Similar to the Private Counting Mechanism, a full understanding of the details above is not required.

In algorithm 3, we calculate the private Q-function for the -th episode, PrivQ is a standard batch Q-learning update, with serving as an optimism bonus. The resulting Q-function denoted as , is a greedy policy that we use for the t-th episode.

Algorithm 3: PrivQ
Input: Private counters  privacy parameter , target failure probability 


For  ← 1 to  do:
 | For  do:
 |  | If  then:
 |  |  | 
 |  | Else:
 |  |  | 
 |  | End
 |  | 
 |  | 
 | End
 | 
End
Output: 

Here the bonus function for each can be decomped to two components and where:

The component can be roughly thought of as the sampling error, where represents the errors introduced by private counters.

Notable theoretical guarantees of the PUCB Algorithm

Theorem 3 (PAC guarantee for PUCB)[2] PUCB follows a policy that with probability at least is -optimal on all but

episodes, where .

This means that with sufficient episodes of PUCB, a large fraction of the episodes will act near-optimally. The cost of privacy is very low as the privacy parameter is only in the term scaling as , and typically is small and this term is of a lower order.

Theorem 4 (Regret bound for PUCB)[2] With probability at least , the regret of PUCB up to episode is at most

The remark is similar to the PAC guarantee, as the privacy parameter is only in the term scaling , the utility cost for privacy is negligible as the leading order term scales with . There is a PAC and a regret guarantee. For the readers interested in the proof details, a high-level sketch and detailed proof are provided in the paper.

Conclusion

In this page, we introduced two novel RL algorithms that have outstanding theoretical guarantees. In the first paper, we introduced the UBEV algorithm, UBEV is the first algorithm with both near-tight (up to a factor of ) high probability regret and -PAC bounds as well as the first algorithm with any nontrivial uniform-PAC bound. This novel advancement has inspired researchers to consider a DP setting, where user data is considered sensitive, here the authors utilize joint differential privacy to select future decisions based on privatized data rather than sensitive data. The paper established a JDP algorithm along with an analysis of its PAC and regret utility guarantees. Surprisingly, the author has proven that the utility cost for privacy is asymptotically negligible, creating a novel direction in differentially private algorithms. For future work, the author plans to further close the bond between DP and non-DP settings as well as design RL algorithms in non-tabular environments such as real-life applications.

Annotated Bibliography

Put your annotated bibliography here. Add links where appropriate.

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Christoph, Dann; et al. "Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning" (PDF). Neural Information Processing Systems 2017 – via arXiv.
  2. 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 Vietri, Giuseppe; et al. "Private Reinforcement Learning with PAC and Regret Guarantees" (PDF). International Conference on Machine Learning – via arXiv. line feed character in |journal= at position 36 (help)
  3. 3.0 3.1 Dwork, Cynthia; Roth, Aaron (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science.

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.


Some rights reserved
Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0
By-nc-sa-small-transparent.png