Talk:MDP for Differentially Private Reinforcement Learning

Initial Feedback

This looks good for the assignment.

The abstract says " two papers that use Markov Decision Processes for Differentially Private (DP) Reinforcement Learning", but does the first one guarantee DP?
What does "sends" mean in the pseudo-code?
Perhaps a page on Differential Privacy could be used for the next assignment (as it wasn't explained here in a way that is easy to understand)
The conclusion should talk about both papers

I really enjoyed the wiki article but I think a little bit of a reorganization could be very useful. Specifically, I think some sort of background section convering the MDP / DP fundamentals would be good to get out of the way so I can clearly understand the contributions of each of these papers
I thought the abstract was good, but perhaps a more fleshed out introduction would be nice to have. Specifically, it would be good to prime me with an expectation about why I should be interested in the problem of episodic RL and why DP in the context of that problem is important.

Introduce authors of papers, and title. Transform the raw url into a link with text.

Abstract

may contain sensitive information such as personalized medical treatment applications - what is a medical treatment application?

Most of the MDP notation is pretty standard. Perhaps move it to a background section since the MDP terminology is not the specific contribution of the paper
Generally in RL the agent interacts with the environment. How does that compare with the interaction with users in the episodic RL algorithm block?
In the UBEV algorithm, it would be nice to get a blurb about what the algorithm is before seeing a wall of text. I don’t know what to look for when reading through this algorithm because I don’t have context about what it is doing.
You talk about the Q function but you didn’t introduce the Q function in your MDP/RL background section.
Add a (JDP) after the first instance of Joint differential privacy so I know what the acronym means.
Not sure if it is relevant but I’m curious what the difference between JDP and DP is. Maybe add a sentence about that?

I don’t know what a prefix count is
I like the intro to the PUCB algorithm. That is exactly what I’d like to see for UBEV

I think your use of pseudo code is good and I can see the contribution of the second paper by making a DP variant! It worked well as a unit too.

Paper 1

Should it be $r(s_{i},\pi (s,h),h)$ or can be any number of time steps?
I was wondering how SAH is computed if S is a set of states and A is a set of actions.
What $\phi$ represents is not explicitly mentioned.

Paper 2

Related to paper 1, Q-functions and Q-learning are mentioned and a link is given but it is not really elaborated on. I think it may help to write a few sentences on Q-learning in the background information.

Minor Corrections

Abstract