Course talk:CPSC522/Diffusion Probabilistic Model

[View source↑]
[History↑]

Discussion 1

Critique

1) The abstract says: "These models consist of a forward trajectory which iteratively adds noise to the source data distribution to shift it towards a tractable prior. The reverse trajectory aims to model the reverse of this process, slowly shifting the distribution from the tractable prior back to the target data distribution."

and, the sentence above the "Forward Trajectory" sub-heading says: "These models are comprised of a forward and reverse process. In the forward process, the data distribution is iteratively transformed into a tractable distribution. The aim of the reverse process is to reverse this forward process, transforming data from a tractable source distribution into samples from the data distribution."

It's the same process explained in the same level of detail twice, so perhaps you could either revise the abstract or remove the sentences above "Forward Trajectory".

2) If $\beta _{t}$ , then the form of ${\textstyle p\left(\mathbf {x^{(t-1)}} |\mathbf {x^{(t)}} \right)}$ will match ${\textstyle q\left(\mathbf {x^{(t)}} |\mathbf {x^{(t-1)}} \right)}$ .

This sentence is incomplete. If $\beta _{t}$ what?

3) Considering the fact that there is room to write more in the page, you might want to explain the "Training Objective" section a bit more. The multiplication of the term to the top and bottom of the existing term can be re-written to understand the rearranged equation better. The first part would show how the equation looks with ${\textstyle {\frac {q\left(\mathbf {x} ^{(1\ldots T)}|\mathbf {x} ^{(0)}\right)}{q\left(\mathbf {x} ^{(1\ldots T)}|\mathbf {x} ^{(0)}\right)}}}$ on the RHS. The second part would involve re-writing parts of the equation, the third part would involve re-arrangement.

4) You need to mention what the Jensen's inequality is as this information is not in the "Builds on" pages. Then, you need to elaborate how the inequality fits in this context (either with a brief explanation or by rewriting the resulting equation in a form that is consistent with Jensen's inequality).

5) "With algebraic manipulation, the authors arrive at the final loss:" It would be wonderful if you can show the algebraic manipulation instead of just mentioning it.

6) "One key difference between this first paper and future diffusion probabilistic models is that the authors choose to learn the variance schedule of the forward trajectory $\beta _{t}$ . The authors set $\beta _{1}$ to a small value, but otherwise optimize $\beta _{2,\ldots ,T}$ through gradient ascent on the learning objective." Why? Why did they choose to do that? What is the benefit?

7) "To implement a Diffusion Probabilistic Model, multiple design choices, including the variances $\beta _{t}$ , the model architecture and how to parameterize the gaussian transitions." This is an incomplete sentence. It can potentially be merged with the previous sentence.

8) "The authors of Denoising Diffusion Probabilistic Models ^[1] connect diffusion models to denoising score matching to simplify the training objective." How does it simplify the training objective?

9)

L=\mathbb {E} _{q}\left[L_{T}+\sum _{t>1}L_{t-1}+L_{0}\right]

L_{T}=D_{KL}\left(q\left(\mathbf {x} _{T}|\mathbf {x} _{0}\right)||\ p\left(\mathbf {x} _{T}\right)\right)

L_{t-1}=D_{KL}\left(q\left(\mathbf {x} _{t-1}|\mathbf {x} _{t},\mathbf {x} _{0}\right)||\ p_{\theta }\left(\mathbf {x} _{t-1}|\mathbf {x} _{t}\right)\right)

L_{0}=-logp_{\theta }\left(\mathbf {x} _{0}|\mathbf {x} _{1}\right)

This comes out of nowhere, how does it relate to the equations we learnt about in the previous section? Explaining what these variables mean would also be great. Also, using equation numbers and referring to those equations with their equation numbers would be ideal.

10) Where is $\mu _{\theta }$ defined?

11) "The authors provide an alternate objective, by first noting that $\mathbf {x} _{t}\left(\mathbf {x} _{0},\mathbf {\epsilon } \right)={\sqrt {{\bar {\alpha }}_{t}}}\mathbf {x} _{0}+{\sqrt {1-{\bar {\alpha }}_{t}}}\mathbf {\epsilon }$ where $\mathbf {\epsilon } \sim {\mathcal {N}}\left(\mathbf {0} ,\mathbf {I} \right)$ and ${\textstyle {\bar {\alpha }}_{t}=\prod _{i=1}^{t}(1-\beta _{t})}$ . " Why?

12) What are your thoughts on the contributions?

Minor edits

1) Grammatical errors:
-> Generative models are a powerful class of *models *that are able to draw novel samples *that match the distribution of their training distribution.
-> One such class of generative model *is diffusion probabilisitic models.
-> Other members of that class of model are Generative Adversarial Networks*, Variational Auto-Encoders*, and Variational Recurrent Neural Networks.
-> Naively, ${\textstyle log\ p\left(\mathbf {x} ^{(0)}\right)}$ is intractable as *it involves evaluating the integral...

2) Incompleteness:
-> In *todo: cite paper, the authors introduce diffusion probabilistic models.

↑ Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.

HarshineeSriram (talk)‎

Thank you Harshinee for this excellent critique. I'll work on addressing the comments you have made.

MatthewNiedoba (talk)‎

I think I've addressed most of your concerns. 1. I simplified the abstract and to reduce the redundancy 2. Fixed that grammatical error (if beta is small) 3. Expanded the derivation of p(x_0) and the top/bottom multiplication steps 4. Added a sentence on Jensen's Inequality and how it is used in this context. Also added a link to the VI page which talks about the ELBO 5. Added the algebraic manipulation, and changed the final form to match the DDPM paper so its clearer when I discuss modifications in the DDPM paper 6. Selecting hyperparmeters is just a choice you make. The DDPM authors don't really motivate it in the paper, although they do provide an ablation which I've included in the experimental results section 7. I completely rewrote that section 8. I tried to rewrite this. The main idea here is that the authors were inspired by a different body of work (score matching methods), and noticed that with a specific parameterization of the reverse trajectory they end up with a score-matching like objective 9. As with 5. I tried to introduce this earlier so it doesn't come out of the blue so much 10. I had f_\mu as the function approximator for the mean in the first paper, but I changed it to \mu_\theta for clarity. 11. I tried to explain how this equation comes from the reparameterization trick. I also added the ablation section to show that this parameterization is more effective. 12. Added conclusion section.

MatthewNiedoba (talk)‎

Critique

Diffusion models are a very hot topic rn, and the fact that it builds on Markov Chains and VI makes it an organic fit for our course material. Very good topic!

For paper 1:

Don't forget to cite papers :)
Visualizations of the steps are clearer than math.
some visualizations for the first paper experiments, such as on cifar 10 would make the model convincing.

For Denoising Diffusion Probabilistic Models, you mentioned [2] "is extremely flexible, but this is somewhat to its detriment." I would love to see some elaborations on this. (The visualization/experiment section could be a great foreshadowing of this issue).

Additionally, the jump from Paper [2] to [3] needs some more justifications, its not obvious to the reader without the proper background, something along the line of "the parameters needed to train a DM is unfeasibly large, ... in this paper, the author considers simplifying it."

This and the problem above is actually just one problem, and there need to be more words dedicated to addressing the disadvantages of paper 1 and why paper 2 is a good choice to remedy 1.

I like the derivations, but the theoretical guarantees seem to be missing
similar to the theoretical guarantees, the experiments are also not addressed so it is unconvincing to readers that these models are as powerful and popular as they are.

Although it's not mandatory, a conclusion section really ties the page together, as right now the page seems to abruptly end, with no discussion or any justifications on why these models are significant.

YilinYang (talk)‎

Thanks for your comments Yilin. I've done my best to address your comments

I added some extra citations for related works and datasets on which the papers were evaluated
I included the figure from the original paper on the swiss roll dataset and moved that up to the top as I think it does a good job illustrating the distributions for the forward and reverse trajectories
I added the experimental results for the first paper, including CIFAR-10. The quality of these results is what I use to motivate the advances made in the second paper
I'm not exactly sure what you mean by theoretical guarantees? As far as I'm aware there aren't any guarantees on the model output which I can described
I added some experimental results for both papers.
I added a conclusion.

MatthewNiedoba (talk)‎

[1] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.

[1]

Thread title	Replies	Last modified
Discussion 1	2	22:41, 15 February 2023
Critique	1	22:35, 15 February 2023

Contents

Discussion 1

Critique