Diffusion language models

Diffusion models have completely taken over generative modelling of perceptual signals such as images, audio and video. Why is autoregression still the name of the game for language modelling? And can we do anything about that? Some thoughts about what it will take for other forms of iterative refinement to take over language modelling, the last bastion of autoregression.

The rise of diffusion models

Roughly three years ago, things were starting to look as if adversarial image generators were about to be supplanted by a powerful combination of autoregression and discrete representation learning. BigGAN1 and StyleGAN2 had significantly expanded the capabilities of image generators, but the mode-seeking nature of GANs made them favour realism over diversity. This presented some challenges, and people were having trouble reproducing impressive domain-specific results (e.g. generating realistic human faces) on more diverse training datasets.

VQ-VAE 23 and especially VQGAN4 extolled the virtue of a two-stage approach to generative modelling: first turn everything into a highly compressed discrete one-dimensional sequence, and then learn to predict this sequence step-by-step using a powerful autoregressive model. This idea had already proven fruitful before, going back to the original VQ-VAE5, but these two papers really drove the point home that this was our best bet for generative modelling of diverse data at scale.

But then, a challenger appeared: a new generative modelling approach based on iterative denoising was starting to show promise. Yang Song and Stefano Ermon proposed score-based models: while their NeurIPS 2019 paper6 was more of a proof-of-concept, the next year’s follow-up ‘Improved Techniques for Training Score-Based Generative Models’7 showed results that convinced some people (including me!) to take this direction of research more seriously. Another NeurIPS 2020 paper by Jonathan Ho, Ajay Jain and Pieter Abbeel, ‘Denoising Diffusion Probabilistic Models’ (DDPMs)8 showed similar results, and it didn’t take people too long to realise that DDPMs and score-based models were two sides of the same coin.

The real triumph of diffusion models over other alternatives for image generation came in 2021, with ‘Diffusion Models Beat GANs on Image Synthesis’9 by Prafulla Dhariwal and Alex Nichol. At that point, it was pretty clear to everyone in the know that this approach was poised to take over. Powerful diffusion-based text-to-image models such as GLIDE10 started to arrive by the end of that year, and proceeded to go mainstream in 2022.

If you are unfamiliar with diffusion models, I recommend reading at least the first section of my previous blog post ‘Diffusion models are autoencoders’ for context, before reading the rest of this one.

Diffusion for images: a match made in heaven

A noisy image of a mountain range, with the level of noise gradually decreasing from left to right.

Diffusion models and the human visual system have one important thing in common: they don’t care too much about high frequencies. At least, not out of the box. I discussed the reasons for this in some detail in an earlier blog post (section 5 in particular).

In a nutshell, the different levels of noise at which a diffusion model operates allow it to focus on different spatial frequency components of the image at each iterative refinement step. When sampling an image, the model effectively builds it up from low frequencies to high frequencies, first filling in large-scale structure and then adding progressively more fine-grained details.

During training, we sample a noise level for each training example, add noise to it, and then try to predict the noise. The relative weights with which we sample the different noise levels therefore determine the degree to which the model focuses on large-scale and fine-grained structure. The most commonly used formulation, with uniform weighting of the noise levels, yields a very different objective than the likelihood loss which e.g. autoregressive models are trained with.

It turns out that there is a particular weighting which corresponds directly to the likelihood loss11, but this puts significantly more weight on very low noise levels. Since low noise levels correspond to high spatial frequencies, this also indirectly explains why likelihood-based autoregressive models in pixel space never really took off: they end up spending way too much of their capacity on perceptually meaningless detail, and never get around to modelling larger-scale structure.

Relative to the likelihood loss, uniform weighting across noise levels in diffusion models yields an objective that is much more closely aligned with the human visual system. I don’t believe this was actually known when people first started training diffusion models on images – it was just a lucky coincidence! But we understand this pretty well now, and I think it is one of the two main reasons why this modelling approach completely took over in a matter of two years. (The other reason is of course classifier-free guidance, which you can read more about in my previous blog post on the topic.)

The reason I bring all this up here, is that it doesn’t bode particularly well for applications of diffusion models beyond the perceptual domain. Our ears have a similar disdain for high frequencies as our eyes (though to a lesser extent, I believe), but in the language domain, what does “high frequency” even mean12? Given the success of likelihood-based language models, could the relatively lower weight of low noise levels actually prove to be a liability in this setting?

Autoregression for language: a tough baseline to beat

Autoregression at the word or token level is a very natural way to do language modelling, because to some degree, it reflects how language is produced and consumed: as a one-dimensional sequence, one element at a time, in a particular fixed order. However, if we consider the process through which an abstract thought turns into an utterance, the iterative denoising metaphor starts to look more appealing. When writing a paragraph, the core concepts are generally decided on first, and the exact wording and phrasing doesn’t materialise until later. That said, perhaps it doesn’t matter precisely how humans interact with language: just like how planes don’t fly the same way birds do (h/t Yann LeCun), the best way to build a practically useful language model need not reflect nature either.

Practically speaking, autoregressive models have an interface that is somewhat limited: they can be prompted, i.e. tasked to complete a sequence for which a prefix is given. While this has actually been shown to be reasonably versatile in itself, the ability of non-autoregressive models to fill in the blanks (i.e. be conditioned on something other than a prefix, also known as inpainting in the image domain) is potentially quite useful, and not something that comes naturally to autoregressive models (though it is of course possible to do infilling with autoregressive models13).

Training efficiency

If we compare autoregression and diffusion side-by-side as different forms of iterative refinement, the former has the distinct advantage that training can be parallelised trivially across all refinement steps. During autoregressive model training, we obtain a useful gradient signal from all steps in the sampling process. This is not true for diffusion models, where we have to sample a particular noise level for each training example. It is not practical to train on many different noise levels for each example, because that would require multiple forward and backward passes through the model. For autoregression, we get gradients for all sequence steps with just a single forward-backward pass.

As a result, diffusion model training is almost certainly significantly less statistically efficient than autoregressive model training, and slower convergence implies higher computational requirements.

Sampling efficiency

Sampling algorithms for diffusion models are very flexible: they allow for sample quality and computational cost to be traded off without retraining, simply by changing the number of sampling steps. This isn’t practical with autoregressive models, where the number of sampling steps is tied directly to the length of the sequence that is to be produced. On the face of it, diffusion models are at an advantage here: perhaps we can get high-quality samples with a number of steps that is significantly lower than the sequence length?

For long enough sequences, this is probably true, but it is important to compare apples to apples. Simply comparing the number of sampling steps across different methods relies on the implicit assumption that all sampling steps have the same cost, and this is not the case. Leaving aside the fact that a single diffusion sampling step can sometimes require multiple forward passes through the model, the cost of an individual forward pass also differs. Autoregressive models can benefit substantially from caching, i.e. re-use of activations computed during previous sampling steps, which significantly reduces the cost of each step. This is not the case for diffusion models, because the level of noise present in the input changes throughout sampling, so each sampling step requires a full forward pass across the entire input.

Therefore, the break-even point at which diffusion sampling becomes more efficient than autoregressive sampling is probably at a number of steps significantly below the length of the sequence. Whether this is actually attainable in practice remains to be seen.

Why bother with diffusion at all?

The efficiency disadvantages with respect to autoregressive models might lead one to wonder if diffusion-based language modelling is even worth exploring to begin with. Aside from infilling capabilities and metaphorical arguments, there are a few other reasons why I believe it’s worth looking into:

  • Unlike autoregressive models, which require restricted connectivity patterns to ensure causality (usually achieved by masking), diffusion model architectures are completely unconstrained. This enables a lot more creative freedom, as well as potentially benefiting from architectural patterns that are common in other application domains, such as using pooling and upsampling layers to capture structure at multiple scales. One recent example of such creativity is Recurrent Interface Networks14, whose Perceiver IO-like15 structure enables efficient re-use of computation across sampling steps.

  • The flexibility of the sampling procedure extends beyond trading off quality against computational cost: it can also be modified to amplify the influence of conditioning signals (e.g. through classifier-free guidance), or to include additional constraints without retraining. Li et al.16 extensively explore the latter ability for text generation (e.g. controlling sentiment or imposing a particular syntactic structure).

  • Who knows what other perks we might uncover by properly exploring this space? The first few papers on diffusion models for images struggled to match results obtained with more established approaches at the time (i.e. GANs, autoregressive models). Work on diffusion models in new domains could follow the same trajectory – if we don’t try, we’ll never know.

Diffusion for discrete data

Diffusion models operate on continuous inputs by default. When using the score-based formalism, continuity is a requirement because the score function \(\nabla_\mathbf{x} \log p(\mathbf{x})\) is only defined when \(\mathbf{x}\) is continuous. Language is usually represented as a sequence of discrete tokens, so the standard formulation is not applicable. Broadly speaking, there are two ways to tackle this apparent incompatibility:

  • formulate a discrete corruption process as an alternative to Gaussian diffusion;
  • map discrete inputs to continuous vectors and apply Gaussian diffusion in that space.

The former approach has been explored extensively: D3PM17, MaskGIT18, Mask-predict19, ARDM20, Multinomial diffusion21, DiffusER22 and SUNDAE23 are all different flavours of non-autoregressive iterative refinement using a discrete corruption process. Many (but not all) of these works focus on language modelling as the target application. It should be noted that machine translation has been particularly fertile ground for this line of work, because the strong conditioning signal makes non-autoregressive methods attractive even when their ability to capture diversity is relatively limited. Several works on non-autoregressive machine translation predate the rise of diffusion models.

Unfortunately, moving away from the standard continuous formulation of diffusion models tends to mean giving up on some useful features, such as classifier-free guidance and the ability to use various accelerated sampling algorithms developed specifically for this setting. Luckily, we can stick with continuous Gaussian diffusion simply by embedding discrete data in Euclidean space. This approach has recently been explored for language modelling. Some methods, like self-conditioned embedding diffusion (SED)24, use a separate representation learning model to obtain continuous embeddings corresponding to discrete tokens; others jointly fit the embeddings and the diffusion model, like Diffusion-LM16, CDCD25 and Difformer26.

Continuous diffusion for categorical data (CDCD) is my own work in this space: we set out to explore how diffusion models could be adapted for language modelling. One of the goals behind this research project was to develop a method for diffusion language modelling that looks as familiar as possible to language modelling practitioners. Training diffusion models is a rather different experience from training autoregressive Transformers, and we wanted to minimise the differences to make this as approachable as possible. The result is a model whose training procedure is remarkably close to that of BERT27: the input token sequence is embedded, noise is added to the embeddings, and the model learns to predict the original tokens using the cross-entropy loss (score interpolation). The model architecture is a standard Transformer. We address the issue of finding the right weighting for the different noise levels with an active learning strategy (time warping), which adapts the distribution of sampled noise levels on the fly during training.

Another way to do language modelling with Gaussian diffusion, which to my knowledge has not been explored extensively so far, is to learn higher-level continuous representations rather than embed individual tokens. This would require a powerful representation learning approach that learns representations that are rich enough to be decoded back into readable text (potentially by a light-weight autoregressive decoder). Autoencoders applied to token sequences tend to produce representations that fail to capture the least predictable components of the input, which carry precisely the most salient information. Perhaps contrastive methods, or methods that try to capture the dynamics of text (such as Time Control28) could be more suitable for this purpose.

Closing thoughts

While CDCD models produce reasonable samples, and are relatively easy to scale due to their similarity to existing language models, the efficiency advantages of autoregression make it a very tough baseline to beat. I believe it is still too early to consider diffusion as a serious alternative to autoregression for generative language modelling at scale. As it stands, we also know next to nothing about scaling laws for diffusion models. Perhaps ideas such as latent self-conditioning14 could make diffusion more competitive, by improving computational efficiency, but it’s not clear that this will be sufficient. Further exploration of this space has the potential to pay off handsomely!

All in all, I have become convinced that the key to powerful generative models is iterative refinement: rather than generating a sample in a single pass through a neural network, the model is applied repeatedly to refine a canvas, and hence the unrolled sampling procedure corresponds to a much “deeper” computation graph. Exactly which algorithm one uses to achieve this might not matter too much in the end, whether it be autoregression, diffusion, or something else entirely. I have a lot more thoughts about this, so perhaps this could be the subject of a future blog post.

On an unrelated note: I’ve disabled Disqus comments on all of my blog posts, as their ads seem to have gotten very spammy. I don’t have a good alternative to hand right now, so in the meantime, feel free to tweet your thoughts at me instead @sedielem, or send me an email. When I eventually revamp this blog at some point in the future, I will look into re-enabling comments. Apologies for the inconvenience!

UPDATE (April 7): I have reenabled Disqus comments.

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{dieleman2023language,
  author = {Dieleman, Sander},
  title = {Diffusion language models},
  url = {https://benanne.github.io/2023/01/09/diffusion-language.html},
  year = {2023}
}

Acknowledgements

Thanks to my collaborators on the CDCD project, and all my colleagues at DeepMind.

References

  1. Brock, Donahue, Simonyan, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, International Conference on Learning Representations, 2019. 

  2. Karras, Laine, Aittala, Hellsten, Lehtinen, Aila, “Analyzing and Improving the Image Quality of StyleGAN”, Computer Vision and Pattern Recognition, 2020. 

  3. Razavi, van den Oord and Vinyals, “Generating Diverse High-Fidelity Images with VQ-VAE-2”, Neural Information Processing Systems, 2019. 

  4. Esser, Rombach and Ommer, “Taming Transformers for High-Resolution Image Synthesis”, Computer Vision and Pattern Recognition, 2021. 

  5. van den Oord, Vinyals and Kavukcuoglu, “Neural Discrete Representation Learning”, Neural Information Processing Systems, 2017. 

  6. Song and Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution”, Neural Information Processing Systems, 2019. 

  7. Song and Ermon, “Improved Techniques for Training Score-Based Generative Models”, Neural Information Processing Systems, 2020. 

  8. Ho, Jain and Abbeel, “Denoising Diffusion Probabilistic Models”, Neural Information Processing Systems, 2020. 

  9. Dhariwal, Nichol, “Diffusion Models Beat GANs on Image Synthesis”, Neural Information Processing Systems, 2021. 

  10. Nichol, Dhariwal, Ramesh, Shyam, Mishkin, McGrew, Sutskever, Chen, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, arXiv, 2021. 

  11. Song, Durkan, Murray, Ermon, “Maximum Likelihood Training of Score-Based Diffusion Models”, Neural Information Processing Systems, 2021. 

  12. Tamkin, Jurafsky, Goodman, “Language Through a Prism: A Spectral Approach for Multiscale Language Representations”, Neural Information Processing Systems, 2020. 

  13. Bavarian, Jun, Tezak, Schulman, McLeavey, Tworek, Chen, “Efficient Training of Language Models to Fill in the Middle”, arXiv, 2022. 

  14. Jabri, Fleet, Chen, “Scalable Adaptive Computation for Iterative Generation”, arXiv, 2022.  2

  15. Jaegle, Borgeaud, Alayrac, Doersch, Ionescu, Ding, Koppula, Zoran, Brock, Shelhamer, Hénaff, Botvinick, Zisserman, Vinyals, Carreira, “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, International Conference on Learning Representations, 2022. 

  16. Li, Thickstun, Gulrajani, Liang, Hashimoto, “Diffusion-LM Improves Controllable Text Generation”, Neural Information Processing Systems, 2022.  2

  17. Austin, Johnson, Ho, Tarlow, van den Berg, “Structured Denoising Diffusion Models in Discrete State-Spaces”, Neural Information Processing Systems, 2021. 

  18. Chang, Zhang, Jiang, Liu, Freeman, “MaskGIT: Masked Generative Image Transformer”, Computer Vision and Patern Recognition, 2022. 

  19. Ghazvininejad, Levy, Liu, Zettlemoyer, “Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Empirical Methods in Natural Language Processing, 2019. 

  20. Hoogeboom, Gritsenko, Bastings, Poole, van den Berg, Salimans, “Autoregressive Diffusion Models”, International Conference on Learning Representations, 2022. 

  21. Hoogeboom, Nielsen, Jaini, Forré, Welling, “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, Neural Information Processing Systems, 2021. 

  22. Reid, Hellendoorn, Neubig, “DiffusER: Discrete Diffusion via Edit-based Reconstruction”, arXiv, 2022. 

  23. Savinov, Chung, Binkowski, Elsen, van den Oord, “Step-unrolled Denoising Autoencoders for Text Generation”, International Conference on Learning Representations, 2022. 

  24. Strudel, Tallec, Altché, Du, Ganin, Mensch, Grathwohl, Savinov, Dieleman, Sifre, Leblond, “Self-conditioned Embedding Diffusion for Text Generation”, arXiv, 2022. 

  25. Dieleman, Sartran, Roshannai, Savinov, Ganin, Richemond, Doucet, Strudel, Dyer, Durkan, Hawthorne, Leblond, Grathwohl, Adler, “Continuous diffusion for categorical data”, arXiv, 2022. 

  26. Gao, Guo, Tan, Zhu, Zhang, Bian, Xu, “Difformer: Empowering Diffusion Model on Embedding Space for Text Generation”, arXiv, 2022. 

  27. Devlin, Chang, Lee, Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, North American Chapter of the Association for Computational Linguistics, 2019. 

  28. Wang, Durmus, Goodman, Hashimoto, “Language modeling via stochastic processes”, International Conference on Learning Representations, 2022. 

Guidance: a cheat code for diffusion models

Classifier-free diffusion guidance1 dramatically improves samples produced by conditional diffusion models at almost no cost. It is simple to implement and extremely effective. It is also an essential component of OpenAI’s DALL·E 22 and Google’s Imagen3, powering their spectacular image generation results. In this blog post, I share my perspective and try to give some intuition about how it works.

Diffusion guidance

Barely two years ago, they were a niche interest on the fringes of generative modelling research, but today, diffusion models are the go-to model class for image and audio generation. In my previous blog post, I discussed the link between diffusion models and autoencoders. If you are unfamiliar with diffusion models, I recommend reading at least the first section of that post for context, before reading the rest of this one.

Diffusion models are generative models, which means they model a high-dimensional data distribution \(p(x)\). Rather than trying to approximate \(p(x)\) directly (which is what likelihood-based models do), they try to predict the so-called score function, \(\nabla_x \log p(x)\).

To sample from a diffusion model, an input is initialised to random noise, and is then iteratively denoised by taking steps in the direction of the score function (i.e. the direction in which the log-likelihood increases fastest), with some additional noise mixed in to avoid getting stuck in modes of the distribution. This is called Stochastic Gradient Langevin Dynamics (SGLD). This is a bit of a caricature of what people actually use in practice nowadays, but it’s not too far off the truth.

In conditional diffusion models, we have an additional input \(y\) (for example, a class label or a text sequence) and we try to model the conditional distribution \(p(x \mid y)\) instead. In practice, this means learning to predict the conditional score function \(\nabla_x \log p(x \mid y)\).

One neat aspect of the score function is that it is invariant to normalisation of the distribution: if we only know the distribution \(p(x)\) up to a constant, i.e. we have \(p(x) = \frac{\tilde{p}(x)}{Z}\) and we only know \(\tilde{p}(x)\), then we can still compute the score function:

\[\nabla_x \log \tilde{p}(x) = \nabla_x \log \left( p(x) \cdot Z \right) = \nabla_x \left( \log p(x) + \log Z \right) = \nabla_x \log p(x),\]

where we have made use of the linearity of the gradient operator, and the fact that the normalisation constant \(Z = \int \tilde{p}(x) \mathrm{d} x\) does not depend on \(x\) (so its derivative w.r.t. \(x\) is zero).

Unnormalised probability distributions come up all the time, so this is a useful property. For conditional models, it enables us to apply Bayes’ rule to decompose the score function into an unconditional component, and a component that “mixes in” the conditioning information:

\[p(x \mid y) = \frac{p(y \mid x) \cdot p(x)}{p(y)}\] \[\implies \log p(x \mid y) = \log p(y \mid x) + \log p(x) - \log p(y)\] \[\implies \nabla_x \log p(x \mid y) = \nabla_x \log p(y \mid x) + \nabla_x \log p(x) ,\]

where we have used that \(\nabla_x \log p(y) = 0\). In other words, we can obtain the conditional score function as simply the sum of the unconditional score function and a conditioning term. (Note that the conditioning term \(\nabla_x \log p(y \mid x)\) is not itself a score function, because the gradient is w.r.t. \(x\), not \(y\).)

Throughout this blog post, I have mostly ignored the time dependency of the distributions estimated by diffusion models. This saves me having to add extra conditioning variables and subscripts everywhere. In practice, diffusion models perform iterative denoising, and are therefore usually conditioned on the level of input noise at each step.

Classifier guidance

The first thing to notice is that \(p(y \mid x)\) is exactly what classifiers and other discriminative models try to fit: \(x\) is some high-dimensional input, and \(y\) is a target label. If we have a differentiable discriminative model that estimates \(p(y \mid x)\), then we can also easily obtain \(\nabla_x \log p(y \mid x)\). All we need to turn an unconditional diffusion model into a conditional one, is a classifier!

The observation that diffusion models can be conditioned post-hoc in this way was mentioned by Sohl-Dickstein et al.4 and Song et al.5, but Dhariwal and Nichol6 really drove this point home, and showed how classifier guidance can dramatically improve sample quality by enhancing the conditioning signal, even when used in combination with traditional conditional modelling. To achieve this, they scale the conditioning term by a factor:

\[\nabla_x \log p_\gamma(x \mid y) = \nabla_x \log p(x) + \gamma \nabla_x \log p(y \mid x) .\]

\(\gamma\) is called the guidance scale, and cranking it up beyond 1 has the effect of amplifying the influence of the conditioning signal. It is extremely effective, especially compared to e.g. the truncation trick for GANs7, which serves a similar purpose.

Samples from an unconditional diffusion model with classifier guidance, for guidance scales 1.0 (left) and 10.0 (right), taken from Dhariwal & Nichol (2021).'
Samples from an unconditional diffusion model with classifier guidance, for guidance scales 1.0 (left) and 10.0 (right), taken from Dhariwal & Nichol (2021).

If we revert the gradient and the logarithm operations that we used to go from Bayes’ rule to classifier guidance, it’s easier to see what’s going on:

\[p_\gamma(x \mid y) \propto p(x) \cdot p(y \mid x)^\gamma .\]

We are raising the conditional part of the distribution to a power, which corresponds to tuning the temperature of that distribution: \(\gamma\) is an inverse temperature parameter. If \(\gamma > 1\), this sharpens the distribution and focuses it onto its modes, by shifting probability mass from the least likely to the most likely values (i.e. the temperature is lowered). Classifier guidance allows us to apply this temperature tuning only to the part of the distribution that captures the influence of the conditioning signal.

In language modelling, it is now commonplace to train a powerful unconditional language model once, and then adapt it to downstream tasks as needed (via few-shot learning or finetuning). Superficially, it would seem that classifier guidance enables the same thing for image generation: one could train a powerful unconditional model, then condition it as needed at test time using a separate classifier.

Unfortunately there are a few snags that make this impractical. Most importantly, because diffusion models operate by gradually denoising inputs, any classifier used for guidance also needs to be able to cope with high noise levels, so that it can provide a useful signal all the way through the sampling process. This usually requires training a bespoke classifier specifically for the purpose of guidance, and at that point, it might be easier to train a traditional conditional generative model end-to-end (or at least finetune an unconditional model to incorporate the conditioning signal).

But even if we have a noise-robust classifier on hand, classifier guidance is inherently limited in its effectiveness: most of the information in the input \(x\) is not relevant to predicting \(y\), and as a result, taking the gradient of the classifier w.r.t. its input can yield arbitrary (and even adversarial) directions in input space.

Classifier-free guidance

This is where classifier-free guidance1 comes in. As the name implies, it does not require training a separate classifier. Instead, one trains a conditional diffusion model \(p(x \mid y)\), with conditioning dropout: some percentage of the time, the conditioning information \(y\) is removed (10-20% tends to work well). In practice, it is often replaced with a special input value representing the absence of conditioning information. The resulting model is now able to function both as a conditional model \(p(x \mid y)\), and as an unconditional model \(p(x)\), depending on whether the conditioning signal is provided. One might think that this comes at a cost to conditional modelling performance, but the effect seems to be negligible in practice.

What does this buy us? Recall Bayes’ rule from before, but let’s apply it in the other direction:

\[p(y \mid x) = \frac{p(x \mid y) \cdot p(y)}{p(x)}\] \[\implies \log p(y \mid x) = \log p(x \mid y) + \log p(y) - \log p(x)\] \[\implies \nabla_x \log p(y \mid x) = \nabla_x \log p(x \mid y) - \nabla_x \log p(x) .\]

We have expressed the conditioning term as a function of the conditional and unconditional score functions, both of which our diffusion model provides. We can now substitute this into the formula for classifier guidance:

\[\nabla_x \log p_\gamma(x \mid y) = \nabla_x \log p(x) + \gamma \left( \nabla_x \log p(x \mid y) - \nabla_x \log p(x) \right),\]

or equivalently:

\[\nabla_x \log p_\gamma(x \mid y) = (1 - \gamma) \nabla_x \log p(x) + \gamma \nabla_x \log p(x \mid y) .\]

This is a barycentric combination of the conditional and the unconditional score function. For \(\gamma = 0\), we recover the unconditional model, and for \(\gamma = 1\) we get the standard conditional model. But \(\gamma > 1\) is where the magic happens. Below are some examples from OpenAI’s GLIDE model8, obtained using classifier-free guidance.

GLIDE sample with guidance scale 1: 'A stained glass window of a panda eating bamboo.' GLIDE sample with guidance scale 3: 'A stained glass window of a panda eating bamboo.'
Two sets of samples from OpenAI's GLIDE model, for the prompt 'A stained glass window of a panda eating bamboo.', taken from their paper. Guidance scale 1 (no guidance) on the left, guidance scale 3 on the right.
GLIDE sample with guidance scale 1: '“A cozy living room with a painting of a corgi on the wall above a couch and a round coffee table in front of a couch and a vase of flowers on a coffee table.' GLIDE sample with guidance scale 3: '“A cozy living room with a painting of a corgi on the wall above a couch and a round coffee table in front of a couch and a vase of flowers on a coffee table.'
Two sets of samples from OpenAI's GLIDE model, for the prompt '“A cozy living room with a painting of a corgi on the wall above a couch and a round coffee table in front of a couch and a vase of flowers on a coffee table.', taken from their paper. Guidance scale 1 (no guidance) on the left, guidance scale 3 on the right.

Why does this work so much better than classifier guidance? The main reason is that we’ve constructed the “classifier” from a generative model. Whereas standard classifiers can take shortcuts and ignore most of the input \(x\) while still obtaining competitive classification results, generative models are afforded no such luxury. This makes the resulting gradient much more robust. As a bonus, we only have to train a single (generative) model, and conditioning dropout is trivial to implement.

It is worth noting that there was only a very brief window of time between the publication of the classifier-free guidance idea, and OpenAI’s GLIDE model, which used it to great effect – so much so that the idea has sometimes been attributed to the latter! Simple yet powerful ideas tend to see rapid adoption. In terms of power-to-simplicity ratio, classifier-free guidance is up there with dropout9, in my opinion: a real game changer!

(In fact, the GLIDE paper says that they originally trained a text-conditional model, and applied conditioning dropout only in a finetuning phase. Perhaps there is a good reason to do it this way, but I rather suspect that this is simply because they decided to apply the idea to a model they had already trained before!)

Clearly, guidance represents a trade-off: it dramatically improves adherence to the conditioning signal, as well as overall sample quality, but at great cost to diversity. In conditional generative modelling, this is usually an acceptable trade-off, however: the conditioning signal often already captures most of the variability that we actually care about, and if we desire diversity, we can also simply modify the conditioning signal we provide.

Guidance for autoregressive models

Is guidance unique to diffusion models? On the face of it, not really. People have pointed out that you can do similar things with other model classes:

You can train autoregressive models with conditioning dropout just as easily, and then use two sets of logits produced with and without conditioning to construct classifier-free guided logits, just as we did before with score functions. Whether we apply this operation to log-probabilities or gradients of log-probabilities doesn’t really make a difference, because the gradient operator is linear.

There is an important difference however: whereas the score function in a diffusion model represents the joint distribution across all components of \(x\), \(p(x \mid y)\), the logits produced by autoregressive models represent \(p(x_t \mid x_{<t}, y)\), the sequential conditional distributions. You can obtain a joint distribution \(p(x \mid y)\) from this by multiplying all the conditionals together:

\[p(x \mid y) = \prod_{t=1}^T p(x_t \mid x_{<t}, y),\]

but guidance on each of the factors of this product is not equivalent to applying it to the joint distribution, as one does in diffusion models:

\[p_\gamma(x \mid y) \neq \prod_{t=1}^T p_\gamma(x_t \mid x_{<t}, y).\]

To see this, let’s first expand the left hand side:

\[p_\gamma(x \mid y) = \frac{p(x) \cdot p(y \mid x)^\gamma}{\int p(x) \cdot p(y \mid x)^\gamma \mathrm{d} x},\]

from which we can divide out the unconditional distribution \(p(x)\) to obtain an input-dependent scale factor that adapts the probabilities based on the conditioning signal \(y\):

\[s_\gamma(x, y) := \frac{p(y \mid x)^\gamma}{\mathbb{E}_{p(x)}\left[ p(y \mid x)^\gamma \right]} .\]

Now we can do the same thing with the right hand side:

\[\prod_{t=1}^T p_\gamma(x_t \mid x_{<t}, y) = \prod_{t=1}^T \frac{p(x_t \mid x_{<t}) \cdot p(y \mid x_{\le t})^\gamma}{\int p(x_t \mid x_{<t}) \cdot p(y \mid x_{\le t})^\gamma \mathrm{d} x_t}\]

We can again factor out \(p(x)\) here:

\[\prod_{t=1}^T p_\gamma(x_t \mid x_{<t}, y) = p(x) \cdot \prod_{t=1}^T \frac{p(y \mid x_{\le t})^\gamma}{\int p(x_t \mid x_{<t}) \cdot p(y \mid x_{\le t})^\gamma \mathrm{d} x_t}.\]

The input-dependent scale factor is now:

\[s_\gamma'(x, y) := \prod_{t=1}^T \frac{p(y \mid x_{\le t})^\gamma}{ \mathbb{E}_{p(x_t \mid x_{<t})} \left[ p(y \mid x_{\le t})^\gamma \right] },\]

which is clearly not equivalent to \(s_\gamma(x, y)\). In other words, guidance on the sequential conditionals redistributes the probability mass in a different way than guidance on the joint distribution does.

I don’t think this has been extensively tested at this point, but my hunch is that diffusion guidance works so well precisely because we are able to apply it to the joint distribution, rather than to individual sequential conditional distributions. As of today, diffusion models are the only model class for which this approach is tractable (if there are others, I’d be very curious to learn about them, so please share in the comments!).

As an aside: if you have an autoregressive model where the underlying data can be treated as continuous (e.g. an autoregressive model of images like PixelCNN10 or an Image Transformer11), you can actually get gradients w.r.t. the input. This means you can get an efficient estimate of the score function \(\nabla_x \log p(x|y)\) and sample from the model using Langevin dynamics, so you could in theory apply classifier or classifier-free guidance to the joint distribution, in a way that’s equivalent to diffusion guidance!


Update / correction (May 29th)

@RiversHaveWings on Twitter pointed out that the distributions which we modify to apply guidance are \(p_t(x \mid y)\) (where \(t\) is the current timestep in the diffusion process), not \(p(x \mid y)\) (which is equivalent to \(p_0(x \mid y)\)). This is clearly a shortcoming of the notational shortcut I took throughout this blog post (i.e. making the time dependency implicit).

This calls into question my claim above that diffusion model guidance operates on the true joint distribution of the data – though it doesn’t change the fact that guidance does a different thing for autoregressive models and for diffusion models. As ever in deep learning, whether the difference is meaningful in practice will probably have to be established empirically, so it will be interesting to see if classifier-free guidance catches on for other model classes as well!


Temperature tuning for diffusion models

One thing people often do with autoregressive models is tune the temperature of the sequential conditional distributions. More intricate procedures to “shape” these distributions are also popular: top-k sampling, nucleus sampling12 and typical sampling13 are the main contenders. They are harder to generalise to high-dimensional distributions, so I won’t consider them here.

Can we tune the temperature of a diffusion model? Sure: instead of factorising \(p(x \mid y)\) and only modifying the conditional component, we can just raise the whole thing to the \(\gamma\)‘th power simply by multiplying the score function with \(\gamma\). Unfortunately, this invariably yields terrible results. While tuning temperatures of the sequential conditionals in autoregressive models works quite well, and often yields better results, tuning the temperature of the joint distribution seems to be pretty much useless (let me know in the comments if your experience differs!).

Just as with guidance, this is because changing the temperature of the sequential conditionals is not the same as changing the temperature of the joint distribution. Working this out is left as an excerise to the reader :)

Note that they do become equivalent when all \(x_t\) are independent (i.e. \(p(x_t \mid x_{<t}) = p(x_t)\)), but if that is the case, using an autoregressive model kind of defeats the point!

Closing thoughts

Guidance is far from the only reason why diffusion models work so well for images: the standard loss function for diffusion de-emphasises low noise levels, relative to the likelihood loss14. As I mentioned in my previous blog post, noise levels and image feature scales are closely tied together, and the result is that diffusion models pay less attention to high-frequency content that isn’t visually salient to humans anyway, enabling them to use their capacity more efficiently.

That said, I think guidance is probably the main driver behind the spectacular results we’ve seen over the course of the past six months. I believe guidance constitutes a real step change in our ability to generate perceptual signals, going far beyond the steady progress of the last few years that this domain has seen. It is striking that the state-of-the-art models in this domain are able to do what they do, while still being one to two orders of magnitude smaller than state-of-the-art language models in terms of parameter count.

I also believe we’ve only scratched the surface of what’s possible with diffusion models’ steerable sampling process. Dynamic thresholding, introduced this week in the Imagen paper3, is another simple guidance-enhancing trick to add to our arsenal, and I think there are many more such tricks to be discovered (as well as more elaborate schemes). Guidance seems like it might also enable a kind of “arithmetic” in the image domain like we’ve seen with word embeddings.

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{dieleman2022guidance,
  author = {Dieleman, Sander},
  title = {Guidance: a cheat code for diffusion models},
  url = {https://benanne.github.io/2022/05/26/guidance.html},
  year = {2022}
}

Acknowledgements

Thanks to my colleagues at DeepMind for various discussions, which continue to shape my thoughts on this topic!

References

  1. Ho, Salimans, “Classifier-Free Diffusion Guidance”, NeurIPS workshop on DGMs and Applications”, 2021.  2

  2. Ramesh, Dhariwal, Nichol, Chu, Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv, 2022. 

  3. Saharia, Chan, Saxena, Li, Whang, Ho, Fleet, Norouzi et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv, 2022.  2

  4. Sohl-Dickstein, Weiss, Maheswaranathan and Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”, International Conference on Machine Learning, 2015. 

  5. Song, Sohl-Dickstein, Kingma, Kumar, Ermon and Poole, “Score-Based Generative Modeling through Stochastic Differential Equations”, International Conference on Learning Representations, 2021. 

  6. Dhariwal, Nichol, “Diffusion Models Beat GANs on Image Synthesis”, Neural Information Processing Systems, 2021. 

  7. Brock, Donahue, Simonyan, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, International Conference on Learning Representations, 2019. 

  8. Nichol, Dhariwal, Ramesh, Shyam, Mishkin, McGrew, Sutskever, Chen, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, arXiv, 2021. 

  9. Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research, 2014. 

  10. Van den Oord, Kalchbrenner, Kavukcuoglu, “Pixel Recurrent Neural Networks”, International Conference on Machine Learning, 2016. 

  11. Parmar, Vaswani, Uszkoreit, Kaiser, Shazeer, Ku, Tran, “Image Transformer”, International Conference on Machine Learning, 2018. 

  12. Holtzman, Buys, Du, Forbes, Choi, “The Curious Case of Neural Text Degeneration”, International Conference on Learning Representations, 2020. 

  13. Meister, Pimentel, Wiher, Cotterell, “Typical Decoding for Natural Language Generation”, arXiv, 2022. 

  14. Song, Durkan, Murray, Ermon, “Maximum Likelihood Training of Score-Based Diffusion Models”, Neural Information Processing Systems, 2021 

Diffusion models are autoencoders

Diffusion models took off like a rocket at the end of 2019, after the publication of Song & Ermon’s seminal paper. In this blog post, I highlight a connection to another type of model: the venerable autoencoder.

Diffusion models

Diffusion models are fast becoming the go-to model for any task that requires producing perceptual signals, such as images and sound. They provide similar fidelity as alternatives based on generative adversarial nets (GANs) or autoregressive models, but with much better mode coverage than the former, and a faster and more flexible sampling procedure compared to the latter.

In a nutshell, diffusion models are constructed by first describing a procedure for gradually turning data into noise, and then training a neural network that learns to invert this procedure step-by-step. Each of these steps consists of taking a noisy input and making it slightly less noisy, by filling in some of the information obscured by the noise. If you start from pure noise and do this enough times, it turns out you can generate data this way!

Diffusion models have been around for a while1, but really took off at the end of 20192. The ideas are young enough that the field hasn’t really settled on one particular convention or paradigm to describe them, which means almost every paper uses a slightly different framing, and often a different notation as well. This can make it quite challenging to see the bigger picture when trawling through the literature, of which there is already a lot! Diffusion models go by many names: denoising diffusion probabilistic models (DDPMs)3, score-based generative models, or generative diffusion processes, among others. Some people just call them energy-based models (EBMs), of which they technically are a special case.

My personal favourite perspective starts from the idea of score matching4 and uses a formalism based on stochastic differential equations (SDEs)5. For an in-depth treatment of diffusion models from this perspective, I strongly recommend Yang Song’s richly illustrated blog post (which also comes with code and colabs). It is especially enlightening with regards to the connection between all these different perspectives. If you are familiar with variational autoencoders, you may find Lilian Weng or Jakub Tomczak’s takes on this model family more approachable.

If you are curious about generative modelling in general, section 3 of my blog post on generating music in the waveform domain contains a brief overview of some of the most important concepts and model flavours.

Denoising autoencoders

Autoencoders are neural networks that are trained to predict their input. In and of itself, this is a trivial and meaningless task, but it becomes much more interesting when the network architecture is restricted in some way, or when the input is corrupted and the network has to learn to undo this corruption.

A typical architectural restriction is to introduce some sort of bottleneck, which limits the amount of information that can pass through. This implies that the network must learn to encode the most important information efficiently to be able to pass it through the bottleneck, in order to be able to accurately reconstruct the input. Such a bottleneck can be created by reducing the capacity of a particular layer of the network, by introducing quantisation (as in VQ-VAEs6) or by applying some form of regularisation to it during training (as in VAEs7 8 or contractive autoencoders9). The internal representation used in this bottleneck (often referred to as the latent representation) is what we are really after. It should capture the essence of the input, while discarding a lot of irrelevant detail.

Corrupting the input is another viable strategy to make autoencoders learn useful representations. One could argue that models with corrupted input are not autoencoders in the strictest sense, because the input and target output differ, but this is really a semantic discussion – one could just as well consider the corruption procedure part of the model itself. In practice, such models are typically referred to as denoising autoencoders.

Denoising autoencoders were actually some of the first true “deep learning” models: back when we hadn’t yet figured out how to reliably train neural networks deeper than a few layers with simple gradient descent, the prevalent approach was to pre-train networks layer by layer, and denoising autoencoders were frequently used for this purpose10 (especially by Yoshua Bengio and colleagues at MILA – restricted Boltzmann machines were another option, favoured by Geoffrey Hinton and colleagues).

One and the same?

So what is the link between modern diffusion models and these – by deep learning standards – ancient autoencoders? I was inspired to ponder this connection a bit more after seeing some recent tweets speculating about autoencoders making a comeback:

As far as I’m concerned, the autoencoder comeback is already in full swing, it’s just that we call them diffusion models now! Let’s unpack this.

The neural network that makes diffusion models tick is trained to estimate the so-called score function, \(\nabla_\mathbf{x} \log p(\mathbf{x})\), the gradient of the log-likelihood w.r.t. the input (a vector-valued function): \(\mathbf{s}_\theta (\mathbf{x}) = \nabla_\mathbf{x} \log p_\theta(\mathbf{x})\). Note that this is different from \(\nabla_\theta \log p_\theta(\mathbf{x})\), the gradient w.r.t. the model parameters \(\theta\), which is the one you would use for training if this were a likelihood-based model. The latter tells you how to change the model parameters to increase the likelihood of the input under the model, whereas the former tells you how to change the input itself to increase its likelihood. (This is the same gradient you would use for DeepDream-style manipulation of images.)

In practice, we want to use the same network at every point in the gradual denoising process, i.e. at every noise level (from pure noise all the way to clean data). To account for this, it takes an additional input \(t \in [0, 1]\) which indicates how far along we are in the denoising process: \(\mathbf{s}_\theta (\mathbf{x}_t, t) = \nabla_{\mathbf{x}_t} \log p_\theta(\mathbf{x}_t)\). By convention, \(t = 0\) corresponds to clean data and \(t = 1\) corresponds to pure noise, so we actually “go back in time” when denoising.

The way you train this network is by taking inputs \(\mathbf{x}\) and corrupting them with additive noise \(\mathbf{\varepsilon}_t \sim \mathcal{N}(0, \sigma_t^2)\), and then predicting \(\mathbf{\varepsilon}_t\) from \(\mathbf{x}_t = \mathbf{x} + \mathbf{\varepsilon}_t\). The reason why this works is not entirely obvious. I recommend reading Pascal Vincent’s 2010 tech report on the subject11 for an in-depth explanation of why you can do this.

Note that the variance depends on \(t\), because it corresponds to the specific noise level at time \(t\). The loss function is typically just mean squared error, sometimes weighted by a scale factor \(\lambda(t)\), so that some noise levels are prioritised over others:

\[\arg\min_\theta \mathcal{L}_\theta = \arg\min_\theta \mathbb{E}_{t,p(\mathbf{x}_t)} \left[\lambda(t) ||\mathbf{s}_\theta (\mathbf{x} + \mathbf{\varepsilon}_t, t) - \mathbf{\varepsilon}_t||_2^2\right] .\]

Going forward, let’s assume \(\lambda(t) \equiv 1\), which is usually what is done in practice anyway (though other choices have their uses as well12).

One key observation is that predicting \(\mathbf{\varepsilon}_t\) or \(\mathbf{x}\) are equivalent, so instead, we could just use

\[\arg\min_\theta \mathbb{E}_{t,p(\mathbf{x}_t)} \left[||\mathbf{s}_\theta' (\mathbf{x} + \mathbf{\varepsilon}_t, t) - \mathbf{x}||_2^2\right] .\]

To see that they are equivalent, consider taking a trained model \(\mathbf{s}_\theta\) that predicts \(\mathbf{\varepsilon}_t\) and add a new residual connection to it, going all the way from the input to the output, with a scale factor of \(-1\). This modified model then predicts:

\[\mathbf{\varepsilon}_t - \mathbf{x}_t = \mathbf{\varepsilon}_t - (\mathbf{x} + \mathbf{\varepsilon}_t) = - \mathbf{x} .\]

In other words, we obtain a denoising autoencoder (up to a minus sign). This might seem surprising, but intuitively, it actually makes sense that to increase the likelihood of a noisy input, you should probably just try to remove the noise, because noise is inherently unpredictable. Indeed, it turns out that these two things are equivalent.

A tenuous connection?

Of course, the title of this blog post is intentionally a bit facetious: while there is a deeper connection between diffusion models and autoencoders than many people realise, the models have completely different purposes and so are not interchangeable.

There are two key differences with the denoising autoencoders of yore:

  • the additional input \(t\) makes one single model able to handle many different noise levels with a single set of shared parameters;
  • we care about the output of the model, not the internal latent representation, so there is no need for a bottleneck. In fact, it would probably do more harm than good.

In the strictest sense, both of these differences have no bearing on whether the model can be considered an autoencoder or not. In practice, however, the point of an autoencoder is usually understood to be to learn a useful latent representation, so saying that diffusion models are autoencoders could perhaps be considered a bit… pedantic. Nevertheless, I wanted to highlight this connection because I think many more people know the ins and outs of autoencoders than diffusion models at this point. I believe that appreciating the link between the two can make the latter less daunting to understand.

This link is not merely a curiosity, by the way; it has also been the subject of several papers, which constitute an early exploration of the ideas that power modern diffusion models. Apart from the work by Pascal Vincent mentioned earlier11, there is also a series of papers by Guillaume Alain and colleagues13 that14 are15 worth16 checking17 out18!

[Note that there is another way to connect diffusion models to autoencoders, by viewing them as (potentially infinitely) deep latent variable models. I am personally less interested in that connection because it doesn’t provide me with much additional insight, but it is just as valid. Here’s a blog post by Angus Turner that explores this interpretation in detail.]

Noise and scale

A noisy image of a mountain range, with the level of noise gradually decreasing from left to right.

I believe the idea of training a single model to handle many different noise levels with shared parameters is ultimately the key ingredient that made diffusion models really take off. Song & Ermon2 called them noise-conditional score networks (NCSNs) and provide a very lucid explanation of why this is important, which I won’t repeat here.

The idea of using different noise levels in a single denoising autoencoder had previously been explored for representation learning, but not for generative modelling. Several works suggest gradually decreasing the level of noise over the course of training to improve the learnt representations19 20 21. Composite denoising autoencoders22 have multiple subnetworks that handle different noise levels, which is a step closer to the score networks that we use in diffusion models, though still missing the parameter sharing.

A particularly interesting observation stemming from these works, which is also highly relevant to diffusion models, is that representations learnt using different noise levels tend to correspond to different scales of features: the higher the noise level, the larger-scale the features that are captured. I think this connection is worth investigating further: it implies that diffusion models fill in missing parts of the input at progressively smaller scales, as the noise level decreases step by step. This does seem to be the case in practice, and it is potentially a useful feature. Concretely, it means that \(\lambda(t)\) can be designed to prioritise the modelling of particular feature scales! This is great, because excessive attention to detail is actually a major problem with likelihood-based models (I’ve previously discussed this in more detail in section 6 of my blog post about typicality).

This connection between noise levels and feature scales was initially baffling to me: the noise \(\mathbf{\varepsilon}_t\) that we add to the input during training is isotropic Gaussian, so we are effectively adding noise to each input element (e.g. pixel) independently. If that is the case, how can the level of noise (i.e. the variance) possibly impact the scale of the features that are learnt? I found it helpful to think of it this way:

  • Let’s say we are working with images. Each pixel in an image that could be part of a particular feature (e.g. a human face) provides evidence for the presence (or absence) of that feature.
  • When looking at an image, we implicitly aggregate the evidence provided by all the pixels to determine which features are present (e.g. whether there is a face in the image or not).
  • Larger-scale features in the image will cover a larger proportion of pixels. Therefore, if a larger-scale feature is present in an image, there is more evidence pointing towards that feature.
  • Even if we add noise with a very high variance, that evidence will still be apparent, because when combining information from all pixels, we average out the noise.
  • If more pixels are involved in this process, the tolerable noise level increases, because the maximal variance that still allows for the noise to be canceled out is much higher. For smaller-scale features however, recovery will be impossible because the noise dominates when we can only aggregate information from a smaller set of pixels.

Concretely, if an image contains a human face and we add a lot of noise to it, we will probably no longer be able to discern the face if it is far away from the camera (i.e. covers fewer pixels in the image), whereas if it is close to the camera, we might still see a faint outline. The header image of this section provides another example: the level of noise decreases from left to right. On the very left, we can still see the rough outline of a mountain despite very high levels of noise.

This is completely handwavy, but it provides some intuition for why there is a direct correspondence between the variance of the noise and the scale of features captured by denoising autoencoders and score networks.

Closing thoughts

So there you have it: diffusion models are autoencoders. Sort of. When you squint a bit. Here are some key takeaways, to wrap up:

  • Learning to predict the score function \(\nabla_\mathbf{x} \log p(\mathbf{x})\) of a distribution can be achieved by learning to denoise examples of that distribution. This is a core underlying idea that powers modern diffusion models.
  • Compared to denoising autoencoders, score networks in diffusion models can handle all noise levels with a single set of parameters, and do not have bottlenecks. But other than that, they do the same thing.
  • Noise levels and feature scales are closely linked: high noise levels lead to models capturing large-scale features, low noise levels lead to models focusing on fine-grained features.

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{dieleman2022diffusion,
  author = {Dieleman, Sander},
  title = {Diffusion models are autoencoders},
  url = {https://benanne.github.io/2022/01/31/diffusion.html},
  year = {2022}
}

Acknowledgements

Thanks to Conor Durkan and Katie Millican for fruitful discussions!

References

  1. Sohl-Dickstein, Weiss, Maheswaranathan and Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”, International Conference on Machine Learning, 2015. 

  2. Song and Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution”, Neural Information Processing Systems, 2019.  2

  3. Ho, Jain and Abbeel, “Denoising Diffusion Probabilistic Models”, Neural Information Processing Systems, 2020. 

  4. Hyvarinen, “Estimation of Non-Normalized Statistical Models by Score Matching”, Journal of Machine Learning Research, 2005. 

  5. Song, Sohl-Dickstein, Kingma, Kumar, Ermon and Poole, “Score-Based Generative Modeling through Stochastic Differential Equations”, International Conference on Learning Representations, 2021. 

  6. van den Oord, Vinyals and Kavukcuoglu, “Neural Discrete Representation Learning”, Neural Information Processing Systems, 2017. 

  7. Kingma and Welling, “Auto-Encoding Variational Bayes”, International Conference on Learning Representations, 2014. 

  8. Rezende, Mohamed and Wierstra, “Stochastic Backpropagation and Approximate Inference in Deep Generative Models”, International Conference on Machine Learning, 2014. 

  9. Rifai, Vincent, Muller, Glorot and Bengio, “Contractive Auto-Encoders: Explicit Invariance During Feature Extraction”, International Conference on Machine Learning, 2011. 

  10. Vincent, Larochelle, Lajoie, Bengio and Manzagol, “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion”, Journal of Machine Learning Research, 2010. 

  11. Vincent, “A Connection Between Score Matching and Denoising Autoencoders”, Technical report, 2010.  2

  12. Song, Durkan, Murray and Ermon, “Maximum Likelihood Training of Score-Based Diffusion Models”, Neural Information Processing Systems, 2021. 

  13. Bengio, Alain and Rifai, “Implicit density estimation by local moment matching to sample from auto-encoders”, arXiv, 2012. 

  14. Alain, Bengio and Rifai, “Regularized auto-encoders estimate local statistics”, Neural Information Processing Systems, Deep Learning workshop, 2012. 

  15. Bengio,Yao, Alain and Vincent, “Generalized denoising auto-encoders as generative models”, Neural Information Processing Systems, 2013. 

  16. Alain and Bengio, “What regularized auto-encoders learn from the data-generating distribution”, Journal of Machine Learning Research, 2014. 

  17. Bengio, Laufer, Alain and Yosinski, “Deep generative stochastic networks trainable by backprop”, International Conference on Machine Learning, 2014. 

  18. Alain, Bengio, Yao, Yosinski, Laufer, Zhang and Vincent, “GSNs: generative stochastic networks”, Information and Inference, 2016. 

  19. Geras and Sutton, “Scheduled denoising autoencoders”, International Conference on Learning Representations, 2015. 

  20. Chandra and Sharma, “Adaptive noise schedule for denoising autoencoder”, Neural Information Processing Systems, 2014. 

  21. Zhang and Zhang, “Convolutional adaptive denoising autoencoders for hierarchical feature extraction”, Frontiers of Computer Science, 2018. 

  22. Geras and Sutton, “Composite denoising autoencoders”, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2016. 

Musings on typicality

If you’re training or sampling from generative models, typicality is a concept worth understanding. It sheds light on why beam search doesn’t work for autoregressive models of images, audio and video; why you can’t just threshold the likelihood to perform anomaly detection with generative models; and why high-dimensional Gaussians are “soap bubbles”. This post is a summary of my current thoughts on the topic.

First, some context: one of the reasons I’m writing this, is to structure my own thoughts about typicality and the unintuitive behaviour of high-dimensional probability distributions. Most of these thoughts have not been empirically validated, and several are highly speculative and could be wrong. Please bear this in mind when reading, and don’t hesitate to use the comments section to correct me. Another reason is to draw more attention to the concept, as I’ve personally found it extremely useful to gain insight into the behaviour of generative models, and to correct some of my flawed intuitions. I tweeted about typicality a few months ago, but as it turns out, I have a lot more to say on the topic!

As with most of my blog posts, I will assume a degree of familiarity with machine learning. For certain parts, some knowledge of generative modelling is probably useful as well. Section 3 of my previous blog post provides an overview of generative models.

Overview (click to scroll to each section):

  1. The joys of likelihood
  2. Motivating examples
  3. Abstraction and the curse of dimensionality
  4. Typicality
  5. Typicality in the wild
  6. The right level of abstraction
  7. Closing thoughts
  8. Acknowledgements
  9. References

The joys of likelihood

When it comes to generative modelling, my personal preference for the likelihood-based paradigm is no secret (my recent foray into adversarial methods for text-to-speech notwithstanding). While there are many other ways to build and train models (e.g. using adversarial networks, score matching, optimal transport, quantile regression, … see my previous blog post for an overview), there is something intellectually pleasing about the simplicity of maximum likelihood training: the model explicitly parameterises a probability distribution, and we fit the parameters of that distribution so it is able to explain the observed data as well as possible (i.e., assigns to it the highest possible likelihood).

It turns out that this is far from the whole story, and higher likelihood’ doesn’t always mean better in a way that we actually care about. In fact, the way likelihood behaves in relation to the quality of a model as measured by humans (e.g. by inspecting samples) can be deeply unintuitive. This has been well-known in the machine learning community for some time, and Theis et al.’s A note on the evaluation of generative models1 does an excellent job of demonstrating this with clever thought experiments and concrete examples. In what follows, I will expound on what I think is going on when likelihoods disagree with our intuitions.

One particular way in which a higher likelihood can correspond to a worse model is through overfitting on the training set. Because overfitting is ubiquitous in machine learning research, the unintuitive behaviours of likelihood are often incorrectly ascribed to this phenomenon. In this post, I will assume that overfitting is not an issue, and that we are talking about properly regularised models trained on large enough datasets.

Motivating examples

Unfair coin flips

Jessica Yung has a great blog post that demonstrates how even the simplest of probability distributions start behaving in unintuitive ways in higher-dimensional spaces, and she links this to the concept of typicality. I will borrow her example here and expand on it a bit, but I recommend reading the original post.

To summarise: suppose you have an unfair coin that lands on heads 3 times out of 4. If you toss this coin 16 times, you would expect to see 12 heads (H) and 4 tails (T) on average. Of course you wouldn’t expect to see exactly 12 heads and 4 tails every time: there’s a pretty good chance you’d see 13 heads and 3 tails, or 11 heads and 5 tails. Seeing 16 heads and no tails would be quite surprising, but it’s not implausible: in fact, it will happen about 1% of the time. Seeing all tails seems like it would be a miracle. Nevertheless, each coin toss is independent, so even this has a non-zero probability of being observed.

When we count the number of heads and tails in the observed sequence, we’re looking at the binomial distribution. We’ve made the implicit assumption that what we care about is the frequency of occurrence of both outcomes, and not the order in which they occur. We’ve made abstraction of the order, and we are effectively treating the sequences as unordered sets, so that HTHHTHHHHTTHHHHH and HHHHHTHTHHHTHTHH are basically the same thing. That is often desirable, but it’s worth being aware of such assumptions, and making them explicit.

If we do not ignore the order, and ask which sequence is the most likely, the answer is ‘all heads’. That may seem surprising at first, because seeing only heads is a relatively rare occurrence. But note that we’re asking a different question here, about the ordered sequences themselves, rather than about their statistics. While the difference is pretty clear here, the implicit assumptions and abstractions that we tend to use in our reasoning are often more subtle.

The table and figure below show how the probability of observing a given number of heads and tails can be found by multiplying the probability of a particular sequence with the number of such sequences. Note that ‘all heads’ has the highest probability out of all sequences (bolded), but there is only a single such sequence. The most likely number of heads we’ll observe is 12 (also bolded): even though each individual sequence with 12 heads is less likely, there are a lot more of them, and this second factor ends up dominating.

#H #T p(sequence) # sequences p(#H, #T)
0 16 \(\left(\frac{3}{4}\right)^0 \left(\frac{1}{4}\right)^{16} = 2.33 \cdot 10^{-10}\) 1 \(2.33\cdot 10^{-10}\)
1 15 \(\left(\frac{3}{4}\right)^1 \left(\frac{1}{4}\right)^{15} = 6.98 \cdot 10^{-10}\) 16 \(1.12\cdot 10^{-8}\)
2 14 \(\left(\frac{3}{4}\right)^2 \left(\frac{1}{4}\right)^{14} = 2.10 \cdot 10^{-9}\) 120 \(2.51\cdot 10^{-7}\)
3 13 \(\left(\frac{3}{4}\right)^3 \left(\frac{1}{4}\right)^{13} = 6.29 \cdot 10^{-9}\) 560 \(3.52\cdot 10^{-6}\)
4 12 \(\left(\frac{3}{4}\right)^4 \left(\frac{1}{4}\right)^{12} = 1.89 \cdot 10^{-8}\) 1820 \(3.43\cdot 10^{-5}\)
5 11 \(\left(\frac{3}{4}\right)^5 \left(\frac{1}{4}\right)^{11} = 5.66 \cdot 10^{-8}\) 4368 \(2.47\cdot 10^{-4}\)
6 10 \(\left(\frac{3}{4}\right)^6 \left(\frac{1}{4}\right)^{10} = 1.70 \cdot 10^{-7}\) 8008 \(1.36\cdot 10^{-3}\)
7 9 \(\left(\frac{3}{4}\right)^7 \left(\frac{1}{4}\right)^9 = 5.09 \cdot 10^{-7}\) 11440 \(5.83\cdot 10^{-3}\)
8 8 \(\left(\frac{3}{4}\right)^8 \left(\frac{1}{4}\right)^8 = 1.53 \cdot 10^{-6}\) 12870 \(1.97\cdot 10^{-2}\)
9 7 \(\left(\frac{3}{4}\right)^9 \left(\frac{1}{4}\right)^7 = 4.58 \cdot 10^{-6}\) 11440 \(5.24\cdot 10^{-2}\)
10 6 \(\left(\frac{3}{4}\right)^{10} \left(\frac{1}{4}\right)^6 = 1.37 \cdot 10^{-5}\) 8008 \(1.10\cdot 10^{-1}\)
11 5 \(\left(\frac{3}{4}\right)^{11} \left(\frac{1}{4}\right)^5 = 4.12 \cdot 10^{-5}\) 4368 \(1.80\cdot 10^{-1}\)
12 4 \(\left(\frac{3}{4}\right)^{12} \left(\frac{1}{4}\right)^4 = 1.24 \cdot 10^{-4}\) 1820 \(\mathbf{2.25\cdot 10^{-1}}\)
13 3 \(\left(\frac{3}{4}\right)^{13} \left(\frac{1}{4}\right)^3 = 3.71 \cdot 10^{-4}\) 560 \(2.08\cdot 10^{-1}\)
14 2 \(\left(\frac{3}{4}\right)^{14} \left(\frac{1}{4}\right)^2 = 1.11 \cdot 10^{-3}\) 120 \(1.34\cdot 10^{-1}\)
15 1 \(\left(\frac{3}{4}\right)^{15} \left(\frac{1}{4}\right)^1 = 3.33 \cdot 10^{-3}\) 16 \(5.35\cdot 10^{-2}\)
16 0 \(\left(\frac{3}{4}\right)^{16} \left(\frac{1}{4}\right)^0 = \mathbf{1.00 \cdot 10^{-2}}\) 1 \(1.00\cdot 10^{-2}\)
import matplotlib.pyplot as plt
import numpy as np
import scipy.special

h = np.arange(16 + 1)
p_sequence = (3/4)**h * (1/4)**(16 - h)
num_sequences = scipy.special.comb(16, h)
p_heads_count = p_sequence * num_sequences

plt.figure(figsize=(9, 3))
plt.plot(h, p_sequence, 'C0-s',
         label='probability of a single sequence with this number of heads')
plt.plot(h, p_heads_count, 'C1-o',
         label='probability of observing this number of heads')
plt.yscale('log')
plt.xlabel('number of heads')
plt.ylabel('probability')
plt.legend()
Probabilities of observing a particular sequence with a given number of heads, and of observing a given number of heads.
Probabilities of observing a particular sequence with a given number of heads, and of observing a given number of heads.

Gaussian soap bubbles

Another excellent blog post about the unintuitive behaviour of high-dimensional probability distributions is Ferenc Huszar’s ‘Gaussian Distributions are Soap Bubbles’. A one-dimensional Gaussian looks like bell curve: a big bump around the mode, with a tail on either side. Clearly, the bulk of the total probability mass is clumped together around the mode. In higher-dimensional spaces, this shape changes completely: the bulk of the probability mass of a spherical Gaussian distribution with unit variance in \(K\) dimensions is concentrated in a thin ‘shell’ at radius \(\sqrt{K}\). This is known as the Gaussian annulus theorem.

For example, if we sample lots of vectors from a 100-dimensional standard Gaussian, and measure their radii, we will find that just over 84% of them are between 9 and 11, and more than 99% are between 8 and 12. Only about 0.2% have a radius smaller than 8!

Ferenc points out an interesting implication: high-dimensional Gaussians are very similar to uniform distributions on the sphere. This clearly isn’t true for the one-dimensional case, but it turns out that’s an exception, not the rule. Stefan Stein also discusses this implication in more detail in a recent blog post.

Where our intuition can go wrong here, is that we might underestimate how quickly a high-dimensional space grows in size as we move further away from the mode. Because of the radial symmetry of the distribution, we tend to think of all points at a given distance from the mode as similar, and we implicitly group them into sets of concentric spheres. This allows us to revert back to reasoning in one dimension, which we are more comfortable with: we think of a high-dimensional Gaussian as a distribution over these sets, rather than over individual points. What we tend to overlook, is that those sets differ wildly in size: as we move away from the mode, they grow larger very quickly. Note that this does not happen at all in 1D!

Abstraction and the curse of dimensionality

The curse of dimensionality is a catch-all term for various phenomena that appear very different and often counterintuitive in high-dimensional spaces. It is used to highlight poor scaling behaviour of ideas and algorithms, where one wouldn’t necessarily expect it. In the context of machine learning, it is usually used in a more narrow sense, to refer to the fact that models of high-dimensional data tend to require very large training datasets to be effective. But the curse of dimensionality manifests itself in many forms, and the unintuitive behaviour of high-dimensional probability distributions is just one of them.

In general, humans have lousy intuitions about high-dimensional spaces. But what exactly is going on when we get things wrong about high-dimensional distributions? In both of the motivating examples, the intuition breaks down in a similar way: if we’re not careful, we might implicitly reason about the probabilities of sets, rather than individual points, without taking into account their relative sizes, and arrive at the wrong answer. This means that we can encounter this issue for both discrete and continuous distributions.

We can generalise this idea of grouping points into sets of similar points, by thinking of it as ‘abstraction’: rather than treating each point as a separate entity, we think of it as an instance of a particular concept, and ignore its idiosyncrasies. When we think of ‘sand’, we are rarely concerned about the characteristics of each individual grain. Similarly, in the ‘unfair coin flips’ example, we group sequences by their number of heads and tails, ignoring their order. In the case of the high-dimensional Gaussian, the natural grouping of points is based on their Euclidean distance from the mode. A more high-level example is that of natural images, where individual pixel values across localised regions of the image combine to form edges, textures, or even objects. There are usually many combinations of pixel values that give rise to the same texture, and we aren’t able to visually distinguish these particular instances unless we carefully study them side by side.

The following is perhaps a bit of an unfounded generalisation based on my own experience, but our brains seem hardwired to perform this kind of abstraction, so that we can reason about things in the familiar low-dimensional setting. It seems to happen unconsciously and continuously, and bypassing it requires a proactive approach.

Typicality

Informally, typicality refers to the characteristics that samples from a distribution tend to exhibit on average (in expectation). In the ‘unfair coin flip’ example, a sequence with 12 heads and 4 tails is ‘typical’. A sequence with 6 heads and 10 tails is highly atypical. Typical sequences contain an average amount of information: they are not particularly surprising or (un)informative.

We can formalise this intuition using the entropy of the distribution: a typical set \(\mathcal{T}_\varepsilon \subset \mathcal{X}\) is a set of sequences from \(\mathcal{X}\) whose probability is close to \(2^{-H}\), where \(H\) is the entropy of the distribution that the sequences were drawn from, measured in bits:

\[\mathcal{T}_\varepsilon = \{ \mathbf{x} \in \mathcal{X}: 2^{-(H + \varepsilon)} \leq p(\mathbf{x}) \leq 2^{-(H - \varepsilon)} \} .\]

This means that the negative log likelihood of each such sequence is close to the entropy. Note that a distribution doesn’t have just one typical set: we can define many typical sets based on how close the probability of the sequences contained therein should be to \(2^{-H}\), by choosing different values of \(\varepsilon > 0\).

This concept was originally defined in an information-theoretic context, but I want to focus on machine learning, where I feel it is somewhat undervalued. It is often framed in terms of sequences sampled from stationary ergodic processes, but it is useful more generally for distributions of any kind of high-dimensional data points, both continuous and discrete, regardless of whether we tend to think of them as sequences.

Why is this relevant to our discussion of abstraction and flawed human intuitions? As the dimensionality increases, the probability that any random sample from a distribution is part of a given typical set \(\mathcal{T}_\varepsilon\) tends towards 1. In other words, randomly drawn samples will almost always be ‘typical’, and the typical set covers most of the support of the distribution (this is a consequence of the so-called asymptotic equipartition property (AEP)). This happens even when \(\varepsilon\) is relatively small, as long as the dimensionality is high enough. This is visualised for a 100-dimensional standard Gaussian distribution below (based on empirical measurements, to avoid having to calculate some gnarly 100D integrals).

import matplotlib.pyplot as plt
import numpy as np

N = 1000000
K = 100
samples = np.random.normal(0, 1, (N, K))
radii = np.sqrt(np.sum(samples**2, axis=-1))
epsilon = np.logspace(-1, 2, 200)
lo = np.sqrt(np.maximum(K - epsilon * np.log(4), 0))
hi = np.sqrt(K + epsilon * np.log(4))
radius_range = hi - lo
mass = [np.mean((lo[i] < radii) & (radii < hi[i])) for i in range(len(epsilon))]

plt.figure(figsize=(9, 3))
plt.plot(radius_range, mass)
plt.xlabel('Difference between the min. and max. radii inside '
           '$\\mathcal{T}_\\varepsilon$ for given $\\varepsilon$')
plt.ylabel('Total probability mass in $\\mathcal{T}_\\varepsilon$')
The total probability mass of a range of typical sets of a 100-dimensional standard Gaussian distribution, with their size measured by the difference between the minimal and maximal radii within the set (i.e. the width of the Gaussian annulus). An annulus with width 4 already contains most of the probability mass.
The total probability mass of a range of typical sets of a 100-dimensional standard Gaussian distribution, with their size measured by the difference between the minimal and maximal radii within the set (i.e. the width of the Gaussian annulus). An annulus with width 4 already contains most of the probability mass.

But this is where it gets interesting: for unimodal high-dimensional distributions, such as the multivariate Gaussian, the mode (i.e. the most likely value) usually isn’t part of the typical set. More generally, individual samples from high-dimensional (and potentially multimodal) distributions that have an unusually high likelihood are not typical, so we wouldn’t expect to see them when sampling. This can seem paradoxical, because they are by definition very ‘likely’ samples — it’s just that there are so few of them! Think about how surprising it would be to randomly sample the zero vector (or something very close to it) from a 100-dimensional standard Gaussian distribution.

This has some important implications: if we want to learn more about what a high-dimensional distribution looks like, studying the most likely samples is usually a bad idea. If we want to obtain a good quality sample from a distribution, subject to constraints, we should not be trying to find the single most likely one. Yet in machine learning, these are things that we do on a regular basis. In the next section, I’ll discuss a few situations where this paradox comes up in practice. For a more mathematical treatment of typicality and the curse of dimensionality, check out this case study by Bob Carpenter.

Typicality in the wild

A significant body of literature, spanning several subfields of machine learning, has sought to interpret and/or mitigate the unintuitive ways in which high-dimensional probability distributions behave. In this section, I want to highlight a few interesting papers and discuss them in relation to the concept of typicality. Note that I’ve made a selection based on what I’ve read recently, and this is not intended to be a comprehensive overview of the literature. In fact, I would appreciate pointers to other related work (papers and blog posts) that I should take a look at!

Language modelling

In conditional language modelling tasks, such as machine translation or image captioning, it is common to use conditional autoregressive models in combination with heuristic decoding strategies such as beam search. The underlying idea is that we want to find the most likely sentence (i.e. the mode of the conditional distribution, ‘MAP decoding’), but since this is intractable, we’ll settle for an approximate result instead.

With typicality in mind, it’s clear that this isn’t necessarily the best idea. Indeed, researchers have found that machine translation results, measured using the BLEU metric, sometimes get worse when the beam width is increased2 3. A higher beam width gives a better, more computationally costly approximation to the mode, but not necessarily better translation results. In this case, it’s tempting to blame the metric itself, which obviously isn’t perfect, but this effect has also been observed with human ratings4, so that cannot be the whole story.

A recent paper by Eikema & Aziz5 provides an excellent review of recent work in this space, and makes a compelling argument for MAP decoding as the culprit behind many of the pathologies that neural machine translation systems exhibit (rather than their network architectures or training methodologies). They also propose an alternative decoding strategy called ‘minimum Bayes risk’ (MBR) decoding that takes into account the whole distribution, rather than only the mode.

In unconditional language modelling, beam search hasn’t caught on, but not for want of trying! Stochasticity of the result is often desirable in this setting, and the focus has been on sampling strategies instead. In The Curious Case of Neural Text Degeneration6, Holtzman et al. observe that maximising the probability leads to poor quality results that are often repetitive. Repetitive samples may not be typical, but they have high likelihoods simply because they are more predictable.

They compare a few different sampling strategies that interpolate between fully random sampling and greedy decoding (i.e. predicting the most likely token at every step in the sequence), including the nucleus sampling technique which they propose. The motivation for trying to find a middle ground is that models will assign low probabilities to sequences that they haven’t seen much during training, which makes low-probability predictions inherently less reliable. Therefore, we want to avoid sampling low-probability tokens to some extent.

Zhang et al.4 frame the choice of a language model decoding strategy as a trade-off between diversity and quality. However, they find that reducing diversity only helps quality up to a point, and reducing it too much makes the results worse, as judged by human evaluators. They call this ‘the likelihood trap’: human-judged quality of samples correlates very well with likelihood, up to an inflection point, where the correlation becomes negative.

In the context of typicality, this raises an interesting question: where exactly is this inflection point, and how does it relate to the typical set of the model distribution? I think it would be very interesting to determine whether the inflection point coincides exactly with the typical set, or whether it is more/less likely. Perhaps there is some degree of atypicality that human raters will tolerate? If so, can we quantify it? This wouldn’t be far-fetched: think about our preference for celebrity faces over ‘typical’ human faces, for example!

Image modelling

The previously mentioned ‘note on the evaluation of generative models’1 is a seminal piece of work that demonstrates several ways in which likelihoods in the image domain can be vastly misleading.

In ‘Do Deep Generative Models Know What They Don’t Know?’7, Nalisnick et al. study the behaviour of likelihood-based models when presented with out-of-domain data. They observe how models can assign higher likelihoods to datasets other than their training datasets. Crucially, they show this for different classes of likelihood-based models (variational autoencoders, autoregressive models and flow-based models, see Figure 3 in the paper), which clearly demonstrates that this is an issue with the likelihood-based paradigm itself, and not with a particular model architecture or formulation.

Comparing images from CIFAR-10 and SVHN, two of the datasets they use, a key difference is the prevalence of textures in CIFAR-10 images, and the relative absence of such textures in SVHN images. This makes SVHN images inherently easier to predict, which partially explains why models trained on CIFAR-10 tend to assign higher likelihoods to SVHN images. Despite this, we clearly wouldn’t ever be able to sample anything that looks like an SVHN image from a CIFAR-10-trained model, because such images are not in the typical set of the model distribution (even if their likelihood is higher).

Audio modelling

I don’t believe I’ve seen any recent work that studies sampling and decoding strategies for likelihood-based models in the audio domain. Nevertheless, I wanted to briefly discuss this setting because a question I often get is: “why don’t you use greedy decoding or beam search to improve the quality of WaveNet samples?”

If you’ve read this far, the answer is probably clear to you by now: because audio samples outside of the typical set sound really weird! In fact, greedy decoding from a WaveNet will invariably yield complete silence, even for fairly strongly conditioned models (e.g. WaveNets for text-to-speech synthesis). In the text-to-speech case, even if you simply reduce the sampling temperature a bit too aggressively, certain consonants that are inherently noisy (such as ‘s’, ‘f’, ‘sh’ and ‘h’, the fricatives) will start sounding very muffled. These sounds are effectively different kinds of noise, and reducing the stochasticity of this noise has an audible effect.

Anomaly detection

Anomaly detection, or out-of-distribution (OOD) detection, is the task of identifying whether a particular input could have been drawn from a given distribution. Generative models are often used for this purpose: train an explicit model on in-distribution data, and then use its likelihood estimates to identify OOD inputs.

Usually, the assumption is made that OOD inputs will have low likelihoods, and in-distribution inputs will have high likelihoods. However, the fact that the mode of a high-dimensional distribution usually isn’t part of its typical set clearly contradicts this. This mistaken assumption is quite pervasive. Only recently has it started to be challenged explicitly, e.g. in works by Nalisnick et al.8 and Morningstar et al.9. Both of these works propose testing the typicality of inputs, rather than simply measuring and thresholding their likelihood.

The right level of abstraction

While our intuitive notion of likelihood in high-dimensional spaces might technically be wrong, it can often be a better representation of what we actually care about. This raises the question: should we really be fitting our generative models using likelihood measured in the input space? If we were to train likelihood-based models with ‘intuitive’ likelihood, they might perform better according to perceptual metrics, because they do not have to waste capacity capturing all the idiosyncrasies of particular examples that we don’t care to distinguish anyway.

In fact, measuring likelihood in more abstract representation spaces has had some success in generative modelling, and I think the approach should be taken more seriously in general. In language modelling, it is common to measure likelihoods at the level of word pieces, rather than individual characters. In symbolic music modelling, recent models that operate on event-based sequences (rather than sequences with a fixed time quantum) are more effective at capturing large-scale structure10. Some likelihood-based generative models of images separate or discard the least-significant bits of each pixel colour value, because they are less perceptually relevant, allowing model capacity to be used more efficiently11 12.

But perhaps the most striking example is the recent line of work where VQ-VAE13 is used to learn discrete higher-level representations of perceptual signals, and generative models are then trained to maximise the likelihood in this representation space. This approach has led to models that produce images that are on par with those produced by GANs in terms of fidelity, and exceed them in terms of diversity14 15 16. It has also led to models that are able to capture long-range temporal structure in audio signals, which even GANs had not been able to do before17 18. While the current trend in representation learning is to focus on coarse-grained representations which are suitable for discriminative downstream tasks, I think it also has a very important role to play in generative modelling.

In the context of modelling sets with likelihood-based models, a recent blog post by Adam Kosiorek drew my attention to point processes, and in particular, to the formula that expresses the density over ordered sequences in terms of the density over unordered sets. This formula quantifies how we need to scale probabilities across sets of different sizes to make them comparable. I think it may yet prove useful to quantify the unintuitive behaviours of likelihood-based models.

Closing thoughts

To wrap up this post, here are some takeaways:

  • High-dimensional spaces, and high-dimensional probability distributions in particular, are deeply unintuitive in more ways than one. This is a well-known fact, but they still manage to surprise us sometimes!

  • The most likely samples from a high-dimensional distribution usually aren’t a very good representation of that distribution. In most situations, we probably shouldn’t be trying to find them.

  • Typicality is a very useful concept to describe these unintuitive phenomena, and I think it is undervalued in machine learning — at least in the work that I’ve been exposed to.

  • A lot of work that discusses these issues (including some that I’ve highlighted in this post) doesn’t actually refer to typicality by name. I think doing so would improve our collective understanding, and shed light on links between related phenomena in different subfields.

If you have any thoughts about this topic, please don’t hesitate to share them in the comments below!

In an addendum to this post, I explore quantitatively what happens when our intuitions fail us in high-dimensional spaces.

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{dieleman2020typicality,
  author = {Dieleman, Sander},
  title = {Musings on typicality},
  url = {https://benanne.github.io/2020/09/01/typicality.html},
  year = {2020}
}

Acknowledgements

Thanks to Katie Millican, Jeffrey De Fauw and Adam Kosiorek for their valuable input and feedback on this post!

References

  1. Theis, van den Oord and Bethge, “A note on the evaluation of generative models”, International Conference on Learning Representations, 2016.  2

  2. Koehn & Knowles, “Six Challenges for Neural Machine Translation”, First Workshop on Neural Machine Translation, 2017. 

  3. Ott, Auli, Grangier and Ranzato, “Analyzing Uncertainty in Neural Machine Translation”, International Conference on Machine Learning, 2018. 

  4. Zhang, Duckworth, Ippolito and Neelakantan, “Trading Off Diversity and Quality in Natural Language Generation”, arXiv, 2020.  2

  5. Eikema and Aziz, “Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation”, arXiv, 2020. 

  6. Holtzman, Buys, Du, Forbes and Choi, “The Curious Case of Neural Text Degeneration”, International Conference on Learning Representations, 2020. 

  7. Nalisnick, Matsukawa, Teh, Gorur and Lakshminarayanan, “Do Deep Generative Models Know What They Don’t Know?”, International Conference on Learnign Representations, 2019. 

  8. Nalisnick, Matuskawa, Teh and Lakshminarayanan, “Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality”, arXiv, 2019. 

  9. Morningstar, Ham, Gallagher, Lakshminarayanan, Alemi and Dillon, “Density of States Estimation for Out-of-Distribution Detection”, arXiv, 2020. 

  10. Oore, Simon, Dieleman, Eck and Simonyan, “This Time with Feeling: Learning Expressive Musical Performance”, Neural Computing and Applications, 2020. 

  11. Menick and Kalchbrenner, “Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling”, International Conference on Machine Learning, 2019. 

  12. Kingma & Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions”, Neural Information Processing Systems, 2018. 

  13. van den Oord, Vinyals and Kavukcuoglu, “Neural Discrete Representation Learning”, Neural Information Processing Systems, 2017. 

  14. Razavi, van den Oord and Vinyals, “Generating Diverse High-Fidelity Images with VQ-VAE-2”, Neural Information Processing Systems, 2019. 

  15. De Fauw, Dieleman and Simonyan, “Hierarchical Autoregressive Image Models with Auxiliary Decoders”, arXiv, 2019. 

  16. Ravuri and Vinyals, “Classification Accuracy Score for Conditional Generative Models”, Neural Information Processing Systems, 2019. 

  17. Dieleman, van den Oord and Simonyan, “The challenge of realistic music generation: modelling raw audio at scale”, Neural Information Processing Systems, 2018. 

  18. Dhariwal, Jun, Payne, Kim, Radford and Sutskever, “Jukebox: A Generative Model for Music”, arXiv, 2020. 

Addendum: quantifying our flawed intuitions

This post is an addendum to my blog post about typicality. Please consider reading that first, if you haven’t already. Here, I will try to quantify what happens when our intuitions fail us in high-dimensional spaces.

Note that the practical relevance of this is limited, so consider this a piece of optional extra content!

In the ‘unfair coin flips’ example from the main blog post, it’s actually pretty clear what happens when our intuitions fail us: we think of the binomial distribution, ignoring the order of the sequences as a factor, when we should actually be taking it into account. Referring back to the table from section 2.1, we use the probabilities in the rightmost column, when we should be using those in the third column. But when we think of a high-dimensional Gaussian distribution and come to the wrong conclusion, what distribution are we actually thinking of?

The Gaussian distribution \(\mathcal{N}_K\)

Let’s start by quantifying what a multivariate Gaussian distribution actually looks like: let \(\mathbf{x} \sim \mathcal{N}(\mathbf{0}, I_K)\), a standard Gaussian distribution in \(K\) dimensions, henceforth referred to as \(\mathcal{N}_K\). We can sample from it by drawing \(K\) independent one-dimensional samples \(x_i \sim \mathcal{N}(0, 1)\), and joining them into a vector \(\mathbf{x}\). This distribution is spherically symmetric, which makes it very natural to think about samples in terms of their distance to the mode (in this case, the origin, corresponding to the zero-vector \(\mathbf{0}\)), because all samples at a given distance \(r\) have the same density.

Now, let’s look at the distribution of \(r\): it seems as if the multivariate Gaussian distribution \(\mathcal{N}_K\) naturally arises by taking a univariate version of it, and rotating it around the mode in every possible direction in \(K\)-dimensional space. Because each of these individual rotated copies is Gaussian, this in turn might seem to imply that the distance from the mode \(r\) is itself Gaussian (or rather half-Gaussian, since it is a nonnegative quantity). But this is incorrect! \(r\) actually follows a chi distribution with \(K\) degrees of freedom: \(r \sim \chi_K\).

Note that for \(K = 1\), this does indeed correspond to a half-Gaussian distribution. But as \(K\) increases, the mode of the chi distribution rapidly shifts away from 0: it actually sits at \(\sqrt{K - 1}\). This leaves considerably less probability mass near 0, where the mode of our original multivariate Gaussian \(\mathcal{N}_K\) is located.

This exercise yields an alternative sampling strategy for multivariate Gaussians: first, sample a distance from the mode \(r \sim \chi_K\). Then, sample a direction, i.e. a vector on the \(K\)-dimensional unit sphere \(S^K\), uniformly at random: \(\mathbf{\theta} \sim U[S^K]\). Multiply them together to obtain a Gaussian sample: \(\mathbf{x} = r \cdot \mathbf{\theta} \sim \mathcal{N}_K\).

The Gaussian mirage distribution \(\mathcal{M}_K\)

What if, instead of sampling \(r \sim \chi_K\), we sampled \(r \sim \mathcal{N}(0, K)\) instead? Note that \(\sigma^2_{\chi_K} = K\), so this change preserves the scale of the resulting vectors. For \(K = 1\), we get the same distribution for \(\mathbf{x}\), but for \(K > 1\), we get something very different. The resulting distribution represents what we might think the multivariate Gaussian distribution looks like, if we rely on a mistaken intuition and squint a bit. Let’s call this the Gaussian mirage distribution, denoted by \(\mathcal{M}\): \(\mathbf{x} = r \cdot \mathbf{\theta} \sim \mathcal{M}_K\). (If this thing already has a name, I’m not aware of it, so please let me know!)

We’ve already established that \(\mathcal{M}_1 \equiv \mathcal{N}_1\). But in higher dimensions, these distributions behave very differently. One way to comprehend this is to look at a flattened histogram of samples across all coordinates:

import matplotlib.pyplot as plt
import numpy as np

def gaussian(n, k):
    return np.random.normal(0, 1, (n, k))

def mirage(n, k):
    direction = np.random.normal(0, 1, (n, k))
    direction /= np.sqrt(np.sum(direction**2, axis=-1, keepdims=True))
    distance = np.random.normal(0, np.sqrt(k), (n, 1))
    return distance * direction

def plot_histogram(x):
    plt.hist(x.ravel(), bins=100)
    plt.ylim(0, 80000)
    plt.xlim(-4, 4)
    plt.tick_params(labelleft=False, left=False, labelbottom=False, bottom=False)

plt.figure(figsize=(9, 3))
ks = [1, 3, 10, 100]
for i, k in enumerate(ks):
    plt.subplot(2, len(ks), i + 1)
    plt.title(f'K = {k}')
    plot_histogram(gaussian(10**6 // k, k))
    plt.subplot(2, len(ks), i + 1 + len(ks))
    plot_histogram(mirage(10**6 // k, k))
Histograms of the flattened coordinates of the multivariate Gaussian distribution (top) and the Gaussian mirage (bottom).
Histograms of the flattened coordinates of the multivariate Gaussian distribution (top) and the Gaussian mirage (bottom), for different dimensionalities (K). For the mirage, the histograms become increasingly peaked around 0 as the dimensionality increases.

For \(\mathcal{N}_K\), this predictably looks like a univariate Gaussian for all \(K\). For \(\mathcal{M}_K\), it becomes highly leptokurtic as \(K\) increases, indicating that dramatically more probability mass is located close to the mode.

Typical sets of \(\mathcal{N}_K\) and \(\mathcal{M}_K\)

Let’s also look at the typical sets for both of these distributions. For \(\mathcal{N}_K\), the probability density function (pdf) has the form:

\[f_{\mathcal{N}_K}(\mathbf{x}) = (2 \pi)^{-\frac{K}{2}} \exp \left( -\frac{\mathbf{x}^T \mathbf{x}}{2} \right),\]

and the differential entropy is given by:

\[H_{\mathcal{N}_K} = \frac{K}{2} \log \left(2 \pi e \right) .\]

To find the typical set, we just need to look for the \(\mathbf{x}\) where \(f_{\mathcal{N}_K}(\mathbf{x}) \approx 2^{-H_{\mathcal{N}_K}} = (2 \pi e)^{-\frac{K}{2}}\) (assuming the entropy is measured in bits). This is clearly the case when \(\mathbf{x}^T\mathbf{x} \approx K\), or in other words, for any \(\mathbf{x}\) whose distance from the mode is close to \(\sqrt{K}\). This is the Gaussian annulus from before.

Let’s subject the Gaussian mirage \(\mathcal{M}_K\) to the same treatment. It’s not obvious how to express the pdf in terms of \(\mathbf{x}\), but it’s easier if we rewrite \(\mathbf{x}\) as \(r \cdot \mathbf{\theta}\), as before, and imagine the sampling procedure: first, pick a radius \(r \sim \mathcal{HN}(0, K)\) (the half-Gaussian distribution — using the Gaussian distribution complicates the math a bit, because the radius should be nonnegative), and then pick a position on the \(K\)-sphere with radius \(r\), uniformly at random:

\[f_{\mathcal{M}_K}(\mathbf{x}) = f_{\mathcal{HN}(0, K)}(r) \cdot f_{U[S^K(r)]}(\theta) = \frac{2}{\sqrt{2 \pi K}} \exp \left( -\frac{r^2}{2 K} \right) \cdot \frac{1}{r^{K-1}} \frac{\Gamma\left( \frac{K}{2} \right)}{2 \pi ^ \frac{K}{2}} .\]

The former factor is the density of the half-Gaussian distribution: note the additional factor 2 compared to the standard Gaussian density, because we only consider nonnegative values of \(r\). The latter is the density of a uniform distribution on the \(K\)-sphere with radius \(r\) (which is the inverse of its surface area). As an aside, this factor is worth taking a closer look at, because it behaves in a rather peculiar way. Here’s the surface area of a unit \(K\)-sphere for increasing \(K\):

import matplotlib.pyplot as plt
import numpy as np
import scipy.special

K = np.arange(0, 30 + 1)
A = (2 * np.pi**(K / 2.0)) / scipy.special.gamma(K / 2.0)
plt.figure(figsize=(9, 3))
plt.stem(K, A, basefmt=' ')
plt.ylim(0, 35)
Surface area of a K-dimensional unit sphere, for K ranging from 0 to 30.
Surface area of a K-dimensional unit sphere, for K ranging from 0 to 30.

Confused? You and me both! Believe it or not, the surface area of a \(K\)-sphere tends to zero with increasing \(K\) — but only after growing to a maximum at \(K = 7\) first. High-dimensional spaces are weird.

Another thing worth noting is that the density at the mode \(f_{\mathcal{M}_K}(\mathbf{0}) = +\infty\) for \(K > 1\), which already suggests that this distribution has a lot of its mass concentrated near the mode.

Computing the entropy of this distribution takes a bit of work. The differential entropy is:

\[H_{\mathcal{M}_K} = - \int_{\mathbb{R}^K} f_{\mathcal{M}_K}(\mathbf{x}) \log f_{\mathcal{M}_K}(\mathbf{x}) \mathrm{d}\mathbf{x} .\]

We can use the radial symmetry of this density to reformulate this as an integral of a scalar function:

\[H_{\mathcal{M}_K} = - \int_0^{+\infty} f_{\mathcal{M}_K}(r) \log f_{\mathcal{M}_K}(r) S^K(r) \mathrm{d} r,\]

where \(S^K(r)\) is the surface area of a \(K\)-sphere with radius \(r\). Filling in the density function, we get:

\[H_{\mathcal{M}_K} = - \int_0^{+\infty} \frac{2}{\sqrt{2 \pi K}} \exp \left( -\frac{r^2}{2 K} \right) \cdot \log \left( \frac{2}{\sqrt{2 \pi K}} \exp \left( -\frac{r^2}{2 K} \right) \cdot \frac{1}{r^{K-1}} \frac{\Gamma\left( \frac{K}{2} \right)}{2 \pi ^ \frac{K}{2}} \right) \mathrm{d} r,\]

where we have made use of the fact that \(S^K(r)\) cancels out with the second factor of \(f_{\mathcal{M}_K}(r)\). We can split up the \(\log\) into three different terms, \(H_{\mathcal{M}_K} = H_1 + H_2 + H_3\):

\[H_1 = - \int_0^{+\infty} \frac{2}{\sqrt{2 \pi K}} \exp \left( -\frac{r^2}{2 K} \right) \left(-\frac{r^2}{2 K} \right) \mathrm{d} r = \int_0^{+\infty} \frac{r^2}{\sqrt{2 \pi}} \exp \left( -\frac{r^2}{2} \right) \mathrm{d} r = \frac{1}{2},\] \[H_2 = - \int_0^{+\infty} \frac{2}{\sqrt{2 \pi K}} \exp \left( -\frac{r^2}{2 K} \right) \log \left( \frac{1}{r^{K-1}} \right) \mathrm{d} r = \frac{K - 1}{2} \left( \log \frac{K}{2} - \gamma \right),\] \[H_3 = - \int_0^{+\infty} \frac{2}{\sqrt{2 \pi K}} \exp \left( -\frac{r^2}{2 K} \right) \log \left( \frac{2}{\sqrt{2 \pi K}} \frac{\Gamma\left( \frac{K}{2} \right)}{2 \pi ^ \frac{K}{2}} \right) \mathrm{d} r = - \log \left( \frac{1}{\sqrt{2 \pi K}} \frac{\Gamma\left( \frac{K}{2} \right)}{\pi ^ \frac{K}{2}} \right),\]

where we have taken \(\log\) to be the natural logarithm for convenience, and \(\gamma\) is the Euler-Mascheroni constant. In summary:

\[H_{\mathcal{M}_K} = \frac{1}{2} + \frac{K - 1}{2} \left( \log \frac{K}{2} - \gamma \right) - \log \left( \frac{1}{\sqrt{2 \pi K}} \frac{\Gamma\left( \frac{K}{2} \right)}{\pi ^ \frac{K}{2}} \right) .\]

Note that \(H_{\mathcal{M}_1} = \frac{1}{2} \log (2 \pi e)\), matching the standard Gaussian distribution as expected.

Because this is measured in nats, not in bits, we find the typical set where \(f_{\mathcal{M}_K}(\mathbf{x}) \approx \exp(-H_{\mathcal{M}_K})\). We must find \(r \geq 0\) so that

\[\frac{r^2}{2 K} + (K - 1) \log r = \frac{1}{2} + \frac{K - 1}{2} \left( \log \frac{K}{2} - \gamma \right) .\]

We can express the solution of this equation in terms of the Lambert \(W\) function:

\[r = \sqrt{K (K - 1) W\left(\frac{1}{K (K - 1)} \exp \left( \frac{1}{K - 1} + \log \frac{K}{2} - \gamma \right) \right)} .\]
import matplotlib.pyplot as plt
import numpy as np
import scipy.special

K = np.unique(np.round(np.logspace(0, 6, 100)))
w_arg = np.exp(1 / (K - 1) + np.log(K / 2) - np.euler_gamma) / (K * (K - 1))
r = np.sqrt(K * (K - 1) * scipy.special.lambertw(w_arg))
r[0] = 1  # Special case for K = 1.

plt.figure(figsize=(9, 3))
plt.plot(K, r / np.sqrt(K))
plt.xscale('log')
plt.ylim(0, 1.2)
plt.xlabel('$K$')
plt.ylabel('$\\frac{r}{\\sqrt{K}}$')
The distance from the mode at which the typical set of the Gaussian mirage is found, as a function of K.
The distance from the mode at which the typical set of the Gaussian mirage is found, normalised by the standard deviation, as a function of K.

As \(K \to +\infty\), this seems to converge to the value \(0.52984 \sqrt{K}\), which is somewhere in between the mode (\(0\)) and the mean (\(\sqrt{\frac{2K}{\pi}} \approx 0.79788 \sqrt{K}\)) of the half-Gaussian distribution (which \(r\) follows by construction). This is not just an interesting curiosity: although it is clear that the typical set of \(\mathcal{M}_K\) is much closer to the mode than for \(\mathcal{N}_K\) (because \(r < \sqrt{K}\)), the mode is not unequivocally a member of the typical set. In fact, the definition of typical sets sort of breaks down for this distribution, because we need to allow for a very large range of probability densities to capture the bulk of its mass. In this sense, it behaves a lot more like the one-dimensional Gaussian. Nevertheless, even this strange concoction of a distribution exhibits unintuitive behaviour in high-dimensional space!

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{dieleman2020typicality,
  author = {Dieleman, Sander},
  title = {Musings on typicality},
  url = {https://benanne.github.io/2020/09/01/typicality.html},
  year = {2020}
}