In a recent paper, Stanford researchers Mason Kamb and Surya Ganguli proposed a mechanism that could underlie the creativity of diffusion models. The mathematical model they developed suggests that this creativity is a deterministic consequence of how those models use the denoising process to generate images.
In rough terms, diffusion models are trained to sort of uncover an image from an isotropic Gaussian noise distribution that is the outcome of the training process from a finite set of sample images. This process consists of gradually removing the Gaussian noise by learning a scoring function that points in gradient directions of increasing probability.
If the network can learn this ideal score function exactly, then they will implement a perfect reversal of the forward process. This, in turn, will only be able to turn Gaussian noise into memorized training examples.
This means that, to generate new images that are far from the training set, the models must fail to learn the ideal score (IS) function. One way to explain how this occurs is by hypothesizing the presence of inductive biases that may provide a more exact account of what diffusion models are actually doing when creatively generating new samples.
By analyzing how diffusion models estimate the score function using CNNs, the researchers identify two such biases: translational equivariance and locality. Translational equivariance refers to the model’s tendency to reflect shifts in the input image, meaning that if the input is shifted by a few pixels, the generated image will mirror that shift. Locality, on the other hand, arises from the convolutional neural networks (CNNs) used to learn the score function, which only consider a small neighborhood of input pixels rather than the entire image.
Based on these insights, the researchers built a mathematical model aimed at optimizing a score function for equivariance and locality, which they called an equivariant local score (ELS) machine.
An ELS machine is a set of equations that can calculate the composition of denoised images and compared its output with that of diffusion models such as ResNets and UNets trained on simplified models. What they found was “a remarkable and uniform quantitative agreement between the CNN outputs and ELS machine outputs”, with an accuracy of around 90% or higher depending on the acutal diffusion model and dataset considered.
To our knowledge, this is the first time an analytic theory has explained the creative outputs of a trained deep neural network-based generative model to this level of accuracy. Importantly, the (E)LS machine explains all trained outputs far better than the IS machine.
According to Ganguli, their research explains how diffusion model create new images “by mixing and matching different local training set image patches at different locations in the new output, yielding a local patch mosaic model of creativity”. The theory also helps explain why diffusion models make mistakes, for example generating excess fingers or limbs, due to excessive locality.
This result, while compelling, initially excluded diffusion models that incorporate highly non-local self-attention (SA) layers, which violate the locality assumption in the researchers’ hypothesis. To address this, the authors used their ELS machine to predict the output of a publicly available UNet+SA model pretrained on CIFAR-10 and found that it still achieved significantly higher accuracy than the baseline IS machine.
According to the researchers, their results suggest that locality and equivariance are sufficient to explain the creativity of convolution-only diffusion models and could form the foundation for further study of more complex diffusion models.
The researchers also shared the code they used to train the diffusion models they used in the study.