Deep learning tricks all the way down, with a bit of mathematics for good measure

Notes and reflections based on the first lesson (aka ‘lesson 9’) of the FastAI Part II course. This covers the fundamentals of Stable Diffusion, how it works and some core concepts or techniques.

Alex Strick van Linschoten


October 17, 2022

(This is part of a series of blog posts relating to and responding to the live FastAI course (part 2) being taught October-December 2022. To read others, see the ones listed for the ‘parttwo’ tag.)

Much awaited and anticipated, the second part of the FastAI course is being taught live again. If part one is about getting solid foundations and learning how to get going in a practical/useful way, part two is about approaching things from the foundations but with research or ‘impractical’ questions kept in mind. The backdrop of the current iteration is the breakthroughs happening in the world of generative computer vision models like Stable Diffusion, which we’ll explore and deconstruct (and reconstruct?!) over the coming weeks.

Diving into the details of how things work means that along the way we’re much more likely to encounter (legitimate) specialist vocabulary and techniques as well as a decent dose of jargon. Whereas bringing up the intricacies of particular algorithms, architectures or mathematical methods was unnecessary during part one, it seems like part two is a little bit more of a venue for that kind of material. I will use these blogs as a way of reviewing materials and concepts introduced during the lectures as well as keeping track of the big questions I have.

In this blog, in particular, I’ll keep a glossary at the bottom for some new terms which were introduced. I may repeat this for subsequent blog reviews, depending on what’s covered in those lessons. I’ll also keep a section containing new mathematical symbols that are introduced. (This blog mainly relates to the core lecture given during week 1. I’ll update it later with some small extras that came up from the 9A and 9B videos, or expand those into separate posts on their own.)

Stable Diffusion isn’t, in itself, a model that I’m especially interested in, except insofar as it teaches me fundamental principles about the craft of deep learning or about doing research in this field. As such, my plan and current intention is to stick to documenting core mental models or bigger-picture lessons that I’m taking away from the lessons rather than each individual step that Jeremy made along the way. (This seems to be the motivation behind including it in the course at all. Stable Diffusion touches so many topics (big and small) and getting to grips with this one thing will help understand many other things about machine learning and the world of research.)

💬🌄 Stable Diffusion 101

If you’ve been on the internet at all during the past 6-12 months, you’ll almost certainly have been exposed to examples of images that have been generated using techniques grounded in deep learning. Here is one that I generated just now:

These images are generated by passing in a prompt into the model which then uses that to come up with something that represents the text you passed in. (Shoutout to the creators and maintainers of the tremendously useful diffusion-nbs notebooks.) The interface between the text and the image is still not as seamless as might be hoped, and a discipline of ‘prompt engineering’ has grown around finding the best ways to coax certain kinds of images out of the model. (See this book to learn more about what works with DALL-E 2, for example. Or visit Lexica to search images based on the prompts that were used to create them.)

There are some obvious interim questions that result from the existence of such models and their outputs, notably what this allows in terms of creativity and how it might transform the kinds of tools we use for image editing and creation. The advances are certainly impressive, but outside the field, (and without being moving on too quickly from the unquestioned achievement of these types of models) what does it mean for the rest of deep learning?

The first thing that is maybe interesting for the field is the way that these models are multi-modal, or in other words they aren’t stuck in the silo of being text-only, or image-only, and so on. We are able to translate (to a greater or lesser degree) between language and images with these models, which seems like it might open up a whole universe of interactions and behaviours that are interesting to explore.

At a very (very) high level, what’s going on with stable diffusion is that it starts with generating an image that is more or less purely random noise, and then (with subsequent iterations) slowly reveals an image and coherence that was contained within the random noise. Similar to how Michelangelo said of sculpture (“It is already there, I just have to chisel away the superfluous material”), what happens here is that we have to remove the superfluous noise.

🛠 Core Takeaways: How does it work?

Those of you not taking the course live will have to wait a few months for the lectures to be released and in any case I don’t want to parrot the order and progression of how Jeremy explained how Stable Diffusion works. With that said, I was pleasantly surprised by how much I was able to follow along given what is a fairly technically involved topic. (Note to self: the fundamentals continue to be important!)

Fundamentals still count

Even though there are a hundred and one small innovations and technologies which make something like Stable Diffusion possible, in the end we’re still dealing with Deep Learning and we’re still dealing with finding ways of converting things into numbers which can be used by machines to update weights by way of evaluating loss functions. So many of the individual pieces that make up how you build something like Stable Diffusion amount to:

  • figure out how to get this non-number-like thing into a numeric representation (ideally a vector of some kind)
  • do all the usual deep learning things that we’ve done a thousand times and that we know work
  • at the end, maybe find a way to convert the numeric representation that our model learned into some kind of form that is useful to us

Obviously the details are important and nobody is creating magical generative art with this very high-level hand-wavy explanation, but for someone at the earlier end of their journey into deep learning it is reassuring that the fundamentals continue to have relevance and that those mental models remain useful as a way of thinking about new developments.

The tricks are the way

The other pleasant surprise was the enduring relevance of ‘tricks’. In chapter one of the FastAI book, Jeremy & Sylvain showcase a number of examples where clever approaches are taken to solve problems with Deep Learning:

  • a malware classification program is made by converting malware code into an image which is used to train a model
  • a fraud detection algorithm is trained by converting images of computer mouse movements
  • …and so on

Even amongst the Delft FastAI study group, Kurian trained a classifier to detect genre in music samples using a similar method (i.e. using images as an intermediary form for the samples which were used in training). The book emphasises:

“In general, you’ll find that a small number of general approaches in deep learning can go a long way, if you’re a bit creative in how you represent your data! You shouldn’t think of approaches like the ones described here as”hacky workarounds,” because actually they often (as here) beat previously state-of-the-art results. These really are the right ways to think about these problem domains.”

Most of these ‘tricks’ seem to relate to either performance improvements (i.e. how can we get this training to happen faster, or with fewer compute needs) or ways of getting your problem domain into a form that we can use deep learning techniques on them. In the case of Stable Diffusion, one of the problems we have to address is how to work in this multi-modal manner, where text is used to represent a particular idea (which in turn needs a vector/numeric representation) but where we also want to represent that same idea in image form.

At the same time, we have the whole autoencoder part of the story — whereby we use an encoder to turn a large image into a (smaller-sized) latent representation which can be used in training, and then we use a decoder to turn a noisy latent into a full-sized image — which seems to mainly be about making the training process more efficient.

Each of these techniques come with their own complexities and histories, but it’s just notable to me how the story of the development of machine learning techniques seems somehow to be a succession of these small incremental innovations that progressively accrue. That’s not to say that there aren’t big breakthroughs in either understanding why things work the way they do, or in the more tactical method space, but it just seemed very apparent in the unpacking of Stable Diffusion that a great deal of creative stitching together of ideas had taken place.

The historian in me is fascinated by the different pathways that the field has explored, or the reasons why certain techniques emerged when they did, or how hardware improvements gave tried-and-rejected techniques a new lease of life, but I’m guessing that probably doesn’t help much with the work of research.

💪 What happens when we train the diffusion model

A diffusion model is a neural network that we train. The way it works is that it removes noise from an image (passed in as input along with a text prompt) such that the output more closely resembles the prompt. When we are training our network, we pass in the vectorised words along with the latent forms of the images (since those are much smaller file sizes and thus faster / more efficient to train). We use the encoder to get a latent representation of the image that we use for training.

For the text caption, we want a way to represent the association of images with text captions in vector space. In other words, if there are various phrases that all represent more or less the same image if you were to translate those phrases into an image, then those should be similar when represented as a vector. The technique or trick for this is to use ‘contrastive loss’, a particular kind of loss function which allows us to calculate the relative similarity of two vectors. This contrastive loss is what gives us the first two letters of ‘CLIP’, a neural network developed by OpenAI.

The CLIP model takes some text and outputs an embedding, i.e. some features in vector form that our unet can use for training along with the images in their latent representation form.

🎨 What happens when we generate an image

When generating our image we can use the neural network we trained to progressively remove noise from our candidate image. We start off with a more or less completely noisy image, then apply the unet to it and it returns the noise that it calculates is sitting on top of a latent representation that approximates the vectorised version of our prompt. We take a fraction of that, remove it, and repeat a few times. (Currently that can take as many as 50 iterations before we reach a really impressive image, but new techniques are in review which would dramatically reduce the need for so many iterations.)

Note that it is during the inference stage where we need the decoder part of our (VAE) encoder to turn a latent tensor representation of an image into a fully-fledged large picture.

🎺 How to play & practice for part II

I also wanted to briefly take a second to reflect on what might be useful as ways to get practically involved during the coming weeks. In part one, the instruction was fairly simple: “train lots of models”. In part two, the practicality is initially still there, it seems, but there will be other areas of emphasis. The things that seem to make sense to me currently are:

  • continue to blog as a way of reflecting and developing my understanding
  • understand and digest the core concepts that are introduced
  • whenever the ‘code everything from scratch’ part of the course starts, make sure to at least attempt this on my own alongside whatever is being showcased in lectures
  • discuss areas where concepts are unclear during the weekly Delft FastAI study group calls that I organise

I suspect that the discipline of the coding will be most instructive, once we get to it, though by extension probably also one that comes with the most struggle.

Following a session of the Delft Study Group, I gathered some more suggestions for how to get the most out of this part 2:

  • get hands-on as much as possible
  • ‘the details matter’ and try to go above and beyond with the course and you’ll be rewarded
  • blog and explain what you’re learning
  • answer questions on the forums as a way of cementing your learning
  • don’t let yourself get blocked by ideas that you don’t understand along the way. Keep following along with the course and more likely than not these things will clear themselves up

📖 Glossary of Core Terms

(Listed alphabetically, not in the order of exposition. Also these reflect my current understanding which is not always complete, so I’ll keep this updated as my understanding grows.)

  • Analytic derivatives — This is a faster way of calculating the gradients for our image, such that we calculate the whole set at once. This is what PyTorch uses under the hood in conjunction with a GPU to speed up the training process.
  • Autoencoder (model) — This is a combination of an encoder and a decoder. This model is a neural network with a series of layers that progressively ‘compress’ an image (through convolutions) until the point where it is much smaller. At this point the representation is called a ‘latent’. Then (in the full autoencoder) the image is progressively scaled back up into its full version.
  • CLIP — This is a model that turns text into images, powered by ‘contrastive loss’. It was developed by OpenAI.
  • Contrastive loss — This is a loss function that allows us to compare the similarity of two vectors. We multiply them together and sum up all the values. (This process is also known as the dot product.) If the two vectors are similar, we would expect the number to be large. Contrastive loss is used in CLIP.
  • Convolutional layer — This is a key part of computer vision and it is a way of representing images at different resolutions. Images are either scaled up or down through a convolutional layer (which seems to be some way of averaging the values of an image). Convolutional is the C in CNN.
  • Decoder — This is the part of an autoencoder that takes a latent representation and scales it back up (i.e. ‘decompresses’ it) to its full representation.
  • Differential equations — This is a part of mathematics which is really important for Stable Diffusion and whose language forms the context and backdrop for discussions around this technique, but it is a fairly different set of vocabulary from what we use for deep learning.
  • Diffusion sampler — This is the part of the process which relates to adding or subtracting noise from an image.
  • Dot product — This is the process by which we multiply two vectors by each other and sum up the values. It is used when we are calculating contrastive loss, but it is a common linear algebra calculation.
  • Embedding — This is a representation of something as a vector, particularly useful for deep learning. In particularly, it’s useful for areas like text where we might, for example have semantic fields that we want to represent as being similar to each other, but we need to do so in such a way as is comprehensible and processable by a machine. Embeddings allow us to do this in vector space.
  • Encoder — This is one part of an autoencoder that takes a full sized image and passes it through a series of convolutions such that at the end we have a significantly reduced tensor that is known as a latent.
  • Finite differentiation — This is one way of calculating the gradients for our image, but it is done pixel by pixel. It is quite slow. (Contrast with analytic derivative.)
  • Guidance + guidance scale — This is the prompt that we pass into our Stable Diffusion model. The guidance scale is what we can pass in to our generation function call to specify how much we want the prompt to be strictly followed.
  • Latent representation(s) — This is the intermediate product in the middle of an autoencoder. It is what is produced by an encoder, and it is what is consumed by a decoder. It is sort of a compressed version of all the important pieces of information relating to a particular image, for example.
  • Momentum — This is a technique used by optimizers in which if we increase the same parameters (or weights) several times in a row, then it seems likely that we’ll do that again so we can increase the learning rate for those parameters so that we don’t have to make so many iterations.
  • Negative prompts — You can pass in a negative prompt along with your prompt and it is a way somehow of ensuring that the resulting image does not correspond to whatever was in the negative prompt. (Think of it as ‘subtracting’ from the main prompt, which seems to be what is going on under the hood, in vector space).
  • Noise — Noise is random data, with no meaning as such. It is important in the world of Stable Diffusion because the work of generating the image is the work of removing noise.
  • Perceptual loss — This is another kind of loss function that may play a role in Stable Diffusion going forward int he course.
  • Pipeline — This is the concept that is used by the HuggingFace Diffusers library, out of which our images are generated.
  • Score function — This is another way of stating the gradients for our image. I.e. the representation of what needs adjusting (and by how much) in order to remove noise from our image.
  • Step — This is one iteration of the inference process.
  • Textual Inversion — This is the process of creating a new embedding for a single specific concept or item. (i.e. the Indian watercolour example in the course)
  • Time step — This is a concept from the way the original Stable Diffusion creators thought about things. It represents a way to go from a value to an amount of noise that gets added to an image.
  • VAE — This is the specific kind of autoencoder used in Stable Diffusion.

✖️➗ New Mathematical Symbols

  • ∑ — means to sum up
  • ∇ — we use this symbol instead of something else (d something?) in representing the gradient because we are talking about the gradients of many pixel values and not just a single one
  • β — (beta) — used in relation to the time steps to represent the amount of noise or variance. I think this is used instead of the letter σ (sigma), but I might be wrong on that.

❓ Enduring Questions

Some of the questions which I have in my mind following the class include:

  • These ‘generative’ models intersect with the field of art and creativity, at least nominally, but to what extent can we even say that they are generating something versus simply repeating things that they’ve already seen? (see also, the ‘stochastic parrots’ paper)
  • SKILL: are there tricks or best practices when deciphering jargon-rich papers down to their core and, in doing so, being able to see the parts of the paper that are new (versus the parts that are just standard practice)?
  • SKILL: what does it mean to be ‘impractical’ i.e. do research in this field? What is involved and how is it generally or most usefully done?
  • What are the useful or fundamental innovations involved in Stable Diffusion?
  • What are the parts of ML/DL and/or Stable Diffusion that we do because of time or hardware or cost limitations as opposed to the things that we do because they are the right way to approach this particular problem? (provoked by the whole detour down into autoencoders that seems mainly to be there to improve iteration speed.)
  • What’s the bigger takeaway for the field as a whole? In other words, what things can we think about doing now with these new techniques?
  • Why does the whole ‘time steps’ conversion stage happen at all? i.e. why can’t we just choose a random number to represent how much or little noise we apply to an image for our training data.
  • The autoencoder / compression step seems like an amazing technique all to its own. Is it really lossless, or is some information lost along the way?