- Seven steps for a machine to learn
- Calculating gradients with PyTorch
- 'Stepping': using learning rates to figure out how much to step
- Takeaways from the seven-step process
!pip install -Uqq fastbook nbdev torch import fastbook fastbook.setup_book() from fastai.vision.all import * from fastbook import * import torch from torch.utils.data import Dataset from torchvision import datasets from torchvision.transforms import ToTensor import matplotlib.pyplot as plt
In the previous post I showed a naive approach to calculating the similarity or difference between images, and how that could be used to create a function that did pretty well at estimating whether any particular image was a pullover or a dress.
Chapter 4 of the fastbook then takes us on a journey showing a smarter approach where the computer can make even better estimations and predictions. The broad strokes of this approach are simple to grasp, but of course the individual details are where the nuances of machine learning are to be found.
1. Initialise a set of weights
We use random values start with. There could potentially be more elaborate / fancy ways to calculate some starting values that are closer to our end goal, but in practice it is unnecessary because we have a process that we are going to use to update our values.
2. Use the weights to make a prediction
We check whether the sets of weights we have currently set as part of our function are the right ones. We check the prediction (i.e. what came out the other end of our model / function) and get a sense of how well our model did.
3. Loss: see how well we did with our predictions
We calculate the loss for our data. How well did we do at predicting what we were trying to predict? We use a number that will be small if our function is doing well. (Note: this is just a convention and otherwise there's no special reason for this.)
4. Calculate the gradients across all the weights
We are calculating the gradient for lots of numbers, not just a single value. For every stage, we get back the gradient for all of the numbers. This is done sequentially, where we calculate the gradient for one weight / number, keeping all the other numbers constant, and then we repeat this for all the other weights.
Gradient calculation relates to calculating derivatives and with this we are stepping firmly into the space of calculus. I don't fully understand how all this works, but intuitively, the important thing to know is this: we are calculating the change of the value, not the value itself. I.e. we want to know how things will change (by how much, and in what direction) if we shift this value slightly.
Note that the process of calculating the gradients is a performance optimisation. We could just as well have done this with a (slower) manual process where we adjust a little bit each time. With the gradient calculations we can take bigger steps in the direction we want, with more precision guiding our guesses about the direction and distance we want to go.
5. 'Step': Update the weights
This is where we increase or decrease our own weights by a small amount and see whether the loss goes up or down. As hinted in step 4, we use calculus to figure out:
- which direction to go in
- how much we should increase or decrease our own weights
6. Repeat starting at step 2
This is an iterative process.
7. Iterate until we decide to stop
There are various criteria determining when we should stop. Some possible stopping points might include:
- when the model is 'good enough' for our use case
- when we have run out of time (or money!)
- when the accuracy starts getting worse (i.e. the model isn't performing as well)
For those of us who don't bring a strong mathematics foundation into this space, the mention of calculus, derivatives and gradients isn't especially reassuring, but rest assured that PyTorch can help us out during this process.
There are three main parts to using PyTorch to calculate gradients (i.e. step four of the seven steps listed above.)
For any Tensor where we know we're going to want to calculate the gradients of
values, we call
.require_grad() on that Tensor.
def f(x): return x**2 x_tensor = torch.tensor(3.).requires_grad_() y_tensor = f(x_tensor) y_tensor
Here we can see that 3 squared is indeed 9, and we can see the
grad_fn as part
of the Tensor.
This actually refers to backpropagation, something which is explained much later
in the book. This step is also known as the 'backward pass'. Note, that this is
again another piece of jargon that we just have to learn. In reality this method
might as well have been called
I can't explain why this is the case, since I've never learned how to calculate gradients or derivatives (or anything about calculus, for that matter!) but in any case it's not really important.
Note that we can do this whole process over Tensors that are more complex than illustrated in the above simple example:
complex_x = tensor([3., 5., 12.]).requires_grad_() def f(x): return (x**2).sum() complex_y = f(complex_x) complex_y.backward() complex_x.grad
tensor([ 6., 10., 24.])
Something else I discovered while doing this was that gradients can only be
calculated on floating point values, so this is why when we create
complex_x we create them with floating point values (
3. etc) instead of
just integers. In reality, I think there will be some kind of normalisation of
our values as part of the process, so they would probably already be floats,
but it's worth noting.
Now that we know how to calculate gradients, the key question for the fifth step in our seven-step process is the following:
"How much should we shift our values based on what we get back from our gradient calculations?
We call this amount the 'learning rate', and it is usually a value between 0.001 and 0.1. Very small, in other words :) We adjust our weights / parameters by this basic equation:
weights -= weights.grad * learning_rate
This process is called "stepping the parameters" and it uses an optimisation step.
The learning rate shouldn't be too high, else our loss can get higher / worse or otherwise we can just bounce around within the boundaries of our function without ever reaching the optimum (== lowest) loss.
At the same time, we shouldn't use a very tiny learning rate (i.e. even tinier than the 0.001-0.1 mentioned above) since then the process will take a really long time and while we might reach the optimum loss at some point, it might not be the fastest way to get there.
Most of this makes a lot of intuitive sense to me. The parts that don't are what is going on with the gradient calculations and finding of derivatives and so on. For now, it appears that I can get away without understanding precisely how that works. It is enough to appreciate that we have a way to make these calculations, and those calculations are optimised for us