Understanding Backpropagation
The L-1 neurons’ output values, in turn, are influenced by weights applied to inputs they receive from L-2. So we can differentiate the activation functions in L-1 to find the partial derivatives of the weights applied to L-2’s contributions. These partial derivatives show us how any change to an L-2 weight will affect the outputs in L-1, which would subsequently affect the output value of Lc and thereby affect the loss function. Model training typically begins with a random initialization of weights and biases. Model hyperparameters, such as the number of hidden layers, the number of nodes in each layer and activation functions for specific neurons, are configured manually and not subject to training. The nodes of the input layer receive the input vector, and each passes their value—multiplied by some random initial weight—to the nodes of the first hidden layer.
Testing Neural Network
A key idea of neural nets is to decompose computation into a series of layers. In this chapter we will think of layers as modular blocks that can be chained together into a computation graph. Figure 14.1 shows the computation graph for the two-layer multilayer perceptron (MLP) from chapter Chapter 12. A low learning rate ensures we always step in the right direction, but calculating so many changes is time-consuming and computationally expensive. A high learning rate is computationally efficient, but risks overshooting the minimum.
Backpropagation in Neural Network
The hidden units take the weighted sum of these output values as input to an activation function, whose output value (conditioned by a random initial weight) serves as input to the neurons in the next layer. This continues until the output layer, where a final prediction occurs. A neural network consists of a set of parameters – the weights and biases – which define the outcome of the network, that is the predictions.
But each mini-batch gives a pretty good approximation, and if there are 100 minibatches, each step takes 1/100th of the total time. And after 100 steps, each piece of training data will have had its chance to influence the final result. Or, rather, in principle it should, but for computational efficiency, we’ll do a little trick later to keep you from needing to hit every single example for every single step.
Generalized Equations
It is general to a large family of computation graphs and can be used not just for learning parameters but also for optimizing data. Since the forward pass is also a neural network (the original network), the full backpropagation algorithm—a forward pass followed by a backward pass—can be viewed as just one big neural network. The parameter gradients can be computed from this network via one additional matrix multiply (matmul) for each layer of the layer of the backward network.
Changing the Activations
It is harder to train an RNN because the model can be very sensitive to changes in the connecting weights. During the training phase, gradient descent and a modified version of backpropagation (BPTT, as mentioned earlier) would be used to adjust all the weights and biases. However, because of the feedback loops in RNN, connecting weights can become compounded many times until they become very large. This causes algorithms like gradient descent to perform very poorly because the high weights cause proportionally high changes in parameters, as opposed to the tiny changes required to find minimum points. On the other hand, if connecting weights are too small to begin with, then training can cause them to quickly approach zero, which is called the vanishing gradient problem.
I must also attribute use of some code from his network class from the neural network series. If you’re not familiar with his channel do yourself a favor and check it out (3B1B Channel). While manim was my tool of choice it’s not the easiest and at some point between ‘I’ve gone too far to stop now’ and ‘I’ve bitten off way more than I can chew’ I may have regretted this decision, but here we are.
- In Forward pass inputs are passed through the network activating the hidden and output layers using the sigmoid function.
- The output zz of the neuron is modified by a connecting weight wcwc and the result included in the sum, making up the input of the same neuron.
- Modern deep neural networks, often with dozens of hidden layers each containing many neurons, might comprise thousands, millions or—in the case of most large language models (LLMs)—billions of such adjustable parameters.
- A high learning rate is computationally efficient, but risks overshooting the minimum.
Neurons that are way off require big nudges, but neurons that are pretty close to correct only require little nudges. At this point we now have a bunch of formulas which will be hard to keep track of if we don’t do some bookkeeping. That means putting the terms into matrices so we can more easily manage/track the terms. Figure 5 shows how the terms are grouped, it’s worth noting the use of the Hadamard operator (circle with a dot inside). This is used for element wise matrix multiplication which helps simplify the matrix operations.
Example of Back Propagation in Machine Learning
Neural networks make predictions once the original input data has made a “forward pass” through the entire network. The analogy is not perfect since the untrained network is not really “thinking” about a 2 when it sees this example; it’s more that the label on the training data is hardcoding what the network should be thinking about. After each forward pass, a “loss function” measures the difference (or “loss”) between the model’s predicted output for a given input and the correct predictions (or “ground truth”) for that input. In other words, it measures how different the model’s actual output is from the desired output. Working backward from the model’s output, backpropagation applies the “chain rule” to calculate the influence of changes to each individual neural network parameter on the overall error of the model’s predictions.
A full treatment of backpropagation requires familiarity with matrix operations, calculus, and numerical analysis, among other things. Such topics fall well outside the scope of this text, there are many resources online that go into more depth, such as Neural Networks and Deep Learning by Michael Nielsen (2019) and What Is Backpropagation Really Doing? More general formulations of the backpropagation algorithm can be found in the following links. We can see that even for this very small and simple neural net, the calculations easily get overwhelming. Please find the detailed calculation of the derivative of the sigmoid function in the appendix of this post. The full set of operations for a pointwise layer is shown next in Figure 14.13.
- The activation function through its derivative plays a crucial role in computing these gradients during Back Propagation.
- Because the process of backpropagation is so fundamental to how neural networks are trained, a helpful explanation of the process requires a working understanding of how neural networks make predictions.
- In the same vain you could also drive your car, as many of us do, without the faintest idea of how an engine really works.
- It would be much more convenient to have a model that can take different amounts of input data.
- Figure 5 shows how the terms are grouped, it’s worth noting the use of the Hadamard operator (circle with a dot inside).
Error Calculation
By formalizing that process into a straightforward equation and implementing a few lines of code in Python, you can automate that process for each weight in the network. In Forward pass inputs are passed through the network activating the hidden and output layers using the sigmoid function. In practice, it takes computers an extremely long time to add up the influences of every single training example for every single gradient descent step.
Loss function, cost function or error function refer specifically to terms we want to minimize. As will be explained in the following sections, backpropagation is a remarkably fast, efficient algorithm to untangle the massive web of interconnected variables and equations in a neural network. This process is said to be continued until the actual output is gained by the neural network. Although we can’t directly change the activations, it’s helpful to keep track of what adjustments we wish to take place in this output layer. But remember, we only have control over the weights and biases of the network. So we will have to nudge those weights and biases in a way that improves the output.
We want to adjust all these other output neurons too, which backpropagation tutorial means we will have many competing requests for changes to activations in the previous layer. For this article, let’s begin with a complete disregard for notation and instead step through the effect that each training example has on the weights and biases. Hopefully, these effects will feel intuitive so that by the time we return to the notation, it acts to articulate something you already know, rather than acting as a code to be decrypted. An important point to note is that for a given layer all but the last term will be the same as the equations we just found with respect to a given weight. The last terms is simply the bias which is assigned a value of 1 (the bias weight terms are used to adjust the bias).
We’ve also discussed gradient descent, so you should know that when people describe a network as “learning,” they mean finding the weights and biases that minimize a certain cost function. The ultimate goal of backpropagation is to find the change in the error with respect to the weights in the network. If we’re looking for the change of one value with respect to another, that’s a derivative. For our computational map each node represents a function and each edge performs an operation on the attached node (multiplication by the weight).
The softmax value of each output neuron represents the likelihood, out of 1, that an input belongs to their category. In a perfectly trained model, the neuron representing the correct classification would have an output value close to 1 and the other neurons would have an output value close to 0. You have now seen the intuition behind all the code (like you might find in Nielsen’s book) that goes into building a simple neural network. RNNs and LSTMs are a steppingstone to very sophisticated AI models, which we will discuss in the next section. Parameter sharing consists of a single parameter being sent as input to multiple different layers. We can consider this as a branching operation, as shown in Figure 14.21.
In supervised learning, which uses labeled data, ground truth is provided by manual annotations. In self-supervised learning, which masks or transforms parts of unlabeled data samples and task models by reconstructing it, the original sample serves as ground truth. Though equivalents and predecessors to backpropagation were independently proposed in varying contexts dating back to the 1960s, David E. Rumelhart, Geoffrey Hinton and Ronald J. Williams first published the formal learning algorithm. Their 1986 paper, “Learning representations by back-propagating errors,” provided the derivation of the backpropagation algorithm as used and understood in a modern machine learning context. With that, every line of code that would go into implementing backpropagation corresponds to something you have now seen, at least in informal terms. But sometimes, knowing what the math does is only half the battle, and just representing the damn thing is where it gets all muddled and confusing.
To get an intuition for Equation 14.6, it can help to draw the matrices being multiplied. Below, in Figure 14.8, on the left we have the forward operation of the layer (omitting biases) and on the right we have the backward operation in Equation 14.6. To simplify an explanation of how backpropagation works, it will be helpful to first briefly review some core mathematical concepts and terminology.