Google Professional Data Engineer – TensorFlow and Machine Learning Part 12
August 6, 2023

26. Learning XOR

Let’s begin with a question which I’d like you to keep in mind as you go through the video. Learning or reverse engineering the XOR function requires more than one neuron. Each of those neurons, each of those multiple neurons that we are going to use can suffice with only an affine transformation. We do not need an activation function. Only an affine transformation is enough for each of the neurons. In in learning the XOR function, learning linear regression using a single neuron didn’t seem too difficult. Let’s now turn our attention to a slightly more complicated function. This is the XOR function. XOR will take in two inputs which are both bits, so they can take values of zero and one. The output of the XOR function will be one.

If each of these inputs is different from the other, if either both of them are zero or both of them are one, then the output will be zero. Now let me tell you upfront that reverse engineering the XR function will require three neurons arranged in two layers and it will also require us to make use of a nonlinear activation function. When you hear this, very likely your question is going to be how do we know? How do we know that this particular arrangement of three neurons arranged in these two layers is going to learn exor? And the answer is well, some previous researcher has found it out. This brings home one of the challenges of working with neural networks.

Your job really is to experiment with different architectures and ultimately settle on some architecture which meets your purposes. It ought to be ideally, the simplest architecture which can learn the function that you are interested in learning. So do keep this in mind for when you are building your own neural networks for image processing or text processing or whatever. The hard part of that job really is trying out different architectures and finding one that works and is not too complicated. Returning now to XOR. XOR is typically represented using a truth table rather than a pseudocode representation. The truth table is very succinct. You can see that it tells us at a glance how the output y depends on the inputs x one and x two.

XOR is a very famous example of a function that cannot be learned using a linear activation function or just an affine transformation. And that’s because XOR is not linearly separable. Let’s understand what exactly the term means given that there are two inputs x one and x two. Let’s represent these along a pair of axes, x one along one axis and x two along the other. Now let’s go ahead and plot the values of y corresponding to the different combinations of x one and x two. For instance, when x one and x two are both zero, y is equal to zero. So we write y is equal to zero for that point at the origin. From our true table, we know that when x two is equal to one and x one is equal to zero, the output y is equal to one.

Likewise, when only x one is equal to one, and lastly, when both of the inputs x one and x two are equal to zero, the output y is going to be zero as well. So we end up with four points in our two dimensional space. Now, here’s the deal. Try drawing any straight line through this plane such that it divides the points into disjoint regions. In other words, try and find any straight line if you can, such that all points on one side of the straight line have y equal to zero and all points on the other have y equal to one. No matter how hard you try, you will be unable to find such a straight line. And this is why we say that XOR is not a linearly separable function.

This bit about linear separation has to do with a particular type of neural network called a single layer perceptron. A perceptron is really just one neuron. It represents the simplest type of neural network in TensorFlow. However, we tend to model neurons and neural networks using directed asyclic graphs rather than geometrically. Perceptrons are more of a geometric representation. And that’s why, while working with TensorFlow, we tend not to talk about them at much length. In any event, it is not possible to learn the XR function using just one perceptron or just one neuron. And we are going to need a more complex neural network like the one that you see on screen.

Now, once again, don’t worry about how the folks who came up with this architecture knew that it would work. That’s beyond the scope of this course. Instead, let’s focus on how this architecture is trained and how once it’s trained, it’s able to perfectly reproduce the XR function. Let’s go ahead and double click on this network. We are going to open up the individual neurons so the inputs x one and x two will both flow into each of the neurons in the first layer. Each of these neurons in turn, will have both an affine transformation and an activation function. Each of these building blocks in turn, will have their own weights and biases.

Then the outputs of these two individual neurons will be passed into a third neuron. That neuron will also have an affine transformation and an activation function. And the output of the activation function of that third neuron will represent the output of the neural network as a whole. And that is what needs to match the truth table of the x or function. Let’s now break this down. There’s a lot going on here, so let’s parse this. Let’s start with the first neuron, x one and x two. Both inputs will be passed into neuron one. As the inputs into the affine transformation, there will be a couple of weights as well as a bias. The second neuron also will have the same structure. That’s why these are in the same layer.

Once again, both inputs will be passed into each of neuron one and neuron two. Both neuron one and two are in the first layer, but neuron three is in the second layer and that’s because it receives as its input the outputs of the original two neurons one and two. Let’s now zero in on the activation functions of the layer one neurons. The XOR function is not linearly separable and so we need some nonlinear modeling in there somewhere. And this is what the activation functions provide. The activation functions that we are going to use on both of these neurons will be ReLU. That is, the rectified linear unit which we already alluded to ReLU will basically output the maximum of zero and its input.

So if you pass in some input x, the output of the reluctation function is simply going to be max of x comma zero. So let’s go ahead and pencil in ReLU as the activation function in each of the neurons in our first layer. Notice that this is a change from the linear regression case there because we did not require any nonlinear elements at all, we may do without an activation function. The afine transformation was all that we needed. These ReLU activations will take the outputs of the corresponding a fine transformations and chain each of them into the input of the layer two neuron. Now, it turns out that this much nonlinearity is enough for our neural network to learn XOR the activation function of the layer two neuron does not need to add any further nonlinearity. And so here all that we need is the identity activation function. The identity activation function just takes whatever input was passed in and sends that out as the output. The identity activation is what we had made use of implicitly when we modeled linear regression using a one neuron neural network. This is now a fairly legit, a fairly serious neural network. It has three neurons arranged in two layers. Let’s be clear on what the inputs are.

They are x one and x two, which pass in equally to each of the neurons in layer one. So in a TensorFlow computation graph, x one and x two will be placeholders. The output of this entire neural network is the output of the activation function of the last neuron. In the last layer, neurons one and two together constitute layer one. That’s because they are logically similar. They have the same structure. The third neuron, which is by itself in layer two has a different structure than the other two previous neurons. And more importantly, it takes in as the inputs the outputs of the first layer. And in this way information only feeds forward in this particular neural network. Not all neural networks are like this by the way.

This is a specific type of neural network known as a two layer feed forward neural network. Now, we mentioned that the inputs x one and x two are placeholders in a TensorFlow sense, but that still leaves a whole bunch of variables, because each of the neurons has weights and biases, which are TensorFlow variables. The values of these weights and biases will need to be determined during the training process. We are not actually going to implement this XOR neural network in TensorFlow, so we will not specify optimizers and so on. We will, however, go through the workings of this neural network in painful detail, and we shall see how the values of these variables are what allow this neural network to really learn and to reverse engineer the truth.

Table of the XOR function. Let’s turn back to the question we posed at the start of this video. Learning or reverse engineering the XOR function is going to require more than one neuron. That bit is certainly true, but as we saw, the second clause of this statement is false. That’s because XOR is not a linear function, there is no way that we can create one line to separate all of the points, the different outputs of the XOR function in a plane. Because XOR is nonlinear, it’s not linearly separable. We need to introduce some nonlinearity in our neural network for it to be able to successfully learn or reverse engineer XOR. So this statement on screen now is false. We require multiple neurons, and those multiple neurons will require activation units.

27. XOR Trained

Here is a question that I’d like you to keep in mind as you go through the contents of this video. Let’s say we have created a neural network which learns or which reverse engineers the XOR function. How is that neural network going to get the best values of the weights and biases for each of its neurons? Will you need to write a separate program explicitly to compute those best values? That’s the question which I’d like you to think about. Let’s now keep going with our look at the training process and the neural network which learns XOR. Remember that every neural network, no matter how complicated, just consists of a large number of interconnected neurons. Each one of these neurons has in turn, some parameters, some variables, the weights and the biases, as well as whatever parameters the activation function has.

These weights and biases are determined during the training process. This is usually performed for us by TensorFlow, but we do have to specify a bunch of information such as the cost function and optimizer object, the number of steps, and the training data. In any case, for now, let’s gloss over the mechanics of the training process. We will return to that in gory detail in a little bit. Let’s now skip forward and understand how a trained neural network can reverse engineer or recreate XOR. Let’s start with a focus on the first neuron. The weights and the biases of this neuron will be determined during the training process. And those are the values that you see on screen now. Both of the weights are one and the value of the bias is equal to zero.

Once again, don’t worry too much about how these constants were determined. The training process, if correctly set up with enough training data and the right optimizer, will converge to these values. W and B are variables for the first neuron. Similar variables exist for the second neuron. Once again, we will find the values of the weights and the biases here. The two weight vector elements are also one, and the bias this time is equal to minus one. Let’s now discuss the variable values for the third neuron. Here. The weights that the training process throws up are one and minus two, and the bias is equal to zero. We now have nailed down all of the variables in our neural network.

We have all of the weights and biases for each of the three neurons. So this is a fully trained neural network which is capable of reverse engineering XOR. Let’s ensure that this is the case. Let’s satisfy ourselves that passing in the different configurations of input will actually give the same output as XR. Let’s start with x one and x two, both equal to zero. So we specify the values of these input variables as equal to zero, and we feed them into the neurons in our neural network. Let’s start with neuron number one, where both of the weights were equal to one, and the bias was equal to zero. The product of W and x works out to zero.

This is added to zero, which is the bias, and this zero passes into the ReLU. The ReLU rectifies it, and the output of the ReLU is zero as well. The second neuron also accepts the same inputs, x one and x two. It also has both of its weight variables set to one, but the bias is equal to minus one. So the output of the affine transformation is minus one. This gets rectified by our ReLU to be zero. So we now have both of the inputs into the affine transformation of the second neuron. Both of these are zero, as we just saw a moment ago. So once they pass through the Affine transform, the output is zero. The identity activation function also just returns to zero without changing it in any way.

And in this way, our final output was indeed zero when both of the inputs were zero. Let’s move on to the next element in our true table. This time, x two is equal to one, x one is still equal to zero. Let’s pipe these inputs into our neurons. As before, the weights of neuron one are both one, the bias is equal to zero. So the output of the affine transformation is one, and the output of the ReLU is one as well. Let’s consider now the second neuron here, the output of the affine transformation works out to be zero. That’s because the bias of minus one cancels out the input x two, which is equal to one.

The ReLU output is zero as well. Now we have both of the inputs into the affine transformation of the second neuron. These are now one and zero. The output of that Affine transformation works out to one. The identity function does not tamper with it. And so the final output is one as well. We have successfully recreated the second line of our truth table. Let’s move on to the third line this time. X one is equal to one, but x two is equal to zero. As always, we plug these into our two neurons. X one is equal to one and x two is equal to zero. Now, this case merely interchanges the values of x one and x two from the previous case. And so all of the neuron outputs remain the same.

The output of neuron one is going to be one, the output of neuron two will be equal to zero, and the output of the third neuron, which is in the second layer, will be one as well. This follows because this neural network is perfectly symmetric in x one and x two. So when we merely interchange the values of x one and x two, nothing in the working of the neural network is going to change. So yet another line from the true table has been successfully reverse engineered. Let’s now move on to the last line. This one has both of the inputs equal to one. We plug those inputs into the first neuron and the output of the Affine transformation works out to be equal to two and the output of the ReLU works out to two as well.

Likewise, in the second neuron, the weights are the same, but the bias is now equal to minus one. So the Affine transformation output is equal to one rather than two. This is fed into the ReLU and emerges as the output one. We now have the inputs into our layer two neuron. The inputs are two and one respectively. When we plug these into the Affine transformation weights, the output of that Affine transformation emerges as zero. And it’s here that the weight of minus two on the second input really plays an important role. The identity leaves that unchanged and so the output as a whole is equal to zero. And you can see that we have now successfully reverse engineered the entire root table of the XOR function.

This example was really instructive because of the use of a nonlinear activation function. Because XOR is not linearly separable, we would not have been able to reverse engineer XOR if we had not made use of the reluctation function in the neurons of layer one. And in general, smart choices of activation function are a really important component of architecting a neural network so that it actually does what you would like it to do. A common practice is to have input layers use the identity function as the activation. This means that the output of the neuron as a whole is equal to the output of the Affine transformation. That is just a weighted sum of the inputs.

If, for instance, the inputs represent pixels, the output of this first input layer is just going to be some kind of weighted average of the group of pixels in an image. Those groups of pixels can then be operated on by other inner layers which realize that they represent corners or edges or some stuff like that. Inner hidden layers usually make use of ReLU as the activation function. That is a really common choice. Well, perhaps it’s an exaggeration to say that they usually use ReLU, but ReLU is used really often. It just has nice mathematical properties. The output layer in our XR example was simply the identity. Whatever was the output of the Affine transformation, that was the output of the final neuron. Identity activation doesn’t work all that well.

It only makes sense to use when you have a linear function like linear regression. There is also another really common choice for the output layer that is called the soft max activation function. We shall discuss this in a lot more detail when we get to logistic regression or linear classification later on in this course. The basic idea of soft max, however, is that the output will represent a number between zero and one no matter what the input is. The output will always be between zero and one, and it will always be interpretable as a probability. And what’s nice is that this probability will in some way square up with the probabilities of all of the labels in the possibility set in the output set, so that they all sum to one.

We will get to softmax activation in a lot more detail. But now let’s just absorb what we’ve just learned. We saw how it was possible for a neural network to reverse engineer the XOR function. This XOR function represents a decent bit of pseudocode. We were able to reverse engineer it using three neurons arranged in two layers. And we also saw how this was able to perfectly recreate the truth table of the XOR function. The general idea of neural networks is exactly this. By adding an arbitrary number of layers connected in arbitrarily complex ways, we can actually go ahead and learn an arbitrarily complicated program. Next, we are going to plunge into implementations of linear and logistic regression in TensorFlow.

But before that, let’s just summarize what we’ve learned so far. We saw that a neuron is the smallest entity in a neural network. Linear regression is a really simple function which can be reverse engineered or learned by just one single neuron. It’s possible to learn or to reverse engineer pretty much anything. We saw how a nonlinear function such as XOR could be reverse engineered using three neurons. And to extend this idea, any function really can be learned using combinations of interconnected neurons. The architecture of this neural network is something that you, as the programmer or the designer of the machine learning algorithm, need to experiment with. The other crucial step which we’ve only hinted at so far, is training.

The training process involves finding the values of the optimal weights and biases and other parameters inside each of those neurons. That’s what we will turn to discussing next. Let’s return to the question we posed at the start of this video. Hopefully you’ll recognize that this is an easy one. We do not need to explicitly train a neural network ourselves. We do not need write any program ever to find the best values of the weights and biases ourselves. This is something which TensorFlow does for us. All we need to do is to specify the training data, the cost function, the optimizer that we need to use, and that training process will then be carried out for us. It is abstracted away for us by TensorFlow.

Leave a Reply

How It Works

img
Step 1. Choose Exam
on ExamLabs
Download IT Exams Questions & Answers
img
Step 2. Open Exam with
Avanset Exam Simulator
Press here to download VCE Exam Simulator that simulates real exam environment
img
Step 3. Study
& Pass
IT Exams Anywhere, Anytime!