Computers can seem pretty dumb sometimes, can’t they? Why can’t they just learn how to do things like we do? Learning comes so effortlessly to us humans; we don’t even remember learning something as extraordinarily complicated as speech – it just sort of happened. If I showed you 10 pictures, 5 with cats in them and 5 without (actually this is the internet, so 11 of those 10 pictures would have cats in them, but bear with me) you could easily identify which images contained cats. Because computers are basically math machines, unless you can very precisely define what a cat *is*, then a computer will not be very good at such a task. That’s where neural networks come in – what if we could simulate a human brain? And like a human brain, what if we could purpose our simulation to only look at cats?

## What are neural networks?

My previous post, Decision Tree Learning, briefly alluded to neural networks as an alternative machine learning technique. At their core, neural networks seek to very coarsely emulate a brain. You probably know that the brain has neurons in it. Neurons are little cells that can send an electrical signal to another neuron. Some neurons are more strongly connected to each other than others and therefore the messages they send to each other have a larger effect. If a neuron receives a strong enough message from all its neighbors, it will in turn “activate” and send a message. So how can we use this simplified concept of neurons to identify cats?

Let’s begin with the simplest of neural networks: 1 neuron. In machine learning terms, this is called a Perceptron. The neuron receives a signal, and based on that signal it either fires or it doesn’t. Let’s talk a little more about this “signal” the neuron receives.

In our neural network, a neuron can receive a signal from a variety of sources. Imagine your brain was only a single neuron instead of the billions it actually is. This neuron receives signals from your eyes, your nose, your mouth, your ears, etc. Your eyes tell you that you are sensing something with four legs. Your nose smells kitty litter. Your ears hear a “meow”, and your mouth….? Let’s hope you’re not using that sense to identify if something is a cat or not.

Anyway, all these senses get passed directly into your brain, or single neuron in this case. When a neuron receives a signal, or combination of signals, this may cause it to fire. That is, the signals it receives combine in some way to form a message that means “you too, should fire”. If your brain, like many human brains, is made for saying “YES, that’s a kitty!” or “No, not a kitty.” then the neuron firing is akin to saying “Yes”, while not firing is akin to saying “No”.

## How does a perceptron work?

Let’s take away some of the mystique surrounding the perceptron, and neural networks in general. In a nutshell, a perceptron is a function that converts your input values into some output value.

That’s it! We make a fancy function, then feed it some numbers that describe the object we’re trying to identify, and it spits out some numbers, which we then interpret as a classification prediction. In slightly more formal terms, the “signal” that our lonely perceptron receives is a list of values (called a vector) from all the attributes of the object we’re trying to classify. Based on those values, we classify the object as “Cat” or “Not a cat”.

On top of simply receiving the values, we may also want to weight certain values higher than others. For example, there are thousands of animals that have four legs, but very few will regularly smell like kitty litter. Whether the object smells like kitty or not, then, should probably be more important to our final position. That is, it has more *weight*. Giving an attribute more weight is analogous to thickening the synapse between two neurons, strengthening the connection between them.

Here’s where I start getting a bit technical, but I’ll try to explain everything clearly. Feel free to skip all the math; I’ve tried to write this article to give you an intuition on how things work without it. Head to the comments to tell me what I need to clarify!

For each input attribute we need to determine a weight such that the sum of all weighted attributes is higher than the neuron’s threshold when it **should** fire, and below that threshold when the neuron **should not** fire.

Now, like all machine learners, we need a set of training data. That is, a bunch of objects that *are* cats, and bunch of objects that *aren’t* cats. Associated with each example are attribute values like shape, smell, sound, etc. Let’s call the set of training data **T**.

Each individual training example can be labeled t_{i} where indicates the example’s position in **T**’s list. If we have 3 attributes (shape, smell, sound) for each training example, then the attributes for t_{i} can be labeled as a_{i1}, a_{i2}, a_{i3}. More generally, there may be *k* attributes for each training example, so we can refer to them as a_{i1}, a_{i2}, …, a_{ij}, …, a_{ik}.

Now that we know what our input looks like, the objective is to determine how we should weight attributes. Some attributes may help our classification, and some may hurt it. Those that help it should be given a positive weight, and those that hurt it should be given a negative weight. Usually we start with some really small nonzero weights that we increase or decrease over time.

### Input Conversion

Now, the easiest way to apply a numerical weight to some attribute like “smell” would be to multiply them together. The only problem is, what does it mean to multiply English by some number? It doesn’t make any sense!

Therefore, we should **convert **these English descriptions to numbers. For example, let’s say you have values for your attribute “Smell” of “Kitty Litter”, “Grass”, “Hay”, and “Mud” that you want to change to numbers. The most logical thing to do is assign them values of 1, 2, 3, and 4, because that’s the order they appear in. I’ll talk a bit shortly about why this is a bad idea, and how we will change the values, but for now let’s go with it.

**Recap**: Okay, at this point we have training examples, which themselves have numeric attributes, and we have weights on each attribute. How does all this come together to give us “Cat” or “Not Cat”?

### Creating a Classification

What we’ll do to actually get an answer to the question “Is it a Cat?” is combine our attributes with the weights we have on them through multiplication. Then we’ll sum up each product to get a final value, which we’ll interpret as our answer (more on the interpretation part in a sec).

Clearly, the animal we’re trying to classify is a horse, not a cat. How do we interpret that output value of 3.1 as “Not a cat”, though? What a perceptron does is this: define some number to be a *threshold* value; values that lie on one side of it are Cats, and values on the other side are Not Cats. Usually this value is 0.0:

From the figures above you can see what a perceptron does: the multiplication of our attributes by weights forms a line. Some of the elements on this line are above 0, and some are below 0. Those above 0 we’ll say are cats, and those below we’ll say aren’t. For funsies, we add another weight to our perceptron:

This new attribute, Attribute 0 is considered to always be on. The reason for adding it (and an associated weight) is so that our line from above doesn’t always necessarily go exactly through (0, 0):

Notice how our line has moved up, but the area we classify as Cats has not? This is right; the number of items that will map to a point on the line above “0” is higher! We could have easily gone the other way to make fewer items be classified as cats.

#### Brief mathematical definition:

For one training example let’s compute a sum of all weights multiplied by their attributes:

Because I don’t like the term “Sum”, and it’s not really used in the literature, let’s replace it with the term “*y*”, and give it a subscript to indicate that it’s the sum for training example *i*:

This equation is actually the same as the equation for a line through the origin. This observation helps us to realize that what we’re really doing is trying to draw a line that best separates our “cat” and “not cat” examples. (See figures above).

To add a little extra ability to this function, it’s common to add one more “bias” weight that we’ll call weight 0 (). What this term does is give us the y intercept of our line, allowing it to not be forced to go through (0,0). In neural network terms, we’re adding a new attribute that always takes on a value of 1, and is our weight for it.

Those familiar with vector math might see that this summation of products is really the same as a vector multiplication:

**To recap**: For a training example that has attributes , and neural network weights on each attribute , we generate a linear sum . What does this sum mean? It’s just a number after all. Let’s say this: if is greater than, we’ll say that the neuron fires (yes, the object is a cat!), else it doesn’t (no, it’s not a cat). That is, we can just check the sign of , :

Alright, we now know how to classify our object if we have the numerical attribute values, and appropriate weights on those attributes. The purpose of a neural network is to learn these appropriate weights, which I’ll get to in a minute. First, let’s return to why the numerical attribute values we chose before were bad.

### Input Normalization

When we’re performing our multiplication of attribute values by their weights, won’t “Mud” inherently have a higher weight than “Kitty Litter”? Our neural network should theoretically be able to compensate by modifying weights on these attributes, but we should do this ourselves; It’s generally accepted that neural networks work better on normalized input. That is, we need to “normalize” our attribute values.

Normalization means that we change our input attribute values from the ones given into ones relative to each other. Quite often storing them relative to each other causes them to be in a smaller range. For example, if you have values that go from -1000000 to 1000000, it might be better to just divide everything by a million to get the range (-1.0, 1.0) (this is called Min-Max normalization).

There’s a variety of other ways to normalize data. The way I was taught, and the way I’ll describe here, is *z-score, *or* statistical* normalization. This kind of normalization treats all the given values of attributes in our training set as a population in the statistical sense. When I say “population” you might think of humans, and this is an appropriate intuition here. That is, for a population of humans, you’ll have an average height, average weight, etc. You can think of each humans’ height in terms relative to this average. That’s exactly what z-score normalization does; it replaces a value with one relative to the average.

In more technical terms, the z-score is defined as:

z is our z-score, m is our mean, s is our standard deviation, and x is the attribute value we are replacing. Let’s change our mapping from above to reflect our new normalized values (assume that mean = 1.4, std. deviation = 0.25, calculated from some training set not shown here):

The fact that normalization usually puts numbers into a smaller range can be beneficial because computers have a limit on the biggest number they can represent. Go over that number, and you come around the other side – a really really high positive number will become a really big negative number. Because we do a lot of multiplication and squaring operations in our neural network, it’s good to keep attribute values small.

In our case, normalization expanded the range, but if you notice, Kitty Litter became the only negative value. With negative weights on the Smell attribute, we’ll easy compensate for this to show that most objects being classified that smell like Kitty Litter are cats.

### Learning Weights

Okay, at this point we have our normalized, numerical input attributes, and a way to combine them with weights to generate a classification of “Cat” or “Not a cat”. It’s about time we learn how to assign the proper weights!

Here’s an idea: let’s start with some random weights on our synapses. If we run a training example through the neuron, and the neuron says “Cat!”, but we know it’s not really a cat, then we’ll modify the weights a little so that next time we might get it right.

So how do we provide some negative reinforcement to our neuron? Imagine if you will, a hill. You are told that you need to get to the bottom of this hill, and you have to do so with a blindfold on. It’s actually pretty easy, right? You feel the pull of gravity in one direction so you walk that way until the pull of gravity goes away. You *descended* the hill. Based on how steep the *gradient* of the hill was, it took you longer or shorter to get to the bottom. Training our perceptron to use the appropriate weights follows the same principle! In fact, it’s called *gradient descent.* We start with some random initial weight, and this weight gets pulled towards its true value.

In the figure above, we have some random initial weight that has an error value assigned to it. We want to “pull” that weight to the right so as to decrease its error measure, until we reach a point where we can’t improve it anymore.

#### Mathematically (Gradient Descent):

More formally, we define a quadratic error function for our prediction. Taking the derivative of this function at a point gives us the slope of the function at that point. We want the slope to be 0, and if it is, we know that we’ve minimized our error (reached a local or global min). Now, this next part requires you to know what a partial derivative is.

Given a training example and weights on its attributes, we already know that the function we use to compute a prediction is the summation of attributes multiplied by their weights:

After we *threshold* the output using the sign function s(y_{i}), we have a prediction of 1 (it’s a cat!) or 0 (it’s not a cat…). Well, because all the training examples are labeled with what we **should** have predicted, we know whether the output *should* have been 1, or *should* have been 0. Let’s let the **real classification** of training example *i* be denoted *r _{i}*. And now let’s define our error as:

What the Error equation is saying is that we define the error on our weight vector (given the attributes and real classification of training example i) as proportional to the square of the difference between the real classification and our prediction. Because we only have 2 possible classes (cat, not a cat), the squared difference is only ever 1 or 0, but if we had many different classes this error could get much larger. The ½ that gets multiplied to it is completely man-made; it was only put there to make the next part (taking a derivative) easy.

Turning this error equation into a weight update equation requires us to take a partial derivative with respect to the weight term. Taking the derivative with respect to w, then requires us to use the chain rule twice, eventually leaving us with:

What this equation is saying is that we modify weight *h* based on the difference between the real classification and prediction, multiplied by the value for attribute h. For example, this this could be the weight on “smell”, where the attribute value was “Mud”.

One problem with the equation is that it assumes equal responsibility for error on behalf of all weights, when in fact this may not be the case. This is an artifact of us trying to optimize potentially many weights at the same time. Changing each weight entirely by equation 7, then, could make our weights “bounce around” the correct values. To address this issue, let’s only change the weight by a small percentage of what equation tells us. That is, let’s multiply the right hand side of the equation with a value in the range [0.0, 1.0), which we’ll label :

is typically small, but the best value of is yet another optimization problem. A different value tends to work better on different data sets, so knowing it beforehand is nigh impossible. Programmers use a few strategies on this, from just specifying a really small value, which makes training time take much longer, or specifying a slightly larger value and decreasing it over time, or letting the program adapt based on the network’s ongoing performance. I myself have tried all of the above, and found that an adaptive value seems to be the best generic strategy. I won’t get into how to adapt in this article for simplicity, but once you get the basics of neural networks down, it’s not too hard to implement.

**Recap:**

Alright, at this point we know the following:

- How to change qualitative attribute values into quantitative ones
- How to normalize quantitative attribute values
- How to combine a training example’s attribute values with weights to generate a classification prediction
- How to specify the error on our prediction
- How to update our weights in response to error on our prediction

### Training Epochs

The only question that remains is this: how do we know when we’ve finished training our perceptron?

What we do is this: we take a portion of all our training data, up to 1/3 of it, and we set it aside. Let’s call this our *validation set*. So now we have our training set, which is 2/3 of our training examples, and our validation set, which is 1/3 of our training examples.

What we do is this:

- Run each training example through our perceptron, calculating prediction error and updating weights in response to that error
- After we’ve done that, let’s run each example from the validation set on our perceptron, calculating prediction error, but
**NOT**updating weights in response to that error. - Repeat

Each time we repeat the above is called a “training epoch”. We’ll stop training our perceptron when it stops improving. That is, when the error on the training set doesn’t change for a few epochs. At this point, we know that we’re not getting any better, so we might as well stop. The perceptron’s performance over time will look something like this:

Notice how the early stages have a lot of variation, but the performance is generally improving. Near the end, though, it actually **decreases**! At this point we know that we’ve overfit the network; it’s too well trained for the training set, and so it performs poorly on the validation set.

When we compute the prediction error on the validation set, we do it on the set as a whole; we add up the error for each validation set example, and then divide by the size of the validation set. This is called the mean squared error (MSE):

The equation for the MSE is very similar to the error equation we used for deriving our weight update equation above.

The logic I used for training my neural network was this:

- Run for at least 100 epochs to get out of the typically noisy beginning
- Each time we run our validation set through the perceptron, compute the perceptron’s mean squared error (MSE) on the validation set. Track the epoch at which we saw the lowest MSE.
- Let Epoch
_{best}be the training epoch at which we saw the lowest MSE. If 2 * Epoch_{best}epochs have passed without finding a new minimum MSE, OR 10,000 epochs total have passed, terminate. (I also experimented with up to 100k training epochs, but saw no difference).

## Results

For some perspective, I compared the performance of our lowly perceptron against a decision tree approach. I also compared against a Multi-Layer Perceptron (MLP), which is a more complex neural network consisting of various perceptron layers. What I found may surprise you; despite the fact that the perceptron is effectively a brain consisting of only a single neuron it performs nearly as well as the other approaches.

The data sets we compared were taken from the UCI Machine Learning repository. Some are more difficult than others, as you can see that predicting Heart Disease is a difficult task that even the MLP only got about 65% accuracy on.

## Discussion

What’s a perceptron good for? You don’t exactly think of a brain as being composed of a single neuron, and on top of that, a perceptron can only learn a linear discriminant. They’re surprisingly accurate classifiers, though, and due to their size their fast too.

If you want to make your perceptron bigger and learn more complex things (and multiple class labels), it takes a little more work. You have to add more layers to the network, creating a *multi-layer perceptron *(MLP). You have an input layer that received your training examples, and then you have a series of *hidden layers*, that feed into each other, finally going into an *output layer* that will produce your prediction for each class you’re interested in (dog, cat, human, etc). Also, you can’t simply use a linear function as the output of your neurons; the output of your hidden layers needs to be thresholded using a nonlinear function like tanh or the logistic function. Finally, all this added complexity makes defining your weight update equations more difficult with each additional layer:

The above figure shows an example MLP network with a hidden layer size half that of the input layer size. The network is trying to learn four class labels: Cat, Dog, Horse, and Pig. The σ symbol implies I’m using the logistic (sigmoid) thresholding function in the hidden layers. The next layer outputs some value for each class label that implies how certain the network is about the example belonging to that particular class. The final prediction takes the most likely of these using the softmax function.

In my own experiments I found that sometimes a perceptron performs just as well as a MLP, and other times the MLP significantly outperforms the perceptron. It all comes down to the problem domain you’re trying to learn. This means messing around with different parameters like number of hidden layers, neurons per hidden layer, values for , training epochs, and thresholding functions until you end up with something that suits your needs.

Neural networks as a whole are very useful, and the subject to much research. It’s coincidental that I wrote this article just as Google is telling the world that it has made a neural network with a billion nodes in it! It’s using networks like this to identify arbitrary objects in pictures (yes, including cats!). The biggest network I created with had less than 50 neurons in it.

#### References:

- Alpaydin, E.
*Introduction to machine learning*. 2nd ed. The MIT Press, 2012. 233-245. Print. - Priddy, Kevin L., and Paul E. Keller.
*Artificial neural networks: an introduction*. Vol. 68. Society of Photo Optical, 2005.

Tags: AI, machine learning, neural network

August 3, 2016 at 6:51 am |

Hello Andy,

thanks for your explanations!

Wish you were my professor in university, cause your explanations are way better!

The only thing I don’t get is:

Why is your mean m=1.4 for the z-Score?

If I compute it correctly my mean would be m=(1+2+3+4)/4=2.5.

Cheers

Simon

August 3, 2016 at 11:31 am |

Hi Simon, Thank you for your interest in the post!I intended to imply that the mean and standard deviations I gave were computed from some list of training examples that wasn’t shown. In other words, I made it up!

I’ll try to edit the post to be more descriptive. -Andy

August 3, 2016 at 2:08 pm |

Hi Andy,

thanks for your quick reply!

Yeah already thought about that, thanks anyways!

Another Question regarding the MLP:

What results you can expect after the sigmoid function threshold is reached in the (first) hidden layer and the one or two perceptrons fire?

I mean the “Cat, Dog, Horse, Pig” Output in the second Layer is totally clear to me, but atm I’m confused about the Outputs of the first two perceptrons (1st layer).

What do we model there? Must be some kind of pre classification step to gain more certainty right?

Simon

August 3, 2016 at 8:08 pm |

Hi Simon, As far as I know, the addition of the sigmoid function allows us to model non-linear functions. Adding additional hidden layers allows us to construct what are essentially piecewise non-linear functions. The intermediate result itself is not useful per-se, but it produces an output that the next layer can then transform again into something useful for classification. “Pre-classification” is a pretty nice way of thinking about it; every hidden layer performs some additional pre-processing on the input until the last layer is hit before output.

Hope this helps! -Andy