Let’s say you want to learn how to classify something, but the ‘rules’ behind the classification are not obvious. For example, given some measurements about a person’s heart, can we tell if they have heart disease? What if you are hiking through the woods and want to know if a mushroom you see is edible or not? Perhaps the stem has a certain color, the head is a certain width, or it has a ‘curtain’ around it. Instead of spending years mastering which mushrooms are edible based on sight, how about you simply get a computer to tell you? Well, with decision trees you can!
Bear with me, as there is much to cover. A lot of this is lifted directly from a report I submitted for a recent Machine Learning project.
So What Is a Decision Tree?
Think of a decision tree like a family tree. Let’s say you want to find yourself in your own family tree. You know some information about your heritage, and so you can reliably start at the top and work your way down. How did you go down the tree? Well you knew that your great grandmother’s first name was “Shirley”, that your grandfather’s name was “Frank”, your father’s name was also “Frank”, and your name is “Andy”. Let’s think about this for a minute. What if your grandfather was George Foreman? Then all your potential fathers would be named “George”, and you couldn’t reliably continue down the tree solely based on this information. It would be much better if the tree branched at birthdays instead of first names here. That is, you gain more “information” by splitting the tree based on birthday instead of first name.
Programmers use similar methods to create decision trees: they search for the best attribute to “split” the tree on to maximize information gained. When you make really good decisions about how to make the branches in your tree, then you can more accurately make a classification. So how do we make these good decisions? Glad you asked.
Formally, we define a decision tree as follows (These formal definitions are not needed to understand the article):
Decision tree classifiers are hierarchical models for supervised learning whereby the local region is identified in a sequence of recursive splits in a smaller number of steps . A decision tree consists of internal decision nodes and terminal leaves. Each decision node implements a test function fm(x) with discrete outcomes labeling the branches. Given an input, at each node, a test is applied and one of the branches is taken depending on the outcome. The process begins at the root and recurses until a leaf node is encountered, at which point the label of the leaf node is returned as the output.
So I mentioned before that it might make it easier to find yourself in a family tree if one level was split by birthdays instead of first names. Now, this family tree example isn’t exactly a great candidate for a decision tree, but it was intended to be intuitive enough for you to grasp the concept. Let’s go back to the mushroom example I spoke about in the introduction.
The UCI Machine Learning Repository  has a database of poisonous and edible mushrooms. For each mushroom, scientists carefully measured 22 attributes — including cap size, stalk size, color, odor, habitat, and abundance. Being experts, they also knew which ones were edible and which were not, so they labelled each mushroom in the database as edible ‘e’ or poisonous ‘p’.
How can we best build a tree so that we can quickly and accurately classify a new mushroom we encounter? Well, we could simply compare our mushroom to every mushroom in the database and perhaps hope we find a match within some threshold. Or, we could build a decision tree using information gain.
The theory of information gain is actually pretty simple on the surface. We have some edible mushrooms and some poisonous mushrooms in the database. Let’s play the “What If?” game. What if we separated all our mushrooms according to their odor? Let’s say a mushroom either smells like almond or it doesn’t. Perhaps this could perfectly separate all our poisonous mushrooms from the edible ones, and we’d already be done (below)! There’s plenty of examples of single-node decision trees we could think of — separating pens from pencils or bicycles from sedans. This measure of how well an attribute separates our examples is called entropy, or equivalently information gain.
However, we’re interested in more complicated decision trees. As it turns out, all mushrooms that do smell like almond are indeed edible, but there are plenty of those that have no smell and are also edible.
Entropy and Information Gain measure opposite things. That is, the less entropy there is, the more information there is. Let’s say you are given only poisonous mushrooms with which to build your decision tree. It wouldn’t help much to separate them according to odor…or anything really. No matter what you do, you’ll always classify a mushroom as poisonous. Because all your examples are poisonous, we’ll say that the entropy is minimum (0), and information gain is maximum. Because entropy is at its absolute minimum, we will simply stop building the decision tree and just say that all mushrooms are poisonous. Easy, right? The same concept applies for individual nodes of a tree being built. That is, say we’ve separated our examples according to some other criteria (like smell and size) above, and then we get to a case where all our examples are poisonous. We can simply create a classification node or leaf node of the tree that classifies any mushroom that gets to it as poisonous.
The above paragraph qualitatively describes what entropy is. Now, for a formal definition:
Equation 1 defines the probability of any instance at node m being labeled as an instance of class i. Nmi is the total number of instances of class i at node m.
Equation 2 defines entropy, or impurity I, for node m as a function of the probability of all K classes at node m. We seek to minimize I.
Splitting Attributes by Information Gain
We know when to make a node in a tree a classification node –we read about that in the previous section; we simply halt when our entropy is minimized. Now how can we apply this concept of information gain to choosing the best attribute to split on when a node does not have minimal entropy?
Remember how we said that choosing to separate mushrooms according to odor could effectively separate the poisonous from the edible ones? That is the main idea behind choosing the best attribute to split on! We can examine how well the mushrooms are separated according to edibility before we further separate them by color, and compare this value to how well they are further separated by color. The attribute that gives us the best separation we use to make the next split in our decision tree. Easy right? In more sciency terms, we compare entropy before the split to entropy after the split. The difference in entropy is Information Gain. Since we are trying to minimize entropy, we want the new entropy to be much lower than the old entropy.
Formally, we can define Information Gain, G, as follows:
Equation 3 is closely related to equation 1 — it is the proportion of examples of class i at node m that would go down branch j. Equation 3 is used in Equation 4 to calculate entropy for each branch. Equation 4 defines impurity of a node after a split as a function of the entropy of each created child node. The impurity (entropy) after the split is the sum of the entropy for all the branches created weighted by proportion of examples that get sent to that branch. That is, if more examples get sent down branch 1, then branch 1’s entropy has more weight than branch 2. Information Gain G is therefore the difference between the entropy of the node before a split and the entropy of the node after the split.
The ID3 and C4.5 Algorithms
Ross Quinlan   is famous for inventing the algorithm (ID3) that uses entropy and information gain to recursively create decision trees. This algorithm had some weaknesses, such as the inability to handle numerical attributes or attributes with missing values. His extension to ID3, C4.5, addressed those weaknesses.
The idea of overfitting means that your prediction model is too biased towards your training data. Think about the hypothetical case where a node in our mushroom decision tree has 1,000 examples that are poisonous and only 1 that is edible. Should we really split the tree again here, or would it be a better idea to just assume that all mushrooms at this node are poisonous? Perhaps that 1 edible mushroom was simply mislabeled by our scientists. They are human after all. Or perhaps there is some attribute about the mushrooms that the scientists did not take into account that might better split the examples earlier on? It’s a good idea to assume that there is noise in our examples. If we don’t we may make a tree too affected by this noise, and thus have it overfit to our training examples. New examples, then, could have lower classification accuracy.
To alleviate the problem of overfit, it is custom to prune our decision trees.
We can define pruning (of a node) in a decision tree, as enveloping all branches below the node — making one node again, as if we never split on it in the first place. “Why do we need to prune,” you are asking yourself “if we could just stop making the decision tree when a node’s entropy reaches some threshold which doesn’t need to be 0!?” I would tell you, sure, why not? That’s a valid pre-pruning step, though be careful.
John Mingers  published a paper in 1989 comparing a variety of post-pruning methods. The two that I implemented, and will focus on, are named Error Complexity (sometimes called Cost Complexity) and Minimum Error pruning.
To enable ourselves to efficiently prune trees, we need to set aside some of the examples from the database for this purpose before we build our decision tree (called the pruning set). That is, a set of examples solely used for pruning and not used for tree construction. I used 1/3 of the examples, but you might experiment with other values.
Error Complexity Pruning
Error Complexity pruning operates on the notion of reducing errors per leaf. Mingers defines three equations (6, 7, 8) that allow one to assign a value α to an interior node. Nodes with smaller α values imply less worth (higher error rate per leaf), and should be pruned first. Iteratively pruning nodes results in a set of subtrees that vary in their misclassification rate (on the pruning set). My own implementation selected the pruned tree with the lowest misclassification rate. Mingers, however, suggests choosing the smallest pruned tree (in terms of number of leaves) with a misclassification rate one standard error away from the minimum.
Equation 6 defines the error cost for node t. The cost is a product of the error rate at node t and the proportion of pruning set examples at node t. Equation 7 extends 6 to define the error cost for a subtree rooted at node t.
Equation 8 defines the derivation of α as the difference in error costs for a tree and its leaves divided by the number of leaves, NT, in the subtree rooted at node t. α implies an amount of error per leaf in the new subtree, so therefore a node with smaller α should be pruned first.
Reduced Error Pruning
Reduced Error pruning is much simpler than Cost Complexity pruning. The idea behind this is to simply test the pruning set on the existing tree, and follow the examples as they go down the tree. At each node, we can play the “What If?” game again, asking ourselves “What if the current node was actually a leaf node, would it correctly or incorrectly classify this example?” After we run all the examples through the tree, we can compare any node’s # of errors to the # of errors made by all the nodes below it. If the root of the subtree makes fewer errors, prune it. This simple method is very effective, and was found by Mingers to perform roughly second best in pruning methods, just behind Cost Complexity pruning.
Now that you know the basics behind building a decision tree, let’s see how well one performs. For the report I wrote, I was essentially comparing how well these pruning methods worked relative to each other as well as if we didn’t prune at all. Generally speaking, pruning really reduced the size of the tree while maintaining at least as good classification accuracy. Let’s look at some results:
The table below shows results for a decision tree applied to a census database, and attempts to predict whether a person makes more or less than 56K a year. There were about 32,500 examples in this dataset with 14 attributes; some attributes were real-valued.
Accuracy refers to the average accuracy of the tree over all trials, normalized to be between 0 and 1. So, 0.81 means the tree correctly classified 81% of examples. “Resub” means resubstitution accuracy. That is, how accurate the tree was in classifying the examples that were used to create it. Why is this accuracy less than 100% on the unpruned tree? This is because the dataset contains attributes with unknown values. Since we don’t know the values, we simply ignore those examples in entropy calculations. Techniques exist to estimate these attribute values, like simply using the most common value, but I did not do this.
TreeSize means the number of leaf nodes (classification nodes) in the tree. Can you see how much the pruning methods reduced it, while actually improving overall accuracy? Statistically, the probability of this reduction simply being due to random chance is less than 0.01%. (For those statisticians out there, I used 10-fold cross-validation with alpha = 0.05).
Now let’s look a dataset the decision tree does not perform well on –heart disease classification. The heart disease decision tree attempts to predict the risk for heart disease among 0, 1, 2, 3, 4 where 0 is “not heart disease”, and essentially anything else means “has heart disease”, the value is just an indicator of how severe. If I’d wanted I could have simply reduced the number of labels from five to two –heart disease or not heart disease. Instead I opted to try to predict the severity of the heart disease. This dataset had only 303 examples with 13 different attributes. Below is the table of stats:
“Only about 50% accuracy? You might as well have simply guessed randomly!” That’s not a completely valid critique. A 20% accuracy would have implied random guessing. Still, this is not great. There are a lot of unknown attribute values in this dataset, and perhaps not as many examples of some classes as others. Reducing the number of classes to two seems to give a lot higher accuracy: about 74%. Still, you can see how well the pruning methods worked here. In some cases, the pruning methods actually did reduce the tree to only a single node.
Alright, alright, but what about the Mushroom dataset? I mean, I’ve been using it as my example all along. How well does the decision tree perform on that one? Very, very well, is the short answer. Perfectly (for the dataset), in fact:
So it looks like you could very easily classify your mushrooms. The tree was 100% accurate on the data, which does NOT necessarily mean it’s perfect in the real world. So please don’t actually use this tree to decide on eating mushrooms. Here’s the full mushroom decision tree:
Some Conclusions and Discussion
It looks like decision trees are pretty good at classifying some things, and not others. How can we know if a decision tree will perform well? Well, my advice would be to simply try it and see if it works. You could also examine how many attributes you have. ID3 and C4.5 are sensitive to attributes with high numbers of values, and likely won’t perform as well on them. Pruning also seems to be a really, really good idea in a decision tree. Other methods discussed in the Mingers paper, like Pessimistic Pruning, don’t actually require you to set aside a pruning set. It is one of the worst performing pruning methods, but useful if you don’t have many examples to spare for a pruning set.
So I ran you through the very basics of making a decision tree. We did not get really deep into the actual ID3 and C4.5 algorithms. For example, how it handles numerical attributes. Should one simply try splitting on all possible numerical values from the examples? Well you could, and it actually works pretty well. Or, you could order all the possible values and then just look at the midpoints between each consecutive two values. Also, remember that there are plenty of techniques for estimating unknown attribute values. We also could have discussed more weaknesses in pruning methods (like sensitivity).
Also, remember that decision trees are most certainly not the only way of making predictions. There’s Neural Nets, K-nearest neighbor classifiers, Polynomial regression, and many, many other ways. I hope you enjoyed this article, and thanks for reading. Email me if you’d like the source code to this project. (Or leave a comment if you think the code is worth hosting for download. There are plenty of available decision tree classifiers available on the web already).
1. Alpaydin, E., Decision Trees, in Introduction to Machine Learning, T. Diettrich, Editor 2010, MIT Press. p. 185-192.
2. Mingers, J., An Empirical Comparison of Pruning Methods for Decision Tree Induction. Machine Learning, 1989. 4(2): p. 227-243.
3. Quinlan, J.R., Induction of decision trees. Machine Learning, 1986. 1(1): p. 81-106.
4. Quinlan, J.R., C4.5: Programs for Machine Learning. Machine Learning, 1994. 16(3): p. 235-240.
5. UCI Machine Learning Repository. [cited 2012; A database of datasets for testing classification algorithms]. Available from: http://archive.ics.uci.edu/ml/datasets.html.