CS345, Machine Learning
Prof. Alvarez

Entropy-Based Decision Tree Induction (as in ID3 and C4.5)

Decision Trees

A decision tree is simply a graphical representation of a sequential decision process, one in which the final outcome is determined by the answers to a special sequence of questions. As a model, think of the game "20 questions", in which one of the two players must guess what the other is thinking using only the answers to a sequence of yes/no questions. We want to be able to guess correctly based on a uniform set of questions for each dataset. Note that the question to ask next will depend on the answers obtained so far, just as in "20 questions"; this is what gives rise to branching in the decision tree. Unlike the game, we will no longer be able to restrict the number of questions in advance. However, we will strive to minimize the average number of questions in each case.

We assume that a dataset with only nominal instance attributes is given and that one of the attributes of this dataset is chosen as the target attribute. We wish to predict the value of the target attribute for a given instance given the values of a selected sequence of other attributes for the same instance. These values are the "answers" to our "questions". The goal of decision tree induction is to design as short a question sequence as possible while yielding good predictive performance. One approach to this task involves entropy, a fundamental quantity in information theory. We will briefly describe entropy and the entropy approach to decision tree induction below.

Uncertainty and Entropy

Entropy is a measure of the amount of uncertainty associated with a set of probabilities. To motivate the idea of entropy informally, suppose that there are n possible outcomes in a certain context. If each of these n outcomes is estimated to occur with probability 1/n, then uncertainty is at a maximum - we have no reason to lean toward any one of the outcomes when predicting which of them will occur; in this case we are nowhere near being able to rationally choose among the possibilities. On the other hand, if one of the outcomes is estimated to occur with probability 0.99 (99% of the time), then the amount of uncertainty is quite small; in this case, predicting that this particular outcome will occur will be correct most of the time.

Clearly, the amount of uncertainty is related to the degree of success that can be expected in the task of predicting the outcome. In decision tree induction, asking more questions will change the probabilities, normally reducing the amount of uncertainty remaining. The magnitude of the remaining uncertainty can be thought of as a heuristic estimate of the number of additional questions that we will have to ask before the outcome is certain.

Definition of entropy

Given n probabilities p1, p2, ... pn that sum to 1, the associated entropy is defined as the following quantity H:
	H(p1, p2, ... pn) = sum from i=1 to i=n of -pi*log(pi)
The logarithm is usually taken in base 2. In this case, the entropy can be interpreted as the number of bits of information needed to resolve the uncertainty associated with the probabilities p1, ... pn.

Examples

  1. The entropy of a uniform distribution over n outcomes is log n.
  2. The entropy of a deterministic distribution (in which one of the outcomes has probability 1 and the others have probability 0) is 0.
The entropy may be shown to reach a maximum for a uniform distribution.

Definition of class entropy

In machine learning techniques for classification, we are interested in predicting the value of a selected class attribute based on the values of the remaining (non-class) attributes. In this context, entropy may be used to rank attributes in terms of the reduction in the uncertainty of the class label for a given instance that can be expected if the value of the attribute becomes known for that instance.

We define the class entropy of an attribute a as follows:

Hc(a) = sum of #(a=v)/#(all a-values)* H(class distribution | a=v) 
where:
  1. the sum is calculated over all values v of a,
  2. #(v=a) is the number of instances for which a has the value v,
  3. #(all a-values) is the total number of instances being considered,
  4. H(class distribution | a=v) is the set of probabilities of the various classes over the instances that satisfy a=v
The point is that the probability distribution of the class will depend on the specific value of the selected attributed a. The class entropy is a weighted average over all of these possibilities. Each possible value of the selected attribute a is weighted in proportion to its occurrence among the instances being considered.

The ID3 Algorithm

The ID3 algorithm was invented by J.R. Quinlan ("Induction of Decision Trees", Machine Learning, vol 1, issue 1, 1986, 81-106). ID3 uses the class entropy to decide which attribute to query on at each node of a decision tree. Note that entropy in this context is relative to the previously selected class attribute. The class entropy measures the average uncertainty of the class label still remaining after learning the value of the given attribute at a given node of the partially constructed tree. The class entropy may be interpreted as a heuristic estimate of the number of additional questions that must be asked (nodes that must be visited) in order to resolve the uncertainty associated with the class attribute. At each stage of the construction, ID3 chooses to split on the attribute for which the average remaining uncertainty is minimized. Heuristically, this should minimize the expected number of remaining questions, and thus the overall height of the decision tree.

ID3 pseudocode

	Inputs: (D, A, C), where:
		D is a dataset with only nominal instance attributes A 
		C is the class attribute

	Output: a decision tree T representing a sequential decision process for
	classifying instances (predicting the values of the class attribute C);
	each node of T is labeled with a non-class attribute of A

	Informal Inductive Bias: minimize the average height of the tree

	Procedure:

	if the set of remaining non-class attributes is empty 
	or if all of the instances in D are in the same class

		return an empty tree

	else {

		compute the class entropy of each attribute over the dataset D
			let a* be an attribute with minimum class entropy	

		create a root node for a tree T; label it with a* 

		for each value b of attribute a* {

			let T(a*=b) be the tree computed recursively by ID3 on input (D|a*=b, A-a*, C), where:
				D|a*=b contains all instances of D for which a* has the value b
				A-a* consists of all attributes of A except a*

			attach T(a*=b) to the root of T as a subtree

		}
		
		return the resulting decision tree T

	}

Example of ID3 entropy computation

Taking the weather dataset, for example: if the target attribute for classification is the play attribute, then the entropy must be computed relative to that attribute.

Here goes the entropy computation at the root of the tree in this case:

To compute entropy(outlook):

The outlook attribute has three values: sunny, overcast, and rainy.

Of the 14 instances in the dataset, 

        5 have outlook=sunny
        4 have outlook=overcast
        5 have outlook=rainy

That splits the training instances into three subsets as described.
Let's check the uncertainty of the class (play) attribute in each subset:

        of the 5 instances with outlook=sunny,
                3 have play=no
                2 have play=yes
        the entropy associated with this subset is 
        entropy(3/5, 2/5) = (3/5)*log2(5/3) + (2/5)*log2(5/2) = 0.97

        of the 4 instances with outlook=overcast,
                4 have play=yes
                0 have play=no
        the entropy associated with this subset is 
        entropy(1, 0) = 1*log2(1) + 0*log2(0) = 0 
        
        of the 5 instances with outlook=rainy,
                3 have play=yes
                2 have play=no
        the entropy associated with this subset is 
        entropy(3/5, 2/5) = (3/5)*log2(5/3) + (2/5)*log2(5/2) = 0.97

The average uncertainty still remaining after asking about the outlook
attribute is therefore the following weighted average:

        entropy(outlook) = (5/14)*0.97 + (4/14)*0 + (5/14)*0.97 = 0.69

The coefficients here are determined by what ratio of the training instances 
fall into each of the three subsets associated with the three possible values 
of the outlook attribute.

ID3 would then proceed to calculate the entropy of each of the other non-class
attributes in order to choose which attribute to split on at the root level.