CS345, Machine Learning
Prof. Alvarez

Decision Tree Pruning based on Confidence Intervals (as in C4.5)

The basic entropy-based decision tree learning algorithm ID3 continues to grow a tree until it makes no errors over the set of training data. This fact makes ID3 prone to overfitting. In order to reduce overfitting, pruning is used.

Reduced-Error Pruning

One approach to pruning is to withhold a portion of the available labeled data for validation. The validation set is not used during training. Once training has been completed, testing is carried out over the validation set. If the error rate of the original decision tree over the validation set exceeds the error rate of a pruned version of the tree (obtained by replacing a near-leaf node with its child leaves by a single leaf node), then the pruning operation is carried out. Reduced error pruning can reduce overfitting, but it also reduces the amount of data available for training.

Statistical Pruning

C4.5 uses a pruning technique based on statistical confidence estimates. This technique has the advantage that it allows all of the available labeled data to be used for training.

Confidence intervals

The heart of the statistical pruning technique is the calculation of a confidence interval for the error rate. In brief, one starts from an observed error rate f measured over the set of N training instances. The observed error rate is analaogous to the observed fraction of heads in N tosses of a (usually loaded) coin. One wishes to estimate the true error rate p that would be observed if one could test over all possible data instances that will ever become available. The true error rate p is analogous to the actual probability of heads of the given coin. Note the difference between f and p: f is the fraction of errors (or heads) that were observed in one particular sample of size N, while p is the fraction of errors (heads) that will be observed over the infinitely large set of all instances, past, present, and future.

The confidence interval is calculated as follows:

	p = f +- z*sqrt( f*(1-f) / N )
Here, f, p, and N are as described above. z is a factor that depends on the desired level of confidence. Values of z for selected confidence levels appear below.
z confidence
0.67 50%
1 68%
1.64 90%
1.96 95%
With the level of confidence given in this table, one can state that the true value p will be found inside the corresponding confidence interval calculated as described above. For example, suppose that 3 errors are observed among a set of 10 instances. We thus have
	f = 0.3, 	N = 10, 	z = 1.64
The confidence interval for p is:
	p = 0.3 +- 1.64*sqrt( 0.3*(1-0.3) / 10 )
To two decimal digits, the confidence interval is:
	p = 0.3 +- 0.24
Thus, we can be 90% certain that the true error rate of the particular classifier that led to the observed error rate of 3/10 will be between 0.16 and 0.54.

Pruning based on confidence intervals

In order to decide whether to replace a near-leaf node and its child leaves by a single leaf node, C4.5 compares the upper limits of the error confidence intervals for the two trees. For the unpruned tree, the upper error estimate is calculated as a weighted average over its child leaves. Whichever tree has a lower estimated upper limit on the error rate "wins" and is selected.

Example from the labor negotiations dataset

The unpruned subtree is:
                        wage increase?
                <=2.5 /         \ > 2.5
                     /           \
                work hours
           <=36 /       \ >36
               /         \
                        health plan
                none /    half |   full \
                    /          |         \
                4+ 2-        1+ 1-      4+ 2-
We target the health plan node near the bottom of the tree for pruning. First we calculate the average estimated upper error rate for the unpruned tree:
        for health plan = none leaf node:
                f = 2/6, N=6 => upper p estimate = .46
                (using z=0.69 for 75-th percentile estimate)

        for health plan = half leaf node:
                f = 1/2, N=2 => upper p estimate = .74

        for health plan = full leaf node:
                f = 2/6, N=6 => upper p estimate = .46

        average upper error rate over the three leaves: .55
On the other hand, if the tree were pruned by replacing the health plan node by a leaf (9+, 5-), the confidence interval calculation would be as follows:
        f = 5/14, N=14 => upper p estimate = .49
Since the pruned tree results in a lower upper estimate for the error rate, the leaves are indeed pruned.