CSCI 3346, Data Mining, Rule Learning via Sequential Covering

CSCI 3346, Data Mining
Prof. Alvarez
Learning Rules by Sequential Covering

Rules provide models of data that people find intuitive. Therefore, data mining techniques that produce rules can be of interest when the results will be used and interpreted by people. One option is to extract rules from decision trees: each branch in the tree corresponds to a rule that has its leaf label as the consequent and the conjunction of the attribute-value pairs along the path as the antecedent. Here I describe the sequential covering approach to learning rules for classification directly. In contrast to decision tree learning algorithms such as ID3 and C4.5 which proceed by repeatedly splitting the dataset based on the most promising attribute at each stage, sequential covering algorithms proceed by repeatedly removing a portion of the dataset consisting of all instances covered by the most promising rule at each stage. I will describe a covering algorithm known as PRISM.

Sequential Covering Pseudocode

	Inputs: labeled training dataset D

	Outputs: ruleset R that covers all instances in D

	Procedure:

	Initialize R as the empty set

	for each class C {

		while D is nonempty {

			Construct one rule r that correctly classifies
			some instances in D that belong to class C and 
			does not incorrectly classify any non-C instances

			Add rule r to ruleset R

			Remove from D all instances correctly classified by r
		}

	}

	return R

Finding the next rule

Notice that the sequential covering pseudocode does not specify how the rule r is to be constructed on a given pass through the inner while loop. Different sequential covering algorithms do this in different ways. The PRISM algorithm uses a depth-first search to construct the next rule for a given class C. Since the consequent of the rule must be the given class C, only the antecedent needs to be constructed; this is done by starting with an empty antecedent and iteratively adding the most promising attribute=value constraint next. The classification accuracy is used to rate candidate rules. This depth-first search continues until the resulting rule is specific enough that it makes no classification errors over the available data instances. Pseudocode follows below.

	Inputs: nonempty dataset D, class C

	Outputs: a single rule r with C as its consequent that 
	covers some instances in D and that does not incorrectly
	classify any non-C instances in D

	Procedure:

	Initialize r as a rule with an empty antecedent,
	which predicts C without examining any attributes:

		=> C

	Initialize A as the set of all attributes over D

	while r incorrectly classifies some non-C instances of D {

		write r as ant(r) => C

		for each attribute-value pair (a=v), 
		where a belongs to A and v is a value of a,
		compute the accuracy of the rule

			ant(r) and (a=v) => C

		let (a*=v*) be the attribute-value pair of
		maximum accuracy over D; in case of a tie,
		choose the pair that covers the greatest
		number of instances of D

		update r by adding (a*=v*) to its antecedent:

			r = ( ant(r) and (a*=v*) ) => C

		remove the attribute a* from the set A:

			A = A - {a*}

	}

	return r

Example of Rule Learning using PRISM

Here is an illustration of how PRISM learns rules over the contact lens dataset, as discussed in the WEKA book by Witten and Frank.

Contact Lens Dataset

I include the ARFF file for the contact lens dataset below.

% 1. Title: Database for fitting contact lenses
%
% 2. Sources:
%      (a) Cendrowska, J. "PRISM: An algorithm for inducing modular rules",
%          International Journal of Man-Machine Studies, 1987, 27, 349-370
%      (b) Donor: Benoit Julien (Julien@ce.cmu.edu)
%      (c) Date: 1 August 1990
%
% 3. Past Usage:
%       1. See above.
%       2. Witten, I. H. & MacDonald, B. A. (1988). Using concept
%          learning for knowledge acquisition. International Journal of
%          Man-Machine Studies, 27, (pp. 349-370).
%
%  Notes:  This database is complete (all possible combinations of
%          attribute-value pairs are represented).
%
%          Each instance is complete and correct.
%
%          9 rules cover the training set.
%
% 4. Relevant Information Paragraph:
%     The examples are complete and noise free.
%     The examples highly simplified the problem. The attributes do not
%     fully describe all the factors affecting the decision as to which type,
%     if any, to fit.
%
% 5. Number of Instances: 24
%
% 6. Number of Attributes: 4 (all nominal)
%
% 7. Attribute Information:
%     -- 3 Classes
%      1 : the patient should be fitted with hard contact lenses,
%      2 : the patient should be fitted with soft contact lenses,
%      1 : the patient should not be fitted with contact lenses.
%
%     1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic
%     2. spectacle prescription:  (1) myope, (2) hypermetrope
%     3. astigmatic:     (1) no, (2) yes
%     4. tear production rate:  (1) reduced, (2) normal
%
% 8. Number of Missing Attribute Values:   0
%
% 9. Class Distribution:
%    1. hard contact lenses: 4
%    2. soft contact lenses: 5
%    3. no contact lenses: 15

@relation contact-lenses

@attribute age                  {young, pre-presbyopic, presbyopic}
@attribute spectacle-prescrip   {myope, hypermetrope}
@attribute astigmatism          {no, yes}
@attribute tear-prod-rate       {reduced, normal}
@attribute contact-lenses       {soft, hard, none}

@data
%
% 24 instances
%
young,myope,no,reduced,none
young,myope,no,normal,soft
young,myope,yes,reduced,none
young,myope,yes,normal,hard
young,hypermetrope,no,reduced,none
young,hypermetrope,no,normal,soft
young,hypermetrope,yes,reduced,none
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,no,reduced,none
pre-presbyopic,myope,no,normal,soft
pre-presbyopic,myope,yes,reduced,none
pre-presbyopic,myope,yes,normal,hard
pre-presbyopic,hypermetrope,no,reduced,none
pre-presbyopic,hypermetrope,no,normal,soft
pre-presbyopic,hypermetrope,yes,reduced,none
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,no,reduced,none
presbyopic,myope,no,normal,none
presbyopic,myope,yes,reduced,none
presbyopic,myope,yes,normal,hard
presbyopic,hypermetrope,no,reduced,none
presbyopic,hypermetrope,no,normal,soft
presbyopic,hypermetrope,yes,reduced,none
presbyopic,hypermetrope,yes,normal,none

Finding the first rule

The main loop in the sequential covering pseudocode first chooses contact-lenses=hard as the class. Work then focuses on finding one rule for the hard class.

Initialization

Initially, the rule is just

	=> hard

and all non-class attributes are still being considered, that is,

A = { age, spectacle-prescription, astigmatism, tear-production-rate }

First pass through the loop

The accuracies are computed for all candidate rules of the form:

	(a = v) => C

where (a, v) ranges over all possible attribute-value pairs:

	attribute-value pair				p/t

	age=young					2/8
	age=pre-presbyopic				1/8
	age=presbyopic					1/8
	spectacle-prescription=myope			3/12
	spectacle-prescription=hypermetrope		1/12
	astigmatism=no					0/12
	astigmatism=yes					4/12
	tear-production-rate=normal			4/12
	tear-production-rate=reduced			0/12

In each case, the denominator of the fraction p/t on the right indicates the number of instances that contain the given attribute-value pair, and the numerator is the number of these instances that belong to the hard class. For example, there are 8 instances with age=young, and 2 of these 8 are in the hard class. The atribute-value pairs with the largest accuracies are astigmatism=yes and tear-production-rate=normal. Since both of these have the same coverage, one of them is chosen arbitrarily, say astigmatism=yes. That is:

	(a*, v*) = (astigmatism, yes)

The rule r is then updated by adding this attribute-value pair to the antecedent:

	r = (astigmatism=yes) => hard

and the set of attributes being considered is reduced by removing the astigmatism attribute:

	A = { age, spectacle-prescription, tear-production-rate }

Since the current rule r incorrectly classifies some instances by predicting hard for some instances that are actually in the contact-lenses=none class, another pass is made through the loop that refines the rule.

Second pass

Candidate rules for this pass are of the form

	(astigmatism=yes) and (a=v) => hard

Accuracies for the candidate attribute-value pairs are:

	attribute-value pair				p/t

	age=young					2/4
	age=pre-presbyopic				1/4
	age=presbyopic					1/4
	spectacle-prescription=myope			3/6
	spectacle-prescription=hypermetrope		1/6
	tear-production-rate=normal			4/6
	tear-production-rate=reduced			0/6

In computing these acuracies, only instances for which (astigmatism=yes) are considered. Here is the set of all such instances:

young,myope,yes,reduced,none
young,myope,yes,normal,hard
young,hypermetrope,yes,reduced,none
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,yes,reduced,none
pre-presbyopic,myope,yes,normal,hard
pre-presbyopic,hypermetrope,yes,reduced,none
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,yes,reduced,none
presbyopic,myope,yes,normal,hard
presbyopic,hypermetrope,yes,reduced,none
presbyopic,hypermetrope,yes,normal,none

The attribute-value pair of greatest accuracy is tear-production-rate=normal, with an accuracy p/t = 4/6. The rule r is updated by adding this attribute-value pair to the antecedent:

	r = (astigmatism=yes) and (tear-production-rate=normal) => hard

A = { age, spectacle-prescription } Some instances are still misclassified by the refined rule r, so another pass is made through the loop.

Third pass

Candidate rules are of the form:

	r = (astigmatism=yes) and (tear-production-rate=normal) and (a=v) => hard

Accuracies for the candidate attribute-value pairs are:

	attribute-value pair				p/t

	age=young					2/2
	age=pre-presbyopic				1/2
	age=presbyopic					1/2
	spectacle-prescription=myope			3/3
	spectacle-prescription=hypermetrope		1/3

These are computed over the set of instances that satisfy both constraints that appear in the current antecedent:

young,myope,yes,normal,hard
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,yes,normal,hard
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,yes,normal,hard
presbyopic,hypermetrope,yes,normal,none

There are now two attribute-value pairs with 100% accuracy, namely (age=young) and (spectacle-prescription=myope). The tie is broken by choosing the one with greater coverage, that is:

	(a*, v*) = (spectacle-prescription, myope)

The updated rule is

	r = (astigmatism=yes) and (tear-production-rate=normal) and (spectacle-prescription=myope) => hard

Since this rule has 100% accuracy, it is accepted and returned by the rule construction procedure as the first rule generated for the hard class by the inner loop in the sequential covering pseudocode.

Updates in the sequential covering inner loop

The first rule

	r = (astigmatism=yes) and (tear-production-rate=normal) and (spectacle-prescription=myope) => hard

is added to the ruleset R:

	R = { (astigmatism=yes) and (tear-production-rate=normal) and (spectacle-prescription=myope) => hard }

and the three instances covered by this rule are removed from the dataset D. D now contains 21 instances:

young,myope,no,reduced,none
young,myope,no,normal,soft
young,myope,yes,reduced,none
young,hypermetrope,no,reduced,none
young,hypermetrope,no,normal,soft
young,hypermetrope,yes,reduced,none
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,no,reduced,none
pre-presbyopic,myope,no,normal,soft
pre-presbyopic,myope,yes,reduced,none
pre-presbyopic,hypermetrope,no,reduced,none
pre-presbyopic,hypermetrope,no,normal,soft
pre-presbyopic,hypermetrope,yes,reduced,none
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,no,reduced,none
presbyopic,myope,no,normal,none
presbyopic,myope,yes,reduced,none
presbyopic,hypermetrope,no,reduced,none
presbyopic,hypermetrope,no,normal,soft
presbyopic,hypermetrope,yes,reduced,none
presbyopic,hypermetrope,yes,normal,none

This completes the first pass through the inner loop in the sequential covering algorithm. Since one instance with contact-lenses=hard still remains, another pass is made in order to generate a second rule that targets the hard class. This second rule is constructed by following the procedure described above. The second rule turns out to be:

	r = (age=young) and (astigmatism=yes) and (tear-production-rate=normal) => hard

This rule covers the remaining instance in the hard class, which completes the first pass through the outer loop of the sequential covering algorithm. Subsequent passes generate rules that cover instances belonging to the soft and none classes.

CSCI 3346, Data Mining Prof. Alvarez Learning Rules by Sequential Covering