Laboratorium 3 - Drzewa Decyzyjne

Literatura: Tom M. Mitchell, Machine Learning, Rozdział 3.

Lista i opis plików

Potrzebne pliki to program id3.pl oraz wcześniejsze dane animals.pl, loandata.pl.

Pliki uzupełniające dla przykładów z PMG: program dtlearn_bool.pl, dane dtlearn_t2.pl i dla zadanie as11data.pl.

Tworzenie, przeglądanie i dostęp do drzew decyzyjnych i reguł

?- [id3].
?- [animals].
 
?- id3(2).

The parameter we specify (2) is used to control the growing of the tree. If a node covers less than or equal to 2 examples, then it becomes a leaf no matter what the class distribution is.

To see the tree we use:

?- showtree.
 
  has_covering=feathers => [bird/2]
  has_covering=scales
    gills=f => [reptile/2]
    gills=t => [fish/1]
  has_covering=none => [amphibian/1, mammal/1]
  has_covering=hair => [mammal/3]

The lists at the leaves show the class distribution in the examples covered. If we change the threshold we get a different tree. With threshold 1 we will have one-class examples at the leaves. For example:

?- id3(1),showtree.
 
  has_covering=feathers => [bird/2]
  has_covering=scales
    gills=f => [reptile/2]
    gills=t => [fish/1]
  has_covering=none
    eggs=t => [amphibian/1]
    eggs=f => [mammal/1]
  has_covering=hair => [mammal/3]

The „id3” query also generates „if-then” rules. We can see these rules by using „listing”:

?- listing(if).
 
:- dynamic if/1.
 
if[has_covering=feathers]then bird.
if[has_covering=scales, gills=f]then reptile.
if[has_covering=scales, gills=t]then fish.
if[has_covering=none, eggs=t]then amphibian.
if[has_covering=none, eggs=f]then mammal.
if[has_covering=hair]then mammal.

NOTE: „listing” is a Prolog built-in procedure that lists the contents of the database by specifying a name. This means that the if-then strucures are already in the database and their name is „if”. In fact, the ID3 program when started by „?- id3(1).” creates a decision tree (that can be viewed by „?- showtree.”) and loads in the database a set of rules, that equivalently represent the tree.

The rule created by ID3 are Prolog facts, but put in a special prefix format. They can also be accessed directly by Prolog queries like these:

?- if X then mammal.
 
X = [has_covering=none, eggs=f] ;
 
X = [has_covering=hair] ;
 
?- if X then Y.
 
X = [has_covering=feathers]
Y = bird ;
 
X = [has_covering=scales, gills=f]
Y = reptile ;
 
X = [has_covering=scales, gills=t]
Y = fish ;
 
X = [has_covering=none, eggs=t]
Y = amphibian ;
 
X = [has_covering=none, eggs=f]
Y = mammal ;
 
X = [has_covering=hair]
Y = mammal

If you want to run ID3 with another data file, load the data only (no need to reload id3.pl). However NOTE that after executing „?- id3(1)” the previous tree and rules are deleted and replaced with those corresponding to the currently loaded data file. For example:

?- [loandata].
 
?- id3(1),showtree.
 
  emp=no => [reject/3]
  emp=yes
    buy=car
      married=no => [reject/1]
      married=yes => [approve/1]
    buy=comp => [approve/7]
 
?- id3(9),showtree.
 
  emp=no => [reject/3]
  emp=yes => [approve/8, reject/1]

Narzędzia

Wykorzystując dane z pliku animals.pl, przetestuj działanie Narzędzie do budowania drzew decyzyjnych.

Możesz zaimportować plik XML ze zbiorem uczącym.

Uwaga Oblicz Entropię i tzw. Knowledge Gain dla poszczególnych parametrów i wskaż najlepszy korzeń drzewa decyzyjnego. Czy narzędzie poprawnie wyznaczyło korzeń?

Porównanie z uczeniem pojęć

Assuming that the loandata tree and rules created by „?- id3(1)” in the database, try this (note the sets [2, 5, 11, 12] = [2, 11, 12] + [5]):

?- if H then [reject/N],model(H,M).
 
H = [emp=no]
N = 3
M = [2, 11, 12] ;
 
H = [emp=yes, buy=car, married=no]
N = 1
M = [5] ;

This query accesses the two disjunctive components of the hypothesis for class „reject” (the two paths in the decision tree following to the „reject” leaves).

Now, obviously we should be able to run VS on each of the two sets and produce consistent hypotheses. For the first part, [2, 11, 12] we put these examples in the beginning and all from class „approve” to follow (we may also leave the other „reject” example at the end). Let's call this new data file loandata1.pl and load it into the database.

?- [loandata1].

We may try „listing” to verify our data:

?- listing(example).
 
example(2, reject, [emp=no, buy=comp, sex=f, married=yes]).
example(11, reject, [emp=no, buy=comp, sex=m, married=yes]).
example(12, reject, [emp=no, buy=car, sex=f, married=yes]).
example(1, approve, [emp=yes, buy=comp, sex=f, married=no]).
example(3, approve, [emp=yes, buy=comp, sex=m, married=no]).
example(4, approve, [emp=yes, buy=car, sex=f, married=yes]).
example(6, approve, [emp=yes, buy=comp, sex=f, married=yes]).
example(7, approve, [emp=yes, buy=comp, sex=f, married=no]).
example(8, approve, [emp=yes, buy=comp, sex=m, married=no]).
example(9, approve, [emp=yes, buy=comp, sex=m, married=yes]).
example(10, approve, [emp=yes, buy=comp, sex=m, married=yes]).
example(5, reject, [emp=yes, buy=car, sex=f, married=no]).

Now, we load and run the candiate elimination algorithm (vs.pl) in batch mode.

?- [vs].
 
?- batch.
+[no, comp, f, yes]
G-set: [[?, ?, ?, ?]]
S-set: [[no, comp, f, yes]]
+[no, comp, m, yes]
G-set: [[?, ?, ?, ?]]
S-set: [[no, comp, ?, yes]]
+[no, car, f, yes]
G-set: [[?, ?, ?, ?]]
S-set: [[no, ?, ?, yes]]
-[yes, comp, f, no]
G-set: [[?, ?, ?, yes], [no, ?, ?, ?]]
S-set: [[no, ?, ?, yes]]
-[yes, comp, m, no]
-[yes, car, f, yes]
G-set: [[no, ?, ?, ?]]
S-set: [[no, ?, ?, yes]]
-[yes, comp, f, yes]
-[yes, comp, f, no]
-[yes, comp, m, no]
-[yes, comp, m, yes]
-[yes, comp, m, yes]
+[yes, car, f, no]
G-set: []
S-set: []
There is no consistent concept description in this language !

NOTE that the inconsistancy occurs because of the last example (# 5) which is a part of the other hypothesis. The hypothesis created by ID3, [emp=no] is contained in the version space:

G-set: [[no, ?, ?, ?]]
S-set: [[no, ?, ?, yes]]

In fact, it is the one in the G-set. Interestingly, we have another one (the S-set). This means that there may be another more specific decision tree, which looks like this:

  emp=no 
    married=yes => reject        THIS IS THE NEW BRANCH
    married=no => approve        THIS HAS TO BE ADDED TO COMPLETE THE SUBTREE
  emp=yes
    buy=car
      married=no => reject
      married=yes => approve
    buy=comp => approve
<code>
 
NOTE that this tree is created MANUALLY. With loandata1.pl the ID3 algorithm will 
create the previous tree, because the order of the examples does not matter.
 
Let's verify what the new brach covers. First, we need the "model" procedure back
(it's in id3.pl).
 
<code prolog>
?- [id3].
 
?- model([emp=no,married=yes],M).
 
M = [2, 11, 12]

NOTE that this is the same coverage as before.

?- model([emp=no,married=no],M).
 
M = []

The brach we added to complete the tree does not cover anything, but this is ok.

Similarly the other disjunctive componenet of the ID3's „reject” hypothesis,

if [emp=yes, buy=car, married=no] then reject

may be learned with VS. First, we create file loandata2.pl with the following order of the examples: [5, 1, 3, 4, 6, 7, 8, 9, 10, 2, 11, 12].

Then running VS:

?- batch.
+[yes, car, f, no]
G-set: [[?, ?, ?, ?]]
S-set: [[yes, car, f, no]]
-[yes, comp, f, no]
G-set: [[?, car, ?, ?]]
S-set: [[yes, car, f, no]]
-[yes, comp, m, no]
-[yes, car, f, yes]
G-set: [[?, car, ?, no]]
S-set: [[yes, car, f, no]]
-[yes, comp, f, yes]
-[yes, comp, f, no]
-[yes, comp, m, no]
-[yes, comp, m, yes]
-[yes, comp, m, yes]
+[no, comp, f, yes]
G-set: []
S-set: []
There is no consistent concept description in this language !

NOTE that the version space here is bigger:

G-set: [[?, car, ?, no]]
S-set: [[yes, car, f, no]]

It has four hypotheses, among which is the one created by ID3.

Agorytm OneR

?- [id3].
?- [loandata].
 
?- id3(1),showtree.
 
  emp=no => [reject/3]
  emp=yes
    buy=car
      married=no => [reject/1]
      married=yes => [approve/1]
    buy=comp => [approve/7]

The id3.pl file defines procedures to compute the entopy and the distribution of classes in a set of examples. For example:

?- if H then C, model(H,M),entropy(M,I).
 
H = [emp=no]
C = [reject/3]
M = [2, 11, 12]
I = 0 ;
 
H = [emp=yes, buy=car, married=no]
C = [reject/1]
M = [5]
I = 0 ;
 
H = [emp=yes, buy=car, married=yes]
C = [approve/1]
M = [4]
I = 0 ;
 
H = [emp=yes, buy=comp]
C = [approve/7]
M = [1, 3, 6, 7, 8, 9, 10]
I = 0

These are all 100% correct hypotheses and that's way the entropy of the class distribution is 0. If we allow mixed classes at the leaves we may get non-zero entropy. For example:

?- id3(2),showtree.
 
  emp=no => [reject/3]
  emp=yes
    buy=car => [approve/1, reject/1]
    buy=comp => [approve/7]
 
?- if H then C, model(H,M),entropy(M,I).
 
H = [emp=no]
C = [reject/3]
M = [2, 11, 12]
I = 0 ;
 
H = [emp=yes, buy=car]
C = [approve/1, reject/1]
M = [4, 5]
I = 1 ;
 
H = [emp=yes, buy=comp]
C = [approve/7]
M = [1, 3, 6, 7, 8, 9, 10]
I = 0 ;

The id3.pl program inlcudes a procedure to compute the class distribution of a set of examples. For example:

?- if H then C, model(H,M),distr(M,D).
 
H = [emp=no]
C = [reject/3]
M = [2, 11, 12]
D = [reject/3] ;
 
H = [emp=yes, buy=car]
C = [approve/1, reject/1]
M = [4, 5]
D = [approve/1, reject/1] ;
 
H = [emp=yes, buy=comp]
C = [approve/7]
M = [1, 3, 6, 7, 8, 9, 10]
D = [approve/7] ;

This we may see the class distribution and the entropy in a set of examples. For example:

?- distr([1,2,3,4],D),entropy([1,2,3,4],I).
 
D = [approve/3, reject/1]
I = 0.811278

Here is the entropy of a hypothesis that covers mixed class examples:

?- model([buy=car],M),distr(M,D),entropy(M,I).
 
M = [4, 5, 12]
D = [approve/1, reject/2]
I = 0.918296

The distribution of classes can be used to compute the error. For example, asuming that we assigned the majority class label to the set above (reject), the error will be 1/3 (one wrong classification out of 3).

The OneR algorithm generates a one-level decision tree, using an error based attribute selection. The following query can be used to find the OneR decision tree for our loan data („attr” and „vals” are described in the header of id3.pl. Try the query without writeln and fail to understand better how it works).

?- attr(L),member(A,L),vals(A,W),nl,member(V,W),model([V],M),distr(M,D),writeln([V]-D),fail.
 
[buy=car]-[approve/1, reject/2]
[buy=comp]-[approve/7, reject/2]
 
[emp=no]-[reject/3]
[emp=yes]-[approve/8, reject/1]
 
[married=no]-[approve/4, reject/1]
[married=yes]-[approve/4, reject/3]
 
[sex=f]-[approve/4, reject/3]
[sex=m]-[approve/4, reject/1]

The class distributions above allow us to compute the total error in each of the four possible one-level trees:

1. buy=car => reject [approve/1, reject/2] 
   buy=comp => approve [approve/7, reject/2]
   -----------------------------------------
   Error = (1+2)/12 = 0.25
 
2. emp=no => reject [reject/3]
   emp=yes => approve [approve/8, reject/1]
   ----------------------------------------
   Error = (0+1)/12 = 0.0833333
 
3. married=no => approve [approve/4, reject/1]
   married=yes => approve [approve/4, reject/3]
   --------------------------------------------
   Error = (1+3)/12 = 0.333333
 
4. sex=f => approve [approve/4, reject/3]
   sex=m => approve [approve/4, reject/1]
   --------------------------------------
   Error = (3+1)/12 = 0.333333

Obviously the best OneR tree is #2, because it has the lowest error rate of 0.0833333. Interestingly the entropy-based attribute selection used by ID3 is in agreement with the error-based one. This is clear from the fact the pruned ID tree is the same as the OneR one.

pl/dydaktyka/ml/lab3.txt · ostatnio zmienione: 2017/07/17 08:08 (edycja zewnętrzna)
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0