✎ pl:dydaktyka:ml:lab3 [aiWiki]

Ta strona jest tylko do odczytu. Możesz wyświetlić źródła tej strony ale nie możesz ich zmienić.
====== Laboratorium 3 - Drzewa Decyzyjne ======
Literatura: Tom M. Mitchell, Machine Learning, Rozdział 3. 

===== Lista i opis plików  =====
Potrzebne pliki to 
program
{{:pl:prolog:prolog_lab:ml:id3.pl}}
oraz wcześniejsze dane
{{:pl:prolog:prolog_lab:ml:animals.pl}}, 
{{:pl:prolog:prolog_lab:ml:loandata.pl}}.

Pliki uzupełniające dla przykładów z PMG:
program
{{:pl:prolog:prolog_lab:ml:dtlearn_bool.pl}}, 
dane
{{:pl:prolog:prolog_lab:ml:dtlearn_t2.pl}}
i dla zadanie
{{:pl:prolog:prolog_lab:ml:as11data.pl}}.

===== Tworzenie, przeglądanie i dostęp do drzew decyzyjnych i reguł =====

<code prolog>
?- [id3].
?- [animals].

?- id3(2).
</code>

The parameter we specify (2) is used to control the growing of the tree. If a node covers
less than or equal to 2 examples, then it becomes a leaf no matter what the class distribution
is.

To see the tree we use:
<code prolog>
?- showtree.

  has_covering=feathers => [bird/2]
  has_covering=scales
    gills=f => [reptile/2]
    gills=t => [fish/1]
  has_covering=none => [amphibian/1, mammal/1]
  has_covering=hair => [mammal/3]
</code>

The lists at the leaves show the class distribution in the examples covered. If we change
the threshold we get a different tree. With threshold 1 we will have one-class examples
at the leaves. For example:

<code prolog>
?- id3(1),showtree.

  has_covering=feathers => [bird/2]
  has_covering=scales
    gills=f => [reptile/2]
    gills=t => [fish/1]
  has_covering=none
    eggs=t => [amphibian/1]
    eggs=f => [mammal/1]
  has_covering=hair => [mammal/3]
</code>

The "id3" query also generates "if-then" rules. We can see these rules by using "listing":
<code prolog>
?- listing(if).

:- dynamic if/1.

if[has_covering=feathers]then bird.
if[has_covering=scales, gills=f]then reptile.
if[has_covering=scales, gills=t]then fish.
if[has_covering=none, eggs=t]then amphibian.
if[has_covering=none, eggs=f]then mammal.
if[has_covering=hair]then mammal.
</code>

NOTE: "listing" is a Prolog built-in procedure that lists the contents of the database
by specifying a name. This means that the if-then strucures are already in the database
and their name is "if". In fact, the ID3 program when started by "?- id3(1)." creates a 
decision tree (that can be viewed by "?- showtree.") and loads in the database a set of
rules, that equivalently represent the tree.

The rule created by ID3 are Prolog facts, but put in a special prefix format.
They can also be accessed directly by Prolog queries like these:

<code prolog>
?- if X then mammal.

X = [has_covering=none, eggs=f] ;

X = [has_covering=hair] ;

?- if X then Y.

X = [has_covering=feathers]
Y = bird ;

X = [has_covering=scales, gills=f]
Y = reptile ;

X = [has_covering=scales, gills=t]
Y = fish ;

X = [has_covering=none, eggs=t]
Y = amphibian ;

X = [has_covering=none, eggs=f]
Y = mammal ;

X = [has_covering=hair]
Y = mammal
</code>

If you want to run ID3 with another data file, load the data only (no need to reload id3.pl).
However NOTE that after executing "?- id3(1)" the previous tree and rules are deleted and 
replaced with those corresponding to the currently loaded data file. For example:

<code prolog>
?- [loandata].

?- id3(1),showtree.

  emp=no => [reject/3]
  emp=yes
    buy=car
      married=no => [reject/1]
      married=yes => [approve/1]
    buy=comp => [approve/7]

?- id3(9),showtree.

  emp=no => [reject/3]
  emp=yes => [approve/8, reject/1]

</code>

===== Narzędzia =====

Wykorzystując  dane z pliku {{:pl:prolog:prolog_lab:ml:animals.pl}}, przetestuj działanie [[http://www.aispace.org/dTree/index.shtml|Narzędzie do budowania drzew decyzyjnych]].

Możesz zaimportować plik XML ze {{:pl:dydaktyka:ml:lab3.xml|zbiorem uczącym}}.

**Uwaga**  Oblicz Entropię i tzw. //Knowledge Gain// dla poszczególnych parametrów i wskaż najlepszy korzeń drzewa decyzyjnego. Czy narzędzie poprawnie wyznaczyło korzeń?


===== Porównanie z uczeniem pojęć =====


Assuming that the loandata tree and rules created by "?- id3(1)" in the database, try this
(note the sets [2, 5, 11, 12] = [2, 11, 12] + [5]):

<code prolog>
?- if H then [reject/N],model(H,M).

H = [emp=no]
N = 3
M = [2, 11, 12] ;

H = [emp=yes, buy=car, married=no]
N = 1
M = [5] ;
</code>

This query accesses the two disjunctive components of the hypothesis for class "reject"
(the two paths in the decision tree following to the "reject" leaves).

Now, obviously we should be able to run VS on each of the two sets and produce
consistent hypotheses. For the first part, [2, 11, 12] we put these examples in the
beginning and all from class "approve" to follow (we may also leave the other "reject"
example at the end). Let's call this new data file loandata1.pl and load it into the 
database.
<code prolog>
?- [loandata1].
</code>

We may try "listing" to verify our data:

<code prolog>
?- listing(example).

example(2, reject, [emp=no, buy=comp, sex=f, married=yes]).
example(11, reject, [emp=no, buy=comp, sex=m, married=yes]).
example(12, reject, [emp=no, buy=car, sex=f, married=yes]).
example(1, approve, [emp=yes, buy=comp, sex=f, married=no]).
example(3, approve, [emp=yes, buy=comp, sex=m, married=no]).
example(4, approve, [emp=yes, buy=car, sex=f, married=yes]).
example(6, approve, [emp=yes, buy=comp, sex=f, married=yes]).
example(7, approve, [emp=yes, buy=comp, sex=f, married=no]).
example(8, approve, [emp=yes, buy=comp, sex=m, married=no]).
example(9, approve, [emp=yes, buy=comp, sex=m, married=yes]).
example(10, approve, [emp=yes, buy=comp, sex=m, married=yes]).
example(5, reject, [emp=yes, buy=car, sex=f, married=no]).
</code>

Now, we load and run the candiate elimination algorithm (vs.pl) in batch mode.

<code prolog>
?- [vs].

?- batch.
+[no, comp, f, yes]
G-set: [[?, ?, ?, ?]]
S-set: [[no, comp, f, yes]]
+[no, comp, m, yes]
G-set: [[?, ?, ?, ?]]
S-set: [[no, comp, ?, yes]]
+[no, car, f, yes]
G-set: [[?, ?, ?, ?]]
S-set: [[no, ?, ?, yes]]
-[yes, comp, f, no]
G-set: [[?, ?, ?, yes], [no, ?, ?, ?]]
S-set: [[no, ?, ?, yes]]
-[yes, comp, m, no]
-[yes, car, f, yes]
G-set: [[no, ?, ?, ?]]
S-set: [[no, ?, ?, yes]]
-[yes, comp, f, yes]
-[yes, comp, f, no]
-[yes, comp, m, no]
-[yes, comp, m, yes]
-[yes, comp, m, yes]
+[yes, car, f, no]
G-set: []
S-set: []
There is no consistent concept description in this language !
</code>

NOTE that the inconsistancy occurs because of the last example (# 5) which is a part
of the other hypothesis. The hypothesis created by ID3, [emp=no] is contained
in the version space:

<code prolog>
G-set: [[no, ?, ?, ?]]
S-set: [[no, ?, ?, yes]]
</code>

In fact, it is the one in the G-set. Interestingly, we have another one (the S-set).
This means that there may be another more specific decision tree, which looks like this:

<code prolog>
  emp=no 
    married=yes => reject        THIS IS THE NEW BRANCH
    married=no => approve        THIS HAS TO BE ADDED TO COMPLETE THE SUBTREE
  emp=yes
    buy=car
      married=no => reject
      married=yes => approve
    buy=comp => approve
<code>

NOTE that this tree is created MANUALLY. With loandata1.pl the ID3 algorithm will 
create the previous tree, because the order of the examples does not matter.

Let's verify what the new brach covers. First, we need the "model" procedure back
(it's in id3.pl).

<code prolog>
?- [id3].

?- model([emp=no,married=yes],M).

M = [2, 11, 12]
</code>

NOTE that this is the same coverage as before.

<code prolog>
?- model([emp=no,married=no],M).

M = []
</code> 

The brach we added to complete the tree does not cover anything, but this is ok.

Similarly the other disjunctive componenet of the ID3's "reject" hypothesis,
<code prolog>
if [emp=yes, buy=car, married=no] then reject
</code>

may be learned with VS. First, we create file loandata2.pl with the following 
order of the examples: [5, 1, 3, 4, 6, 7, 8, 9, 10, 2, 11, 12].

Then running VS:

<code prolog>
?- batch.
+[yes, car, f, no]
G-set: [[?, ?, ?, ?]]
S-set: [[yes, car, f, no]]
-[yes, comp, f, no]
G-set: [[?, car, ?, ?]]
S-set: [[yes, car, f, no]]
-[yes, comp, m, no]
-[yes, car, f, yes]
G-set: [[?, car, ?, no]]
S-set: [[yes, car, f, no]]
-[yes, comp, f, yes]
-[yes, comp, f, no]
-[yes, comp, m, no]
-[yes, comp, m, yes]
-[yes, comp, m, yes]
+[no, comp, f, yes]
G-set: []
S-set: []
There is no consistent concept description in this language !
</code>

NOTE that the version space here is bigger:
<code prolog>
G-set: [[?, car, ?, no]]
S-set: [[yes, car, f, no]]
</code>

It has four hypotheses, among which is the one created by ID3.


===== Agorytm OneR =====

<code prolog>
?- [id3].
?- [loandata].

?- id3(1),showtree.

  emp=no => [reject/3]
  emp=yes
    buy=car
      married=no => [reject/1]
      married=yes => [approve/1]
    buy=comp => [approve/7]
</code>

The id3.pl file defines procedures to compute the entopy and the distribution 
of classes in a set of examples. For example:

<code prolog>
?- if H then C, model(H,M),entropy(M,I).

H = [emp=no]
C = [reject/3]
M = [2, 11, 12]
I = 0 ;

H = [emp=yes, buy=car, married=no]
C = [reject/1]
M = [5]
I = 0 ;

H = [emp=yes, buy=car, married=yes]
C = [approve/1]
M = [4]
I = 0 ;

H = [emp=yes, buy=comp]
C = [approve/7]
M = [1, 3, 6, 7, 8, 9, 10]
I = 0
</code>

These are all 100% correct hypotheses and that's way the entropy of the
class distribution is 0. If we allow mixed classes at the leaves we may get
non-zero entropy. For example:

<code prolog>
?- id3(2),showtree.

  emp=no => [reject/3]
  emp=yes
    buy=car => [approve/1, reject/1]
    buy=comp => [approve/7]

?- if H then C, model(H,M),entropy(M,I).

H = [emp=no]
C = [reject/3]
M = [2, 11, 12]
I = 0 ;

H = [emp=yes, buy=car]
C = [approve/1, reject/1]
M = [4, 5]
I = 1 ;

H = [emp=yes, buy=comp]
C = [approve/7]
M = [1, 3, 6, 7, 8, 9, 10]
I = 0 ;
</code>

The id3.pl program inlcudes a procedure to compute the class distribution of a set of 
examples. For example:

<code prolog>
?- if H then C, model(H,M),distr(M,D).

H = [emp=no]
C = [reject/3]
M = [2, 11, 12]
D = [reject/3] ;

H = [emp=yes, buy=car]
C = [approve/1, reject/1]
M = [4, 5]
D = [approve/1, reject/1] ;

H = [emp=yes, buy=comp]
C = [approve/7]
M = [1, 3, 6, 7, 8, 9, 10]
D = [approve/7] ;
</code>

This we may see the class distribution and the entropy in a set of examples.
For example:

<code prolog>
?- distr([1,2,3,4],D),entropy([1,2,3,4],I).

D = [approve/3, reject/1]
I = 0.811278
</code>

Here is the entropy of a hypothesis that covers mixed class examples:

<code prolog>
?- model([buy=car],M),distr(M,D),entropy(M,I).

M = [4, 5, 12]
D = [approve/1, reject/2]
I = 0.918296
</code>

The distribution of classes can be used to compute the error.
For example, asuming that we assigned the majority class label to
the set above (reject), the error will be 1/3 (one wrong classification
out of 3).

The OneR algorithm generates a one-level decision tree, using an
error based attribute selection. The following query can be used to
find the OneR decision tree for our loan data ("attr" and "vals" are
described in the header of id3.pl. Try the query without writeln and fail to
understand better how it works).

<code prolog>
?- attr(L),member(A,L),vals(A,W),nl,member(V,W),model([V],M),distr(M,D),writeln([V]-D),fail.

[buy=car]-[approve/1, reject/2]
[buy=comp]-[approve/7, reject/2]

[emp=no]-[reject/3]
[emp=yes]-[approve/8, reject/1]

[married=no]-[approve/4, reject/1]
[married=yes]-[approve/4, reject/3]

[sex=f]-[approve/4, reject/3]
[sex=m]-[approve/4, reject/1]
</code>

The class distributions above allow us to compute the total error in each
of the four possible one-level trees:

<code prolog>
1. buy=car => reject [approve/1, reject/2] 
   buy=comp => approve [approve/7, reject/2]
   -----------------------------------------
   Error = (1+2)/12 = 0.25

2. emp=no => reject [reject/3]
   emp=yes => approve [approve/8, reject/1]
   ----------------------------------------
   Error = (0+1)/12 = 0.0833333

3. married=no => approve [approve/4, reject/1]
   married=yes => approve [approve/4, reject/3]
   --------------------------------------------
   Error = (1+3)/12 = 0.333333

4. sex=f => approve [approve/4, reject/3]
   sex=m => approve [approve/4, reject/1]
   --------------------------------------
   Error = (3+1)/12 = 0.333333
</code>

Obviously the best OneR tree is #2, because it has the lowest error rate of 0.0833333.
Interestingly the entropy-based attribute selection used by ID3 is in agreement 
with the error-based one. This is clear from the fact the pruned ID tree is 
the same as the OneR one.