|
|
pl:dydaktyka:ml:lab3 [2013/03/12 20:51] esimon [Narzędzia] |
pl:dydaktyka:ml:lab3 [2019/06/27 15:50] |
====== Laboratorium 3 - Drzewa Decyzyjne ====== | |
Literatura: Tom M. Mitchell, Machine Learning, Rozdział 3. | |
| |
===== Lista i opis plików ===== | |
Potrzebne pliki to | |
program | |
{{:pl:prolog:prolog_lab:ml:id3.pl}} | |
oraz wcześniejsze dane | |
{{:pl:prolog:prolog_lab:ml:animals.pl}}, | |
{{:pl:prolog:prolog_lab:ml:loandata.pl}}. | |
| |
Pliki uzupełniające dla przykładów z PMG: | |
program | |
{{:pl:prolog:prolog_lab:ml:dtlearn_bool.pl}}, | |
dane | |
{{:pl:prolog:prolog_lab:ml:dtlearn_t2.pl}} | |
i dla zadanie | |
{{:pl:prolog:prolog_lab:ml:as11data.pl}}. | |
| |
===== Tworzenie, przeglądanie i dostęp do drzew decyzyjnych i reguł ===== | |
| |
<code prolog> | |
?- [id3]. | |
?- [animals]. | |
| |
?- id3(2). | |
</code> | |
| |
The parameter we specify (2) is used to control the growing of the tree. If a node covers | |
less than or equal to 2 examples, then it becomes a leaf no matter what the class distribution | |
is. | |
| |
To see the tree we use: | |
<code prolog> | |
?- showtree. | |
| |
has_covering=feathers => [bird/2] | |
has_covering=scales | |
gills=f => [reptile/2] | |
gills=t => [fish/1] | |
has_covering=none => [amphibian/1, mammal/1] | |
has_covering=hair => [mammal/3] | |
</code> | |
| |
The lists at the leaves show the class distribution in the examples covered. If we change | |
the threshold we get a different tree. With threshold 1 we will have one-class examples | |
at the leaves. For example: | |
| |
<code prolog> | |
?- id3(1),showtree. | |
| |
has_covering=feathers => [bird/2] | |
has_covering=scales | |
gills=f => [reptile/2] | |
gills=t => [fish/1] | |
has_covering=none | |
eggs=t => [amphibian/1] | |
eggs=f => [mammal/1] | |
has_covering=hair => [mammal/3] | |
</code> | |
| |
The "id3" query also generates "if-then" rules. We can see these rules by using "listing": | |
<code prolog> | |
?- listing(if). | |
| |
:- dynamic if/1. | |
| |
if[has_covering=feathers]then bird. | |
if[has_covering=scales, gills=f]then reptile. | |
if[has_covering=scales, gills=t]then fish. | |
if[has_covering=none, eggs=t]then amphibian. | |
if[has_covering=none, eggs=f]then mammal. | |
if[has_covering=hair]then mammal. | |
</code> | |
| |
NOTE: "listing" is a Prolog built-in procedure that lists the contents of the database | |
by specifying a name. This means that the if-then strucures are already in the database | |
and their name is "if". In fact, the ID3 program when started by "?- id3(1)." creates a | |
decision tree (that can be viewed by "?- showtree.") and loads in the database a set of | |
rules, that equivalently represent the tree. | |
| |
The rule created by ID3 are Prolog facts, but put in a special prefix format. | |
They can also be accessed directly by Prolog queries like these: | |
| |
<code prolog> | |
?- if X then mammal. | |
| |
X = [has_covering=none, eggs=f] ; | |
| |
X = [has_covering=hair] ; | |
| |
?- if X then Y. | |
| |
X = [has_covering=feathers] | |
Y = bird ; | |
| |
X = [has_covering=scales, gills=f] | |
Y = reptile ; | |
| |
X = [has_covering=scales, gills=t] | |
Y = fish ; | |
| |
X = [has_covering=none, eggs=t] | |
Y = amphibian ; | |
| |
X = [has_covering=none, eggs=f] | |
Y = mammal ; | |
| |
X = [has_covering=hair] | |
Y = mammal | |
</code> | |
| |
If you want to run ID3 with another data file, load the data only (no need to reload id3.pl). | |
However NOTE that after executing "?- id3(1)" the previous tree and rules are deleted and | |
replaced with those corresponding to the currently loaded data file. For example: | |
| |
<code prolog> | |
?- [loandata]. | |
| |
?- id3(1),showtree. | |
| |
emp=no => [reject/3] | |
emp=yes | |
buy=car | |
married=no => [reject/1] | |
married=yes => [approve/1] | |
buy=comp => [approve/7] | |
| |
?- id3(9),showtree. | |
| |
emp=no => [reject/3] | |
emp=yes => [approve/8, reject/1] | |
| |
</code> | |
| |
===== Narzędzia ===== | |
| |
Wykorzystując dane z pliku {{:pl:prolog:prolog_lab:ml:animals.pl}}, przetestuj działanie [[http://www.aispace.org/dTree/index.shtml|Narzędzie do budowania drzew decyzyjnych]]. | |
| |
**Uwaga** Oblicz Entropię i tzw. //Knowledge Gain// dla poszczególnych parametrów i wskaż najlepszy korzeń drzewa decyzyjnego. Czy narzędzie poprawnie wyznaczyło korzeń? | |
| |
| |
===== Porównanie z uczeniem pojęć ===== | |
| |
| |
Assuming that the loandata tree and rules created by "?- id3(1)" in the database, try this | |
(note the sets [2, 5, 11, 12] = [2, 11, 12] + [5]): | |
| |
<code prolog> | |
?- if H then [reject/N],model(H,M). | |
| |
H = [emp=no] | |
N = 3 | |
M = [2, 11, 12] ; | |
| |
H = [emp=yes, buy=car, married=no] | |
N = 1 | |
M = [5] ; | |
</code> | |
| |
This query accesses the two disjunctive components of the hypothesis for class "reject" | |
(the two paths in the decision tree following to the "reject" leaves). | |
| |
Now, obviously we should be able to run VS on each of the two sets and produce | |
consistent hypotheses. For the first part, [2, 11, 12] we put these examples in the | |
beginning and all from class "approve" to follow (we may also leave the other "reject" | |
example at the end). Let's call this new data file loandata1.pl and load it into the | |
database. | |
<code prolog> | |
?- [loandata1]. | |
</code> | |
| |
We may try "listing" to verify our data: | |
| |
<code prolog> | |
?- listing(example). | |
| |
example(2, reject, [emp=no, buy=comp, sex=f, married=yes]). | |
example(11, reject, [emp=no, buy=comp, sex=m, married=yes]). | |
example(12, reject, [emp=no, buy=car, sex=f, married=yes]). | |
example(1, approve, [emp=yes, buy=comp, sex=f, married=no]). | |
example(3, approve, [emp=yes, buy=comp, sex=m, married=no]). | |
example(4, approve, [emp=yes, buy=car, sex=f, married=yes]). | |
example(6, approve, [emp=yes, buy=comp, sex=f, married=yes]). | |
example(7, approve, [emp=yes, buy=comp, sex=f, married=no]). | |
example(8, approve, [emp=yes, buy=comp, sex=m, married=no]). | |
example(9, approve, [emp=yes, buy=comp, sex=m, married=yes]). | |
example(10, approve, [emp=yes, buy=comp, sex=m, married=yes]). | |
example(5, reject, [emp=yes, buy=car, sex=f, married=no]). | |
</code> | |
| |
Now, we load and run the candiate elimination algorithm (vs.pl) in batch mode. | |
| |
<code prolog> | |
?- [vs]. | |
| |
?- batch. | |
+[no, comp, f, yes] | |
G-set: [[?, ?, ?, ?]] | |
S-set: [[no, comp, f, yes]] | |
+[no, comp, m, yes] | |
G-set: [[?, ?, ?, ?]] | |
S-set: [[no, comp, ?, yes]] | |
+[no, car, f, yes] | |
G-set: [[?, ?, ?, ?]] | |
S-set: [[no, ?, ?, yes]] | |
-[yes, comp, f, no] | |
G-set: [[?, ?, ?, yes], [no, ?, ?, ?]] | |
S-set: [[no, ?, ?, yes]] | |
-[yes, comp, m, no] | |
-[yes, car, f, yes] | |
G-set: [[no, ?, ?, ?]] | |
S-set: [[no, ?, ?, yes]] | |
-[yes, comp, f, yes] | |
-[yes, comp, f, no] | |
-[yes, comp, m, no] | |
-[yes, comp, m, yes] | |
-[yes, comp, m, yes] | |
+[yes, car, f, no] | |
G-set: [] | |
S-set: [] | |
There is no consistent concept description in this language ! | |
</code> | |
| |
NOTE that the inconsistancy occurs because of the last example (# 5) which is a part | |
of the other hypothesis. The hypothesis created by ID3, [emp=no] is contained | |
in the version space: | |
| |
<code prolog> | |
G-set: [[no, ?, ?, ?]] | |
S-set: [[no, ?, ?, yes]] | |
</code> | |
| |
In fact, it is the one in the G-set. Interestingly, we have another one (the S-set). | |
This means that there may be another more specific decision tree, which looks like this: | |
| |
<code prolog> | |
emp=no | |
married=yes => reject THIS IS THE NEW BRANCH | |
married=no => approve THIS HAS TO BE ADDED TO COMPLETE THE SUBTREE | |
emp=yes | |
buy=car | |
married=no => reject | |
married=yes => approve | |
buy=comp => approve | |
<code> | |
| |
NOTE that this tree is created MANUALLY. With loandata1.pl the ID3 algorithm will | |
create the previous tree, because the order of the examples does not matter. | |
| |
Let's verify what the new brach covers. First, we need the "model" procedure back | |
(it's in id3.pl). | |
| |
<code prolog> | |
?- [id3]. | |
| |
?- model([emp=no,married=yes],M). | |
| |
M = [2, 11, 12] | |
</code> | |
| |
NOTE that this is the same coverage as before. | |
| |
<code prolog> | |
?- model([emp=no,married=no],M). | |
| |
M = [] | |
</code> | |
| |
The brach we added to complete the tree does not cover anything, but this is ok. | |
| |
Similarly the other disjunctive componenet of the ID3's "reject" hypothesis, | |
<code prolog> | |
if [emp=yes, buy=car, married=no] then reject | |
</code> | |
| |
may be learned with VS. First, we create file loandata2.pl with the following | |
order of the examples: [5, 1, 3, 4, 6, 7, 8, 9, 10, 2, 11, 12]. | |
| |
Then running VS: | |
| |
<code prolog> | |
?- batch. | |
+[yes, car, f, no] | |
G-set: [[?, ?, ?, ?]] | |
S-set: [[yes, car, f, no]] | |
-[yes, comp, f, no] | |
G-set: [[?, car, ?, ?]] | |
S-set: [[yes, car, f, no]] | |
-[yes, comp, m, no] | |
-[yes, car, f, yes] | |
G-set: [[?, car, ?, no]] | |
S-set: [[yes, car, f, no]] | |
-[yes, comp, f, yes] | |
-[yes, comp, f, no] | |
-[yes, comp, m, no] | |
-[yes, comp, m, yes] | |
-[yes, comp, m, yes] | |
+[no, comp, f, yes] | |
G-set: [] | |
S-set: [] | |
There is no consistent concept description in this language ! | |
</code> | |
| |
NOTE that the version space here is bigger: | |
<code prolog> | |
G-set: [[?, car, ?, no]] | |
S-set: [[yes, car, f, no]] | |
</code> | |
| |
It has four hypotheses, among which is the one created by ID3. | |
| |
| |
===== Agorytm OneR ===== | |
| |
<code prolog> | |
?- [id3]. | |
?- [loandata]. | |
| |
?- id3(1),showtree. | |
| |
emp=no => [reject/3] | |
emp=yes | |
buy=car | |
married=no => [reject/1] | |
married=yes => [approve/1] | |
buy=comp => [approve/7] | |
</code> | |
| |
The id3.pl file defines procedures to compute the entopy and the distribution | |
of classes in a set of examples. For example: | |
| |
<code prolog> | |
?- if H then C, model(H,M),entropy(M,I). | |
| |
H = [emp=no] | |
C = [reject/3] | |
M = [2, 11, 12] | |
I = 0 ; | |
| |
H = [emp=yes, buy=car, married=no] | |
C = [reject/1] | |
M = [5] | |
I = 0 ; | |
| |
H = [emp=yes, buy=car, married=yes] | |
C = [approve/1] | |
M = [4] | |
I = 0 ; | |
| |
H = [emp=yes, buy=comp] | |
C = [approve/7] | |
M = [1, 3, 6, 7, 8, 9, 10] | |
I = 0 | |
</code> | |
| |
These are all 100% correct hypotheses and that's way the entropy of the | |
class distribution is 0. If we allow mixed classes at the leaves we may get | |
non-zero entropy. For example: | |
| |
<code prolog> | |
?- id3(2),showtree. | |
| |
emp=no => [reject/3] | |
emp=yes | |
buy=car => [approve/1, reject/1] | |
buy=comp => [approve/7] | |
| |
?- if H then C, model(H,M),entropy(M,I). | |
| |
H = [emp=no] | |
C = [reject/3] | |
M = [2, 11, 12] | |
I = 0 ; | |
| |
H = [emp=yes, buy=car] | |
C = [approve/1, reject/1] | |
M = [4, 5] | |
I = 1 ; | |
| |
H = [emp=yes, buy=comp] | |
C = [approve/7] | |
M = [1, 3, 6, 7, 8, 9, 10] | |
I = 0 ; | |
</code> | |
| |
The id3.pl program inlcudes a procedure to compute the class distribution of a set of | |
examples. For example: | |
| |
<code prolog> | |
?- if H then C, model(H,M),distr(M,D). | |
| |
H = [emp=no] | |
C = [reject/3] | |
M = [2, 11, 12] | |
D = [reject/3] ; | |
| |
H = [emp=yes, buy=car] | |
C = [approve/1, reject/1] | |
M = [4, 5] | |
D = [approve/1, reject/1] ; | |
| |
H = [emp=yes, buy=comp] | |
C = [approve/7] | |
M = [1, 3, 6, 7, 8, 9, 10] | |
D = [approve/7] ; | |
</code> | |
| |
This we may see the class distribution and the entropy in a set of examples. | |
For example: | |
| |
<code prolog> | |
?- distr([1,2,3,4],D),entropy([1,2,3,4],I). | |
| |
D = [approve/3, reject/1] | |
I = 0.811278 | |
</code> | |
| |
Here is the entropy of a hypothesis that covers mixed class examples: | |
| |
<code prolog> | |
?- model([buy=car],M),distr(M,D),entropy(M,I). | |
| |
M = [4, 5, 12] | |
D = [approve/1, reject/2] | |
I = 0.918296 | |
</code> | |
| |
The distribution of classes can be used to compute the error. | |
For example, asuming that we assigned the majority class label to | |
the set above (reject), the error will be 1/3 (one wrong classification | |
out of 3). | |
| |
The OneR algorithm generates a one-level decision tree, using an | |
error based attribute selection. The following query can be used to | |
find the OneR decision tree for our loan data ("attr" and "vals" are | |
described in the header of id3.pl. Try the query without writeln and fail to | |
understand better how it works). | |
| |
<code prolog> | |
?- attr(L),member(A,L),vals(A,W),nl,member(V,W),model([V],M),distr(M,D),writeln([V]-D),fail. | |
| |
[buy=car]-[approve/1, reject/2] | |
[buy=comp]-[approve/7, reject/2] | |
| |
[emp=no]-[reject/3] | |
[emp=yes]-[approve/8, reject/1] | |
| |
[married=no]-[approve/4, reject/1] | |
[married=yes]-[approve/4, reject/3] | |
| |
[sex=f]-[approve/4, reject/3] | |
[sex=m]-[approve/4, reject/1] | |
</code> | |
| |
The class distributions above allow us to compute the total error in each | |
of the four possible one-level trees: | |
| |
<code prolog> | |
1. buy=car => reject [approve/1, reject/2] | |
buy=comp => approve [approve/7, reject/2] | |
----------------------------------------- | |
Error = (1+2)/12 = 0.25 | |
| |
2. emp=no => reject [reject/3] | |
emp=yes => approve [approve/8, reject/1] | |
---------------------------------------- | |
Error = (0+1)/12 = 0.0833333 | |
| |
3. married=no => approve [approve/4, reject/1] | |
married=yes => approve [approve/4, reject/3] | |
-------------------------------------------- | |
Error = (1+3)/12 = 0.333333 | |
| |
4. sex=f => approve [approve/4, reject/3] | |
sex=m => approve [approve/4, reject/1] | |
-------------------------------------- | |
Error = (3+1)/12 = 0.333333 | |
</code> | |
| |
Obviously the best OneR tree is #2, because it has the lowest error rate of 0.0833333. | |
Interestingly the entropy-based attribute selection used by ID3 is in agreement | |
with the error-based one. This is clear from the fact the pruned ID tree is | |
the same as the OneR one. | |
| |