|
|
pl:dydaktyka:ml:lab3 [2017/07/17 10:08] |
pl:dydaktyka:ml:lab3 [2019/06/27 15:50] (aktualna) |
| ====== Laboratorium 3 - Drzewa Decyzyjne ====== |
| Literatura: Tom M. Mitchell, Machine Learning, Rozdział 3. |
| |
| ===== Lista i opis plików ===== |
| Potrzebne pliki to |
| program |
| {{:pl:prolog:prolog_lab:ml:id3.pl}} |
| oraz wcześniejsze dane |
| {{:pl:prolog:prolog_lab:ml:animals.pl}}, |
| {{:pl:prolog:prolog_lab:ml:loandata.pl}}. |
| |
| Pliki uzupełniające dla przykładów z PMG: |
| program |
| {{:pl:prolog:prolog_lab:ml:dtlearn_bool.pl}}, |
| dane |
| {{:pl:prolog:prolog_lab:ml:dtlearn_t2.pl}} |
| i dla zadanie |
| {{:pl:prolog:prolog_lab:ml:as11data.pl}}. |
| |
| ===== Tworzenie, przeglądanie i dostęp do drzew decyzyjnych i reguł ===== |
| |
| <code prolog> |
| ?- [id3]. |
| ?- [animals]. |
| |
| ?- id3(2). |
| </code> |
| |
| The parameter we specify (2) is used to control the growing of the tree. If a node covers |
| less than or equal to 2 examples, then it becomes a leaf no matter what the class distribution |
| is. |
| |
| To see the tree we use: |
| <code prolog> |
| ?- showtree. |
| |
| has_covering=feathers => [bird/2] |
| has_covering=scales |
| gills=f => [reptile/2] |
| gills=t => [fish/1] |
| has_covering=none => [amphibian/1, mammal/1] |
| has_covering=hair => [mammal/3] |
| </code> |
| |
| The lists at the leaves show the class distribution in the examples covered. If we change |
| the threshold we get a different tree. With threshold 1 we will have one-class examples |
| at the leaves. For example: |
| |
| <code prolog> |
| ?- id3(1),showtree. |
| |
| has_covering=feathers => [bird/2] |
| has_covering=scales |
| gills=f => [reptile/2] |
| gills=t => [fish/1] |
| has_covering=none |
| eggs=t => [amphibian/1] |
| eggs=f => [mammal/1] |
| has_covering=hair => [mammal/3] |
| </code> |
| |
| The "id3" query also generates "if-then" rules. We can see these rules by using "listing": |
| <code prolog> |
| ?- listing(if). |
| |
| :- dynamic if/1. |
| |
| if[has_covering=feathers]then bird. |
| if[has_covering=scales, gills=f]then reptile. |
| if[has_covering=scales, gills=t]then fish. |
| if[has_covering=none, eggs=t]then amphibian. |
| if[has_covering=none, eggs=f]then mammal. |
| if[has_covering=hair]then mammal. |
| </code> |
| |
| NOTE: "listing" is a Prolog built-in procedure that lists the contents of the database |
| by specifying a name. This means that the if-then strucures are already in the database |
| and their name is "if". In fact, the ID3 program when started by "?- id3(1)." creates a |
| decision tree (that can be viewed by "?- showtree.") and loads in the database a set of |
| rules, that equivalently represent the tree. |
| |
| The rule created by ID3 are Prolog facts, but put in a special prefix format. |
| They can also be accessed directly by Prolog queries like these: |
| |
| <code prolog> |
| ?- if X then mammal. |
| |
| X = [has_covering=none, eggs=f] ; |
| |
| X = [has_covering=hair] ; |
| |
| ?- if X then Y. |
| |
| X = [has_covering=feathers] |
| Y = bird ; |
| |
| X = [has_covering=scales, gills=f] |
| Y = reptile ; |
| |
| X = [has_covering=scales, gills=t] |
| Y = fish ; |
| |
| X = [has_covering=none, eggs=t] |
| Y = amphibian ; |
| |
| X = [has_covering=none, eggs=f] |
| Y = mammal ; |
| |
| X = [has_covering=hair] |
| Y = mammal |
| </code> |
| |
| If you want to run ID3 with another data file, load the data only (no need to reload id3.pl). |
| However NOTE that after executing "?- id3(1)" the previous tree and rules are deleted and |
| replaced with those corresponding to the currently loaded data file. For example: |
| |
| <code prolog> |
| ?- [loandata]. |
| |
| ?- id3(1),showtree. |
| |
| emp=no => [reject/3] |
| emp=yes |
| buy=car |
| married=no => [reject/1] |
| married=yes => [approve/1] |
| buy=comp => [approve/7] |
| |
| ?- id3(9),showtree. |
| |
| emp=no => [reject/3] |
| emp=yes => [approve/8, reject/1] |
| |
| </code> |
| |
| ===== Narzędzia ===== |
| |
| Wykorzystując dane z pliku {{:pl:prolog:prolog_lab:ml:animals.pl}}, przetestuj działanie [[http://www.aispace.org/dTree/index.shtml|Narzędzie do budowania drzew decyzyjnych]]. |
| |
| Możesz zaimportować plik XML ze {{:pl:dydaktyka:ml:lab3.xml|zbiorem uczącym}}. |
| |
| **Uwaga** Oblicz Entropię i tzw. //Knowledge Gain// dla poszczególnych parametrów i wskaż najlepszy korzeń drzewa decyzyjnego. Czy narzędzie poprawnie wyznaczyło korzeń? |
| |
| |
| ===== Porównanie z uczeniem pojęć ===== |
| |
| |
| Assuming that the loandata tree and rules created by "?- id3(1)" in the database, try this |
| (note the sets [2, 5, 11, 12] = [2, 11, 12] + [5]): |
| |
| <code prolog> |
| ?- if H then [reject/N],model(H,M). |
| |
| H = [emp=no] |
| N = 3 |
| M = [2, 11, 12] ; |
| |
| H = [emp=yes, buy=car, married=no] |
| N = 1 |
| M = [5] ; |
| </code> |
| |
| This query accesses the two disjunctive components of the hypothesis for class "reject" |
| (the two paths in the decision tree following to the "reject" leaves). |
| |
| Now, obviously we should be able to run VS on each of the two sets and produce |
| consistent hypotheses. For the first part, [2, 11, 12] we put these examples in the |
| beginning and all from class "approve" to follow (we may also leave the other "reject" |
| example at the end). Let's call this new data file loandata1.pl and load it into the |
| database. |
| <code prolog> |
| ?- [loandata1]. |
| </code> |
| |
| We may try "listing" to verify our data: |
| |
| <code prolog> |
| ?- listing(example). |
| |
| example(2, reject, [emp=no, buy=comp, sex=f, married=yes]). |
| example(11, reject, [emp=no, buy=comp, sex=m, married=yes]). |
| example(12, reject, [emp=no, buy=car, sex=f, married=yes]). |
| example(1, approve, [emp=yes, buy=comp, sex=f, married=no]). |
| example(3, approve, [emp=yes, buy=comp, sex=m, married=no]). |
| example(4, approve, [emp=yes, buy=car, sex=f, married=yes]). |
| example(6, approve, [emp=yes, buy=comp, sex=f, married=yes]). |
| example(7, approve, [emp=yes, buy=comp, sex=f, married=no]). |
| example(8, approve, [emp=yes, buy=comp, sex=m, married=no]). |
| example(9, approve, [emp=yes, buy=comp, sex=m, married=yes]). |
| example(10, approve, [emp=yes, buy=comp, sex=m, married=yes]). |
| example(5, reject, [emp=yes, buy=car, sex=f, married=no]). |
| </code> |
| |
| Now, we load and run the candiate elimination algorithm (vs.pl) in batch mode. |
| |
| <code prolog> |
| ?- [vs]. |
| |
| ?- batch. |
| +[no, comp, f, yes] |
| G-set: [[?, ?, ?, ?]] |
| S-set: [[no, comp, f, yes]] |
| +[no, comp, m, yes] |
| G-set: [[?, ?, ?, ?]] |
| S-set: [[no, comp, ?, yes]] |
| +[no, car, f, yes] |
| G-set: [[?, ?, ?, ?]] |
| S-set: [[no, ?, ?, yes]] |
| -[yes, comp, f, no] |
| G-set: [[?, ?, ?, yes], [no, ?, ?, ?]] |
| S-set: [[no, ?, ?, yes]] |
| -[yes, comp, m, no] |
| -[yes, car, f, yes] |
| G-set: [[no, ?, ?, ?]] |
| S-set: [[no, ?, ?, yes]] |
| -[yes, comp, f, yes] |
| -[yes, comp, f, no] |
| -[yes, comp, m, no] |
| -[yes, comp, m, yes] |
| -[yes, comp, m, yes] |
| +[yes, car, f, no] |
| G-set: [] |
| S-set: [] |
| There is no consistent concept description in this language ! |
| </code> |
| |
| NOTE that the inconsistancy occurs because of the last example (# 5) which is a part |
| of the other hypothesis. The hypothesis created by ID3, [emp=no] is contained |
| in the version space: |
| |
| <code prolog> |
| G-set: [[no, ?, ?, ?]] |
| S-set: [[no, ?, ?, yes]] |
| </code> |
| |
| In fact, it is the one in the G-set. Interestingly, we have another one (the S-set). |
| This means that there may be another more specific decision tree, which looks like this: |
| |
| <code prolog> |
| emp=no |
| married=yes => reject THIS IS THE NEW BRANCH |
| married=no => approve THIS HAS TO BE ADDED TO COMPLETE THE SUBTREE |
| emp=yes |
| buy=car |
| married=no => reject |
| married=yes => approve |
| buy=comp => approve |
| <code> |
| |
| NOTE that this tree is created MANUALLY. With loandata1.pl the ID3 algorithm will |
| create the previous tree, because the order of the examples does not matter. |
| |
| Let's verify what the new brach covers. First, we need the "model" procedure back |
| (it's in id3.pl). |
| |
| <code prolog> |
| ?- [id3]. |
| |
| ?- model([emp=no,married=yes],M). |
| |
| M = [2, 11, 12] |
| </code> |
| |
| NOTE that this is the same coverage as before. |
| |
| <code prolog> |
| ?- model([emp=no,married=no],M). |
| |
| M = [] |
| </code> |
| |
| The brach we added to complete the tree does not cover anything, but this is ok. |
| |
| Similarly the other disjunctive componenet of the ID3's "reject" hypothesis, |
| <code prolog> |
| if [emp=yes, buy=car, married=no] then reject |
| </code> |
| |
| may be learned with VS. First, we create file loandata2.pl with the following |
| order of the examples: [5, 1, 3, 4, 6, 7, 8, 9, 10, 2, 11, 12]. |
| |
| Then running VS: |
| |
| <code prolog> |
| ?- batch. |
| +[yes, car, f, no] |
| G-set: [[?, ?, ?, ?]] |
| S-set: [[yes, car, f, no]] |
| -[yes, comp, f, no] |
| G-set: [[?, car, ?, ?]] |
| S-set: [[yes, car, f, no]] |
| -[yes, comp, m, no] |
| -[yes, car, f, yes] |
| G-set: [[?, car, ?, no]] |
| S-set: [[yes, car, f, no]] |
| -[yes, comp, f, yes] |
| -[yes, comp, f, no] |
| -[yes, comp, m, no] |
| -[yes, comp, m, yes] |
| -[yes, comp, m, yes] |
| +[no, comp, f, yes] |
| G-set: [] |
| S-set: [] |
| There is no consistent concept description in this language ! |
| </code> |
| |
| NOTE that the version space here is bigger: |
| <code prolog> |
| G-set: [[?, car, ?, no]] |
| S-set: [[yes, car, f, no]] |
| </code> |
| |
| It has four hypotheses, among which is the one created by ID3. |
| |
| |
| ===== Agorytm OneR ===== |
| |
| <code prolog> |
| ?- [id3]. |
| ?- [loandata]. |
| |
| ?- id3(1),showtree. |
| |
| emp=no => [reject/3] |
| emp=yes |
| buy=car |
| married=no => [reject/1] |
| married=yes => [approve/1] |
| buy=comp => [approve/7] |
| </code> |
| |
| The id3.pl file defines procedures to compute the entopy and the distribution |
| of classes in a set of examples. For example: |
| |
| <code prolog> |
| ?- if H then C, model(H,M),entropy(M,I). |
| |
| H = [emp=no] |
| C = [reject/3] |
| M = [2, 11, 12] |
| I = 0 ; |
| |
| H = [emp=yes, buy=car, married=no] |
| C = [reject/1] |
| M = [5] |
| I = 0 ; |
| |
| H = [emp=yes, buy=car, married=yes] |
| C = [approve/1] |
| M = [4] |
| I = 0 ; |
| |
| H = [emp=yes, buy=comp] |
| C = [approve/7] |
| M = [1, 3, 6, 7, 8, 9, 10] |
| I = 0 |
| </code> |
| |
| These are all 100% correct hypotheses and that's way the entropy of the |
| class distribution is 0. If we allow mixed classes at the leaves we may get |
| non-zero entropy. For example: |
| |
| <code prolog> |
| ?- id3(2),showtree. |
| |
| emp=no => [reject/3] |
| emp=yes |
| buy=car => [approve/1, reject/1] |
| buy=comp => [approve/7] |
| |
| ?- if H then C, model(H,M),entropy(M,I). |
| |
| H = [emp=no] |
| C = [reject/3] |
| M = [2, 11, 12] |
| I = 0 ; |
| |
| H = [emp=yes, buy=car] |
| C = [approve/1, reject/1] |
| M = [4, 5] |
| I = 1 ; |
| |
| H = [emp=yes, buy=comp] |
| C = [approve/7] |
| M = [1, 3, 6, 7, 8, 9, 10] |
| I = 0 ; |
| </code> |
| |
| The id3.pl program inlcudes a procedure to compute the class distribution of a set of |
| examples. For example: |
| |
| <code prolog> |
| ?- if H then C, model(H,M),distr(M,D). |
| |
| H = [emp=no] |
| C = [reject/3] |
| M = [2, 11, 12] |
| D = [reject/3] ; |
| |
| H = [emp=yes, buy=car] |
| C = [approve/1, reject/1] |
| M = [4, 5] |
| D = [approve/1, reject/1] ; |
| |
| H = [emp=yes, buy=comp] |
| C = [approve/7] |
| M = [1, 3, 6, 7, 8, 9, 10] |
| D = [approve/7] ; |
| </code> |
| |
| This we may see the class distribution and the entropy in a set of examples. |
| For example: |
| |
| <code prolog> |
| ?- distr([1,2,3,4],D),entropy([1,2,3,4],I). |
| |
| D = [approve/3, reject/1] |
| I = 0.811278 |
| </code> |
| |
| Here is the entropy of a hypothesis that covers mixed class examples: |
| |
| <code prolog> |
| ?- model([buy=car],M),distr(M,D),entropy(M,I). |
| |
| M = [4, 5, 12] |
| D = [approve/1, reject/2] |
| I = 0.918296 |
| </code> |
| |
| The distribution of classes can be used to compute the error. |
| For example, asuming that we assigned the majority class label to |
| the set above (reject), the error will be 1/3 (one wrong classification |
| out of 3). |
| |
| The OneR algorithm generates a one-level decision tree, using an |
| error based attribute selection. The following query can be used to |
| find the OneR decision tree for our loan data ("attr" and "vals" are |
| described in the header of id3.pl. Try the query without writeln and fail to |
| understand better how it works). |
| |
| <code prolog> |
| ?- attr(L),member(A,L),vals(A,W),nl,member(V,W),model([V],M),distr(M,D),writeln([V]-D),fail. |
| |
| [buy=car]-[approve/1, reject/2] |
| [buy=comp]-[approve/7, reject/2] |
| |
| [emp=no]-[reject/3] |
| [emp=yes]-[approve/8, reject/1] |
| |
| [married=no]-[approve/4, reject/1] |
| [married=yes]-[approve/4, reject/3] |
| |
| [sex=f]-[approve/4, reject/3] |
| [sex=m]-[approve/4, reject/1] |
| </code> |
| |
| The class distributions above allow us to compute the total error in each |
| of the four possible one-level trees: |
| |
| <code prolog> |
| 1. buy=car => reject [approve/1, reject/2] |
| buy=comp => approve [approve/7, reject/2] |
| ----------------------------------------- |
| Error = (1+2)/12 = 0.25 |
| |
| 2. emp=no => reject [reject/3] |
| emp=yes => approve [approve/8, reject/1] |
| ---------------------------------------- |
| Error = (0+1)/12 = 0.0833333 |
| |
| 3. married=no => approve [approve/4, reject/1] |
| married=yes => approve [approve/4, reject/3] |
| -------------------------------------------- |
| Error = (1+3)/12 = 0.333333 |
| |
| 4. sex=f => approve [approve/4, reject/3] |
| sex=m => approve [approve/4, reject/1] |
| -------------------------------------- |
| Error = (3+1)/12 = 0.333333 |
| </code> |
| |
| Obviously the best OneR tree is #2, because it has the lowest error rate of 0.0833333. |
| Interestingly the entropy-based attribute selection used by ID3 is in agreement |
| with the error-based one. This is clear from the fact the pruned ID tree is |
| the same as the OneR one. |
| |