Różnice

Różnice między wybraną wersją a wersją aktualną.

Odnośnik do tego porównania

pl:dydaktyka:ml:lab3 [2017/07/17 10:08]
pl:dydaktyka:ml:lab3 [2019/06/27 15:50] (aktualna)
Linia 1: Linia 1:
 +====== Laboratorium 3 - Drzewa Decyzyjne ======
 +Literatura: Tom M. Mitchell, Machine Learning, Rozdział 3. 
 +
 +===== Lista i opis plików ​ =====
 +Potrzebne pliki to 
 +program
 +{{:​pl:​prolog:​prolog_lab:​ml:​id3.pl}}
 +oraz wcześniejsze dane
 +{{:​pl:​prolog:​prolog_lab:​ml:​animals.pl}}, ​
 +{{:​pl:​prolog:​prolog_lab:​ml:​loandata.pl}}.
 +
 +Pliki uzupełniające dla przykładów z PMG:
 +program
 +{{:​pl:​prolog:​prolog_lab:​ml:​dtlearn_bool.pl}}, ​
 +dane
 +{{:​pl:​prolog:​prolog_lab:​ml:​dtlearn_t2.pl}}
 +i dla zadanie
 +{{:​pl:​prolog:​prolog_lab:​ml:​as11data.pl}}.
 +
 +===== Tworzenie, przeglądanie i dostęp do drzew decyzyjnych i reguł =====
 +
 +<code prolog>
 +?- [id3].
 +?- [animals].
 +
 +?- id3(2).
 +</​code>​
 +
 +The parameter we specify (2) is used to control the growing of the tree. If a node covers
 +less than or equal to 2 examples, then it becomes a leaf no matter what the class distribution
 +is.
 +
 +To see the tree we use:
 +<code prolog>
 +?- showtree.
 +
 +  has_covering=feathers => [bird/2]
 +  has_covering=scales
 +    gills=f => [reptile/2]
 +    gills=t => [fish/1]
 +  has_covering=none => [amphibian/​1,​ mammal/1]
 +  has_covering=hair => [mammal/3]
 +</​code>​
 +
 +The lists at the leaves show the class distribution in the examples covered. If we change
 +the threshold we get a different tree. With threshold 1 we will have one-class examples
 +at the leaves. For example:
 +
 +<code prolog>
 +?- id3(1),​showtree.
 +
 +  has_covering=feathers => [bird/2]
 +  has_covering=scales
 +    gills=f => [reptile/2]
 +    gills=t => [fish/1]
 +  has_covering=none
 +    eggs=t => [amphibian/​1]
 +    eggs=f => [mammal/1]
 +  has_covering=hair => [mammal/3]
 +</​code>​
 +
 +The "​id3"​ query also generates "​if-then"​ rules. We can see these rules by using "​listing":​
 +<code prolog>
 +?- listing(if).
 +
 +:- dynamic if/1.
 +
 +if[has_covering=feathers]then bird.
 +if[has_covering=scales,​ gills=f]then reptile.
 +if[has_covering=scales,​ gills=t]then fish.
 +if[has_covering=none,​ eggs=t]then amphibian.
 +if[has_covering=none,​ eggs=f]then mammal.
 +if[has_covering=hair]then mammal.
 +</​code>​
 +
 +NOTE: "​listing"​ is a Prolog built-in procedure that lists the contents of the database
 +by specifying a name. This means that the if-then strucures are already in the database
 +and their name is "​if"​. In fact, the ID3 program when started by "?- id3(1)."​ creates a 
 +decision tree (that can be viewed by "?- showtree."​) and loads in the database a set of
 +rules, that equivalently represent the tree.
 +
 +The rule created by ID3 are Prolog facts, but put in a special prefix format.
 +They can also be accessed directly by Prolog queries like these:
 +
 +<code prolog>
 +?- if X then mammal.
 +
 +X = [has_covering=none,​ eggs=f] ;
 +
 +X = [has_covering=hair] ;
 +
 +?- if X then Y.
 +
 +X = [has_covering=feathers]
 +Y = bird ;
 +
 +X = [has_covering=scales,​ gills=f]
 +Y = reptile ;
 +
 +X = [has_covering=scales,​ gills=t]
 +Y = fish ;
 +
 +X = [has_covering=none,​ eggs=t]
 +Y = amphibian ;
 +
 +X = [has_covering=none,​ eggs=f]
 +Y = mammal ;
 +
 +X = [has_covering=hair]
 +Y = mammal
 +</​code>​
 +
 +If you want to run ID3 with another data file, load the data only (no need to reload id3.pl).
 +However NOTE that after executing "?- id3(1)"​ the previous tree and rules are deleted and 
 +replaced with those corresponding to the currently loaded data file. For example:
 +
 +<code prolog>
 +?- [loandata].
 +
 +?- id3(1),​showtree.
 +
 +  emp=no => [reject/3]
 +  emp=yes
 +    buy=car
 +      married=no => [reject/1]
 +      married=yes => [approve/1]
 +    buy=comp => [approve/7]
 +
 +?- id3(9),​showtree.
 +
 +  emp=no => [reject/3]
 +  emp=yes => [approve/8, reject/1]
 +
 +</​code>​
 +
 +===== Narzędzia =====
 +
 +Wykorzystując ​ dane z pliku {{:​pl:​prolog:​prolog_lab:​ml:​animals.pl}},​ przetestuj działanie [[http://​www.aispace.org/​dTree/​index.shtml|Narzędzie do budowania drzew decyzyjnych]].
 +
 +Możesz zaimportować plik XML ze {{:​pl:​dydaktyka:​ml:​lab3.xml|zbiorem uczącym}}.
 +
 +**Uwaga** ​ Oblicz Entropię i tzw. //Knowledge Gain// dla poszczególnych parametrów i wskaż najlepszy korzeń drzewa decyzyjnego. Czy narzędzie poprawnie wyznaczyło korzeń?
 +
 +
 +===== Porównanie z uczeniem pojęć =====
 +
 +
 +Assuming that the loandata tree and rules created by "?- id3(1)"​ in the database, try this
 +(note the sets [2, 5, 11, 12] = [2, 11, 12] + [5]):
 +
 +<code prolog>
 +?- if H then [reject/​N],​model(H,​M).
 +
 +H = [emp=no]
 +N = 3
 +M = [2, 11, 12] ;
 +
 +H = [emp=yes, buy=car, married=no]
 +N = 1
 +M = [5] ;
 +</​code>​
 +
 +This query accesses the two disjunctive components of the hypothesis for class "​reject"​
 +(the two paths in the decision tree following to the "​reject"​ leaves).
 +
 +Now, obviously we should be able to run VS on each of the two sets and produce
 +consistent hypotheses. For the first part, [2, 11, 12] we put these examples in the
 +beginning and all from class "​approve"​ to follow (we may also leave the other "​reject"​
 +example at the end). Let's call this new data file loandata1.pl and load it into the 
 +database.
 +<code prolog>
 +?- [loandata1].
 +</​code>​
 +
 +We may try "​listing"​ to verify our data:
 +
 +<code prolog>
 +?- listing(example).
 +
 +example(2, reject, [emp=no, buy=comp, sex=f, married=yes]).
 +example(11, reject, [emp=no, buy=comp, sex=m, married=yes]).
 +example(12, reject, [emp=no, buy=car, sex=f, married=yes]).
 +example(1, approve, [emp=yes, buy=comp, sex=f, married=no]).
 +example(3, approve, [emp=yes, buy=comp, sex=m, married=no]).
 +example(4, approve, [emp=yes, buy=car, sex=f, married=yes]).
 +example(6, approve, [emp=yes, buy=comp, sex=f, married=yes]).
 +example(7, approve, [emp=yes, buy=comp, sex=f, married=no]).
 +example(8, approve, [emp=yes, buy=comp, sex=m, married=no]).
 +example(9, approve, [emp=yes, buy=comp, sex=m, married=yes]).
 +example(10, approve, [emp=yes, buy=comp, sex=m, married=yes]).
 +example(5, reject, [emp=yes, buy=car, sex=f, married=no]).
 +</​code>​
 +
 +Now, we load and run the candiate elimination algorithm (vs.pl) in batch mode.
 +
 +<code prolog>
 +?- [vs].
 +
 +?- batch.
 ++[no, comp, f, yes]
 +G-set: [[?, ?, ?, ?]]
 +S-set: [[no, comp, f, yes]]
 ++[no, comp, m, yes]
 +G-set: [[?, ?, ?, ?]]
 +S-set: [[no, comp, ?, yes]]
 ++[no, car, f, yes]
 +G-set: [[?, ?, ?, ?]]
 +S-set: [[no, ?, ?, yes]]
 +-[yes, comp, f, no]
 +G-set: [[?, ?, ?, yes], [no, ?, ?, ?]]
 +S-set: [[no, ?, ?, yes]]
 +-[yes, comp, m, no]
 +-[yes, car, f, yes]
 +G-set: [[no, ?, ?, ?]]
 +S-set: [[no, ?, ?, yes]]
 +-[yes, comp, f, yes]
 +-[yes, comp, f, no]
 +-[yes, comp, m, no]
 +-[yes, comp, m, yes]
 +-[yes, comp, m, yes]
 ++[yes, car, f, no]
 +G-set: []
 +S-set: []
 +There is no consistent concept description in this language !
 +</​code>​
 +
 +NOTE that the inconsistancy occurs because of the last example (# 5) which is a part
 +of the other hypothesis. The hypothesis created by ID3, [emp=no] is contained
 +in the version space:
 +
 +<code prolog>
 +G-set: [[no, ?, ?, ?]]
 +S-set: [[no, ?, ?, yes]]
 +</​code>​
 +
 +In fact, it is the one in the G-set. Interestingly,​ we have another one (the S-set).
 +This means that there may be another more specific decision tree, which looks like this:
 +
 +<code prolog>
 +  emp=no ​
 +    married=yes => reject ​       THIS IS THE NEW BRANCH
 +    married=no => approve ​       THIS HAS TO BE ADDED TO COMPLETE THE SUBTREE
 +  emp=yes
 +    buy=car
 +      married=no => reject
 +      married=yes => approve
 +    buy=comp => approve
 +<​code>​
 +
 +NOTE that this tree is created MANUALLY. With loandata1.pl the ID3 algorithm will 
 +create the previous tree, because the order of the examples does not matter.
 +
 +Let's verify what the new brach covers. First, we need the "​model"​ procedure back
 +(it's in id3.pl).
 +
 +<code prolog>
 +?- [id3].
 +
 +?- model([emp=no,​married=yes],​M).
 +
 +M = [2, 11, 12]
 +</​code>​
 +
 +NOTE that this is the same coverage as before.
 +
 +<code prolog>
 +?- model([emp=no,​married=no],​M).
 +
 +M = []
 +</​code> ​
 +
 +The brach we added to complete the tree does not cover anything, but this is ok.
 +
 +Similarly the other disjunctive componenet of the ID3's "​reject"​ hypothesis,
 +<code prolog>
 +if [emp=yes, buy=car, married=no] then reject
 +</​code>​
 +
 +may be learned with VS. First, we create file loandata2.pl with the following ​
 +order of the examples: [5, 1, 3, 4, 6, 7, 8, 9, 10, 2, 11, 12].
 +
 +Then running VS:
 +
 +<code prolog>
 +?- batch.
 ++[yes, car, f, no]
 +G-set: [[?, ?, ?, ?]]
 +S-set: [[yes, car, f, no]]
 +-[yes, comp, f, no]
 +G-set: [[?, car, ?, ?]]
 +S-set: [[yes, car, f, no]]
 +-[yes, comp, m, no]
 +-[yes, car, f, yes]
 +G-set: [[?, car, ?, no]]
 +S-set: [[yes, car, f, no]]
 +-[yes, comp, f, yes]
 +-[yes, comp, f, no]
 +-[yes, comp, m, no]
 +-[yes, comp, m, yes]
 +-[yes, comp, m, yes]
 ++[no, comp, f, yes]
 +G-set: []
 +S-set: []
 +There is no consistent concept description in this language !
 +</​code>​
 +
 +NOTE that the version space here is bigger:
 +<code prolog>
 +G-set: [[?, car, ?, no]]
 +S-set: [[yes, car, f, no]]
 +</​code>​
 +
 +It has four hypotheses, among which is the one created by ID3.
 +
 +
 +===== Agorytm OneR =====
 +
 +<code prolog>
 +?- [id3].
 +?- [loandata].
 +
 +?- id3(1),​showtree.
 +
 +  emp=no => [reject/3]
 +  emp=yes
 +    buy=car
 +      married=no => [reject/1]
 +      married=yes => [approve/1]
 +    buy=comp => [approve/7]
 +</​code>​
 +
 +The id3.pl file defines procedures to compute the entopy and the distribution ​
 +of classes in a set of examples. For example:
 +
 +<code prolog>
 +?- if H then C, model(H,​M),​entropy(M,​I).
 +
 +H = [emp=no]
 +C = [reject/3]
 +M = [2, 11, 12]
 +I = 0 ;
 +
 +H = [emp=yes, buy=car, married=no]
 +C = [reject/1]
 +M = [5]
 +I = 0 ;
 +
 +H = [emp=yes, buy=car, married=yes]
 +C = [approve/1]
 +M = [4]
 +I = 0 ;
 +
 +H = [emp=yes, buy=comp]
 +C = [approve/7]
 +M = [1, 3, 6, 7, 8, 9, 10]
 +I = 0
 +</​code>​
 +
 +These are all 100% correct hypotheses and that's way the entropy of the
 +class distribution is 0. If we allow mixed classes at the leaves we may get
 +non-zero entropy. For example:
 +
 +<code prolog>
 +?- id3(2),​showtree.
 +
 +  emp=no => [reject/3]
 +  emp=yes
 +    buy=car => [approve/1, reject/1]
 +    buy=comp => [approve/7]
 +
 +?- if H then C, model(H,​M),​entropy(M,​I).
 +
 +H = [emp=no]
 +C = [reject/3]
 +M = [2, 11, 12]
 +I = 0 ;
 +
 +H = [emp=yes, buy=car]
 +C = [approve/1, reject/1]
 +M = [4, 5]
 +I = 1 ;
 +
 +H = [emp=yes, buy=comp]
 +C = [approve/7]
 +M = [1, 3, 6, 7, 8, 9, 10]
 +I = 0 ;
 +</​code>​
 +
 +The id3.pl program inlcudes a procedure to compute the class distribution of a set of 
 +examples. For example:
 +
 +<code prolog>
 +?- if H then C, model(H,​M),​distr(M,​D).
 +
 +H = [emp=no]
 +C = [reject/3]
 +M = [2, 11, 12]
 +D = [reject/3] ;
 +
 +H = [emp=yes, buy=car]
 +C = [approve/1, reject/1]
 +M = [4, 5]
 +D = [approve/1, reject/1] ;
 +
 +H = [emp=yes, buy=comp]
 +C = [approve/7]
 +M = [1, 3, 6, 7, 8, 9, 10]
 +D = [approve/7] ;
 +</​code>​
 +
 +This we may see the class distribution and the entropy in a set of examples.
 +For example:
 +
 +<code prolog>
 +?- distr([1,​2,​3,​4],​D),​entropy([1,​2,​3,​4],​I).
 +
 +D = [approve/3, reject/1]
 +I = 0.811278
 +</​code>​
 +
 +Here is the entropy of a hypothesis that covers mixed class examples:
 +
 +<code prolog>
 +?- model([buy=car],​M),​distr(M,​D),​entropy(M,​I).
 +
 +M = [4, 5, 12]
 +D = [approve/1, reject/2]
 +I = 0.918296
 +</​code>​
 +
 +The distribution of classes can be used to compute the error.
 +For example, asuming that we assigned the majority class label to
 +the set above (reject), the error will be 1/3 (one wrong classification
 +out of 3).
 +
 +The OneR algorithm generates a one-level decision tree, using an
 +error based attribute selection. The following query can be used to
 +find the OneR decision tree for our loan data ("​attr"​ and "​vals"​ are
 +described in the header of id3.pl. Try the query without writeln and fail to
 +understand better how it works).
 +
 +<code prolog>
 +?- attr(L),​member(A,​L),​vals(A,​W),​nl,​member(V,​W),​model([V],​M),​distr(M,​D),​writeln([V]-D),​fail.
 +
 +[buy=car]-[approve/​1,​ reject/2]
 +[buy=comp]-[approve/​7,​ reject/2]
 +
 +[emp=no]-[reject/​3]
 +[emp=yes]-[approve/​8,​ reject/1]
 +
 +[married=no]-[approve/​4,​ reject/1]
 +[married=yes]-[approve/​4,​ reject/3]
 +
 +[sex=f]-[approve/​4,​ reject/3]
 +[sex=m]-[approve/​4,​ reject/1]
 +</​code>​
 +
 +The class distributions above allow us to compute the total error in each
 +of the four possible one-level trees:
 +
 +<code prolog>
 +1. buy=car => reject [approve/1, reject/​2] ​
 +   ​buy=comp => approve [approve/7, reject/2]
 +   ​-----------------------------------------
 +   Error = (1+2)/12 = 0.25
 +
 +2. emp=no => reject [reject/3]
 +   ​emp=yes => approve [approve/8, reject/1]
 +   ​----------------------------------------
 +   Error = (0+1)/12 = 0.0833333
 +
 +3. married=no => approve [approve/4, reject/1]
 +   ​married=yes => approve [approve/4, reject/3]
 +   ​--------------------------------------------
 +   Error = (1+3)/12 = 0.333333
 +
 +4. sex=f => approve [approve/4, reject/3]
 +   sex=m => approve [approve/4, reject/1]
 +   ​--------------------------------------
 +   Error = (3+1)/12 = 0.333333
 +</​code>​
 +
 +Obviously the best OneR tree is #2, because it has the lowest error rate of 0.0833333.
 +Interestingly the entropy-based attribute selection used by ID3 is in agreement ​
 +with the error-based one. This is clear from the fact the pruned ID tree is 
 +the same as the OneR one.
  
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0