[[
✎ pl:dydaktyka:ml:lab3
]]
aiWiki
Pokaż stronę
Ostatnie zmiany
Indeks
Zaloguj
Ta strona jest tylko do odczytu. Możesz wyświetlić źródła tej strony ale nie możesz ich zmienić.
====== Laboratorium 3 - Drzewa Decyzyjne ====== Literatura: Tom M. Mitchell, Machine Learning, Rozdział 3. ===== Lista i opis plików ===== Potrzebne pliki to program {{:pl:prolog:prolog_lab:ml:id3.pl}} oraz wcześniejsze dane {{:pl:prolog:prolog_lab:ml:animals.pl}}, {{:pl:prolog:prolog_lab:ml:loandata.pl}}. Pliki uzupełniające dla przykładów z PMG: program {{:pl:prolog:prolog_lab:ml:dtlearn_bool.pl}}, dane {{:pl:prolog:prolog_lab:ml:dtlearn_t2.pl}} i dla zadanie {{:pl:prolog:prolog_lab:ml:as11data.pl}}. ===== Tworzenie, przeglądanie i dostęp do drzew decyzyjnych i reguł ===== <code prolog> ?- [id3]. ?- [animals]. ?- id3(2). </code> The parameter we specify (2) is used to control the growing of the tree. If a node covers less than or equal to 2 examples, then it becomes a leaf no matter what the class distribution is. To see the tree we use: <code prolog> ?- showtree. has_covering=feathers => [bird/2] has_covering=scales gills=f => [reptile/2] gills=t => [fish/1] has_covering=none => [amphibian/1, mammal/1] has_covering=hair => [mammal/3] </code> The lists at the leaves show the class distribution in the examples covered. If we change the threshold we get a different tree. With threshold 1 we will have one-class examples at the leaves. For example: <code prolog> ?- id3(1),showtree. has_covering=feathers => [bird/2] has_covering=scales gills=f => [reptile/2] gills=t => [fish/1] has_covering=none eggs=t => [amphibian/1] eggs=f => [mammal/1] has_covering=hair => [mammal/3] </code> The "id3" query also generates "if-then" rules. We can see these rules by using "listing": <code prolog> ?- listing(if). :- dynamic if/1. if[has_covering=feathers]then bird. if[has_covering=scales, gills=f]then reptile. if[has_covering=scales, gills=t]then fish. if[has_covering=none, eggs=t]then amphibian. if[has_covering=none, eggs=f]then mammal. if[has_covering=hair]then mammal. </code> NOTE: "listing" is a Prolog built-in procedure that lists the contents of the database by specifying a name. This means that the if-then strucures are already in the database and their name is "if". In fact, the ID3 program when started by "?- id3(1)." creates a decision tree (that can be viewed by "?- showtree.") and loads in the database a set of rules, that equivalently represent the tree. The rule created by ID3 are Prolog facts, but put in a special prefix format. They can also be accessed directly by Prolog queries like these: <code prolog> ?- if X then mammal. X = [has_covering=none, eggs=f] ; X = [has_covering=hair] ; ?- if X then Y. X = [has_covering=feathers] Y = bird ; X = [has_covering=scales, gills=f] Y = reptile ; X = [has_covering=scales, gills=t] Y = fish ; X = [has_covering=none, eggs=t] Y = amphibian ; X = [has_covering=none, eggs=f] Y = mammal ; X = [has_covering=hair] Y = mammal </code> If you want to run ID3 with another data file, load the data only (no need to reload id3.pl). However NOTE that after executing "?- id3(1)" the previous tree and rules are deleted and replaced with those corresponding to the currently loaded data file. For example: <code prolog> ?- [loandata]. ?- id3(1),showtree. emp=no => [reject/3] emp=yes buy=car married=no => [reject/1] married=yes => [approve/1] buy=comp => [approve/7] ?- id3(9),showtree. emp=no => [reject/3] emp=yes => [approve/8, reject/1] </code> ===== Narzędzia ===== Wykorzystując dane z pliku {{:pl:prolog:prolog_lab:ml:animals.pl}}, przetestuj działanie [[http://www.aispace.org/dTree/index.shtml|Narzędzie do budowania drzew decyzyjnych]]. Możesz zaimportować plik XML ze {{:pl:dydaktyka:ml:lab3.xml|zbiorem uczącym}}. **Uwaga** Oblicz Entropię i tzw. //Knowledge Gain// dla poszczególnych parametrów i wskaż najlepszy korzeń drzewa decyzyjnego. Czy narzędzie poprawnie wyznaczyło korzeń? ===== Porównanie z uczeniem pojęć ===== Assuming that the loandata tree and rules created by "?- id3(1)" in the database, try this (note the sets [2, 5, 11, 12] = [2, 11, 12] + [5]): <code prolog> ?- if H then [reject/N],model(H,M). H = [emp=no] N = 3 M = [2, 11, 12] ; H = [emp=yes, buy=car, married=no] N = 1 M = [5] ; </code> This query accesses the two disjunctive components of the hypothesis for class "reject" (the two paths in the decision tree following to the "reject" leaves). Now, obviously we should be able to run VS on each of the two sets and produce consistent hypotheses. For the first part, [2, 11, 12] we put these examples in the beginning and all from class "approve" to follow (we may also leave the other "reject" example at the end). Let's call this new data file loandata1.pl and load it into the database. <code prolog> ?- [loandata1]. </code> We may try "listing" to verify our data: <code prolog> ?- listing(example). example(2, reject, [emp=no, buy=comp, sex=f, married=yes]). example(11, reject, [emp=no, buy=comp, sex=m, married=yes]). example(12, reject, [emp=no, buy=car, sex=f, married=yes]). example(1, approve, [emp=yes, buy=comp, sex=f, married=no]). example(3, approve, [emp=yes, buy=comp, sex=m, married=no]). example(4, approve, [emp=yes, buy=car, sex=f, married=yes]). example(6, approve, [emp=yes, buy=comp, sex=f, married=yes]). example(7, approve, [emp=yes, buy=comp, sex=f, married=no]). example(8, approve, [emp=yes, buy=comp, sex=m, married=no]). example(9, approve, [emp=yes, buy=comp, sex=m, married=yes]). example(10, approve, [emp=yes, buy=comp, sex=m, married=yes]). example(5, reject, [emp=yes, buy=car, sex=f, married=no]). </code> Now, we load and run the candiate elimination algorithm (vs.pl) in batch mode. <code prolog> ?- [vs]. ?- batch. +[no, comp, f, yes] G-set: [[?, ?, ?, ?]] S-set: [[no, comp, f, yes]] +[no, comp, m, yes] G-set: [[?, ?, ?, ?]] S-set: [[no, comp, ?, yes]] +[no, car, f, yes] G-set: [[?, ?, ?, ?]] S-set: [[no, ?, ?, yes]] -[yes, comp, f, no] G-set: [[?, ?, ?, yes], [no, ?, ?, ?]] S-set: [[no, ?, ?, yes]] -[yes, comp, m, no] -[yes, car, f, yes] G-set: [[no, ?, ?, ?]] S-set: [[no, ?, ?, yes]] -[yes, comp, f, yes] -[yes, comp, f, no] -[yes, comp, m, no] -[yes, comp, m, yes] -[yes, comp, m, yes] +[yes, car, f, no] G-set: [] S-set: [] There is no consistent concept description in this language ! </code> NOTE that the inconsistancy occurs because of the last example (# 5) which is a part of the other hypothesis. The hypothesis created by ID3, [emp=no] is contained in the version space: <code prolog> G-set: [[no, ?, ?, ?]] S-set: [[no, ?, ?, yes]] </code> In fact, it is the one in the G-set. Interestingly, we have another one (the S-set). This means that there may be another more specific decision tree, which looks like this: <code prolog> emp=no married=yes => reject THIS IS THE NEW BRANCH married=no => approve THIS HAS TO BE ADDED TO COMPLETE THE SUBTREE emp=yes buy=car married=no => reject married=yes => approve buy=comp => approve <code> NOTE that this tree is created MANUALLY. With loandata1.pl the ID3 algorithm will create the previous tree, because the order of the examples does not matter. Let's verify what the new brach covers. First, we need the "model" procedure back (it's in id3.pl). <code prolog> ?- [id3]. ?- model([emp=no,married=yes],M). M = [2, 11, 12] </code> NOTE that this is the same coverage as before. <code prolog> ?- model([emp=no,married=no],M). M = [] </code> The brach we added to complete the tree does not cover anything, but this is ok. Similarly the other disjunctive componenet of the ID3's "reject" hypothesis, <code prolog> if [emp=yes, buy=car, married=no] then reject </code> may be learned with VS. First, we create file loandata2.pl with the following order of the examples: [5, 1, 3, 4, 6, 7, 8, 9, 10, 2, 11, 12]. Then running VS: <code prolog> ?- batch. +[yes, car, f, no] G-set: [[?, ?, ?, ?]] S-set: [[yes, car, f, no]] -[yes, comp, f, no] G-set: [[?, car, ?, ?]] S-set: [[yes, car, f, no]] -[yes, comp, m, no] -[yes, car, f, yes] G-set: [[?, car, ?, no]] S-set: [[yes, car, f, no]] -[yes, comp, f, yes] -[yes, comp, f, no] -[yes, comp, m, no] -[yes, comp, m, yes] -[yes, comp, m, yes] +[no, comp, f, yes] G-set: [] S-set: [] There is no consistent concept description in this language ! </code> NOTE that the version space here is bigger: <code prolog> G-set: [[?, car, ?, no]] S-set: [[yes, car, f, no]] </code> It has four hypotheses, among which is the one created by ID3. ===== Agorytm OneR ===== <code prolog> ?- [id3]. ?- [loandata]. ?- id3(1),showtree. emp=no => [reject/3] emp=yes buy=car married=no => [reject/1] married=yes => [approve/1] buy=comp => [approve/7] </code> The id3.pl file defines procedures to compute the entopy and the distribution of classes in a set of examples. For example: <code prolog> ?- if H then C, model(H,M),entropy(M,I). H = [emp=no] C = [reject/3] M = [2, 11, 12] I = 0 ; H = [emp=yes, buy=car, married=no] C = [reject/1] M = [5] I = 0 ; H = [emp=yes, buy=car, married=yes] C = [approve/1] M = [4] I = 0 ; H = [emp=yes, buy=comp] C = [approve/7] M = [1, 3, 6, 7, 8, 9, 10] I = 0 </code> These are all 100% correct hypotheses and that's way the entropy of the class distribution is 0. If we allow mixed classes at the leaves we may get non-zero entropy. For example: <code prolog> ?- id3(2),showtree. emp=no => [reject/3] emp=yes buy=car => [approve/1, reject/1] buy=comp => [approve/7] ?- if H then C, model(H,M),entropy(M,I). H = [emp=no] C = [reject/3] M = [2, 11, 12] I = 0 ; H = [emp=yes, buy=car] C = [approve/1, reject/1] M = [4, 5] I = 1 ; H = [emp=yes, buy=comp] C = [approve/7] M = [1, 3, 6, 7, 8, 9, 10] I = 0 ; </code> The id3.pl program inlcudes a procedure to compute the class distribution of a set of examples. For example: <code prolog> ?- if H then C, model(H,M),distr(M,D). H = [emp=no] C = [reject/3] M = [2, 11, 12] D = [reject/3] ; H = [emp=yes, buy=car] C = [approve/1, reject/1] M = [4, 5] D = [approve/1, reject/1] ; H = [emp=yes, buy=comp] C = [approve/7] M = [1, 3, 6, 7, 8, 9, 10] D = [approve/7] ; </code> This we may see the class distribution and the entropy in a set of examples. For example: <code prolog> ?- distr([1,2,3,4],D),entropy([1,2,3,4],I). D = [approve/3, reject/1] I = 0.811278 </code> Here is the entropy of a hypothesis that covers mixed class examples: <code prolog> ?- model([buy=car],M),distr(M,D),entropy(M,I). M = [4, 5, 12] D = [approve/1, reject/2] I = 0.918296 </code> The distribution of classes can be used to compute the error. For example, asuming that we assigned the majority class label to the set above (reject), the error will be 1/3 (one wrong classification out of 3). The OneR algorithm generates a one-level decision tree, using an error based attribute selection. The following query can be used to find the OneR decision tree for our loan data ("attr" and "vals" are described in the header of id3.pl. Try the query without writeln and fail to understand better how it works). <code prolog> ?- attr(L),member(A,L),vals(A,W),nl,member(V,W),model([V],M),distr(M,D),writeln([V]-D),fail. [buy=car]-[approve/1, reject/2] [buy=comp]-[approve/7, reject/2] [emp=no]-[reject/3] [emp=yes]-[approve/8, reject/1] [married=no]-[approve/4, reject/1] [married=yes]-[approve/4, reject/3] [sex=f]-[approve/4, reject/3] [sex=m]-[approve/4, reject/1] </code> The class distributions above allow us to compute the total error in each of the four possible one-level trees: <code prolog> 1. buy=car => reject [approve/1, reject/2] buy=comp => approve [approve/7, reject/2] ----------------------------------------- Error = (1+2)/12 = 0.25 2. emp=no => reject [reject/3] emp=yes => approve [approve/8, reject/1] ---------------------------------------- Error = (0+1)/12 = 0.0833333 3. married=no => approve [approve/4, reject/1] married=yes => approve [approve/4, reject/3] -------------------------------------------- Error = (1+3)/12 = 0.333333 4. sex=f => approve [approve/4, reject/3] sex=m => approve [approve/4, reject/1] -------------------------------------- Error = (3+1)/12 = 0.333333 </code> Obviously the best OneR tree is #2, because it has the lowest error rate of 0.0833333. Interestingly the entropy-based attribute selection used by ID3 is in agreement with the error-based one. This is clear from the fact the pruned ID tree is the same as the OneR one.
pl/dydaktyka/ml/lab3.txt
· ostatnio zmienione: 2019/06/27 15:50 (edycja zewnętrzna)
Pokaż stronę
Poprzednie wersje
Menadżer multimediów
Do góry