220 likes | 316 Views
Data Mining Using Genetic Programming: A PhD Talk. Jeroen Eggermont. Knowledge is Power !. Sir Francis Bacon (1561-1626). Information Age ?. 2002: 5 x 10 18 bytes produced 92% on hard-disk More than twice the amount of 1999. Information vs Knowledge. Information is not Knowledge
E N D
Data Mining Using Genetic Programming: A PhD Talk Jeroen Eggermont
Knowledge is Power ! Sir Francis Bacon (1561-1626)
Information Age ? 2002: • 5 x 1018 bytes produced • 92% on hard-disk • More than twice the amount of 1999
Information vs Knowledge Information is not Knowledge Albert Einstein Where is the Knowledge we have lost in Information ? T.S. Eliot
Knowledge Discovery Knowledge Discovery in Databases ``the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data ´´ Data Mining phase: identifies, searches or constructs patterns
Classification Construct or find a model in order to predict the category of some data value. Question: Do we want to have a BBQ ?
Wind < 3 false true Rain = yes BBQ:=yes true false BBQ:=yes BBQ:=no Decision Trees
Evolutionary Computation • is based on biological metaphors • has great practical potential • is (getting) popular in many fields • yields powerful, diverse applications • gives high performance against low costs • AND IT’S FUN !
The Metaphor EVOLUTION Individual Fitness Environment PROBLEM SOLVING Candidate Solution Quality Problem
Classification using EC ? EVOLUTION Individual Fitness Environment CLASSIFICATION Decision Tree Accuracy Data Set
Wind < 3 false true Rain = yes BBQ:=yes true false BBQ:=yes BBQ:=no Genetic Programming Evolutionary Computation using Trees
Population Parents parent selection evaluation & crossover selection X mutation Offspring Genetic Programming
Classification using GP ? • WHY: • ML: Local Search • EC: Global Search • EC copes well with attribute interactions • Easy to adapt for different types of decision trees • FUN !!!
Simple Representation • Binary Trees • Each node contains an atom: • Internal Nodes:< or = • Leaf Nodes: assignment • Atoms can occur more than once • Maximum of 63 nodes
X Y Z 1 a yes 2 b yes 3 a no 4 b no X < 1 X < 2 X < 3 X < 4 Y = a Y = b Z := yes Z := no Simple Representation • Attribute operator value combinations • Six internal nodes • Two leaf nodes • Maximal 63 nodes • Yields 2 10103 trees
Refining the Search Space How can we reduce the search space? • reduce number of classes • reduce atoms for non-numerical attributes • reduce atoms for numerical attributes
Refining the Search Space Split domain of a numerical attributes • Heuristics: • gain • gain_ratio • K-means clustering
X Y Z 1 a yes 2 b yes 3 a no 4 b no Y = a Y = b Z := yes Z := no Refined Representation (k = 2) • Three internal atoms • Two leaf nodes • Maximum 63 nodes • Smaller Search Space X < 1 X < 2 X < 3 X < 4
Storm false true Showers BBQ:=no true false BBQ:=yes BBQ:=no Fuzzy Decision Trees
D(X) = [0,10] D(X) = (3,10] D(X) = [0,3] D(X) = (3,10] D(X)=Ø Introns X > 3 X < 5 B B A
{A, B} {B} {A} {A} {A} Introns X > 3 Y < 2 B A A
Conclusions • Refining the search space can greatlyimprove performance • Fuzzy decision trees more robust ? • Removing introns increases speed • Nothing works always Free C++ Library for Evolutionary Computation http://eodev.sourceforge.net