220 likes | 247 Views
Learn Genetic Programming (GP) by classifying forest fires using a dataset with 517 samples and 12 attributes. Understand GP's evolutionary algorithms represented by tree structures, fitness functions like relative squared error, and selection methods such as tournament selection. Implement function regression for binary classification. Follow guidelines for running experiments using recommended GP libraries in C++, Java, or Matlab.
E N D
Artificial Intelligence Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.krBiointelligence laboratory
Contents • Project outline • Description on the data set • Genetic Programming • Brief overview • Fitness function & Selection methods • Classification with GP (in this project) • Guide to writing reports • Style & contents • Submission guide / Marking scheme (C) 2008, SNU Biointelligence Laboratory
Outline • Goal • Understand the Genetic Programming (GP) deeper • Practice researching and writing a paper • Forest Fires problem (classification) • To predict whether a fire occurs or not • Using Genetic Programming • Estimating several statistics on the dataset • Data set • Variation of the ‘Forest Fires data set’ • http://archive.ics.uci.edu/ml/datasets/Forest+Fires (C) 2008, SNU Biointelligence Laboratory
Forest Fires Data Set • Description • Database of 517 samples • You can use at most 500 samples for training • 17 samples for prediction • 12 attributes • X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,label • Integer or real value • Label (Class) • Two classes • 0 : a fire does not occur • 1 : a fire occurs (C) 2008, SNU Biointelligence Laboratory
Brief Summary of GP • A kind of evolutionary algorithms • It is represented with a tree structure • You need to set up following elements for GP run • The set of terminals (input attributes, the class variable, constants) • The set of functions (numerical / condition operators) • The fitness measure • The algorithm parameters • population size, maximum number of generations • crossover rate and mutation rate • maximum depth of GP trees etc. • The method for designating a result and the criterion for terminating a run. (C) 2008, SNU Biointelligence Laboratory
GP Flowchart GA loop GP loop 6
Initialization • Maximum initial depth of trees Dmax is set. • Full method (each branch has depth = Dmax): • nodes at depth d < Dmax randomly chosen from function set F • nodes at depth d = Dmax randomly chosen from terminal set T • Grow method (each branch has depth Dmax): • nodes at depth d < Dmax randomly chosen from F T • nodes at depth d = Dmax randomly chosen from T • Common GP initialisation: ramped half-and-half, where grow and full method each deliver half of initial population (C) 2008, SNU Biointelligence Laboratory
Fitness Functions • Relative squared error • The number of outputs that are within % of the correct value • And you can try other fitness functions which are well-defined to solve problems
Selection methods (1/2) • Fitness proportional (roulette wheel) selection • The roulette wheel can be constructed as follows. • Calculate the total fitness for the population. • Calculate selection probability pk for each chromosome vk. • Calculate cumulative probability qk for each chromosome vk.
Procedure: Proportional_Selection • Generate a random number r from the range [0,1]. • If r q1, then select the first chromosome v1; else, select the kth chromosome vk (2 k pop_size) such that qk-1< r qk.
Selection methods (2/2) • Tournament selection • Tournament size q • Ranking-based selection • 2 POP_SIZE • 1 + 2 and - = 2 - + • Elitism • To preserve n good solutions until the next generation
Classification with GP (in this project) IF • Function Regression • Search a function f(x) s.t. • f(x) ≥ threshold t when y=1 • f(x)< threshold t when y=0 • Converting to Boolean value 1 > 0 f(x) t ∧ ¬ ∨ > < = rain 0 RH 50 wind + FFMC ISI
What to do for the experiment? • Select a library that implements GP • You can find various libraries written in C++/Java/Matlab • See the list of recommended libraries on the next page • Build up your own code for the experiment • Check sample codes and tutorials of libraries for quick start • Add comments to explain the flow of your program • Caution • Running GP may take much time (C) 2008, SNU Biointelligence Laboratory
Recommended Libraries for GP • C++ • GPLib: http://www.cs.bham.ac.uk/~cmf/GPLib/index.html • Java • JGAP: http://jgap.sourceforge.net/ • ECJ: http://cs.gmu.edu/~eclab/projects/ecj/ • Matlab toolbox • GPLAB: http://gplab.sourceforge.net/ • More References • Implementations section in Wiki – Genetic Programming: http://en.wikipedia.org/wiki/Genetic_programming (C) 2008, SNU Biointelligence Laboratory
Reports Style • English only!! • Scientific journal-style • How to Write A Paper in Scientific Journal Style and Format • http://abacus.bates.edu/~ganderso/biology/resources/writing/HTWsections.html (C) 2008, SNU Biointelligence Laboratory
Report Contents (1/3) • System description • Used programming language and running environments • Result tables • Analysis & discussion (Very Important!!) (C) 2008, SNU Biointelligence Laboratory
Report Contents (2/3) • Graph • Avg., Max. Fitness versus Generation • Tree size versus Generation (C) 2008, SNU Biointelligence Laboratory
Report Contents (3/3) • Basic experiments • Changing parameters for the crossover and mutation • Various function sets: arithmetic, numerical • Optional experiments • Various selection methods • Depth limitation • Population size, generation numbers • Comparison to Neural Network • … • References (C) 2008, SNU Biointelligence Laboratory
Submission Guide • Due date: Nov. 19 (Wed) 18:00 • Submit both ‘hardcopy’ and ‘email’ • Hardcopy submission to the office (301-417 ) • E-mail submission to mhkim@bi.snu.ac.kr • Subject : [AI Project1 Report] Student number, Name • Report + your source code with comments + executable file(s) • Length: report should be summarized within 12 pages. • We are NOT interested in the accuracy and your programming skill, but your creativity and research ability. • If your major is not a C.S, team project with a C.S major student is possible (Use the class board to find your partner and notice the information of your team to TA (bhkim@bi.snu.ac.kr) by Nov. 5) (C) 2008, SNU Biointelligence Laboratory
Marking Scheme • 5 points for programming • 5 points for result prediction • 30 points for experiment & analysis • 15 pts for experiments, 15pts for analysis • 10 points for report • Late work • - 10% per one day • Maximum 7 days (C) 2008, SNU Biointelligence Laboratory
QnA (C) 2008, SNU Biointelligence Laboratory