1 / 47

Evolutionary Learning: Genetic Algorithms & Knowledge Representations

This lecture explores the use of evolutionary learning techniques, specifically genetic algorithms, for machine learning tasks. It covers different knowledge representations and learning paradigms, and provides examples of complete evolutionary learning systems.

geordi
Download Presentation

Evolutionary Learning: Genetic Algorithms & Knowledge Representations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 3: Data Mining Lecture 2: Evolutionary Learning

  2. Outline of the lecture • Introduction and taxonomy • Genetic algorithms • Knowledge Representations • Paradigms • Two complete examples • GAssist • BioHEL • Resources

  3. Evolutionary Learning • Usage of any kind of evolutionary computation methods (list follows) to machine learning tasks • Genetic Algorithms • Genetic Programming • Evolution Strategies • Ant Colony Optimization • Particle Swarm Optimization • Also known as • Genetics-Based Machine Learning (GBML) • Learning Classifier Systems (LCS) (subset of it)

  4. Paradigms and representation • EL involves a huge mix of • Search methods (previous slide) • Representations • Learning paradigms • Learning paradigms: how the solution to the machine learning problem are generated • Representations: rules, decision trees, synthetic prototypes, hyperspheres, etc.

  5. Genetic Algorithm working cycle Population A Population B Evaluation Mutation Selection Population D Population C Crossover

  6. Genetic Algorithms: terms • Population • Possible solutions of the problem • Traditionally represented as bit-strings (e.g. each bit associated to a feature, indicating if it is selected or not) • Each bit of an individual is called gene • Initial population is created at random • Evaluation • Giving a goodness value to each individual in the population • Selection • Process that rewards good individuals • Good individuals will survive, and get more than one copy in the next population. Bad individuals will disappear

  7. Genetic Algorithms • Crossover • Exchanging subparts of the solutions • The crossover stage will take two individuals from the population (parents) and with certain probability Pc will generate two offspring 1-point crossover uniform crossover

  8. Knowledge representations • For nominal attributes • Ternary representation • GABIL representation • For real-valued attributes • Hyperrectangles • Decision tree • Synthetic prototypes • Others

  9. Ternary representation • Used by XCS (Michigan LCS) • Three-letter alphabet {0,1,#} for binary problems • # means “don’t care”, that is, that the attribute is irrelevant • If A1=0 and A2=1 and A3 is irrelevant  class 0 • For non-binary nominal attributes: • {0,1, 2, …, n,#} • Crossover and mutation act as in a classic GA 01#|0

  10. GABIL representation • Predicate  Class • Predicate: Conjunctive Normal Form (CNF) (A1=V11..  A1=V1n) ..... (An=Vn2..  An=Vnm) • Ai : ith attribute • Vij : jth value of the ith attribute • The rules can be mapped into a binary string 1100|0010|1001|1 • 2 Variables: • Sky = {clear, partially cloudy, dark clouds} • Pressure = {Low, Medium, High} • 2 Classes: {no rain, rain} • Rule: If [sky is (partially cloudy or has dark clouds)] and [pressure is low] then predict rain • Genotype: “011|100|1”

  11. Hyper-rectangle representation • The rule’s predicate encodes an interval for each of the dimensions of the domain, effectively generating an hyperrectangle • Different ways of encoding the interval • X< value, X> value, X in [l,u] • Encoding the actual bounds (UBR, NAX) • Encoding the interval as center±spread (XCSR) • What if the u<l ? • Flipping them (UBR) • Declaring the attribute as irrelevant (NAX) If (X<0.25 and Y<0.25) then 

  12. Decision tree representation • Each individual literally encodes a complete decision tree [Llora, 02] • Only suitable for the Pittsburgh approach • Decision tree can be axis-parallel or oblique • Crossover • Exchange of sub-branches of a tree between parents • Mutation • Change of the definition of a node/leaf • Total replacement of a tree’s sub-branch

  13. Synthetic Prototypes representation [Llora, 02] • Each individual is a set of synthetic instances • These instances are used as the core of a nearest-neighbor classifier 1 • (-0.125,0,yellow) • (0.125,0,red) • (0,-0.125,blue) • (0,0.125,green) Y 0 1

  14. Other representations for continuous problems • Hyperellipsoid representation (XCS) • Each rule encodes an (hyper)ellipse over the search space • Smooth, non-linear, frontiers • Arbitrary rotation • Encoded as • Center • Stretches across dimensions • Rotation angles • Neural representation (XCS) • Each individual is a complete MLP, and evolution can change both the weights and the network topology

  15. Learning Paradigms • Different ways of generating a solution • Is each individual a rule, a rule set? • Is the solution the best individual, or the whole population? • Is the solution generated in a single GA run • The Pittsburgh approach • The Michigan approach • The Iterative Rule Learning approach LCS

  16. The Pittsburgh Approach • Each individual is a complete solution to the classification problem • Traditionally this means that each individual is a variable-length set of rules • The final solution is the best individual from the population after the GA run • Fitness function is based on the rule set accuracy on the training set (usually also on complexity) • GABIL [De Jong & Spears, 91] is a classic example

  17. Pittsburgh approach: recombination • Crossover operator • Mutation operator: classic GA mutation of bit inversion Offspring Parents

  18. The Michigan Approach • Each individual (classifier) is a single rule • The whole population cooperates to solve the classification problem • A reinforcement learning system is used to identify the good rules • A GA is used to explore the search space for more rules • XCS [Wilson, 95] is the most well-known Michigan LCS

  19. The Michigan approach • What is Reinforcement Learning? • “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96] • Rules will be evaluated example by example, receiving a positive/negative reward • Rule fitness will be update incrementally with this reward • After enough trials, good rules should have high fitness

  20. Michigan system’s working cycle

  21. Iterative Rule Learning approach • This approach implements the separate-and-conquer method of rule learning • Each individual is a rule • A GA run ends up generating a single good rule • Examples covered by the rule are removed from the training set, and process starts again • First used in evolutionary learning in the SIA system [Venturini, 93]

  22. The Gassist Pittsburgh LCS [Bacardit, 04] • Genetic clASSIfier SysTem • Designed with three aims • Generate compact and accurate solutions • Run-time reduction • Be able to cope with both continuous and discrete data • Objectives achieved by several components • ADI rule representation (3) • Explicit default rule mechanism (1) • ILAS windowing scheme (2) • MDL-based fitness function (1) • Initialization policies (1) • Rule deletion operator (1)

  23. GAssist components in the GA cycle • Representation • ADI representation • Explicit default rule mechanism • GA cycle Evaluation Selection Initialization Initialization Policies Standard operators • MDL fitness function • ILAS windowing Mutation Crossover

  24. GAssist: Default Rule mechanism • When we encode this rule set as a decision list we can observe an interesting behavior: the emergent generation of a default rule • Using a default rule can help generating a more compact rule set • Easier to learn (smaller search space) • Potentially less sensitive to overlearning • To maximize this benefits, the knowledge representation is extended with an explicit default rule

  25. GAssist: Default Rule mechanism • What class is assigned to the default rule? • Simple policies such as using the majority/minority class are not robust enough • Automatic determination of default class • The initial population contains individuals with all default classes • Evolution will choose the correct default class • In the first few iterations the different default classes will be isolated: each is a separate subpopulation • Different default classes learn at different rates • Afterwards, restrictions are lifted and the system is freely to pick up the best policy

  26. GAssist: Initialisation policy • Initialization policy • Probability of a rule matching a random instance • In GABIL each gene associated to a value of an attribute is independent of the other values • Therefore the probability of matching an attribute equals to the probability of value 1 when initializing the chromosome • Probability of a rule set matching a random instance

  27. GAssist: Initialisation policy • Initialization policy • How can we derive a formula to adjust P1 ? • We use an explicit default rule mechanism • If we suppose equal class distribution, we have to make sure that we match all but one of the classes

  28. GAssist: Initialisation policy • Covering operator • Each time a new rule has to be created, an instance is sampled from the training set • The rule is created as a generalized version of the example • Makes sure it matches the example • It covers not just the examples, but a larger area of the search space • Two methods of sampling instances from the training set • Uniform probability for each instance • Class-wise sampling probability

  29. GAssist: Rule deletion operator • Operator applied after the fitness computation • Rules that do not match any training example are eliminated • The operator leaves a small number of ‘dead’ rules in each individual, acting as protective neutral code • If crossover is applied over a dead rule, it does not matter, it will not break a good rule • However, if too many dead rules are present, exploration is inefficient, and the population loses diversity

  30. GAssist: ILAS windowing scheme • Windowing: use of a subset of examples to perform fitness computations • Incremental Learning with Alternating Strata (ILAS) • The mechanism uses a different subset of training examples in each GA iteration 2·Ex/n 3·Ex/n 0 Ex/n Ex Training set Iterations 0 Iter

  31. BioHEL [Bacardit et al, 09] • BIO-inspired HiErarchical Learning • Successor of GAssist, but changing paradigms: uses the Iterative Rule Learning approach • Created to overcome the scalability limitations of GAssist • It still employs • Default Rule (no auto policy) • ILAS windowing scheme

  32. BioHEL: fitness function • Fitness function definition is trickier than in GAssist, as it is impossible to have a global control over the solution • As in any separate-and-conquer method, the system should favor rules that are • Accurate (do not make mistakes) • General (that cover many examples) • These two objectives are contradictory, specially in real-world problems: the best way of increasing the accuracy is by creating very specific rules • BioHEL redefines coverage as a piece-wise function, which rewards rules that cover at least a certain fraction of the training set

  33. BioHEL: fitness function • Coverage term penalizes rules that do not cover a minimum percentage of examples • Choice of the coverage break is crucial for the proper performance of the system

  34. BioHEL: ALKR • The Attributes List Knowledge Representation (ALKR) • This representation exploits a very frequent situation • In high-dimensionality domains it is usual that each rule only uses a very small subset of the attributes • Example of a rule for predicting a Bioinformatics dataset [Bacardit and Krasnogor, 2009] • Att Leu-2 [-0.51,7] and Glu  [0.19,8] and Asp+1 [-5.01,2.67] and Met+1 [-3.98,10] and Pro+2 [-7,-4.02] and Pro+3 [-7,-1.89] and Trp+3 [-8,13] and Glu+4 [0.70,5.52] and Lys+4 [-0.43,4.94]  alpha • Only 9 attributes out of 300 were actually in the rule

  35. BioHEL: ALKR • Function match (instance x, rule r) Foreach attribute att in the domain If att is relevant in rule r and (x.att < r.att.lower or x.att > r.att.upper) Return false EndIf EndFor Return true • Given the previous example of a rule, 293 iterations of this loop are wasted !! • Can we get rid of them?

  36. BioHEL: ALKR • ALRK automatically identifies the relevant attributes in the domain for each rule and just tracks them

  37. BioHEL’s ALKR • Simulated 1-point crossover

  38. BioHEL: ALKR • In ALKR two operators (specialize and generalize) add or remove attributes from the list with a given probability, hence exploring the rule-wise space of the relevant attributes • ALKR match process is more efficient, however exploration is costlier and it has two extra operators • Since ALKR chromosome only contains relevant information, the exploration process is more efficient.

  39. BioHEL: CUDA-based fitness computation • NVIDIA’s Computer Unified Device Architecture (CUDA) is a parallel computing architecture that exploits the capacity within NVIDIA’s Graphic Processor Units • CUDA runs thousands of threads at the same time  Single Program, Multiple Data paradigm • In the last few years GPUs have been extensively used in the evolutionary computation field • Many papers and applications are available at http://www.gpgpgpu.com • The use of GPGPUs in Machine Learning involves a greater challenge because it deals with more data but this also means it is potentially more parallelizable

  40. CUDA architecture

  41. CUDA memory management • Different types of memory with different access speed • Global memory (slow and large) • Shared memory (block-wise; fast but quite small) • Constant memory (very fast but very small) • The memory is limited • The memory copy operations involve a considerable amount of execution time • Since we are aiming to work with large scale datasets a good strategy to minimize the execution time is based on the memory usage

  42. CUDA for matching a set of rules • The match process is the stage computationally more expensive • However, performing only the match inside the GPU means downloading from the card a structure of size O(NxM) (N=population size, M=training set size) • In most cases we don’t need to know the specific matches of a classifier, just how many (reduce the data) • Performing the second stage also inside the GPU allows the system to reduce the memory traffic to O(N)

  43. CUDA in BioHEL

  44. Performance of CUDA alone • We used CUDA in a Tesla C1060 card with 4GB of global memory, and compared the run-time to that of Intel Xeon E5472 3.0GHz processors • Biggest speedups obtained in large problems (|T| or #Att), specially in domains with continuous attributes • Run time for the largest dataset reduced from 2 weeks to 8 hours

  45. CUDA fitness in combination with ILAS • The speedups of CUDA and ILAS are cumulative

  46. Resources • A very thorough survey on GBML is available here • Thesis of Martin Butz on XCS, including theoretical models and advanced exploration methods (later a book) • My thesis, about Gassist (code) • Complete description of BioHEL (code)

  47. Questions?

More Related