Evolutionary Computation

Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Genetic Algorithms • Population-based technique for discovery of knowledge structures • Based on idea that evolution represents search for optimum solution set • Massively parallel

The Vocabulary of GAs • Population • Set of individuals, each represented by one or more strings of characters • Chromosome • The string representing an individual

The vocabulary of GAs, contd. • Gene • The basic informational unit on a chromosome • Allele • The value of a specific gene • Locus • The ordinal place on a chromosome where a specific gene is found

Thus...

Genetic operators • Reproduction • Increase representations of strong individuals • Crossover • Explore the search space • Mutation • Recapture “lost” genes due to crossover

Genetic operators illustrated...

GAs rely on the concept of “fitness” • Ability of an individual to survive into the next generation • “Survival of the fittest” • Usually calculated in terms of an objective fitness function • Maximization • Minimization • Other functions

Genetic Programming • Based on adaptation and evolution • Structures undergoing adaptation are computer programs of varying size and shape • Computer programs are genetically “bred” over time

The Learning Classifier System • Rule-based knowledge discovery and concept learning tool • Operates by means of evaluation, credit assignment, and discovery applied to a population of “chromosomes” (rules) each with a corresponding “phenotype” (outcome)

Components of a Learning Classifier System • Performance • Provides interaction between environment and rule base • Performs matching function • Reinforcement • Rewards accurate classifiers • Punishes inaccurate classifiers • Discovery • Uses the genetic algorithm to search for plausible rules

The Learning Classifier System • Rule-based knowledge discovery and concept learning tool • EpiCS • First Learning Classifier System designed for use in epidemiologic surveillance • Supervised learning environment

Knowledge Representation • Classifiers • IF-THEN rules • Condition=“genotype” • Action=“phenotype” • Strength metric • Encoded as bit strings or numerics • Population • Fixed size collection of classifiers

Low-level knowledge representation:The Classifier • Taxon is analogous to a condition (LHS) of an IF-THEN rule • Action bit is analogous to an action (RHS) of an IF-THEN rule • Strength is an internal fitness function

High-level knowledge representation:Macrostate Population

Components of a learning classifier system • Performance • Provides interaction between environment and classifier population • Performs matching function • Reinforcement • Rewards accurate classifiers • Punishes inaccurate classifiers • Discovery • Uses the genetic algorithm to search for plausible knowledge structures

Generic Machine Learning Model

A Generic Learning Classifier System

EpiCS: A Learning Classifier System

EpiCS: Performance Component

Performance component • Creates a subset (the matchset, [M]) of all classifiers in population [P] whose conditions match a string received from the environment • From [M], a single classifier is selected, based on its strength as a proportion of the sum of all strengths in [M] • The action of this classifier is then used as the output of the system

EpiCS: Reinforcement Component

Reinforcement component • Correct set [C] is created from classifiers in [M] advocating correct decisions • Remaining classifiers in [M] form Not[C] • Tax is deducted from the strengths of all classifiers in [C] • Reward is added to the strengths of all classifiers in [C], biased for generality • Penalty is deducted from the strengths of all classifiers in Not[C]

EpiCS: Discovery Component

Discovery component • Genetic algorithm invoked once per iteration • One new offspring is created, from parents deterministically selected based on strength • The single offspring replaces weakest classifier in the population

Features of EpiCS • Object-oriented implementation • Stimulus-response architecture • Payoff/Penalty reinforcement regime • Syntactic control of overgeneralization • Differential penalty control of undergeneralization • Ability to compute risk of outcome

Discovering risk with EpiCS • Output decision of the learning classifier system is probability of disease (CSPD), rather than dichotomous decision • CSPD determined from proportion of classifiers matching a given input case’s taxon

Discovering risk with EpiCS: The specifics

Discovery of Predictive Models in an Injury Surveillance Database:An Application of Data Mining in Clinical Research

Partners for Child Passenger SafetyInformation Infrastructure

Why data mining is needed for PCPS • Large number of raw and derived variables renders traditional “manual” methods for discovering patters in data unwieldy • Hypothesis-driven (biased) analyses may lead to missed associations • Constantly changing patterns in prospective data require constantly changing analytic approaches that can be informed by data mining

Candidate Predictors • Demographics • Kinematics • Characteristics of crash • Restraint use

Outcome: Head Injury • Major burns involving the head • Skull fracture • Evidence of brain injury reported by respondent • Excessive sleepiness • Difficulty in arousing • Unresponsiveness • Amnesia after accident

Data Preparation • Pool of 8,334 records • 20 separate datasets created • All cases of head injury included (N=415) • Equal number of non-head injury cases randomly drawn from pool • Each dataset randomly sampled to create mutually exclusive training and testing sets of equal size

Comparison methods:Logistic Regression • Variables from training sets stepped into model to determine significant terms • Significant terms used to create new risk model: • Risk model applied to cases in testing set • Risk estimates categorized by deciles and used construct ROC curves

Comparison Methods:Decision Tree Induction • C4.5 used to create decision trees from training sets • 10-fold cross-validation used to optimize trees • Optimized trees used by C4.5RULES to classify cases in testing set

Experimental Procedure

Results: Training

Results: Training • EpiCS • 5,000 unique classifiers reduced to 2,314 by the end of training • Logistic regression • Single model with eight significant terms, no significant interactions • C4.5 • 11 rules created for each training set, most with single conjuncts

Results: Prediction Area under the ROC curve obtained on testing, averaged over the 20 separate studies

And now for something a little different The XCS model

XCS: A little history • Wilson, SW: Evolutionary Computation, 2(1), 1-18 (1994) • ZCS • Wilson, SW: Evolutionary Computation, 3(2), 149-175 (1995) • The seminal work on XCS • Many papers by Lanzi, Barry, Butz, and others • Butz, M and Wilson, SW: Advances in Learning Classifier Systems. Third International Workshop (IWLCS-2000), Lecture Notes in Artificial Intelligence (LNAI-1996). Berlin: Springer-Verlag (2001) • The algorithm paper

What is XCS? • An LCS that differs from traditional Holland model • Classifier fitness is based on the accuracy of the classifiers payoff prediction, rather than the prediction itself • The genetic algorithm is restricted to niches in the action set, rather than applied to the classifier population as a whole • The major feature is graceful, accurate generalization

XCS in a nutshell ((43*99)+(27*3))/102 Action: 00 Action: 01 Source: Wilson, XCS tutorial

EpiXCS: An XCS-Based Learning Classifier System for Epidemiologic Research

Outline • What is it? • EpiXCS architecture • Data encoding • Evaluation metrics • Reinforcement • Missing values handling • Classifier ranking • Risk assessment • Test case: Pima Indians Diabetes Data

What is EpiXCS? • Learning classifier system based on the XCS paradigm • Uses the Lanzi C++ kernel • Designed for use in epidemiologic research, specifically mining disease surveillance databases in supervised learning environments • Visualization by non-LCS users • Sensitive to demands of clinical data

Data Encoding in EpiXCS • All numeric data formats permissible • Binary • Categorical • Ordinal • Real • Non-binary data represented using “center-spread” approach • Two genes per feature • Actions are limited to binary (for now)

Sample input data format(Pima Indians Diabetes Database) ATTRIBUTE 0 <WILD "99"><REAL><STRING "Clump Thickness"> ATTRIBUTE 1 <WILD "99"><REAL><STRING "Uniformity of Cell Size"> ATTRIBUTE 2 <WILD "99"><REAL><STRING "Uniformity of Cell Shape"> ATTRIBUTE 3 <WILD "99"><REAL><STRING "Marginal Adhesion"> ATTRIBUTE 4 <WILD "99"><REAL><STRING "Single Epithelial Cell Size"> ATTRIBUTE 5 <WILD "99"><REAL><STRING "Bare Nuclei"> ATTRIBUTE 6 <WILD "99"><REAL><STRING "Bland Chromatin"> ATTRIBUTE 7 <WILD "99"><REAL><STRING "Normal Nucleoli"> ATTRIBUTE 8 <WILD "99"><REAL><STRING "Mitoses"> ACTION 9 <STRING "Malignant"> 5 4 4 5 7 10 3 2 1 0 3 1 1 1 2 2 3 1 1 0 8 10 10 8 7 10 9 7 1 1 …

Classifier Population Initialization • Minima and maxima for each attribute determined automatically at start of run • Center values can be initialized by user • Mean • Median • Random value between spread • Spread values can be initialized by user • Standard deviation • Quantile

Evolutionary Computation