710 likes | 906 Views
Evolutionary Computation. Genetic Algorithms Genetic Programming Learning Classifier Systems. Genetic Algorithms. Population-based technique for discovery of knowledge structures Based on idea that evolution represents search for optimum solution set Massively parallel.
E N D
Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems
Genetic Algorithms • Population-based technique for discovery of knowledge structures • Based on idea that evolution represents search for optimum solution set • Massively parallel
The Vocabulary of GAs • Population • Set of individuals, each represented by one or more strings of characters • Chromosome • The string representing an individual
The vocabulary of GAs, contd. • Gene • The basic informational unit on a chromosome • Allele • The value of a specific gene • Locus • The ordinal place on a chromosome where a specific gene is found
Genetic operators • Reproduction • Increase representations of strong individuals • Crossover • Explore the search space • Mutation • Recapture “lost” genes due to crossover
GAs rely on the concept of “fitness” • Ability of an individual to survive into the next generation • “Survival of the fittest” • Usually calculated in terms of an objective fitness function • Maximization • Minimization • Other functions
Genetic Programming • Based on adaptation and evolution • Structures undergoing adaptation are computer programs of varying size and shape • Computer programs are genetically “bred” over time
The Learning Classifier System • Rule-based knowledge discovery and concept learning tool • Operates by means of evaluation, credit assignment, and discovery applied to a population of “chromosomes” (rules) each with a corresponding “phenotype” (outcome)
Components of a Learning Classifier System • Performance • Provides interaction between environment and rule base • Performs matching function • Reinforcement • Rewards accurate classifiers • Punishes inaccurate classifiers • Discovery • Uses the genetic algorithm to search for plausible rules
The Learning Classifier System • Rule-based knowledge discovery and concept learning tool • EpiCS • First Learning Classifier System designed for use in epidemiologic surveillance • Supervised learning environment
Knowledge Representation • Classifiers • IF-THEN rules • Condition=“genotype” • Action=“phenotype” • Strength metric • Encoded as bit strings or numerics • Population • Fixed size collection of classifiers
Low-level knowledge representation:The Classifier • Taxon is analogous to a condition (LHS) of an IF-THEN rule • Action bit is analogous to an action (RHS) of an IF-THEN rule • Strength is an internal fitness function
Components of a learning classifier system • Performance • Provides interaction between environment and classifier population • Performs matching function • Reinforcement • Rewards accurate classifiers • Punishes inaccurate classifiers • Discovery • Uses the genetic algorithm to search for plausible knowledge structures
Performance component • Creates a subset (the matchset, [M]) of all classifiers in population [P] whose conditions match a string received from the environment • From [M], a single classifier is selected, based on its strength as a proportion of the sum of all strengths in [M] • The action of this classifier is then used as the output of the system
Reinforcement component • Correct set [C] is created from classifiers in [M] advocating correct decisions • Remaining classifiers in [M] form Not[C] • Tax is deducted from the strengths of all classifiers in [C] • Reward is added to the strengths of all classifiers in [C], biased for generality • Penalty is deducted from the strengths of all classifiers in Not[C]
Discovery component • Genetic algorithm invoked once per iteration • One new offspring is created, from parents deterministically selected based on strength • The single offspring replaces weakest classifier in the population
Features of EpiCS • Object-oriented implementation • Stimulus-response architecture • Payoff/Penalty reinforcement regime • Syntactic control of overgeneralization • Differential penalty control of undergeneralization • Ability to compute risk of outcome
Discovering risk with EpiCS • Output decision of the learning classifier system is probability of disease (CSPD), rather than dichotomous decision • CSPD determined from proportion of classifiers matching a given input case’s taxon
Discovery of Predictive Models in an Injury Surveillance Database:An Application of Data Mining in Clinical Research
Partners for Child Passenger SafetyInformation Infrastructure
Why data mining is needed for PCPS • Large number of raw and derived variables renders traditional “manual” methods for discovering patters in data unwieldy • Hypothesis-driven (biased) analyses may lead to missed associations • Constantly changing patterns in prospective data require constantly changing analytic approaches that can be informed by data mining
Candidate Predictors • Demographics • Kinematics • Characteristics of crash • Restraint use
Outcome: Head Injury • Major burns involving the head • Skull fracture • Evidence of brain injury reported by respondent • Excessive sleepiness • Difficulty in arousing • Unresponsiveness • Amnesia after accident
Data Preparation • Pool of 8,334 records • 20 separate datasets created • All cases of head injury included (N=415) • Equal number of non-head injury cases randomly drawn from pool • Each dataset randomly sampled to create mutually exclusive training and testing sets of equal size
Comparison methods:Logistic Regression • Variables from training sets stepped into model to determine significant terms • Significant terms used to create new risk model: • Risk model applied to cases in testing set • Risk estimates categorized by deciles and used construct ROC curves
Comparison Methods:Decision Tree Induction • C4.5 used to create decision trees from training sets • 10-fold cross-validation used to optimize trees • Optimized trees used by C4.5RULES to classify cases in testing set
Results: Training • EpiCS • 5,000 unique classifiers reduced to 2,314 by the end of training • Logistic regression • Single model with eight significant terms, no significant interactions • C4.5 • 11 rules created for each training set, most with single conjuncts
Results: Prediction Area under the ROC curve obtained on testing, averaged over the 20 separate studies
And now for something a little different The XCS model
XCS: A little history • Wilson, SW: Evolutionary Computation, 2(1), 1-18 (1994) • ZCS • Wilson, SW: Evolutionary Computation, 3(2), 149-175 (1995) • The seminal work on XCS • Many papers by Lanzi, Barry, Butz, and others • Butz, M and Wilson, SW: Advances in Learning Classifier Systems. Third International Workshop (IWLCS-2000), Lecture Notes in Artificial Intelligence (LNAI-1996). Berlin: Springer-Verlag (2001) • The algorithm paper
What is XCS? • An LCS that differs from traditional Holland model • Classifier fitness is based on the accuracy of the classifiers payoff prediction, rather than the prediction itself • The genetic algorithm is restricted to niches in the action set, rather than applied to the classifier population as a whole • The major feature is graceful, accurate generalization
XCS in a nutshell ((43*99)+(27*3))/102 Action: 00 Action: 01 Source: Wilson, XCS tutorial
EpiXCS: An XCS-Based Learning Classifier System for Epidemiologic Research
Outline • What is it? • EpiXCS architecture • Data encoding • Evaluation metrics • Reinforcement • Missing values handling • Classifier ranking • Risk assessment • Test case: Pima Indians Diabetes Data
What is EpiXCS? • Learning classifier system based on the XCS paradigm • Uses the Lanzi C++ kernel • Designed for use in epidemiologic research, specifically mining disease surveillance databases in supervised learning environments • Visualization by non-LCS users • Sensitive to demands of clinical data
Data Encoding in EpiXCS • All numeric data formats permissible • Binary • Categorical • Ordinal • Real • Non-binary data represented using “center-spread” approach • Two genes per feature • Actions are limited to binary (for now)
Sample input data format(Pima Indians Diabetes Database) ATTRIBUTE 0 <WILD "99"><REAL><STRING "Clump Thickness"> ATTRIBUTE 1 <WILD "99"><REAL><STRING "Uniformity of Cell Size"> ATTRIBUTE 2 <WILD "99"><REAL><STRING "Uniformity of Cell Shape"> ATTRIBUTE 3 <WILD "99"><REAL><STRING "Marginal Adhesion"> ATTRIBUTE 4 <WILD "99"><REAL><STRING "Single Epithelial Cell Size"> ATTRIBUTE 5 <WILD "99"><REAL><STRING "Bare Nuclei"> ATTRIBUTE 6 <WILD "99"><REAL><STRING "Bland Chromatin"> ATTRIBUTE 7 <WILD "99"><REAL><STRING "Normal Nucleoli"> ATTRIBUTE 8 <WILD "99"><REAL><STRING "Mitoses"> ACTION 9 <STRING "Malignant"> 5 4 4 5 7 10 3 2 1 0 3 1 1 1 2 2 3 1 1 0 8 10 10 8 7 10 9 7 1 1 …
Classifier Population Initialization • Minima and maxima for each attribute determined automatically at start of run • Center values can be initialized by user • Mean • Median • Random value between spread • Spread values can be initialized by user • Standard deviation • Quantile