410 likes | 425 Views
Probabilistic Modeling for Combinatorial Optimization. Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja. Combinatorial Optimization. Maximize “evaluation function” f(x) input: fixed-length bitstring x output: real value
E N D
Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja
Combinatorial Optimization • Maximize “evaluation function” f(x) • input: fixed-length bitstring x • output: real value • x might represent: • job shop schedules • TSP tours • discretized numeric values • etc. • Our focus: “Black Box” optimization • No domain-dependent heuristics x=1001001... f(x) 37.4
Most Commonly Used Approaches • Hill-climbing, simulated annealing • Generate candidate solutions “neighboring” single current working solution (e.g. differing by one bit) • Typically make no attempt to model how particular bits affect solution quality • Genetic algorithms • Attempt to implicitly capture dependency of solution quality on bit values by maintaining a population of candidate solutions • Use crossover and mutation operators on population members to generate new candidate solutions
Using Explicit Probabilistic Models • Maintain an explicit probability distribution P from which we generate new candidate solutions • Initialize P to uniform distribution • Until termination criteria met: • Stochastically generate K candidate solutions from P • Evaluate them • Update P to make it more likely to generate solutions “similar” to the “good” solutions • Several different choices for what sorts of P to use and how to update it after candidate solution evaluation
Probability Distributions Over Bitstrings • Let x = (x1, x2, …, xn), where xi can take one of the values {0, 1} and n is the length of the bitstring. • Can factorize any distribution P(x1…xn) bit by bit: P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1) • In general, the above formula is just another way of representing a big lookup table with one entry for each of the 2n possible bitstrings. • Obviously too many parameters to estimate from limited data!
Representing Independencies withBayesian Networks • Graphical representation of probability distributions • Each variable is a vertex • Each variable’s probability distribution is conditioned only on its parents in the directed acyclic graph (“dag”) Wean on Fire P(F,D,I,A,H) = P(F) * P(D) * P(I|F) * P(A|F) * P(H|D,I,A) F Ice Cream Truck Nearby Fire Alarm Activated D I A Office Door Open H Hear Bells
x1 x2 x3 xn x1 x2 x3 xn “Bayesian Networks” for Bitstring Optimization? P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1) Yuck. Let’s just assume all the bits are independent instead. (For now.) P(x1,…,xn) = P(x1) P(x2) P(x3)…P(xn) Ahhh. Much better.
Population-Based Incremental Learning • Population-Based Incremental Learning (PBIL) [Baluja, 1995] • Maintains a vector of probabilities: one independent probability P(xi) for each bit xi. • Until termination criteria met: • Generate a population of K bitstrings from P • Evaluate them • Use the best M of the K to update P as follows: if xi is set to 1 or if xi is set to 0 • Optionally, also update P similarly with the bitwise complement of the worst of the K bitstrings • Return best bitstring ever evaluated
PBIL vs. Discrete Learning Automata • Equivalent to a team of Discrete Learning Automata, one automata per bit. [Thathachar & Sastry, 1987] • Learning automata choose actions independently, but receive common reinforcement signal dependent on all their actions • PBIL update rule equivalent to linear reward-inaction algorithm [Hilgard & Bower, 1975] with “success” defined as “best in the bunch” • However, Discrete Learning Automata typically used previously in problems with few variables but noisy evaluation functions
PBIL vs. Genetic Algorithms • PBIL originated as tool for understanding GA behavior • Similar to Bit-Based Simulated Crossover (BSC) [Syswerda, 1993] • Regenerates P from scratch after every generation • All K used to update P, weighted according to probabilities that a GA would have selected them for reproduction • Why might normal GAs be better? • Implicitly capture inter-bit dependencies with population • However, because model is only implicit, crossover must be randomized. Also, limited population size often leads to premature convergence based on noise in samples.
Four Peaks Problem • Problem used in [Baluja & Caruana, 1995] to test how well GAs maintain multiple solutions before converging. • Given input vector X with N bits, and difficulty parameter T: • FourPeaks(T,X)=MAX(head(1,X), tail(0,X))+Bonus(T,X) • head(b,X) = # of contiguous leading bits in X set to b • tail(b,X) = # of contiguous trailing bits in X set to b • Bonus(T,X) = 100 if (head(1,X)>T) AND (tail(0,X) > T), or 0 otherwise
Four Peaks Problem Results • 80 GA settings tried; best five shown here • Average over 50 runs
Large-Scale PBIL Empirical Comparison • See table in other slide
Modeling Inter-Bit Dependencies • How about automatically learning probability distributions in which at least some dependencies between variables are modeled? • Problem statement: given a dataset D and a set of allowable Bayesian networks {Bi}, find the Bi with the maximum posterior probability:
Equations. (Mwuh hah hah hah!) • Where: • d j is the jth datapoint • dij is the value assigned to xi by d j. • Pi is the set of xi’s parents in B • is the set of values assigned to Pi by d j • P’ is the empirical probability distribution exhibited by D
Mmmm...Entropy. • Fact: given that B has a network structure S, the optimal probabilities to use in B are just the probabilities in D, i.e. P’. So: Pick B to Maximize: Pick S to Maximize:
x17 x32 x20 Entropy Calculation Example x20’s contribution to score: 80 total datapoints where H(p,q) = p log p + q log q.
x17 x22 x5 x2 x13 x19 x6 x14 x8 Single-Parent Tree-Shaped Networks • Now let’s allow each bit to be conditioned on at most one other bit. • Adding an arc from xj to xi increases network score by H(xi) - H(xi|xj) (xj’s “information gain” with xi) = H(xj) - H(xj|xi) (not necessarily obvious, but true) = I(xi, xj)Mutual information between xi and xj
Optimal Single-Parent Tree-Shaped Networks • To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968] • Start with an arbitrary root node xr. • Until all n nodes have been added to the tree: • Of all pairs of nodes xin and xout, where xin has already been added to the tree but xout has not, find the pair with the largest I(xin, xout). • Add xout to the tree with xin as its parent. • Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)
Optimal Dependency Trees for Combinatorial Optimization [Baluja & Davies, 1997] • Start with a dataset D initialized from the uniform distribution • Until termination criteria met: • Build optimal dependency tree T with which to model D. • Generate K bitstrings from probability distribution represented by T. Evaluate them. • Add best M bitstrings to D after decaying the weight of all datapoints already in D by a factor a between 0 and 1. • Return best bitstring ever evaluated. • Running time: O(K*n + n2) per iteration
Tree-based Optimization vs. MIMIC • Tree-based optimization algorithm inspired by Mutual Information Maximization for Input Clustering (MIMIC) [De Bonet, et al., 1997] • Learned chain-shaped networks rather than tree-shaped networks • Dataset: best N% of all bitstrings ever evaluated • +: dataset has simple, well-defined interpretation • -: have to remember bitstrings • -: seems to converge too quickly on some larger problems x4 x17 x8 x5
MIMIC dataset vs. Exp. Decay dataset • Solving a system of linear equations the hard way • Error of best solution as a function of generation #, averaged over 10 problems • 81 bits
Graph-Coloring Example • Noisy Graph-Coloring example • For each edge connected to vertices of different colors, add +1 to evaluation function with probability 0.5
Peaks problems results • Oops. Forgot this slide.
Checkerboard problem results • 16*16 grid of bits • For each bit in middle 14*14 subgrid, add +1 to evaluation for each of four neighbors set to opposite value Genetic Algorithm (avg: 740) PBIL (avg: 742) Chain (avg: 760) Tree (avg: 776)
Linear Equations Results • 81 bits again; 9 problems, each with results averaged over 25 runs • Minimize error
Modeling Higher-Order Dependencies • The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent. • What about finding the best network in which each node has at most K parents for K>1? • NP-complete problem! [Chickering, et al., 1995] • However, can use search heuristics to look for “good” network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.
- l|B| Scoring Function for Arbitrary Networks • Rather than restricting K directly, add penalty term to scoring function to limit total network size • Equivalent to priors favoring simpler network structures • Alternatively, lends itself nicely to MDL interpretation • Size of penalty controls exploration/exploitation tradeoff l: penalty factor |B|: number of parameters in B
Bayesian Network-Based Combinatorial Optimization • Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges • Until termination criteria met: • Perform steepest-ascent hillclimbing from B to find locally optimal network B’. Repeat until no changes increase score: • Evaluate how each possible edge addition, removal or deletion would affect penalized log-likelihood score • Perform change that maximizes increase in score. • Set B to B’. • Generate and evaluate K bitstrings from B. • Decay weight of datapoints in D by a. • Add best M of the K recently generated datapoints to D. • Return best bit string ever evaluated.
Cutting Computational Costs • Can cache contingency tables for all possible one-arc changes to network structure • Only have to recompute scores associated with at most two nodes after arc added, removed, or reversed. • Prevents having to slog through the entire dataset recomputing score changes for every possible arc change when dataset changes. • However, kiss memory goodbye • Only a few network structure changes required after each iteration since dataset hasn’t changed much (one or two structural changes on average)
Evolution of Network Complexity • 100 bits; +1 added to evaluation function for every bit set to the parity of the previous K bits
Summation Cancellation • Minimize magnitudes of cumulative sum of discretized numeric parameters (s1, …, sn) represented with standard binary encoding: Average value over 50 runs of best solution found in 2000 generations
Bayesian Networks: Empirical Results Summary • Does better than Tree-based optimization algorithm on some toy problems • Significantly better on “Summation Cancellation” problem • 10% reduction in error on System of Linear Equation problems • Roughly the same as Tree-based algorithm on others, e.g. small Knapsack problems • Significantly more computation despite efficiency hacks, however. • Why not much better performance? • Too much emphasis on exploitation rather than exploration? • Steepest-ascent hillclimbing over network structures not good enough?
Using Probabilistic Models for Intelligent Restarts • Tree-based algorithm’s O(n2) execution time per generation very expensive for large problems • Even more so for more complicated Bayesian networks • One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing
COMIT • Combining Optimizers with Mutual Information Trees [Baluja & Davies, 1997b]: • Initialize dataset D with bitstrings drawn from uniform distribution • Until termination criteria met: • Build optimal dependency tree T with which to model D. • Generate K bitstrings from the distribution represented by T. Evaluate them. • Execute a hillclimbing run starting from single best bitstring of these K. • Replace up to M bitstrings in D with the best bitstrings found during the hillclimbing run. • Return best bitstring ever evaluated
COMIT, cont’d • Empirical tests performed with stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before restarting • Compare COMIT vs.: • Hillclimbing with restarts from bitstrings chosen randomly from uniform distribution • Hillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution • Genetic algorithms?
COMIT: Example of Behavior • TSP domain: 100 cities, 700 bits Tour Length * 103 Evaluation Number * 103
COMIT: Empirical Comparisons • Each number is average over at least 25 runs • Highlighted: better than each non-COMIT hillclimber with P > 95% • AHCxxx: pick best of xxx randomly generated starting points • COMITxxx: pick best of xxx starting points generated by tree
Summary • PBIL uses very simple probability distribution from which new solutions are generated, yet works surprisingly well • Algorithm using tree-based distributions seems to work even better, though at significantly more computational expense • More sophisticated networks: past the point of diminishing marginal returns? “Future research” • COMIT makes tree-based algorithm applicable to much larger problems
Future Work • Making algorithm based on complex Bayesian Networks more practical • Combine w/simpler search algorithms, ala COMIT? • Applying COMIT to more interesting problems • WALKSAT? • Using COMIT to combine results of multiple search algorithms • Optimization in real-valued state spaces. What sorts of PDF representations might be useful? • Gaussians? • Kernel-based representations? • Hierarchical representations? • Hands off :-)
Acknowledgements • Shumeet Baluja • Doug Baker, Justin Boyan, Lonnie Chrisman, Greg Cooper, Geoff Gordon, Andrew Moore, …?