Probabilistic Modeling for Combinatorial Optimization

Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja

Combinatorial Optimization • Maximize “evaluation function” f(x) • input: fixed-length bitstring x • output: real value • x might represent: • job shop schedules • TSP tours • discretized numeric values • etc. • Our focus: “Black Box” optimization • No domain-dependent heuristics x=1001001... f(x) 37.4

Most Commonly Used Approaches • Hill-climbing, simulated annealing • Generate candidate solutions “neighboring” single current working solution (e.g. differing by one bit) • Typically make no attempt to model how particular bits affect solution quality • Genetic algorithms • Attempt to implicitly capture dependency of solution quality on bit values by maintaining a population of candidate solutions • Use crossover and mutation operators on population members to generate new candidate solutions

Using Explicit Probabilistic Models • Maintain an explicit probability distribution P from which we generate new candidate solutions • Initialize P to uniform distribution • Until termination criteria met: • Stochastically generate K candidate solutions from P • Evaluate them • Update P to make it more likely to generate solutions “similar” to the “good” solutions • Several different choices for what sorts of P to use and how to update it after candidate solution evaluation

Probability Distributions Over Bitstrings • Let x = (x1, x2, …, xn), where xi can take one of the values {0, 1} and n is the length of the bitstring. • Can factorize any distribution P(x1…xn) bit by bit: P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1) • In general, the above formula is just another way of representing a big lookup table with one entry for each of the 2n possible bitstrings. • Obviously too many parameters to estimate from limited data!

Representing Independencies withBayesian Networks • Graphical representation of probability distributions • Each variable is a vertex • Each variable’s probability distribution is conditioned only on its parents in the directed acyclic graph (“dag”) Wean on Fire P(F,D,I,A,H) = P(F) * P(D) * P(I|F) * P(A|F) * P(H|D,I,A) F Ice Cream Truck Nearby Fire Alarm Activated D I A Office Door Open H Hear Bells

x1 x2 x3 xn x1 x2 x3 xn “Bayesian Networks” for Bitstring Optimization? P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1) Yuck. Let’s just assume all the bits are independent instead. (For now.) P(x1,…,xn) = P(x1) P(x2) P(x3)…P(xn) Ahhh. Much better.

Population-Based Incremental Learning • Population-Based Incremental Learning (PBIL) [Baluja, 1995] • Maintains a vector of probabilities: one independent probability P(xi) for each bit xi. • Until termination criteria met: • Generate a population of K bitstrings from P • Evaluate them • Use the best M of the K to update P as follows: if xi is set to 1 or if xi is set to 0 • Optionally, also update P similarly with the bitwise complement of the worst of the K bitstrings • Return best bitstring ever evaluated

PBIL vs. Discrete Learning Automata • Equivalent to a team of Discrete Learning Automata, one automata per bit. [Thathachar & Sastry, 1987] • Learning automata choose actions independently, but receive common reinforcement signal dependent on all their actions • PBIL update rule equivalent to linear reward-inaction algorithm [Hilgard & Bower, 1975] with “success” defined as “best in the bunch” • However, Discrete Learning Automata typically used previously in problems with few variables but noisy evaluation functions

PBIL vs. Genetic Algorithms • PBIL originated as tool for understanding GA behavior • Similar to Bit-Based Simulated Crossover (BSC) [Syswerda, 1993] • Regenerates P from scratch after every generation • All K used to update P, weighted according to probabilities that a GA would have selected them for reproduction • Why might normal GAs be better? • Implicitly capture inter-bit dependencies with population • However, because model is only implicit, crossover must be randomized. Also, limited population size often leads to premature convergence based on noise in samples.

Four Peaks Problem • Problem used in [Baluja & Caruana, 1995] to test how well GAs maintain multiple solutions before converging. • Given input vector X with N bits, and difficulty parameter T: • FourPeaks(T,X)=MAX(head(1,X), tail(0,X))+Bonus(T,X) • head(b,X) = # of contiguous leading bits in X set to b • tail(b,X) = # of contiguous trailing bits in X set to b • Bonus(T,X) = 100 if (head(1,X)>T) AND (tail(0,X) > T), or 0 otherwise

Four Peaks Problem Results • 80 GA settings tried; best five shown here • Average over 50 runs

Large-Scale PBIL Empirical Comparison • See table in other slide

Modeling Inter-Bit Dependencies • How about automatically learning probability distributions in which at least some dependencies between variables are modeled? • Problem statement: given a dataset D and a set of allowable Bayesian networks {Bi}, find the Bi with the maximum posterior probability:

Equations. (Mwuh hah hah hah!) • Where: • d j is the jth datapoint • dij is the value assigned to xi by d j. • Pi is the set of xi’s parents in B • is the set of values assigned to Pi by d j • P’ is the empirical probability distribution exhibited by D

Mmmm...Entropy. • Fact: given that B has a network structure S, the optimal probabilities to use in B are just the probabilities in D, i.e. P’. So: Pick B to Maximize: Pick S to Maximize:

x17 x32 x20 Entropy Calculation Example x20’s contribution to score: 80 total datapoints where H(p,q) = p log p + q log q.

x17 x22 x5 x2 x13 x19 x6 x14 x8 Single-Parent Tree-Shaped Networks • Now let’s allow each bit to be conditioned on at most one other bit. • Adding an arc from xj to xi increases network score by H(xi) - H(xi|xj) (xj’s “information gain” with xi) = H(xj) - H(xj|xi) (not necessarily obvious, but true) = I(xi, xj)Mutual information between xi and xj

Optimal Single-Parent Tree-Shaped Networks • To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968] • Start with an arbitrary root node xr. • Until all n nodes have been added to the tree: • Of all pairs of nodes xin and xout, where xin has already been added to the tree but xout has not, find the pair with the largest I(xin, xout). • Add xout to the tree with xin as its parent. • Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)

Optimal Dependency Trees for Combinatorial Optimization [Baluja & Davies, 1997] • Start with a dataset D initialized from the uniform distribution • Until termination criteria met: • Build optimal dependency tree T with which to model D. • Generate K bitstrings from probability distribution represented by T. Evaluate them. • Add best M bitstrings to D after decaying the weight of all datapoints already in D by a factor a between 0 and 1. • Return best bitstring ever evaluated. • Running time: O(K*n + n2) per iteration

Tree-based Optimization vs. MIMIC • Tree-based optimization algorithm inspired by Mutual Information Maximization for Input Clustering (MIMIC) [De Bonet, et al., 1997] • Learned chain-shaped networks rather than tree-shaped networks • Dataset: best N% of all bitstrings ever evaluated • +: dataset has simple, well-defined interpretation • -: have to remember bitstrings • -: seems to converge too quickly on some larger problems x4 x17 x8 x5

MIMIC dataset vs. Exp. Decay dataset • Solving a system of linear equations the hard way • Error of best solution as a function of generation #, averaged over 10 problems • 81 bits

Graph-Coloring Example • Noisy Graph-Coloring example • For each edge connected to vertices of different colors, add +1 to evaluation function with probability 0.5

Peaks problems results • Oops. Forgot this slide.

Checkerboard problem results • 16*16 grid of bits • For each bit in middle 14*14 subgrid, add +1 to evaluation for each of four neighbors set to opposite value Genetic Algorithm (avg: 740) PBIL (avg: 742) Chain (avg: 760) Tree (avg: 776)

Linear Equations Results • 81 bits again; 9 problems, each with results averaged over 25 runs • Minimize error

Modeling Higher-Order Dependencies • The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent. • What about finding the best network in which each node has at most K parents for K>1? • NP-complete problem! [Chickering, et al., 1995] • However, can use search heuristics to look for “good” network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.

- l|B| Scoring Function for Arbitrary Networks • Rather than restricting K directly, add penalty term to scoring function to limit total network size • Equivalent to priors favoring simpler network structures • Alternatively, lends itself nicely to MDL interpretation • Size of penalty controls exploration/exploitation tradeoff l: penalty factor |B|: number of parameters in B

Bayesian Network-Based Combinatorial Optimization • Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges • Until termination criteria met: • Perform steepest-ascent hillclimbing from B to find locally optimal network B’. Repeat until no changes increase score: • Evaluate how each possible edge addition, removal or deletion would affect penalized log-likelihood score • Perform change that maximizes increase in score. • Set B to B’. • Generate and evaluate K bitstrings from B. • Decay weight of datapoints in D by a. • Add best M of the K recently generated datapoints to D. • Return best bit string ever evaluated.

Cutting Computational Costs • Can cache contingency tables for all possible one-arc changes to network structure • Only have to recompute scores associated with at most two nodes after arc added, removed, or reversed. • Prevents having to slog through the entire dataset recomputing score changes for every possible arc change when dataset changes. • However, kiss memory goodbye • Only a few network structure changes required after each iteration since dataset hasn’t changed much (one or two structural changes on average)

Evolution of Network Complexity • 100 bits; +1 added to evaluation function for every bit set to the parity of the previous K bits

Summation Cancellation • Minimize magnitudes of cumulative sum of discretized numeric parameters (s1, …, sn) represented with standard binary encoding: Average value over 50 runs of best solution found in 2000 generations

Bayesian Networks: Empirical Results Summary • Does better than Tree-based optimization algorithm on some toy problems • Significantly better on “Summation Cancellation” problem • 10% reduction in error on System of Linear Equation problems • Roughly the same as Tree-based algorithm on others, e.g. small Knapsack problems • Significantly more computation despite efficiency hacks, however. • Why not much better performance? • Too much emphasis on exploitation rather than exploration? • Steepest-ascent hillclimbing over network structures not good enough?

Using Probabilistic Models for Intelligent Restarts • Tree-based algorithm’s O(n2) execution time per generation very expensive for large problems • Even more so for more complicated Bayesian networks • One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing

COMIT • Combining Optimizers with Mutual Information Trees [Baluja & Davies, 1997b]: • Initialize dataset D with bitstrings drawn from uniform distribution • Until termination criteria met: • Build optimal dependency tree T with which to model D. • Generate K bitstrings from the distribution represented by T. Evaluate them. • Execute a hillclimbing run starting from single best bitstring of these K. • Replace up to M bitstrings in D with the best bitstrings found during the hillclimbing run. • Return best bitstring ever evaluated

COMIT, cont’d • Empirical tests performed with stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before restarting • Compare COMIT vs.: • Hillclimbing with restarts from bitstrings chosen randomly from uniform distribution • Hillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution • Genetic algorithms?

COMIT: Example of Behavior • TSP domain: 100 cities, 700 bits Tour Length * 103 Evaluation Number * 103

COMIT: Empirical Comparisons • Each number is average over at least 25 runs • Highlighted: better than each non-COMIT hillclimber with P > 95% • AHCxxx: pick best of xxx randomly generated starting points • COMITxxx: pick best of xxx starting points generated by tree

Summary • PBIL uses very simple probability distribution from which new solutions are generated, yet works surprisingly well • Algorithm using tree-based distributions seems to work even better, though at significantly more computational expense • More sophisticated networks: past the point of diminishing marginal returns? “Future research” • COMIT makes tree-based algorithm applicable to much larger problems

Future Work • Making algorithm based on complex Bayesian Networks more practical • Combine w/simpler search algorithms, ala COMIT? • Applying COMIT to more interesting problems • WALKSAT? • Using COMIT to combine results of multiple search algorithms • Optimization in real-valued state spaces. What sorts of PDF representations might be useful? • Gaussians? • Kernel-based representations? • Hierarchical representations? • Hands off :-)

Acknowledgements • Shumeet Baluja • Doug Baker, Justin Boyan, Lonnie Chrisman, Greg Cooper, Geoff Gordon, Andrew Moore, …?

Probabilistic Modeling for Combinatorial Optimization

Probabilistic Modeling for Combinatorial Optimization

Presentation Transcript

Combinatorial Optimization

Approximation Algorithms for Stochastic Combinatorial Optimization

Combinatorial Optimization and Combinatorial Structure in Computational Biology

Approximations for Combinatorial Optimization

Distributed Combinatorial Optimization

Combinatorial Optimization for Graphical Models

Guiding Combinatorial Optimization with UCT

Solving Probabilistic Combinatorial Games

Probabilistic Modeling for Combinatorial Optimization

EAs for Combinatorial Optimization Problems

Fast Probabilistic Modeling for Combinatorial Optimization

Combinatorial Benders’ Cuts for Sports Scheduling Optimization

Combinatorial Optimization for Text Layout

Combinatorial Optimization

Probabilistic (=stochastic) modeling

Building “ Problem Solving Engines ” for Combinatorial Optimization

IE 635 Combinatorial Optimization

IE 635 Combinatorial Optimization

Approximation Algorithms for Stochastic Combinatorial Optimization

Combinatorial Optimization and Combinatorial Structure in Computational Biology

Solving Probabilistic Combinatorial Games

Problems in Combinatorial Optimization