240 likes | 379 Views
Effective Heuristics for NP-Hard Problems Arising in Molecular Biology. Richard M. Karp Bangalore, January 5, 2011. NP-Hard Problems. The P vs. NP problem: Is finding a solution to a combinatorial search problem as easy as checking a solution. The answer is expected to be “No.”
E N D
Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011
NP-Hard Problems • The P vs. NP problem: Is finding a solution to a combinatorial search problem as easy as checking a solution. The answer is expected to be “No.” • NP-Hard Problems: Solvable in polynomial time only if P=NP. • General belief: Solving an NP-hard problem requires worst case exponential time.
Understanding NP-Hard ProblemsThrough Worst-Case Analysis • Exact solution methods: exponential running time in worst case. • Polynomial-time approximation algorithms for optimization problems, yielding a worst-case upper bound of the ratio between the cost of an approximate solution and the cost of optimal solution. Unfortunately, these guaranteed approximation ratios are unrealistically high.
Probabilistic Analysis and Heuristics • In probabilistic analysis problem instances are drawn from simple probability distributions. Often one can prove excellent performance on the average. However, the probability distributions may not correspond to real-life instances. • Heuristics are typically evaluated empirically on examples drawn from, or representative of. real-life instances. Heuristics are often “unreasonably effective,” for reasons not well understood.
Famous Unreasonably Effective Heuristics • Large traveling-salesman problems can be solved by quick tour construction methods, local improvement methods or cutting plane methods. • Local improvement methods find near-optimal solutions to graph bisection problems. • Huge satisfiability problems are routinely solved rapidly by branch-and-bound methods. • The greedy set cover algorithm typically gives solutions within a few percent of optimal.
NP-Hard Problems Arising in Molecular Biology and Genetics • Genome Sequencing • Global alignment of multiple genomes • Identifying siblings, cousins, second cousins etc. through comparison of genomes • Finding protein modules containing specified types of proteins • Computational discovery of dysregulated pathways in human diseases
Patterns of Inheritance • In each region of the genome, each individual has two haplotypes, one inherited from each parent. A haplotype is a sequence of alleles. • The haplotype inherited from a parent is a mosaic of segments inherited from the parent’s two haplotypes. Recombination occurs at the boundaries between segments. • In a pedigree graph the vertices are individuals and the edges represent parent-child relations.
Reconstructing Pedigrees • Given the haplotypes of individuals in the current generation, we wish to reconstruct the pedigree that gave rise to that generation and chart the flow of alleles.
Assumptions of a Generative Model • Monogamy • Layered structure: each individual and its mate lie in generation g, have parents in generation g-1, and children in generation g+1. Generation 1 is the founding generation. • The number of children of each couple is drawn from a Poisson distribution with mean 2. • In each haplotype, sites of recombination occur according to a Poisson process with known rate.
Working Backwards • We construct the pedigree generation by generation, working backwards from the current generation. • It suffices to determine, in each generation, which individuals are siblings. • Two alleles are identical by descent (IBD) if they are inherited from the same allele in the founding generation. • To test whether two individuals in generation g are likely to be siblings, we observe the amount of IBD between their descendants in the current generation
Inferring Siblinghood • Problem: determine which individuals in generation g are siblings. • Using IBD, we construct a compatibility graph with a vertex for each individual in generation g, and edges indicating pairs of individuals that are likely to be siblings on the basis of the IBD of their descendants. • Problem: Infer the siblinghood graph from the compatibility graph.
Inferring Siblinghood • Because of the monogamy assumption, the siblinghood graph must be a union of cliques. • Problem: Given a compatibility graph C determine the “closest” siblinghood graph S. • The algorithm maintains a partition of the vertices of C. The parts of the partition are called quasi-cliques. The score of a partition is A times the number of edges of C whose end points lie in the same quasi-clique, minus the number of non-edges of C whose end points lie in the same in the same quasi-clique. We seek a partition of maximum score.
Justifying the Scoring Function • Assumptions: The compatibility graph C is obtained by randomly perturbing the siblinghood graph S. S is a random union of disjoint cliques with sizes uniformly distributed between 1 and a parameter t. • If u is adjacent to v in S then u is adjacent to v in C with probability p; if u is not adjacent to v in S then u is adjacent to v in C with probability q, where q <p. • Under these assumptions maximizing the score produces a siblinghood graph of maximum conditional probability given C.
Heuristic Algorithm • The heuristic algorithm creates an initial partition by greedily constructing disjoint quasi-cliques. It then performs the following local operations to improve the score: Move a vertex; Extract a vertex; Split a quasi-clique; Merge two quasi-cliques; Restructure two quasi-cliques adjacent to a vertex v; Dynamic Programming: given a chain of quasi-cliques, make an optimal simultaneous move of a small set of vertices from each quasi-clique in the chain to its successor quasi-clique.
Performance of the Algorithm Typically the algorithm produces a partition with a slightly higher score than the “true” partition from which the compatibility graph was generated by perturbation. However, the fraction of vertices placed in the “correct” partition lies between 93% and 98%, depending on the fraction of edges deleted from cliques and the fraction of edges added between cliques in creating the compatibility graph C from the siblinghood graph S.
The Colorful Subgraph Problem • Input: A graph G and an assignment of a color to each vertex. • Find, if one exists, a connected subgraph H containing exactly one vertex of each color. • Optimization version: Minimize a x (number of extra vertices) + b x (number of omitted colors) • The problem is NP-hard, even on planar graphs.
Interpretation • In the protein-protein interaction (PPI) graph of a species the vertices represent proteins and the edges represent pairs of physically interacting proteins. • Given a connected set X of proteins performing a regulatory function in species A, we seek a similar connected set of proteins in species B. The color of each protein in species B indicates its similarity to a particular protein in X.
Dynamic Programming • For each vertex v and set of colors S, determine whether there is a tree containing exactly one vertex of each color in S, no vertices of any other color, and containing vertex v. The computation is recursive, running through sets S in order of increasing cardinality. The running time is of order n3k where n is the number of vertices and k is the number of colors.
Integer Programming plus Constraint Generation • We may assume that the desired connected subgraph is a tree T • Variables: x(i)= 1 iff vertex i is included in T y(e) = 1 iff edge e is included in T • Constraints: Exactly one vertex of each color is included; Exactly n-1 edges are included in T; If an edge is included then its endpoints are included; For each set of colors X, the number of edges of T connecting two vertices in X is at most |X| -1.
Performance • An implementation of integer programming plus constraint generation solves typical instances with 100 vertices in less than a minute. • Using a heuristic not yet implemented, one can solve typical instances with 100 vertices by hand in 20-30 minutes.
Heuristic Algorithmic Strategy • Repeat: (1) Delete vertices with frequent colors; (2) In the remaining graph, select a minimal set of connected components covering all infrequent colors; (3) Insert minimal set of vertices with frequent colors to restore connectedness and cover all colors.
Example on Grid W E L C O M E T O T H E W E B S I T E O F T H E A N N U A L S Y MP O S I U M O N C O MB I N A T O R I A L P A T T E R NMA T
After One Iteration W e B S F t H N N U a L S Y P S I U N C o m B I N R I L P R N
After Two Iterations W E B F T H n A l s Y P I u C O M B