590 likes | 705 Views
Flowers, Bees, and Algorithms: Adventures in Cophylogenetics. Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins (Clemson) Daniel Fielder (HMC) John Peebles (HMC)
E N D
Flowers, Bees, and Algorithms: Adventures in Cophylogenetics Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins (Clemson) Daniel Fielder (HMC) John Peebles (HMC) Tselil Schramm (HMC) AnakYodpinyanee (HMC)
Integrated CS/Bio Course Send e-mail to: ran@cs.hmc.edu
Overview • A 75-minute “research lecture” to first-year students in our CS/Bio intro course • Show first-year students that what they’ve learned is relevant to current research • Showcase research done with senior students • What have they have done so far? • Biology: Genes, alignment, phylogenetic trees, RNA folding • CS: Programming, recursion, “memoization”
Specifically… • Pairwise global alignment and RNA folding • Why you should care • Designed and implemented recursive solutions • Why are they slow? • How do we make them faster? • “Memoization” idea • Wow, that’s fast! (but no actual analysis yet) • Designed and implemented “memoized” versions • Used their implementations to investigate questions Around 10 lines of Python code!
Specifically… • Phylogenetic trees • Why you should care • Implemented simple algorithm (e.g. UPGMA) • Used their implementation to answer questions… • Existence and relative merits of other algorithms (mention maximum likelihood… but it’s slow!)
Actual 75-minute lecture starts here! (Also a chapter in new B4B) Cophylogenetics “ I can understand how a flower and a bee might slowly become, either simultaneously or one after the other, modified and adapted in the most perfect manner to each other, by the continued preservation of individuals presenting mutual and slightly favourable deviations of structure.” Charles Darwin, The Origin of Species
Obligate Mutualism of Figs and Fig Wasps ovipostor From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004
The Cophylogeny Problem From Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of parasites and their hosts.Nature 1988, 332:258-259
Indigobirds and Finches • High level of host specificity (e.g. mouth markings) www.indigobirds.com
The Question… Given a host tree, parasite tree, and tip mapping, what is the most plausible mapping between the trees and is it suggestive of coevolution? This seems to be a “hard” problem!
Measuring the “Hardness” of Computational Problems There are three kinds of problems… Easy Hard Impossible!
“Easy” Problems Sorting a list of n numbers: [42, 3, 17, 26, … , 100] Multiplying two nxnmatrices: ( ) ( ) ( ) 3 5 2 7 1 6 8 9 2 4 6 10 9 3 2 12 1 5 5 4 5 12 8 6 7 6 1 5 9 23 5 8 n = n n n n
Global Alignment is “easy”! • Reminder of 2n running time of alignment • Informally motivate n2 running time of memoized version
“Hard” Problems Snowplows of Northern Minnesota Burrsburg Frostbite City Tundratown Shiversville Freezeapolis
“Hard” Problems Snowplows of Northern Minnesota Burrsburg Frostbite City Tundratown Shiversville Freezeapolis Brute-force? Greed?
n2 versus 2n Ran-O-Matic The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 n2 2n 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 1024 < 1 sec 109 1 sec
n2 versus 2n Ran-O-Matic The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 n2 2n 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 1024 < 1 sec 109 1 sec 1015 13 days
n2 versus 2n Ran-O-Matic The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 n2 2n 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 1021 37 trillion years 1024 < 1 sec 109 1 sec 1015 13 days
n2 versus 2n Ran-O-Matic The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 n2 2n 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 1021 37 trillion years 1024 < 1 sec 109 1 sec 1015 13 days Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years ->
n2 versus 2n Ran-O-Matic The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 n2 2n 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 1021 37 trillion years 1024 < 1 sec 109 1 sec 1015 13 days Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years -> 37 billion years!
Snowplows and Travelling Salesperson Revisited! Tens of thousands of other known problems go in this cloud!! Travelling Salesperson Problem Snowplow Problem Protein Folding Multiple sequence alignment NP-complete problems Phylogenetic trees by maximum likelihood
“I can’t find an efficient algorithm. I guess I’m too dumb.” Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
“I can’t find an efficient algorithm because no such algorithm is possible!” Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
“I can’t find an efficient algorithm, but neither can all these famous people.” Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
$1 million Vinay Deolalikar
Coping with NP-completeness… • Brute force • Ad hoc Heuristics • Meta heuristics • Approximation algorithms
Obligate Mutualism of Figs and Fig Wasps ovipostor From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004
The Cophylogeny Problem… Host tree Parasite tree e d a b c
The Cophylogeny Problem Host tree Parasite tree e d a b c Tips associations
Input Possible Solutions e e d d a a b c b c
Event Cost Modelcospeciation e e d cospeciation cospeciation d a a b c b c
Event Cost Modelduplication e duplication e d d a a b c b c
Event Cost Modelhost-switch e e d host-switch d a a b c b c
Event Cost Modelloss e e d loss loss loss loss d a a b c b c
Event Cost Model Cost = cospeciation + host-switch + loss Cost = duplication + cospeciation + 3 * loss e duplication e d cospeciation cospeciation loss loss loss loss host-switch d a a b c b c
Some typical costs Cost = 8 Cost = 5 e duplication + 2 e d cospeciation cospeciation loss + 0 + 0 + 2 loss loss loss host-switch + 2 + 2 + 2 + 3 d a a b c b c
This problem is hard! • How hard? NP-complete! (Joint work with Charleston, Ovadia, Conow, Fielder) • The host-switches are the culprits h f e g
A Metaheuristic Approach • Fix a timing • We can solve the problem optimally for a given timing using Dynamic Programming (Memoization)
Dynamic Programming Compute Cost[a,su,2] parasite a r t = 0 b s t = 1 c a t t = 2 t = 3 u t = 4 v w x y
Dynamic Programming Compute Cost[a,su,2] parasite a r t = 0 b s t = 1 c a t t = 2 b t = 3 u Cost[b,tw,3] c t = 4 v w x y Cost[c,y,4]
Dynamic Programming Compute Cost[a,su,2] parasite a r t = 0 b s t = 1 c a t t = 2 host-switch loss b t = 3 loss u Cost[b,tw,3] c t = 4 v w x y Cost[c,y,4]
Dynamic Programming Candidate for Cost[a,su,2]: Cost[b, tw, 3] + Cost[c, uy, 4] + 2 * loss + host-switch r t = 0 s t = 1 a t t = 2 host-switch loss b t = 3 loss u Cost[b,tw,3] c t = 4 v w x y Cost[c,y,4]
Dynamic Programming Running Time • O(n3) cells to fill in • O(n2) positions for first child • O(n2) positions for second child • O(n) to count #losses from each child, but this is precomputable O(n3 x (n2 x n2)) = O(n7) total
Dynamic Programming Running Time • O(n3) cells to fill in • O(n2) positions for first child • O(n2) positions for second child • O(n) to count #losses from each child, but this is precomputable O(n3x (n2xn2)) = O(n7) total Can be improved to O(n3)