490 likes | 642 Views
A Step Toward Barcoding Life I: A Model Based, Decision Theoretic Method to Assign Genes to Pre-existing Species Groups. Zaid Abdo and G. Brian Golding McMaster University, Hamilton, Ontario, Canada. Road Map. Introduction. Model & Implementation. Validation. Results. Conclusions.
E N D
A Step Toward Barcoding Life I: A Model Based, Decision Theoretic Method to Assign Genes to Pre-existing Species Groups Zaid Abdo and G. Brian Golding McMaster University, Hamilton, Ontario, Canada
Road Map • Introduction • Model & Implementation • Validation • Results • Conclusions • Future Work
Road Map • Introduction • Model & Implementation • Validation • Results • Conclusions • Future Work
Compare Against Find a universally available DNA sequence Archive Introduction In Barcoding we simply
Introduction • The last described aspect involves assigning newly sampled individuals to existing groups. • Our current research is concerned with performing this task in an accurate and fast manner. • To accomplish this we developed a model based, decision theoretic framework based on the coalescent.
Introduction • Model based methods capitalize on understanding the process governing the system under study and result in more informative and powerful tools to analyze data generated from such systems.
Road Map • Introduction • Model & Implementation • Validation • Results • Conclusions • Future Work
Models & Methods Motivation • Assignment is a decision making problem. • Want to make an informative decision to correctly assign an individual without knowing its true origin. • In particular: want to assign an individual, x, to group i that has the minimum posterior risk, Ri, of assignment.
Models & Methods Risk • Posterior risk is defined to be the expected loss: a loss function that quantifies the error of assignment. the posterior probability that x originated from group k given x and the observed data, D.
Models & Methods Loss • Any loss function can work. • We chose the following distance measure: is the % distance from the consensus is the length of the sequence is sequence data forming group k.
Models & Methods Posterior • The posterior can be rewritten using Bayes’ rule as follows: • Requires the knowledge of the phylogeny in addition to knowing within-group population genetic behavior. • Which makes life very hard.
Models & Methods Assumptions • To overcome this we made the following simplifying assumptions: Groups, or populations, have evolved for extended periods of time such that they evolve independently. Within group evolution follows a Wright-Fisher neutral model of evolution with no migration or recombination. • This allowed us to use the neutral coalescent to model within group evolution.
Models & Methods More Assumptions • We also assumed that: we know all that is to know about the evolutionary process within each group, i.e. we know,for certain, the vector of evolutionary parameters governing within group evolution.
Models & Methods Posterior Again • The posterior reduces to:
Models & Methods Last Assumptions • Last two simplifying assumptions, I promise: First, we assumed a uniform prior on the groups. Second, we assumed that knowing that group k is the origin of x does not affect the likelihood of the observed data in k.
Models & Methods Final Posterior • Hence, our posterior reduces to:
Models & Methods Final Risk • And the risk becomes:
Models & Methods Doubt • We were in doubt when Ri > 0.25/K; where, 0.25 = proportion of nucleotides match by chance two different sequences and K = total number of groups. • Being in doubt highlights the need to consider the possibility that x was misplaced.
Models & Methods Implementation • Implementation was straight forward when noting that: and that: Genealogy
Models & Methods Implementation • Approximated the likelihood by replacing the q’s with their maximum likelihood estimates (MLEs). And, Genealogy
Models & Methods Implementation • Used FLUCTUATE (Kuhner et al. 1998) to find the MLEs. • Used a similar approach to Kuhner et al (1995, 1998) and Raftery (1996), to construct an MCMC to directly calculate the likelihoods. • In both we used F81 to model the mutation process.
Road Map • Introduction • Model & Implementation • Validation • Results • Conclusions • Future Work
Validation preview • Utilized both simulated data and real data. • Compared to a distance based method where the “correct” group of assignment, k, was one minimizing: • We were in doubt when shortest distance was the same for more than 1 sequence; when > 1 sequence had distance < 3%; and when shortest distance was > 3%.
Validation Simulation • Simulation allows control over the parameters of the evolutionary process. • This results in a simplification of this process, yet provides the advantage of knowing the truth.
Validation Simulation • Relaxed assumption of independence of groups by using a phylogeny to govern the history of group evolution. • Used different degrees of phylogenetic evolution, n, crossed with different levels of within group evolution, q, to create multiple levels of between-group overlap. • Used group sizes of 5, 10, and 25.
Validation Simulation Parameters used in the simulation process
Validation Simulation • Used GTR as the model of evolution. • An extra individual was generated within each group, only one of which was chosen to be reassigned. • Generated 100 replicate data sets for each combination of q, sample size and phylogenetic divergence, n.
Validation Real Data • Real data provided a means to evaluate performance when q differed between groups. • Used neotropical skipper butterfly (Astraptes flugerator) data to validate our method. • This data are well characterized and have been suggested, on the basis of barcode, to represent 10 sympatric species.
Validation Real Data Caterpillars of 10 species in the Astraptis fulgerator complex from the Area Conservacion Guanacaste. Interim names reflect the primary larval food plant and, in some cases, a color character of the adult. Hebert, et al. 2004
Validation Real Data • Drew one “query” sequence at random from all available sequences. • Groups were reconstructed in the absence of this removed sequence. • Implemented our method to reassign.
Validation Comparison • In comparing the two methods we combined both “correct” assignment and “correct” assignment with doubt. • This favored the distance method because: the distance measure is limited to a certain range where between group variation is not too small that x is close to all groups, and the within group variation is not too large that x is far from all groups including its origin.
Road Map • Introduction • Model & Implementation • Validation • Results • Conclusions • Future Work
Results Simulation • Our method performed better than the distance method, esp. at sample sizes 5 and 10, and q = 0.01 and 0.1, where within group evolution is on the same level, or a bit faster, than that between groups. • The power of our method dropped much more slowly than that of the distance method as q increased. • Clear improvement of the two methods was seen as the divergence of the phylogeny, n, increases compared to the within group evolution.
Results Simulation Percent correct assignment using the coalescent assigner (left) and using the distance method (right). Within group sample size, n = 5. Our Method The other method Used 50000 chain length and FLUCTUATE with its default settings. (n is substitutions per site and q is per site.)
Results Simulation Percent correct assignment using the coalescent assigner (left) and using the distance method (right). Within group sample size, n = 10. Our Method The other method Used 50000 chain length and FLUCTUATE with its default settings. (n is substitutions per site and q is per site.)
Results Simulation Percent correct assignment using the coalescent assigner (left) and using the distance method (right). Within group sample size, n = 25. Our Method The other method Used 50000 chain length and FLUCTUATE with its default settings. (n is substitutions per site and q is per site.)
Results Simulation Power of the coalescent assigner (solid line) compared to that of the distance method (dotted line) as n increases at different group sizes when q = 0.001. n is substitutions per site and q is per site.
Results Simulation Power of the coalescent assigner (solid line) compared to that of the distance method (dotted line) as n increases at different group sizes when q = 0.01. n is substitutions per site and q is per site.
Results Simulation Power of the coalescent assigner (solid line) compared to that of the distance method (dotted line) as n increases at different group sizes when q = 0.1. n is substitutions per site and q is per site.
Results Simulation • Assigning an individual based on distance alone when the within group evolution is on the same scale, or slightly faster, than between group evolution can be misleading; the closeness of the different groups in their history can cause some individuals to be closer to the wrong group. • Taking the within group evolutionary history into account, along with the distance, gives a more informed assignment decision.
Results Simulation • Comparison of the number of wrong assignments of using our method (light grey) and the distance method (dark grey) as distance from true group increases. n = 0.001 substitutions per site, q = 0.1 per site, n = 10.
Results Real Data • Our algorithm highlighted a misclassification problem in the Astraptes data. • Correcting and re-running resulted in 100% correct assignment. • It was not surprising that the distance method resulted in a 100% correct assignment after correction.
Road Map • Introduction • Model & Implementation • Validation • Results • Conclusions • Future Work
Conclusions • Our method performs better than the distance method when within group evolution is on the same level, or slightly faster, than that between the different groups. • The power of our method dropped much more slowly than that of the distance method as q increased. • The power improved faster than that of the distance method as phylogenetic divergence increased. • Both our method and the distance based method improved as phylogenetic divergence increased compared to within group evolution.
Conclusions • Our method is robust to violations of some of its major assumptions. • Witness to its capacity was its ability to highlight misclassification problem in the Astraptes data. • The power of our method comes with a computational expense that increases with the increase in the within group sample sizes and in q.
Road Map • Introduction • Model & Implementation • Validation • Results • Conclusions • Future Work
Future Work • A model-based clustering approach is under construction. • An approach to overcome the limitation of the coalescent when a group contains only one individual is being tested. • Outliers are also being dealt with. • An expansion of the number of possible models and loss functions to use in the analysis is also under way.
Most Recent Common Ancestor Models & Methods The Coalescent • The coalescent is a stochastic process that looks at the evolutionary process backwards in time. Time • The neutral coalescent is governed by one parameter: