180 likes | 600 Views
Approximate Bayesian Computation. Studying demographic parameters. Joao Lopes, Mark Beaumont University of Reading joao.lopes@rdg.ac.uk. ABC algorithm:. Assumptions: Discordance between gene and species trees is not expected Mutation rate is variable in space, but not in time Features:
E N D
Approximate Bayesian Computation Studying demographic parameters Joao Lopes, Mark Beaumont University of Reading joao.lopes@rdg.ac.uk
ABC algorithm: • Assumptions: • Discordance between gene and species trees is not expected • Mutation rate is variable in space, but not in time • Features: • Based on construction of gene trees using The Coalescent model • Easily applied to 4 or 5 populations/species • Some tweaks are necessary to use in more populations • But most importantly: • Handles large datasets (typically hundreds of samples per population/species) • Complex population/species models can be used (e.g. presence of gene flow) • Assumptions can be greatly relaxed (e.g. variable mutation rate over time)
Popanc Pop2 Pop1 • ABC algorithm ABC algorithm: F = {Ne1, Ne2, NeA, m1, m2, t} • Sample from prior(s): Fi ~ p(F) • Simulate data, given Fi: Di ~ p(D | Fi) • Summarize Di with set of Summary Statistics obtaining Si; go to 1. until N points (S,F) have been created. • _ • Acceptthe points whose S is within a distance d from s’ the real data summarized by the same set. • _ • Correct the values F according to their distance from the real data by performing a local linear regression NeA t m2 Ne1 m1 Ne2 The population model
Simulated data DNA sequence data (1 locus) Pop1: 45 samples Pop2: 55 samples ABC: 200 data sets Comparison with MCMC: 10 data sets • Summary Statistics used: • mean of pairwise differences • in each population • both populations joined together • number of segregating sites • in each population • both populations joined together • number of haplotypes • in each population • both populations joined together Relative Mean Integrated Square Error (relMISE): , where n is the number of accepted points, fi is the value of a determined parameter for the ith point and f‘ is the true value of the parameter.
“real” data ABC prior distribution MCMC • Simulated data ‘Real’ data and Prior information 10000 20000 5000 0 0 5000 0 12500 0 40000 0 10000 0 0.0005 0 0.0005 0 10000 Ne1 Ne2 NeA m1 m2 t
Simulated data ABC (500 000 iter, tol=0.02, logit transf, sstats=9 ): Simulation 8: Mig1 Mig2 Tev Ne1 Ne2 Neanc average relMISE: (10 data sets)
Simulated data: optimized ABC method ABC (2500 000 iter, tol=0.004, log transf, sstats=9): Simulation 8: Mig1 Mig2 Tev Ne1 Ne2 Neanc average relMISE: (10 data sets)
Simulated data: adding summary stats ABC (2500 000 iter, tol=0.004, log transf, sstats=21) Simulation 8: Mig1 Mig2 Tev Ne1 Ne2 Neanc average relMISE: (10 data sets)
Popanc Popanc Pop2 Pop2 Pop1 Pop1 Model-choice: migration present/absent ABC (1000 000 iter, tol=0.004, log transf, sstats=21): Population model 1 (M = M1) Population model 2 (M = M2) or x pM1 = 2% pM2 = 98% (10 data sets)
Simulated data: using model-choice step ABC (2500 000 iter, tol=0.004, log transf, sstats=21): Simulation 8: Mig1 Mig2 Tev Ne1 Ne2 Neanc average relMISE: (10 data sets)
Simulated data: 10 vs 200 datasets ABC (2500 000 iter, tol=0.004, log transf, sstats=21): Simulation 8: Mig1 Mig2 Tev Ne1 Ne2 Neanc average relMISE: (10 data sets) and (200 data sets)
Conclusions: • Comparison between ABC and MCMC methods: • ABC up to 2 orders of magnitude faster than MCMC method for single locus • ABC modes are similar to MCMC (full likelihood method) • Can easily incorporate more complex population models with relaxed assumptions • Using a model-framework comes just naturally from the ABC approach • Easily handles multi-modal Posterior distributions • Does not have problems associated with Local Maximums in Likelihood distributions • ABC improves with: • parameters transformation • more iterations • more summary statistics • model-choice framework
Take home message: • Phylogenetic methods based on gene trees using The Coalescence are being greatly explored. • These methods will be available in a near by future
Acknowledgements I would like to acknowledge David Balding for providing frequent meetings on the subject. And also a special thanks to Mark Beaumont for advice and comments on the work. Support for this work was provided by EPSRC. joao.lopes@rdg.ac.uk http://www.rdg.ac.uk/~sar05sal