290 likes | 406 Views
Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution. Reed A. Cartwright Department of Genetics University of Georgia. Phylogenies. Phylogenies are not known. They are inferred.
E N D
Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution Reed A. Cartwright Department of Genetics University of Georgia
Phylogenies • Phylogenies are not known. They are inferred. • The accuracy of the inference is dependent on the quality of the data and quality of the methodology. RA Cartwright rac@uga.edu - http://scit.us/
Estimating Phylogenies • Retrieve sequences from a database or a sequencer. • Align sequences. • Estimate the phylogeny from the alignment. • However, bad alignments give bad phylogenies. RA Cartwright rac@uga.edu - http://scit.us/
How do we know it works? • Intuition • Concordance • Using rare instances of know phylogenies. • Simulations RA Cartwright rac@uga.edu - http://scit.us/
Why Simulate Phylogenies? • Techniques are often based on certain models of evolution. • Simulating sequence evolution based on these models produces an ideal situation to test the techniques. • Using other models can test how robust a technique is. RA Cartwright rac@uga.edu - http://scit.us/
A A B B C C D D A B C D Testing Procedure 1. Start with a “known” tree. 3. Estimate the trees of the simulated data. 2. Simulate sequence sets based on the tree. 4. Compare estimated trees to the original tree. A AATTCTTTGAGTTAA B AATTCTTTGAGTTAA C AATTCTTAAAGTTAA D AATTCTTAAAGTTAA A AAAAGATAAAGCAAA--A B GAAAGATAAAGCAAA--A C GAAAGATAAAGAAAAACA D GAAAGATAAAGAAAAACA RA Cartwright rac@uga.edu - http://scit.us/
Measuring the accuracy of estimates • What do you want to measure, topology or branch lengths? • Simply binary system: correct or wrong. • More flexible system: accuracy of clade estimation. RA Cartwright rac@uga.edu - http://scit.us/
Clade Estimation • Sensitivity = TP/(TP+FN) • Specificity = TN/(FP+TN) • Positive Predictive Accuracy = TP/(TP+FP) • Negative Predictive Accuracy = TN/(FN+TN) RA Cartwright rac@uga.edu - http://scit.us/
How to best combine different clades into one metric? • Look at each clade separately? • Lump all clades of the same size together? • Lump all clades of different sizes together? • How to do it? RA Cartwright rac@uga.edu - http://scit.us/
Alignment Techniques • The quality of a phylogeny depends on the quality of an alignment. • There are two different classes of alignment techniques. • Pairwise alignments • Multiple alignments RA Cartwright rac@uga.edu - http://scit.us/
Pairwise alignments • Pairwise alignments align pairs of sequences. • Typically use a spreadsheet like technique with penalties for mismatches and gaps. RA Cartwright rac@uga.edu - http://scit.us/
Multiple alignments • Align multiple sequences. • Cannot directly use the techniques used for pairwise alignments. • Typical implementations use a guide tree derived from sequence similarity scores of pairwise alignments. RA Cartwright rac@uga.edu - http://scit.us/
Alignment Models • The typical model of alignment is the affine gap model. • In this model the cost of a gap is a linear function of the size of a gap: C(L) = O+E(L). • This corresponds to a geometric-exponential model of gap sizes. RA Cartwright rac@uga.edu - http://scit.us/
Power-Law • The only problem with this is that indels do not obey this model. • Several studies and some theory in nucleic acids and proteins have found that indels sizes obey a power law. • The appropriate cost model for a power law is logarithmic:C(L) = O+E*Log(L). RA Cartwright rac@uga.edu - http://scit.us/
So what? • Only one program does logarithmic alignments, and it only works on protein pairs. • Monotone by Richard Mott. • Clustal uses the affine gap model. • Above that Clustal uses a simple evolutionary model to estimate a guide tree for aligning multiple sequences. RA Cartwright rac@uga.edu - http://scit.us/
Dealing with gaps • Sequences are typically aligned before they are phylogenized. • This is silly. We should estimate alignments at the same we estimate phylogenies. • For now we are stuck doing it in pieces, but must be wary of introducing biases into our phylogenies when aligning. RA Cartwright rac@uga.edu - http://scit.us/
Dealing with gaps • Gaps contain phylogenetic signal which is usually ignored by researchers. • Can look at how gaps influence phylogenetics? RA Cartwright rac@uga.edu - http://scit.us/
Dealing with gaps • To study how gaps can influence phylogenies we need a program that can simulate molecular evolution with indels. • However, existing programs model indel formation rather poorly if they do at all. RA Cartwright rac@uga.edu - http://scit.us/
Dawg is the solution to this gap • Dawg stands for “DNA Assembly With Gaps.” • A portable and robust program for simulating molecular evolution. • Development Website: http://scit.us/dawg/ RA Cartwright rac@uga.edu - http://scit.us/
Comparing Software RA Cartwright rac@uga.edu - http://scit.us/
Parameters • Tree phylogeny • TreeScale coefficient to scale branch lengths by • Sequence root sequences • Length length of generated root sequences • Rates rate of evolution of each root nucleotide • Model model of evolution: GTR|JC|K2P|K3P|HKY|F81|F84|TN • Freqs nucleotide (ACGT) frequencies • Params parameters for the model of evolution • Width block width for indels and recombination • Scale block position scales • Gamma coefficients of variance for rate heterogeneity • Alpha shape parameters • Iota proportions of invariant sites • GapModel models of indel formation: NB|PL|US • Lambda rates of indel formation • GapParams parameter for the indel model • Reps number of data sets to output • File output file • Format output format: Fasta|Nexus|Phylip|Clustal • GapSingleChar output gaps as a single character • GapPlus distinguish insertions from deletions in alignment • LowerCase output sequences in lowercase • Translate translate outputed sequences to amino acids • NexusCode text or file to include between datasets in Nexus format • Seed PRNG seed (integers) RA Cartwright rac@uga.edu - http://scit.us/
Sample Input File # example.dawg Tree = ((AY727331:0.001359,AY727330:0.001359):0.084512, (AY727327:0.006116,AY727326:0.006116):0.079756); Model = "GTR" Params = {1.08031, 2.45581, 0.44452, 1.09145, 4.06519, 1.00000} Freqs = {0.353470, 0.143681, 0.178206, 0.324643} Length = 300 Lambda = 0.143120 GapModel = "NB" GapParams = {1, 0.753247} Format = "Clustal" File = "example.aln" Seed = 1981 RA Cartwright rac@uga.edu - http://scit.us/
CLUSTAL multiple sequence alignment (Created by DAWG Version 1.0.0) AY727326 TTCGAAAATATGTTAGTACTCAATATGAATTCTTTGAGTTAAAAAAGATAAAGCAAA--A AY727327 TTCGAAAATATGTTAGTACTCAATATGAATTCTTTGAGTTAAGAAAGATAAAGCAAA--A AY727330 TTCAAAAATATGCTAGGACTGAATATGAATTCTTAAAGTTAAGAAAGATAAAGAAAAACA AY727331 TTCAAAAATATGCTAGGACTGAATATGAATTCTTAAAGTTAAGAAAGATAAAGAAAAACA AY727326 ATACATAATGTGATTTCAATATTCCAATTACCTAACAATACGGCTATCAATTAAACGATT AY727327 ATACATAATGTGATTTCAATATTCCAATTACCTAACAATACGGCTATCAATTAAACGATT AY727330 GTACATAATGTAAA----TTATTGCAA---------AAAACGGCTAACAATTAGACGATT AY727331 GTACATAATGTAAA----TTATTGCAA---------AAAACGGCTAACAATTAGACGATT AY727326 TTAGGATTACACCGACAAATATTAGGCCGATATGAATTTAACATCATGTTGTATTTAGAT AY727327 TTAGGATTACACCGACAAATATTAGGCCGATATGAATTTACCATCATGTTGTATTTAGAT AY727330 TTAGGATTACGCTGACAAATATTAGGATGATATTAATTTA------TCTTGTATTTAGAT AY727331 TTAGGATTACGCTGACAAATATTAGGATGATATTAATTTA------TCTTGTATTTAGAT AY727326 GCTGTCTTTTATTAACATTCATCATTAAAT-TTGGAACCTTTTGCATTTAAGAAGTACAT AY727327 GCTGTCTTTTATTAACATTCATCATTAAAT-TTGGAACCTTTTGTATTTAAGAAGTACAT AY727330 GCTGTCTTTTATCAACATTCATCACTAGATATTGGAACCTATTGCATCTAAGAAGTACAT AY727331 GCTGTCTTTTATCAACATTCATCACTAGATATTGGAACCTATTGCATCTAAGAAGTACAT AY727326 GTTTAATAGTGTTTAAAA-TATATATGAAATTGATCATAAGGA---TCTATAAATGCGGT AY727327 GTTTAATAGTGTTTATAA-TATATATGAAATTGATCGTAAGGA---TCTATAAATGCAGT AY727330 GTTTAATAGGGTT-AAAACTATATATGAAGTCGATTATAAGGAATTTCTATAAATGTAGC AY727331 GTTTAATAGGGTT-AAAACTATATATGAAGTCGATTATAAGGAATTTCTATAAATGTAGC AY727326 TCTTCAATTTCTTG AY727327 TCTTCAATTTCTTG AY727330 TCTTCAATTTCCTA AY727331 TCTTCAATTTCCTA RA Cartwright rac@uga.edu - http://scit.us/
Estimating Indel Rate • Dawg would be of little benefit if biologists could not estimate parameters of indel formation from real data. • Dawg’s indel model allows such estimation, which is implemented in a Perl script, lambda.pl. RA Cartwright rac@uga.edu - http://scit.us/
Lambda.pl • Take an alignment and a phylogeny. • The number of unique gaps in this alignment is approximately distributed as a Poisson with mean (λLT) • λ = rate of indel formation • L = average sequence length • T = total branch length • Therefore the rate of indel formation can be estimated as λ = N/(LT) • N = number of unique gaps in alignment RA Cartwright rac@uga.edu - http://scit.us/
Example Usage:Confidence Interval of Indel Rate • I aligned the sequences of chloroplast trnK introns from two Hibiscus and two Prunus species. • Using Paup*, I estimated the phylogeny and substitution parameters. • Using lambda.pl, I estimated the indel formation parameters. RA Cartwright rac@uga.edu - http://scit.us/
Example Usage • From these estimated parameters of evolution, I constructed an input file for Dawg. • From the input file Dawg produced a thousand simulated sequence sets. • The rate of indel formation was estimated for each of the simulated sequences. RA Cartwright rac@uga.edu - http://scit.us/
Results • The estimated rate of indel formation was 0.143120. • Bootstrapping gave a 95% CI of 0.078530 to 0.213560. • Biologically this is 8 to 21 indels per 100 substitutions. RA Cartwright rac@uga.edu - http://scit.us/
Thanks RA Cartwright rac@uga.edu - http://scit.us/