1 / 96

Phylogenetic Analysis

Phylogenetic Analysis. Subroutine. Some code needs to be reused A good way to organize code Called “function” in some languages Name Return Parameters (@_). #!/usr/bin/perl –w print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename;

Download Presentation

Phylogenetic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic Analysis

  2. Subroutine • Some code needs to be reused • A good way to organize code • Called “function” in some languages • Name • Return • Parameters (@_)

  3. #!/usr/bin/perl –w print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>; @DNA = <DNAFILE>; close DNAFILE; $DNA = join('', @DNA); $DNA =~ s/\s//g; $count_of_G = countG($DNA); print $count_of_G; sub countG { my($dna) = @_; my($count) = 0; $count = ( $dna =~ tr/Gg//); return $count; }

  4. #!/usr/bin/perl –w print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>; @DNA = <DNAFILE>; close DNAFILE; $DNA = join('', @DNA); $DNA =~ s/\s//g; $count_of_G = count($DNA, 'Gg'); print $count_of_G; sub count { my($dna, $pattern) = @_; my($count) = 0; $count = ( eval("$dna =~ tr/$pattern//") ); return $count; }

  5. @fred = (1,2,3); @barney = @fred; @huh = 1; @fred = qw(one two); @barney = (4,5,@fred,6,7); @barney = (8,@barney); ($a,$b,$c) = (1,2,3); @fred = (@barney = (2,3,4)); @fred = @barney = (2,3,4); @fred = (1,2,3); $fred[3] = "hi"; $fred[6] = "ho"; # @fred is now (1,2,3,"hi",undef,undef,"ho")

  6. Array operators • push and pop (right-most element) • @mylist = (1,2,3); push(@mylist,4,5,6); • $oldvalue = pop(@mylist); • shift and unshift (left-most element) • @fred = (5,6,7); unshift(@fred,2,3,4); • $x = shift(@fred); • reverse: @a = (7,8,9); @b = reverse(@a); • sort: @a = (7,9,9); @b = sort(@a);

  7. Print to file • Open a file to print • open FILE, ">filename.txt"; • open (FILE, ">filename.txt“); • Print to the file • print FILE $str;

  8. #!/usr/bin/perl print "My name is $0 \n"; print "First arg is: $ARGV[0] \n"; print "Second arg is: $ARGV[1] \n"; print "Third arg is: $ARGV[2] \n"; $num = $#ARGV + 1; print "How many args? $num \n"; print "The full argument string was: @ARGV \n";

  9. Parsimony and Tree Reconstruction 简约

  10. Aardvark Bison Chimp Dog Elephant Small vs. Large Parsimony • We break the problem into two: • Small parsimony: Given the topology find the best assignment to internal nodes • Large parsimony: Find the topology which gives best score • Large parsimony is NP-hard • We’ll show solution to small parsimony (Fitch and Sankoff’s algorithms) Input to small parsimony: tree with character-state assignments to leaves Example: A:CAGGTA B:CAGACA C:CGGGTA D:TGCACT E:TGCGTA

  11. Scoring Matrices Small Parsimony Problem Weighted Parsimony Problem

  12. Unweighted vs. Weighted Small Parsimony Scoring Matrix: Small Parsimony Score: 5

  13. Unweighted vs. Weighted Weighted Parsimony Scoring Matrix: Weighted Parsimony Score: 22

  14. Sankoff Algorithm (cont.) • Begin at leaves: • If leaf has the character in question, score is 0 • Else, score is 

  15. Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0

  16. Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0 + 9 = 9

  17. Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} Repeat for T, G, and C

  18. Sankoff Algorithm (cont.) Repeat for right subtree

  19. Sankoff Algorithm (cont.) Repeat for root

  20. Sankoff Algorithm (cont.) Smallest score at root is minimum weighted parsimony score In this case, 9 – so label with T

  21. Sankoff Algorithm: Traveling down the Tree • The scores at the root vertex have been computed by going up the tree • After the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with optimal character.

  22. Sankoff Algorithm (cont.) 9 is derived from 7 + 2 So left child is T, And right child is T

  23. Sankoff Algorithm (cont.) And the tree is thus labeled…

  24. Fitch’s Algorithm (cont’d) An example: a c t a {a,c} {t,a} a c t a a a a a {a,c} {t,a} a a a c t t a c

  25. Fitch Algorithm (cont.)

  26. Large Parsimony Problem • Input: An n x m matrix M describing n species, each represented by an m-character string • Output: A tree T with n leaves labeled by the n rows of matrix M, and a labeling of the internal vertices such that the parsimony score is minimized over all possible trees and all possible labelings of internal vertices

  27. Large Parsimony Problem (cont.) • Possible search space is huge, especially as n increases • (2n – 3)!! possible rooted trees • (2n – 5)!! possible unrooted trees • Problem is NP-complete • Exhaustive search only possible with n< 10 • Hence, branch and bound or heuristics used

  28. (2n-3)!! • (2n-5)!!= (2n-5)x(2n-7)x…x3 • For n=10, it is 2,027,025 • For n=13, it is 13,749,310,575 • For n=20, it is 221,643,095,476,699,771,875

  29. Hill Climbing • Start with an arbitrary tree and check its neighbors, swap branches, etc • Move to a tree if it provides the best improvement in parsimony score • No way of knowing if the result is the most parsimonious tree • Could be stuck in local optimum • Methods: NNI, TBR, SPR

  30. Nearest Neighbor InterchangeA Greedy Algorithm • A Branch Swapping algorithm • Only evaluates a subset of all possible trees • Defines a neighbor of a tree as one reachable by a nearest neighbor interchange • A rearrangement of the four subtrees defined by one internal edge • Only three different rearrangements per edge

  31. Nearest Neighbor Interchange (cont.)

  32. Subtree Pruning and RegraftingAnother Branch Swapping Algorithm http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/SPR.gif

  33. Tree Bisection and Reconnection Another Branch Swapping Algorithm • Most extensive swapping routine

  34. Problems with current techniques for MP Average MP scores above optimal of best methods at 24 hours across 10 datasets Best current techniques fail to reach 0.01% of optimal at the end of 24 hours, on large datasets

  35. Problems with current techniques for MP Best methods are a combination of simulated annealing, divide-and-conquer and genetic algorithms, as implemented in the software package TNT. However, they do not reach 0.01% of optimal on large datasetsin 24 hours. Performance of TNT with time

  36. Probabilistic modeling and molecular phylogeny

  37. Probabilistic Modeling • Probabilistic models • Describe the likelihood of outcomes • Used to predict future events • Probability distribution • Allocates likelihood to outcomes • Often represented by parameterized functions

  38. Parameter Estimation • Converse of model-based prediction • Take sampled data as given • Estimate the most likely model fitting that data set • Parameter Estimation • Construct model and constraints based on domain knowledge • Solve to find most likely values for parameters of model

  39. Likelihood Function • Define a model by its pdf: • where  are the parameters of the pdf, constant across sampled data • The likelihood function is: • where there are n sets of sample data. • (Note, x and  are often vectors)

  40. Example: Biased coin • n independent coin tosses, k heads • Binomial distribution, parameter p • Given data: 100 trials, 56 heads: • Numerical solution yields max p=0.56 L p

  41. Log Likelihood • Natural log of both sides • d/dp=0

  42. The Jukes-Cantor model of site evolution • Each “site” is a position in a sequence • The state (i.e., nucleotide) of each site at the root is random • The sites evolve independently and identically (i.i.d.) • If the site changes its state on an edge, it changes with equal probability to the other states • For every edge e, p(e) is defined, which is the probability of change for a random site on the edge e.

  43. General Markov (GM) Model • A GM model tree is a pair where • is a rooted binary tree. • , and is a stochastic substitution matrix with • The state at the root of T is random. • GM contains models like Jukes-Cantor (JC), Kimura 2-Parameter (K2P), and the Generalized Time Reversible (GTR) models.

  44. C C A A C A T T G G T G Models of Sequence Evolution Kimura General Jukes Cantor 2a a a a a a e a 2a 2a c a d a b a 2a f

  45. Variation across sites • Standard assumption of how sites can vary is that each site has a multiplicative scaling factor • Typically these scaling factors are drawn from a Gamma distribution (or Gamma plus invariant)

  46. Special issues • Molecular clock: the expected number of changes for a site is proportional to time • No-common-mechanism model: there is a random variable for every combination of edge and site

  47. Jukes & Cantor’s one-parameter model Assumption: • Substitutions occur with equal probabilities among the four nucleotide types.

  48. A G   C T  -3 The Jukes-Cantor model (1969) We need to develop a formula for DNA evolution via Prob(y | x, t) where x and y are taken from {A, C, G, T} and t is the time length. Jukes-Cantor assumes equal rate of change:

  49. If the nucleotide residing at a certain site in a DNA sequence is A at time 0, what is the probability, PA(t),that this site will be occupied by A at time t? Since we start with A, PA(0) = 1. At time 1, the probability of still having A at this site is where 3 is the probability of A changing to T, C, or G, and 1 – 3 is the probability that A has remained unchanged.

More Related