1 / 30

מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח Website: webcourse.cs.technion.ac.il/236503/

236503 פרויקט בתכנות מחקר השוואתי בשחזור עצי אבולוציה: אלגוריתמים קיימים מול תכנות בשלמים אביב 2013. מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח Website: http://webcourse.cs.technion.ac.il/236503/. Evolution. Evolution of new organisms is driven by Diversity

doctor
Download Presentation

מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח Website: webcourse.cs.technion.ac.il/236503/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 236503פרויקט בתכנות מחקר השוואתי בשחזור עצי אבולוציה:אלגוריתמים קיימים מול תכנות בשלמיםאביב 2013 מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח Website: http://webcourse.cs.technion.ac.il/236503/ .

  2. Evolution Evolution of new organisms is driven by • Diversity • Different individuals carry different variants of the same basic blue print • Mutations • The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. • Selection bias

  3. ThePhylogeneticReconstrutction Problem MPI, June 2012

  4. Evolution is modeled by a Tree ACGGTCA (Species represented by their DNA sequences, consisting of {A,G,C,T}) AAAGTCA ACGGATA ACGGGTA AAAGGCG AAACACA AAAGCTG GGGGATT TCTGGTA ACCCGTG GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG MPI, June 2012

  5. Phylogenetic Reconstruction GGGGATT GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG MPI, June 2012

  6. A I J B (root) reconstruct F C D F D G B G A H E H I J E C Phylogenetic Reconstruction A :AATGGGC B :AATCCTG C :ATAGCTG D :GAACGTA E :AAACCGA F :GGGGATT G :TCTGGGA H :TCCGGAA I :AGCCGTG J :ACCGTTG Goal: reconstruct the ‘true’ tree as accurately as possible Distance Methods: use “evolutionary distances” between sequences MPI, June 2012

  7. Reconstructing weighted tree From exact interleaf distances Reconstructed tree edge-weighted unknown tree D D E E 2 2 C C 2 2 5 5 3 3 0.3 0.3 0.4 0.4 F F 4 4 6 6 6 6 5 5 B B A A G G Reconstruction (linear-time) Algorithm Exact (additive) distances Between leaves MPI, June 2012

  8. Formal statement of the problemfor exact distances Input: an n×ndistance matrix D=(d(i,j)): • d(i,i)=0, and for i≠j, d(i,j)>0 • d(i,j)=d(j,i). • For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k). Output:If the distances can be realized by a weighted tree (i.e., the distances are additive) – return that tree. Else– return nothing.

  9. æ ö ç ÷ 1 5 2 4 6 ç ÷ 10 1 ç ÷ 2 7 ç ÷ = ç ÷ ç ÷ ç ÷ ç ÷ è ø Distance based reconstruction methods: (since the 60’s): MPI, June 2012

  10. k c b j v a i Solution for 3 objects For n=3: Each distance metric can be realized by a (unique) tree with one internal node. Distance metrics on 4 objects may not have a tree.

  11. k i l j The Four Points Condition Definition: A distance metric on n objectssatisfies the four points condition iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l) Theorem: A distance metric is additive iff it satisfies the four points condition

  12. Neighbor Joining Let i, j be neighboring leaves in a tree, let v be their parent, and let k be any other leaf. The formula shows that we can compute the distances of v to all other leaves. k d(k,v) v j i

  13. Reconstructing trees byNeighbor Joining Algorithms • This suggest the following method to construct tree from a distance matrix: • Find neighboring leaves i,j in the tree, • Replace i,j by their parent v and recursively construct a tree T for the smaller set. • Add i,j as children of v in T.

  14. Neighbor Finding: Seitou&Nei method Definitions Theorem (Saitou&Nei)Assume all internal edge weights are positive. If Q(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.

  15. k v i j S&N Neighbor Joining Algorithm • If n =3, return tree of three vertices • Compute Q(i,j) for all i,j • Choose i,j such that Q(i,j) is minimal • Create new vertex v, and set d(k,v) • remove i,j, and add v to the set of objects • Recursively construct a tree on the smaller set, then add i,j as children of v, at distances d(i,v) and d(j,v).

  16. Complexity of S&N Neighbor Joining Algorithm Initialization:θ(n2) to compute r(i) and Q(i,j) for all i,jL. Each Iteration: • O(n2) to find the maximal Q(i,j). • O(n) to compute {D(v,k):k L} for the new node v, and to update the matrix. • O(n2) to update the values Q(i,j). Total of O(n3). k D(v,k) i j

  17. NEEDED:Additive DistancesBetween DNA Sequences MPI, June 2012

  18. 1 3 2 C G T A A C C C 1 Additive Evolutionary distance :The number of substitutions which occurred during the sequence evolution substitutions site 1 site 2 site 3 0 Some substitutions are hidden, due to overwriting. Therefore, the exact number of subst. is usually larger than the number of observed changes.

  19. Edge weight = Expected number of substit’s per site u Number of substitutions per site 0.321 v MPI, June 2012

  20. Interleaf distances: sum of edge weights v u d(u,v) = 1.12 0.3 0.5 0.42 When the exact number of substitutions between any two sequences is known, any algorithm which reconstructs trees from the exact distances returns the correct evolutionary tree

  21. What we see is only the observednumber of substitutions between pairs of leaf sequences. The expected number of substitutionsis estimatedfrom the observed number of substitutions

  22. The estimation is based onSubstitution Model The simplest model: Juke Cantor Model On each tree edge e, each letter is mutated to any other later by the same ratio re. The length of an edge is the expected number of mutations per site, i.e. t=3r u t v MPI, June 2012

  23. The expected number of substitutions is estimated from the observed changes by a correction formula MPI, June 2012

  24. edge-weighted ‘true’ tree reconstructed tree D D E E 2 C C 2 5 3 0.3 F 0.4 F 4 6 6 5 B A B G A G reconstruction Reconstruction from estimated distances: Challenge: minimize Reconstruction errors Distance estimation Assuming DNA substitution model Exact (additive) distances Between species Estimated distances MPI, June 2012

  25. edge-weighted ‘true’ tree T reconstructed tree T’ D D E E 2 C C 2 5 3 0.3 F 0.4 F 4 6 6 5 B A B G A G Correct and incorrect reconstruction of edges Each (internal) edge defines a split of the leaves: The edge {ABC | DEFG} is correctly reconstructed The edge {ABCD | EFG} is false negative The edge {AC | BDEFG} is false positive. MPI, June 2012

  26. edge-weighted ‘true’ tree T reconstructed tree T’ D D E E 2 C C 2 5 3 0.3 F 0.4 F 4 6 6 5 B A B G A G Robinson Foulds Distance False positives + false negatives Total number of internal edges = Robison Fouldsdistance = MPI, June 2012

  27. Formal statement of the problemfor estimated distances Input: an n×ndistance matrix, which are estimations of tree (additive) distances. Output:return a tree with small Robinson Foulds distance from the true tree.

  28. Project’s Goal • Practice current algorithm (NJ) of phylogenetic reconstruction by distance methods. • Simulate evolutions of DNA sequences, and generate evolutionary distances. • Study a new method for tree reconstruction, based on mixed integer programming with CPLEX. • Compare the accuracy of this new method with that of Neighbor Joining. You should use the PHYLIP phylogenetic package for most of the required tasks: http://evolution.genetics.washington.edu/phylip.html

  29. Time Line

  30. Grading Scheme • 10% - work plan • 60% - final report + submitted code Rough distribution of grade: • 40% - meeting project requirements • 10% - code organization and documentation • 10% - innovation and creativeness • 30% - final presentation

More Related