1k likes | 1.25k Views
B ioinformā ti ka. Filoģenētiskie koki. LU, 2014, Juris V īksna. Šodien:. Filoģenētiskie koki Hierarhiskā klasterizācija un dendrogrammas Filoģenētisko koku veidi "Molekulārais pulkstenis" Metodes koku konstruēšanai no attālumu matricām no pazīmju matricām
E N D
Bioinformātika Filoģenētiskie koki LU, 2014,Juris Vīksna
Šodien: • Filoģenētiskie koki • Hierarhiskā klasterizācija un dendrogrammas • Filoģenētisko koku veidi • "Molekulārais pulkstenis" • Metodes koku konstruēšanai • no attālumu matricām • no pazīmju matricām • Vēl dažas ar koku konstruēšanu saistītas problēmas • Programmas filoģenētisko koku konstruēšanai un vizualizācijai
Haeckel-a "Dzīvības koks" “Higher” organisms “Lower” organisms A phylogenetic tree is a hierarchical, graphical representation of relationships [Adapted from M.Thomas]
Kas ir klāsterizācija? DOGS!! PETS!! CATS!! [Adapted from V.Olman]
Klāsterizācija un klasifikācija [Adapted from R.B.Altman]
Kas ir klāsteris? “Dabiska” definīcija ir diezgan plaša: • viena klastera elementi ir attālināti no citiem • ir strikta robeža starp divu klasteru • elementiem • liela elementu koncentrācija, salīdzinot • ar fonu [Adapted from V.Olman]
Hierarhiskā klāsterizācija Sākotnēji katrs objekts ir savā klāsterī. Katrā nākamajā solī divi klāsteri tiek apvienoti vienā (kamēr paliek tikai viens klāsteris). [Adapted from Y.Guo]
Hierarhiskā klāsterizācija - varianti • single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. • complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. • average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. [Adapted from Y.Guo]
Single-Link Method Euclidean Distance a a,b b a,b,c a,b,c,d c d c d d (1) (3) (2) Distance Matrix [Adapted from Y.Guo]
Complete-Link Method Euclidean Distance a a,b a,b b a,b,c,d c,d c d c d (1) (3) (2) Distance Matrix [Adapted from Y.Guo]
Compare Dendrograms Single-Link Complete-Link 0 2 4 6 [Adapted from Y.Guo]
Single-link vs Complete-link Nadler and Smith, Pattern Recognition Engineering, 1993 [Adapted from Y.Guo]
OC - hierarhiskās klasterizācijas programma 7 First Second Third Fourth Fifth Sixth Seventh 100.0 100.0 50.0 33.0 25.0 20.0 100.0 50.0 50.0 33.0 33.0 100.0 33.0 20.0 25.0 33.0 20.0 25.0 100.0 100.0 100.0 ## 0 20 2 Entity Score: 20 Number of members: 2 0 6 ## 1 20 2 Entity Score: 20 Number of members: 2 2 5 ## 2 20 3 Entity Score: 20 Number of members: 3 3 2 5 ## 3 25 5 Entity Score: 25 Number of members: 5 0 6 3 2 5 ## 4 33 6 Entity Score: 33 Number of members: 6 4 0 6 3 2 5 ## 5 33 7 Entity Score: 33 Number of members: 7 1 4 0 6 3 2 5 http://www.compbio.dundee.ac.uk/Software/OC/oc.html
Filoģenētiskie koki [Adapted from E.Willasen]
Filoģenētiskie koki • Single origin to all species • Also describesevolution of DNA • Leaves- contemporary • Internal nodes - ancestral • Tree may be rooted/unrooted • Branch length - distance between sequences [Adapted from I.Pe’er]
Kladistikaunfenētika • Cladistic approach: Trees are drawn based on the conserved characters • Phenetic approach: Trees are based on some measure of distance between the leaves • Molecular phylogenies are inferred from molecular (usually sequence) data • either cladistic (e.g. gene order) or phenetic [Adapted from C.Seoighe]
Clade Clade: A set of species which includes all of the species derived from a single common ancestor [Adapted from C.Seoighe]
Koku veidi: kladogrammas cladogram t1 • relative recency of • common descent. • Does not imply that ancestors on the same line necessarily speciated at the same time. • t1 can bebefore or after t2 but not before t3 t3 t2 (no time scale) [Adapted from E.Willasen]
Koku veidi: filogrammas branch lengths = amount of change phylogram (additive tree: branch lenghts can be summed) relative recency of common descent, and [Adapted from E.Willasen]
Koku veidi: ultrametriskie koki ultrametric tree (linearized tree) Amount of change can be scaled to time scale = time [Adapted from E.Willasen]
Molekulārais pulkstenis (Emilie Zuckenkandl, Linus Paulig ~1960) Accepted DNA mutations happens with the constant rate. Thus, the number of mutations is proportional to the time of evolution. But - mutation frequencies could be different for different proteins. fibrinopeptides > hemoglobin > cytochrome c For longer proteins the mutation frequency could differ in different regions. Neutral Theory of Molecular Evolution (Motoo Kimura). Natural selectionrandom genome changes.
Evolūcijas izmaiņu modeļi purines • the simplest model, Jukes-Cantor, • assumes all probabilities of change to be equal. To be realistic: • the base frequencies must be equal • the rates of change must be equal A G transitions transversions Kimura 2 parameter model one might expect ts / tv rates to be 4 / 8 = 0.5, but transitions are usually more common. Kimura model allows for unequal rates of transitions and transversions transitions C T pyrimidines [Adapted from E.Willasen]
Koku reprezentācijas • Cladogram showing the phylogenetic relationships between four species. • Relationships of the same four species represented as a set of nested parentheses. • Evolutionary relationships of the same four species with nine synapomorphies (shared, derived characters) plotted on the branches. [Adapted from M.Thomas]
Pielietojumi Using Phylogeny to Understand Gene Duplication and Loss • A gene tree. • The gene tree superimposed on a species tree, allowing identification of the duplication and loss events. [Adapted from M.Thomas]
Pielietojumi [Adapted from R.Shamir]
Attālumu matricas • MIN un MAX matricas: • MIN matrica - laika moments, kurā notikusi diverģence • MAX matrica - laiks, kas pagajis kopš diverģences
Dati filoģenētisko koku konstruēšanai Laboratorijas metodes. - Hibridizē divu organismu DNS maisījumu. Tad pārbauda pie kādas temperatūras hibridizētās virknes atdalās. Putnu evolūcija (Sibley, Ahlquist,1986). Ieguva izteikti ultrametriskus datus. Uz virknēm balstītas metodes. - Rēķina kaut ko līdzīgu ED (jāņem vērā atkārtotas mutacijas un tādas, kas neizmaina proteīnu).
Dati filoģenētisko koku konstruēšanai - “have wings”; “walk on four legs”... - DNA contain a specific subsequence... - specific nucleotide in a fixed DNS position - are protein (gene) expression regulated by a specific protein (mice and humans have very similar proteins but very different regulation...) - similarity between gaps for multiple alignment of sequences from different organisms (in this way one can demonstrate that fungi are closer to animals than plants)
Īpašību stāvokļu matricas - piemērs [Adapted from M.Thomas]
Dati - ortologi un paralogi [Adapted from R.Shamir]
Dati - ortologi un paralogi [Adapted from R.Shamir]
Metodes koku konstruēšanai • Distance based methods • Maximal parsimony methods - MP • Maximal likelihood methods - ML
Ultrametriskas matricas un koki [Adapted from D.Gusfield]
Ultrametriskas matricas un koki "Neformālāka" definīcija: Attālumu matrica nn ir ultrametriska, ja tai atbilst ultrametrisks koks, resp., ja var uzbūvēt saknes koku ar svarotām šķautnēm, kura lapu kopa ir matricas rindu kopa {1,...,n} un katram rindu pārim i, j attālumi no i un j līdz tām tuvākajai kopīgajai virsotnei ir M(i,j). Ir "vienkārša" pazīme, vai matrica ir ultrametriska - simetriska matrica ir ultrametriska, ja katram trijniekam i,j,k vismaz divas no vērtībām M(i,j), M(j,k), M(i,k) ir vienādas ar maksimālo no tām.
Ultrametrisku koku konstruēšāna Ir vienkāršs O(n2) algoritms. [Adapted from D.Gusfield]
Ultrametrisku koku konstruēšāna Kāds ultrametrisks koks atbilst dotajai matricai?
Ne-ultrametriski dati? Var mēģināt atrast “mazāko izmaiņu”, kas datus padara ultrametriskus. Ja vērtības atļauts tikai samazināt, ir polinomiāls risinājums. Ja vērtības atļauts samazināt vai palielināt, ir polinomiāls risinājums, kas minimizē maksimālo izmaiņu. Ja vērtības atļauts tikai palielināt, problēma ir NP-pilna.
Aditīvas matricas un koki [Adapted from D.Gusfield]
Aditīvas matricas un koki "Neformālāka" definīcija: Attālumu matrica nn ir ultrametriska, ja tai atbilst aditīvs koks (filogramma), resp., ja var uzbūvēt koku ar svarotām šķautnēm, kura virsotņu kopa satur {1,...,n} (visas matricas rindas) un katram rindu pārim i, j attālums starp lapām i un j ir M(i,j).
Aditīvi koki - konstruēšana Problēma Dota simetriska nn matrica, kam uz diagonāles ir 0 un pārējās vērtības ir pozitīvas. Atrast D atbisltosu aditīvu koku vai noskaidrot, ka tāds neeksistē. Ir zināmi O(n2) algoritmi. Problēmu var reducēt uz ultrametrisku koku konstruešanas problēmu. Katra ultrametriska matrica ir aditīva. Aditīva matrica ir ultrametriska, ja tai eksiste aditīvs koks, kam viena no virsotnēm ir vienādā attalumā no visām lapam. Ja D eksistē kompakts aditīvs koks, tad tas ir vienīgais mazākais pārklājošais koks grafam G(D).
Aditīvi koki - konstruēšana [Adapted from D.Gusfield]
Koka "sakņošana" • In an unrooted tree the direction of evolution is unknown • The root is the hypothesized ancestor of the sequences in the tree • The root can either be placed on a branch or at a node • You should start by viewing an unrooted tree [Adapted from C.Seoighe]
Koka "sakņošana" - piemērs [Adapted from C.Seoighe]
Koka "sakņošana" - piemērs [Adapted from C.Seoighe]
Ko iesākt ar ne-aditīvām distanču matricām? Var mēģināt nodefinēt "labāko" iespējamo koku, un tad konstruēt to atbilstoši izvēlētajai definīcijai. Visas "saprātīgās" labākā koka definīcijas dos NP-pilnu problēmu. Praksē parasti lieto tikai hiristiskus risinājumus - resp., algoritmus, kuri liekas "saprātīgi", bet negarantē, ka iegūtajam rezultātam piemisto kādas konkrētas īpašības. Fitch-Margolias trees - “least squares” fit To construct optimal trees more or less exhaustive search is required, algorithm provides a heuristic “approximation”.
UPGMAmetode UPGMA = Unweighted Pair Group Method with Arithmetic mean A B C D E A B C D E A - 0.080.19 0.70 0.65 B - 0.17 0.75 0.70 C - 0.80 0.60 D - 0.12 E - 0.04 0.04 0.06 0.06 0.09 0.09 0.35 0.35 • find the shortest distance 0.08 • group OTUs (AB) • A and B each has branch length 0.04 (because the sum is 0.08) • find the next shortest distance 0.12 (DE) - distance level 0.06 • find the next shortest distance 0.17 (BC)- but B has been ’used’ • so d = (0.19 + 0.17) / 2 = 0.18 - distance level 0.09 • and finally = (0.70+0.65+0.75+0.70+0.80+0.60) / 6 = 0.70 [Adapted from E.Willasen]
UPGMAtrees are additive A B C D E A B C D E A - 0.080.19 0.70 0.65 B - 0.17 0.75 0.70 C - 0.80 0.60 D - 0.12 E - 0.04 0.04 0.06 0.06 0.09 0.09 0.35 0.35 • additive: distances between nodes can be summed • the distance from A to E is (0.04+0.09+0.35+0.35+0.06) = 0.89 [Adapted from E.Willasen]
UPGMAmetode Rezultātā dod aditīvus kokus, bet - nekonstruēs "pareizu" aditīvu koku aditīvai matricai :) [Adapted from R.Shamir]
UPGMA algorithm [Adapted from R.Shamir]