120 likes | 280 Views
Pipeline to construct phylogenetic tree. Kou. General ways to construct phylogenetic tree 1.Collect homologous sequences. >Arath6|AT1G01140.3 YEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDE
E N D
General ways to construct phylogenetic tree 1.Collect homologous sequences >Arath6|AT1G01140.3 YEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDE ARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYL PFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWF >Arath6|AT1G01450.1 YQVKKRLGNGSQYKEITWLGESFALRHFFGDIDALLPQITPLLSLSHPNIVYYLCGFTDEEKKECFLVMELMRKTLGMHIKEVCGPRKKNTLSLPVAVDL MLQIALGMEYLHSKRIYHGELNPSNILVKPRSNQSGDGYLLGKIFGFGLNSVKGFSSKSASLTSQNENFPFIWYSPEVLEEQEQSGTAGSLKYSDKSDVY SFGMVSFELLTGKVPFEDSHLQGDKMSRNIRAGERPLFPFNSPKFITNLTKRCWHADPNQRPTFSSISRILRYI >Arath6|AT1G01540.2 LCEENVIGEGGYGIVYRGILTDGTKVAVKNLLNNRGQAEKEFKVEVEVIGRVRHKNLVRLLGYCVEGAYRMLVYDFVDNGNLEQWIHGDVGDVSPLTWDI RMNIILGMAKGLAYLHEGLEPKVVHRDIKSSNILLDRQWNAKVSDFGLAKLLGSESSYVTTRVMGTFGYVAPEYACTGMLNEKSDIYSFGILIMEIITGR NPVDYSRPQGETNLVDWLKSMVGNRRSEEVVDPKIPEPPSSKALKRVLLVALRCVDPDANKRPKMGHIIHML >Arath6|AT1G01560.1 VPPLRPIGRGASGIVCAAWNSETGEEVAIKKIGNAFGNIIDAKRTLREIKLLKHMDHDNVIAIIDIIRPPQPDNFNDVHIVYELMDTDLHHIIRSNQPLT DDHSRFFLYQLLRGLKYVHSANVLHRDLKPSNLLLNANCDLKIGDFGLARTKSETDFMTEYVVTRWYRAPELLLNCSEYTAAIDIWSVGCILGEIMTREP LFPGRDYVQQLRLITEVNFSLFHLTILFRFNL >Arath6|AT1G01740.1 ENVVSEHGETAPNVVYQGKLENHMKIAIKRFSGTAWPDPRQFLEEARLVGQLRSKRMANLLGYCCEGGERLLVAEFMPNETLAKHLFHWDTEPMKWAMRL RVALYISEALEYCSNNGHTLYHDLNAYRVLFDEECNPRLSTFGLMKNSRDGKSYSTNLAFTPPEYLRTGRITAESVIYSFGTLLLDLLTGKHIPPSHALD LIRDRNLQTLTDSCLEGQFSDSDGTELVRLTSCCLQYEARERPNIKSLVTAL >Arath6|AT1G02970.1 FHEIRQIGAGHFSRVFKVLKRMDGCLYAVKHSTRKLYLDSERRKAMMEVQALAALGFHENIVGYYSSWFENEQLYIQLELCDHSLSALPKKSSLKVSERE ILVIMHQIAKALHFVHEKGIAHLDVKPDNIYIKNGVCKLGDFGCATRLDKSLPVEEGDARYMPQEILNEDYEHLDKVDIFSLGVTVYELIKGSPLTESRN QSLNIKEGKLPLLPGHSLQLQQLLKTMMDRDPKRRPSARELLDHPMF >Arath6|AT1G03740.1 FEKLEKIGQGTYSSVYRARDLLHNKIVALKKVRFDLNDMESVKFMAREIIVMRRLDHPNVLKLEGLITAPVSSSLYLVFEYMDHDLLGLSSLPGVKFTEP QVKCYMRQLLSGLEHCHSRGVLHRDIKGSNLLIDSKGVLKIADFGLATFFDPAKSVSLTSHVVTLWYRPPELLLGASHYGVGVDLWSTGCILGELYAGKP ILPGKTEVEQLHKIFKLCGSPTENYWRKQKLPSSAGFKTAIPYRRKVSEMFKDFPASVLSLLETLLSIDPDHRSSADRALESEYF >Arath6|AT1G03920.1 FELLTMIGKGAFGEVRVVREINTGHVFAMKKLKKSEMLRRGQVEHVRAERNLLAEVDSNCIVKLYCSFQDNEYLYLIMEYLPGGDMMTLLMRKDTLSEDE AKFYIAESVLAIESIHNRNYIHRDIKPDNLLLDRYGHLRLSDFGLCKPLDCSVIDGEDFTVGNAGSGGGSESVSTTPKRSQQEQLEHWQKNRRMLAYSTV GTPDYIAPEVLLKKGYGMECDWWSLGAIMYEMLVGYPPFYADDPMSTCRKIVNWKTHLKFPEESRLSRGARDLIGKLLCSVNQRLGSTGASQIKAHPWF
2. Construct multiple alignment Arath6|AT1G01560.1 VPPLRPIGRGASGIVCAAWNSETGEEVAIKKIG-NAFGNIIDAKRTLREIKLLKHMDHDN Arath6|AT1G03740.1 FEKLEKIGQGTYSSVYRARDLLHNKIVALKKVR-FDLNDMESVKFMAREIIVMRRLDHPN Arath6|AT1G01140.3 YEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPN Arath6|AT1G03920.1 FELLTMIGKGAFGEVRVVREINTGHVFAMKKLKKSEMLRRGQVEHVRAERNLLAEVDSNC Arath6|AT1G02970.1 FHEIRQIGAGHFSRVFKVLKRMDGCLYAVKHSTRKLYLDSERRKAMMEVQALAALGFHEN Arath6|AT1G01540.2 LCEENVIGEGGYGIVYRGILTDGTKVAVKNLLN----NRGQAEKEFKVEVEVIGRVRHKN Arath6|AT1G01740.1 ENVVSEHGETAPNVVYQGKLENHMKIAIKRFSG----TAWPDPRQFLEEARLVGQLRSKR Arath6|AT1G01450.1 YQVKKRLGNGSQ---YKEITWLGESFALRHFFG--------DIDALLPQITPLLSLSHPN * . Arath6|AT1G01560.1 VIAIIDIIRPPQPDNFNDVHIVYELMDTD-----LHHIIRS-NQPLTDDHSRFFLYQLLR Arath6|AT1G03740.1 VLKLEGLITAPVS---SSLYLVFEYMDHD-----LLGLSSLPGVKFTEPQVKCYMRQLLS Arath6|AT1G01140.3 VVEIIEVMASKTK-----IYIVLELVNGG-----ELFDKIAQQGRLKEDEARRYFQQLIN Arath6|AT1G03920.1 IVKLYCSFQDNEY-----LYLIMEYLPGG-----DMMTLLMRKDTLSEDEAKFYIAESVL Arath6|AT1G02970.1 IVGYYSSWFENEQ-----LYIQLELCDHS-----LSALPKKSSLKVSEREILVIMHQIAK Arath6|AT1G01540.2 LVRLLGYCVEGAYR-----MLVYDFVDNGN-LEQWIHGDVGDVSPLTWDIRMNIILGMAK Arath6|AT1G01740.1 MANLLGYCCEGGER-----LLVAEFMPNET---LAKHLFHWDTEPMKWAMRLRVALYISE Arath6|AT1G01450.1 IVYYLCGFTDEEKK---ECFLVMELMRKTLGMHIKEVCGPRKKNTLSLPVAVDLMLQIAL : : : .. Arath6|AT1G01560.1 GLKYVHS---ANVLHRDLKPSNLLLNANCDLKIGDFGLAR-------------------- Arath6|AT1G03740.1 GLEHCHS---RGVLHRDIKGSNLLIDSKGVLKIADFGLATFF------------------ Arath6|AT1G01140.3 AVDYCHS---RGVYHRDLKPENLILDANGVLKVSDFGLSAFSR----------------- Arath6|AT1G03920.1 AIESIHN---RNYIHRDIKPDNLLLDRYGHLRLSDFGLCKPLDCSVIDGEDFTVGNAGSG Arath6|AT1G02970.1 ALHFVHE---KGIAHLDVKPDNIYIKN-GVCKLGDFGCAT-------------------- Arath6|AT1G01540.2 GLAYLHEGLEPKVVHRDIKSSNILLDRQWNAKVSDFGLAKLLG----------------- Arath6|AT1G01740.1 ALEYCSN--NGHTLYHDLNAYRVLFDEECNPRLSTFGLMK-------------------- Arath6|AT1G01450.1 GMEYLHS---KRIYHGELNPSNILVKPRSNQSGDGYLLGKIFGFGLNSVKG--------- .: . : ::: .: .. : Gapped positions were removed
Arath6|AT1G01560.1 VPPLRPIGRGASCAAWNSETGEEVAIKKIGDAKRTLREIKLLKHMDHDN Arath6|AT1G03740.1 FEKLEKIGQGTYYRARDLLHNKIVALKKVRSVKFMAREIIVMRRLDHPN Arath6|AT1G01140.3 YEMGRTLGEGSFKYAKNTVTGDQAAIKILDMVEQLKREISTMKLIKHPN Arath6|AT1G03920.1 FELLTMIGKGAFRVVREINTGHVFAMKKLKQVEHVRAERNLLAEVDSNC Arath6|AT1G02970.1 FHEIRQIGAGHFFKVLKRMDGCLYAVKHSTRRKAMMEVQALAALGFHEN Arath6|AT1G01540.2 LCEENVIGEGGYYRGILTDGTKVAVKNLLNAEKEFKVEVEVIGRVRHKN Arath6|AT1G01740.1 ENVVSEHGETAPYQGKLENHMKIAIKRFSGDPRQFLEEARLVGQLRSKR Arath6|AT1G01450.1 YQVKKRLGNGSQYKEITWLGESFALRHFFGDIDALLPQITPLLSLSHPN * . Arath6|AT1G01560.1 VIAIIDIIRPPQPHIVYELMDTDLHHIIRS-NQPLTDDHSRFFLYQLLR Arath6|AT1G03740.1 VLKLEGLITAPVSYLVFEYMDHDLLGLSSLPGVKFTEPQVKCYMRQLLS Arath6|AT1G01140.3 VVEIIEVMASKTKYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLIN Arath6|AT1G03920.1 IVKLYCSFQDNEYYLIMEYLPGGDMMTLLMRKDTLSEDEAKFYIAESVL Arath6|AT1G02970.1 IVGYYSSWFENEQYIQLELCDHSLSALPKKSSLKVSEREILVIMHQIAK Arath6|AT1G01540.2 LVRLLGYCVEGAYMLVYDFVDNGWIHGDVGDVSPLTWDIRMNIILGMAK Arath6|AT1G01740.1 MANLLGYCCEGGELLVAEFMPNEAKHLFHWDTEPMKWAMRLRVALYISE Arath6|AT1G01450.1 IVYYLCGFTDEEKFLVMELMRKTKEVCGPRKKNTLSLPVAVDLMLQIAL : : : .. Arath6|AT1G01560.1 GLKYVHSANVLHRDLKPSNLLLNANCDLKIGDFGLAR Arath6|AT1G03740.1 GLEHCHSRGVLHRDIKGSNLLIDSKGVLKIADFGLAT Arath6|AT1G01140.3 AVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSA Arath6|AT1G03920.1 AIESIHNRNYIHRDIKPDNLLLDRYGHLRLSDFGLCK Arath6|AT1G02970.1 ALHFVHEKGIAHLDVKPDNIYIKN-GVCKLGDFGCAT Arath6|AT1G01540.2 GLAYLHEPKVVHRDIKSSNILLDRQWNAKVSDFGLAK Arath6|AT1G01740.1 ALEYCSNGHTLYHDLNAYRVLFDEECNPRLSTFGLMK Arath6|AT1G01450.1 GMEYLHSKRIYHGELNPSNILVKPRSNQSGDGYLLGK .: . : ::: .: .. : The same regions were compared in all pairs
Construct phylogenetic tree • Distance Matrix based tree • Unweighted Pair Group Method (UPGMA) • Neighbor Joining Method (NJ) • 2. Character based tree • Maximum parsimonious tree • Maximum likelihood tree • Bayesian tree
Evolutionary distance=the number of substitution per site (nucleotide or amino acid) P distance = the number of nucleotide differences / the total number of nucleotide compared Poisson distance = -log(1-P) (Correct for multiple substitution rate in a simple way) Arath6|AT1G01140.3 Arath6|AT1G01450.1 Arath6|AT1G01540.2 Arath6|AT1G01560.1 Arath6|AT1G01740.1 Arath6|AT1G02970.1 Arath6|AT1G03740.1 Arath6|AT1G03920.1 0.000000 1.950179 2.157422 1.590441 2.646174 2.196675 1.780283 1.583913 1.950179 0.000000 1.987557 2.273655 2.420061 1.829983 2.173979 2.279700 2.157422 1.987557 0.000000 2.243251 1.680676 2.234272 2.228696 2.297521 1.590441 2.273655 2.243251 0.000000 2.339071 2.299861 1.285121 1.969318 2.646174 2.420061 1.680676 2.339071 0.000000 3.066779 2.690132 2.376925 2.196675 1.829983 2.234272 2.299861 3.066779 0.000000 1.984833 2.233078 1.780283 2.173979 2.228696 1.285121 2.690132 1.984833 0.000000 2.671995 1.583913 2.279700 2.297521 1.969318 2.376925 2.233078 2.671995 0.000000
Arath6|AT1G01140.3 Arath6|AT1G01450.1 Arath6|AT1G01540.2 Arath6|AT1G01560.1 Arath6|AT1G01740.1 Arath6|AT1G02970.1 Arath6|AT1G03740.1 Arath6|AT1G03920.1 0.000000 1.950179 2.157422 1.590441 2.646174 2.196675 1.780283 1.583913 1.950179 0.000000 1.987557 2.273655 2.420061 1.829983 2.173979 2.279700 2.157422 1.987557 0.000000 2.243251 1.680676 2.234272 2.228696 2.297521 1.590441 2.273655 2.243251 0.000000 2.339071 2.299861 1.285121 1.969318 2.646174 2.420061 1.680676 2.339071 0.000000 3.066779 2.690132 2.376925 2.196675 1.829983 2.234272 2.299861 3.066779 0.000000 1.984833 2.233078 1.780283 2.173979 2.228696 1.285121 2.690132 1.984833 0.000000 2.671995 1.583913 2.279700 2.297521 1.969318 2.376925 2.233078 2.671995 0.000000
How to obtain statistical value in each branch (Boot strap method) 1.Randomly reconstruct alignment from original alignment Original 456346897 A1 VHSKVHANS A2 HCHEHHRGS A3 YCHDYHRGS A4 …………. A5 …………. A6 …………. 123456789 A1 GLKYVHSAN A2 GLEHCHSRG A3 AVDYCHSRG A4 …………. A5 …………. A6 …………. Randomly choose site Construct tree 1000 times 324/1000 =32.4 789/1000=78.9 989/1000=98.9 A1 A2 A3 A4 A1 A2 A3 A4 A5 A5 A6 A6
Our approach to construct phylogenetic tree 1.Construct multiple alignment Estimate evolutionary distance in each pair Arath6|AT1G01560.1 VPPLRPIGRGASGIVCAAWNSETGEEVAIKKIG-NAFGNIIDAKRTLREIKLLKHMDHDN Arath6|AT1G03740.1 FEKLEKIGQGTYSSVYRARDLLHNKIVALKKVR-FDLNDMESVKFMAREIIVMRRLDHPN Arath6|AT1G01140.3 YEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPN Arath6|AT1G03920.1 FELLTMIGKGAFGEVRVVREINTGHVFAMKKLKKSEMLRRGQVEHVRAERNLLAEVDSNC Arath6|AT1G02970.1 FHEIRQIGAGHFSRVFKVLKRMDGCLYAVKHSTRKLYLDSERRKAMMEVQALAALGFHEN Arath6|AT1G01540.2 LCEENVIGEGGYGIVYRGILTDGTKVAVKNLLN----NRGQAEKEFKVEVEVIGRVRHKN Arath6|AT1G01740.1 ENVVSEHGETAPNVVYQGKLENHMKIAIKRFSG----TAWPDPRQFLEEARLVGQLRSKR Arath6|AT1G01450.1 YQVKKRLGNGSQ---YKEITWLGESFALRHFFG--------DIDALLPQITPLLSLSHPN * . Arath6|AT1G01560.1 VIAIIDIIRPPQPDNFNDVHIVYELMDTD-----LHHIIRS-NQPLTDDHSRFFLYQLLR Arath6|AT1G03740.1 VLKLEGLITAPVS---SSLYLVFEYMDHD-----LLGLSSLPGVKFTEPQVKCYMRQLLS Arath6|AT1G01140.3 VVEIIEVMASKTK-----IYIVLELVNGG-----ELFDKIAQQGRLKEDEARRYFQQLIN Arath6|AT1G03920.1 IVKLYCSFQDNEY-----LYLIMEYLPGG-----DMMTLLMRKDTLSEDEAKFYIAESVL Arath6|AT1G02970.1 IVGYYSSWFENEQ-----LYIQLELCDHS-----LSALPKKSSLKVSEREILVIMHQIAK Arath6|AT1G01540.2 LVRLLGYCVEGAYR-----MLVYDFVDNGN-LEQWIHGDVGDVSPLTWDIRMNIILGMAK Arath6|AT1G01740.1 MANLLGYCCEGGER-----LLVAEFMPNET---LAKHLFHWDTEPMKWAMRLRVALYISE Arath6|AT1G01450.1 IVYYLCGFTDEEKK---ECFLVMELMRKTLGMHIKEVCGPRKKNTLSLPVAVDLMLQIAL : : : .. Arath6|AT1G01560.1 GLKYVHS---ANVLHRDLKPSNLLLNANCDLKIGDFGLAR-------------------- Arath6|AT1G03740.1 GLEHCHS---RGVLHRDIKGSNLLIDSKGVLKIADFGLATFF------------------ Arath6|AT1G01140.3 AVDYCHS---RGVYHRDLKPENLILDANGVLKVSDFGLSAFSR----------------- Arath6|AT1G03920.1 AIESIHN---RNYIHRDIKPDNLLLDRYGHLRLSDFGLCKPLDCSVIDGEDFTVGNAGSG Arath6|AT1G02970.1 ALHFVHE---KGIAHLDVKPDNIYIKN-GVCKLGDFGCAT-------------------- Arath6|AT1G01540.2 GLAYLHEGLEPKVVHRDIKSSNILLDRQWNAKVSDFGLAKLLG----------------- Arath6|AT1G01740.1 ALEYCSN--NGHTLYHDLNAYRVLFDEECNPRLSTFGLMK-------------------- Arath6|AT1G01450.1 GMEYLHS---KRIYHGELNPSNILVKPRSNQSGDGYLLGKIFGFGLNSVKG--------- .: . : ::: .: .. : ….. Problem:The same regions were not compared in all pairs
Why do we estimate evolutionary distance in each pair? Because we have the following alignment seq44 LGKNGRSVSS----YSFTTDLRTFSYKG seq46 --------------RQPAGLQIFDGYGR seq47 --------------YIYSHGLQIFDGYG seq107 LDQDDDEHQP------------------ seq43 LDKNGRSVQL----YSKLYPLHTFTEKG seq45 LDRARKPSYS----RSSGKADVFLNTGW seq106 LAAAHDGSHP--PPP--RSDQSTATRS- seq55 VDPTQKLRPS-----EMKTVSHKFHKPG • If gapped positions were removed, we cannot estimate evolutionary distance • If sequences were compared in each pair, we can estimate evolutionary distance seq44 LGKNGRSVSS----YSFTTDLRTFSYKG seq46 --------------RQPAGLQIFDGYGR Estimate evolutionary distance seq44 LGKNGRSVSS----YSFTTDLRTFSYKG seq47 --------------YIYSHGLQIFDGYG seq44 LGKNGRSVSS----YSFTTDLRTFSYKG seq107 LDQDDDEHQP------------------ …………….
However, we could not get evolutionary distance by our approach. Because we have the following alignment Chlre3|142771 -----------G--------------AGHGAAAPARASDVLQ-------- Chlre3|143360 -LQEAI-------------------------------------------- Chlre3|144026 ------------------SSEQLGR-GVSG-----------T-------- Chlre3|144070 -----V-----------------------G-------------------- Chlre3|144460 -KYSQV----YLARETH--------------------------------- Even if sequences were compared in each pair, we cannot estimate evolutionary distance. If data is so large, it is impossible to make multiple alignment. New approach • Make pairwise alignment Estimate evolutionary distance in the pairwise alignment • Construct phylogenetic tree • Problem:Totally different regions were sometime compared in each pair. We cannot estimate bootstrap value.