380 likes | 469 Views
Remarks About Homework. Write detailed answers Pay attention to details in the questions “… nor can the shy man learn…”. Multiple Sequence Alignment (MSA) and Phylogeny. One of the options to get multiple sequence Fasta file. One of the options to get multiple sequence Fasta file.
E N D
Remarks About Homework • Write detailed answers • Pay attention to details in the questions • “… nor can the shy man learn…”
MSA input: multiple sequence Fasta file >gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLT KGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLT LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI >gi|57113961|ref|NP_001009043.1| CD4 antigen [Pan troglodytes] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQTKILGNQGSFLT KGPSKLNDRVDSRRSLWDQGNFTLIIKNLKIEDSDTYICEVGDQKEEVQLLVFGLTANSDTHLLQGQSLT LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAQRMSQIKRLLSEKKTCQCPHRFQKTCSPI >gi|50054438|ref|NP_001001908.1| CD4 antigen [Sus scrofa] MDPGTSLRHLFLVLQLAMLPAASGTQEKYLVLGKAGDLAELPCHSSQKKNLPFNWKNSNQTKILGGHGSF WHTASVTELTSRLDSKKNMWDHGSFPLIIKNLEVTDSGIYICEVEDKRIEVQLLVFRLTASVTRVLLGQS LTLTLEGPSGSHPTVQWKGPGNKSKNDVKSLLLPQVGLEDSGLWTCTVSQDQKTLVFRSNIFVLAFQKVP STVYVKEGDQVALSFPLTFEAESLSGELMWRQTKGASSPQSWITFSLKDRKVTVQKSLQNLKLRMAEKLP LQITLLQALPQYAGSGNLTLVLPEGRLHREVNLVVMRATQSKNEVTCEVLGPTPPKVVLSLKLGNQSMKV SDQQKLVTVLDPEAGMWRCLLRDKDKVLLESQVEVLPTAFTRAWPELLASVIGGIIGLLFLAGFCIACVK CWHRRRRAERMSQIKRLLSEKKTCQCAHRQQKNYSLT >gi|6978631|ref|NP_036837.1| Cd4 molecule [Rattus norvegicus] MCRGFSFRHLLPLLLLQLSKLLVVTQGKTVVLGKEGGSAELPCESTSRRSASFAWKSSDQKTILGYKNKL LIKGSLELYSRFDSRKNAWERGSFPLIINKLRMEDSQTYVCELENKKEEVELWVFRVTFNPGTRLLQGQS LTLILDSNPKVSDPPIECKHKSSNIVKDSKAFSTHSLRIQDSGIWNCTVTLNQKKHSFDMKLSVLGFAST SITAYKSEGESAEFSFPLNLGEESLQGELRWKAEKAPSSQSWITFSLKNQKVSVQKSTSNPKFQLSETLP LTLQIPQVSLQFAGSGNLTLTLDRGILYQEVNLVVMKVTQPDSNTLTCEVMGPTSPKMRLILKQENQEAR VSRQEKVIQVQAPEAGVWQCLLSEGEEVKMDSKIQVLSKGLNQTMFLAVVLGSAFSFLVFTGLCILFCVR CRHQQRQAARMSQIKRLLSEKKTCQCSHRMQKSHNLI
Uploaded sequences A little unclear…
Edit Fasta headers… >Homo_sapiens_CD4 <gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLT KGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLT LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQTKILGNQGSFLT KGPSKLNDRVDSRRSLWDQGNFTLIIKNLKIEDSDTYICEVGDQKEEVQLLVFGLTANSDTHLLQGQSLT LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAQRMSQIKRLLSEKKTCQCPHRFQKTCSPI MDPGTSLRHLFLVLQLAMLPAASGTQEKYLVLGKAGDLAELPCHSSQKKNLPFNWKNSNQTKILGGHGSF WHTASVTELTSRLDSKKNMWDHGSFPLIIKNLEVTDSGIYICEVEDKRIEVQLLVFRLTASVTRVLLGQS LTLTLEGPSGSHPTVQWKGPGNKSKNDVKSLLLPQVGLEDSGLWTCTVSQDQKTLVFRSNIFVLAFQKVP STVYVKEGDQVALSFPLTFEAESLSGELMWRQTKGASSPQSWITFSLKDRKVTVQKSLQNLKLRMAEKLP LQITLLQALPQYAGSGNLTLVLPEGRLHREVNLVVMRATQSKNEVTCEVLGPTPPKVVLSLKLGNQSMKV SDQQKLVTVLDPEAGMWRCLLRDKDKVLLESQVEVLPTAFTRAWPELLASVIGGIIGLLFLAGFCIACVK CWHRRRRAERMSQIKRLLSEKKTCQCAHRQQKNYSLT MCRGFSFRHLLPLLLLQLSKLLVVTQGKTVVLGKEGGSAELPCESTSRRSASFAWKSSDQKTILGYKNKL LIKGSLELYSRFDSRKNAWERGSFPLIINKLRMEDSQTYVCELENKKEEVELWVFRVTFNPGTRLLQGQS LTLILDSNPKVSDPPIECKHKSSNIVKDSKAFSTHSLRIQDSGIWNCTVTLNQKKHSFDMKLSVLGFAST SITAYKSEGESAEFSFPLNLGEESLQGELRWKAEKAPSSQSWITFSLKNQKVSVQKSTSNPKFQLSETLP LTLQIPQVSLQFAGSGNLTLTLDRGILYQEVNLVVMKVTQPDSNTLTCEVMGPTSPKMRLILKQENQEAR VSRQEKVIQVQAPEAGVWQCLLSEGEEVKMDSKIQVLSKGLNQTMFLAVVLGSAFSFLVFTGLCILFCVR CRHQQRQAARMSQIKRLLSEKKTCQCSHRMQKSHNLI >Pan_troglodytes_CD4 >gi|57113961|ref|NP_001009043.1| CD4 antigen [Pan troglodytes] >Sus_scrofa_CD4 >gi|50054438|ref|NP_001001908.1| CD4 antigen [Sus scrofa] >gi|6978631|ref|NP_036837.1| Cd4 molecule [Rattus norvegicus] >Rattus_norvegicus_CD4
Uploaded sequences Much better
The Newick tree format is used to represent trees as strings A C B D In Newick format: ((A,C),(B,D)); • Each pair of parenthesis () encloses a clade in the tree • A comma “,” separates the members of the corresponding clade • A semicolon “;” is always the last character
Step 4: View tree with NJPlot Note: unrooted tree
C = = A B C A B C C = A B B A
3 1 A B C C B A 2 A C B Rooted vs. unrooted trees ≠ 3 A 1 ≠ C B 2
3 1 A B C C B A ((A,B),C) ((C,B),A) 2 A C B (A,B,C) ((A,C),B) How would each tree look in Newick format? ≠ 3 A 1 ≠ C B 2
Step 4: View tree with NJPlot Note: The order inside a split doesn’t matter
Gorilla Human Chimp Human Chimp Gorilla = Gorilla Chimp Human (Gorilla,(Human,Chimp)) = (Gorilla,(Chimp,Human)) = = Chimp Human Gorilla = ((Human,Chimp),Gorilla) = ((Chimp,Human),Gorilla)
How robust is our tree? • We need some statistical way to estimate the confidence in the tree topology (like we need the E-value to estimate the confidence of a blast hit) • But we don’t know anything about the distribution of tree topologies • The only data source we have is our data (MSA) • So, we must rely on our own resources: “pull up by your own bootstraps”
Bootstrap 1. Create n (100-1000) new MSAs (pseudo-datasets) by randomly sampling K positions from our original MSA with replacement 12345K 1 : ATCTG…A 2 : ATCTG…C 3 : ACTTA…C 4 : ACCTA…T 11244…3 1 : AATTT…C 2 : AATTT…C 3 : AACTT…T 4 : AACTT…C 97478…10 1 : TTTTA…T 2 : CATAC…A 3 : CATAC…T 4 : AGTGG…A 51578… 12 1 : GAGTA…T 2 : GAGAC…G 3 : AAAAC…A 4 : AAAGG…C
Sp1 Sp2 Sp3 Sp4 Bootstrap 2. Reconstruct a pseudo-tree from each pseudo-dataset using the same method used for reconstructing the original tree 11244…3 1 : AATTT…C 2 : AATTT…C 3 : AACTT…T 4 : AACTT…C 97478…10 1 : TTTTA…T 2 : CATAC…A 3 : CATAC…T 4 : AGTGG…A 51578… 12 1 : GAGTA…T 2 : GAGAC…G 3 : AAAAC…A 4 : AAAGG…C Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4
Sp1 Sp2 Sp3 Sp4 Bootstrap 3. For each node in our original tree, we count the number of times it appeared in the pseudo-trees Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4 67% Sp1 100% Sp2 Sp3 Sp4
Bootstrap values on NJPlot Note:ClustalX saves trees with .ph extension. Trees with bootstrap are saved with .phb extension
Darwin’s vision of the tree of life from the Origin of Species
Based on molecular data (SSU rRNA), branching of several kingdoms remain in dispute
Lateral Gene Transfer (LGT) Challenges the Conceptual Basis of Phylogenetic Classification
Toward Automatic Reconstruction of a Highly Resolved Tree of Life Science 3 March 2006:Vol. 311. no. 5765, pp. 1283 - 1287
Methodology • Started with 36 genes universally present in 191 species (spanning all 3 domains of life), for which orthologs could be unambiguously identified • Eliminated 5 genes that are LGT suspects (mostly tRNA synthetases) • Constructed an MSA for each of the 31 orthogroups • Concatenated all 31 MSAs to a super-MSA of 8090 columns • The phylogeny was reconstructed based on the super-MSA using the maximum likelihood approach
Archaea Eukaryota Bacteria
Tree support • 81.7% of the branches show bootstrap support of over 80% • 65% of the branches show bootstrap support of 100% • However, several deep branchings show low supports