650 likes | 798 Views
T-COFFEE , a novel method for combining biological information. Cédric Notredame. Potential Uses of A Multiple Sequence Alignment ?. chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
E N D
T-COFFEE,a novel method for combining biological information Cédric Notredame
Potential Uses of A Multiple Sequence Alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Phylogeny Multiple Alignments Are CENTRAL to MOST Bioinformatics Techniques. Motifs/Patterns Struc. Prediction Profiles
BIOLOGY:What is A Good Alignment COMPUTATIONWhat is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM
Why Is It Difficult To Compute A multiple Sequence Alignment ? BIOLOGY COMPUTATION CIRCULAR PROBLEM.... Good Good Alignment Sequences
Dynamic Programming Using A Substitution Matrix Progressive Alignment
The Triplet Assumption SEQ A SEQ B
Weighting=Using The surrounding Information (Coffee) Extension=Using Information from Other Sequences Weighting And Extension
T-Coffee Progressive Alignment Notredame, Higgins, Heringa, 2000 Dynamic Programming Using The extended Library
Mixing Local and Global Alignments Global Alignment Local Alignment Extension Multiple Sequence Alignment
What is a library? 2 Seq1 MySeq Seq2 MyotherSeq #1 2 1 1 25 3 8 70 …. 3 Seq1 anotherseq Seq2 atsecondone Seq3 athirdone #1 2 1 1 25 #1 3 3 8 70 …. Extension+T-Coffee Library Based Multiple Sequence Alignment
What Is BaliBase BaliBase BaliBase is a collection of reference Multiple Alignments The Structure of the Sequences are known and were used to assemble the MALN. Evaluation is carried out by Comparing the Structure Based Reference Alignment With its Sequence Based Counterpart
BaliBase Method X DALI, Sap … Comparison
T-Coffee Results Validation Using BaliBase
Mixing Heterogenous Information With T-Coffee Local Alignment Global Alignment Multiple Alignment Specialist Structural Multiple Sequence Alignment
Why Do We Want To Mix Sequences and Structures? STUCTURE FUNCTION • Sequences are Cheap and Common. • Structures are Expensive and Rare. • We WANT to use Structural information in multiple alignments: • To help the alignment • To extrapolate from Structures to Sequences.
Low gap penalties high gap penalties Helping an Alignment With Structures? Better gap penalties (ClustalW).
Helping an Alignment With Structures? Better gap penalties (ClustalW). Revealing Very Distant Relationships 1tc3c 1hstA
Is It Possible to Use Structural Information ? Any_pair THE new T-coffee method Seq Vs Seq LocalGlobal Seq Vs Struct Struct Vs Struct FUGUE SAP Evaluation on Homestrad
Validation of Any_pair on the Homestrad Database (Orla O’Sullivan, Des Higgins and C. Notredame) Is It Possible to Use Structural Information ? CW: Clustal W TC: T-Coffee default Result: % of columns correctly aligned as judged from the Homestrad reference Alignment SA: T-Coffee Using SAP FU: T-Coffee Using SAP
Of the Importance of being Trustworthy…Identifying Good Bits in an Alignment
How Good Is my Alignment? cah2_human NGPEHWHK-DFPIAKGERQSPVDIDTHTAKYDP------------SLKPLSVS--YDQAT cahp_mouse --GVEWGL-VFPDANGEYQSPINLNSREARYDP------------SLLDVRLSPNYVVCR cah4_rat SGPEQWTG----DCKKNQQSPINIVTSKTKLNP------------SLTPFTFVG-YDQKK ptpg_mouse YGPEHWVT-SSVSCGGSHQSPIDILDHHARVGD------------EYQELQLDG-FDNES cah6_human LDEAHWPQ-HYPACGGQRQSPINLQRTKVRYNP------------SLKGLNMTGYETQAG cah_dunsa -VGFDWTGGVCVNTGTSKQSPINIETDSLAEESERLGTADDTSRLALKGLLSS--SYQLT cahh_varv --------------MSQQLSPINIETKKAISNA------------RLKPLNIH--YNESK cah2_chlre EGKDGAG-NPWVCKTGRKQSPINVPQYHVLDGK------------GSK--IATGLQTQWS **::: cah2_human ---------SLRILNNGHAFNVEFDD-SQDKAVLK--------------------GGPLD cahp_mouse ---------DCEVTNDGHTIQVILKS----KSVLS--------------------GGPLP cah4_rat ---------KWEVKNNQHSVEMSLGE----DIYIF--------------------GGDLP ptpg_mouse SN-------KTWMKNTGKTVAILLKD----DYFVS--------------------GAGLP cah6_human ---------EFPMVNNGHTVQIGLPS----TMRMT--------------------VAD-G cah_dunsa ---------SEVAINLEQDMQFSFNAPDEDLPQLT--------------------IGGVV cahh_varv ---------PTTIQNTGKLVRINFKG-----GYLS--------------------GGFLP cah2_chlre YPDLMSNGSSVQVINNGHTIQVQWTY----DYAGHATIAIPAMRNQSNRIVDVLEMRPND * : . . cah2_human G----TYRLIQFHFHWGSLD--GQGSEHTVDKKKYAAELHLVHWNTK-YGDFGKAVQQPD cahp_mouse Q--GQEFELYEVRFHWGREN--QRGSEHTVNFKAFPMELHLIHWNSTLFGSIDEAVGKPH cah4_rat T----QYKAIQLHLHWSEES--NKGSEHSIDGKHFAMEMHVVHKKMTTGDKVQDSDSKD- ptpg_mouse G----RFKAEKVEFHWGHSNG-SAGSEHSVNGRRFPVEMQIFFYNPDDFDSFQTAISENR cah6_human I----VYIAQQMHFHWGGASSEISGSEHTVDGIRHVIEIHIVHYNS-KYKTYDIAQDAPD cah_dunsa H----TFKPVQIHFH-------HFASEHAIDGQLYPLEAHMVMASQN-DGS--------D cahh_varv N----EYVLSSLHIYWGKED--DYGSNHLIDVYKYSGEINLVHWNKKKYSSYEEAKKHDD cah2_chlre ASDRVTAVPTQFHFH--------STSEHLLAGKIFPLELHIVHKVTD---KLEACKG--G ...:: *:* : . * ::.
Measuring The Local Reliability: CORE cah2_human NGPEHWHK-DFPIAKGERQSPVDIDTHTAKYDPSLKPLSVS cahp_mouse --GVEWGL-VFPDANGEYQSPINLNSREARYDPSLLDVRLS cah4_rat SGPEQWTG----DCKKNQQSPINIVTSKTKLNPSLTPFTFV ptpg_mouse YGPEHWVT-SSVSCGGSHQSPIDILDHHARVGDEYQELQLD cah6_human LDEAHWPQ-HYPACGGQRQSPINLQRTKVRYNPSLKGLNMT Measure of Reliability S Escore (Q,x) Core (Q)= N*Max Escore
Specificity () and Sensitivity () 0.48 CORE index
Using Consistency For Automatic Annotation? T-COFFEE, Version_1.24(Wed Nov 15 18:31:29 PST 2000) Notredame, Higgins, Heringa, JMB(302)pp205-217,2000 CPU TIME:11 sec. SCORE=39 * BAD AVG GOOD * cah2_human : 42 cah4_rat : 41 cah6_human : 40 cahp_mouse : 43 cah_dunsa : 33 cah2_human 77664444-454555557666665554444444------------33322222- cah4_rat 54553332----233445655555554444444------------443323221 cah6_human 44333443-333344445555444444444444------------444433331 cahp_mouse --633453-333345565554444334444455------------555444331 cah_dunsa -34334320212223456555555543333333ERLGTADDTSRL22222111- cah2_chlre 7663333-0333334566666555444343322------------222--1110 ptpg_mouse 67763343-333334445444433333333333------------332222221 cahh_varv --------------5555555555554444433------------33322211- Cons 655433430333334455555554444444443------------333322221 cah2_human -11121---------22223334333322321-00011222------------- cah4_rat -22222---------23333344443344442----22222------------- cah6_human 001122---------22233344333333433----22222------------- cahp_mouse 022333---------34344455554444543----33334------------- cah_dunsa -11111---------11111111111111110P00000111------------- cah2_chlre 00000000DLMSNGS11223333333433332----22111ATIAIPAMRNQSN ptpg_mouse -1111100-------12234445444544433----33333------------- cahh_varv -11222---------22233333333333322-----1122------------- Cons 01112100-------22233334333333332-00022222-------------
Evaluating An Alignment Not Generated With T-Coffee: T_coffee –infile CLUSTALW_ALN –in Library –do_score
WHERE ? Cedric.notredame@europe.com igs-server.cnrs-mrs.fr/~cnotred igs-server.cnrs-mrs.fr/Tcoffee
ES45, 4Proc1 Gb RAM The T-Coffee Server
T-Coffee Server HP/Compaq-ES45/4-2G