1 / 39

Medical Natural Sciences Year 2: Introduction to Bioinformatics

Medical Natural Sciences Year 2: Introduction to Bioinformatics. Lecture 9: Multiple sequence alignment (III) Centre for Integrative Bioinformatics VU. Intermezzo: Symmetry-derived secondary structure prediction using multiple sequence alignments (SymSSP).

gamada
Download Presentation

Medical Natural Sciences Year 2: Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Medical Natural Sciences Year 2:Introduction to Bioinformatics Lecture 9: Multiple sequence alignment (III) Centre for Integrative Bioinformatics VU

  2. Intermezzo: Symmetry-derived secondary structure prediction using multiple sequence alignments (SymSSP) Victor Simossis Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam, The Netherlands

  3. Symmetry-derived secondary structure prediction using multiple sequence alignments (SymSSP) • Modern state-of-the-art methods use multiple sequence alignments • Methods like PhD, Profs, SSPro, etc., predict for the top sequence in the alignment by cutting out positions with gaps in the top sequence • What if two helices ‘out of phase’ are pasted together? Or a strand and a helix? • Approach: correct by permuting alignments and consensus prediction

  4. Secondary structure periodicity patterns Burried -strand Edge -strand -helix hydrophobic hydrophilic

  5. Symmetry-derived Secondary structure prediction using MA (SymSSP) 3 1 2 4 4 1 2 3 1 2 3 4 2 1 3 4 1 1 1 1 EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH EEEEE HHHHHH EEE HHHH EEEE? ?HHHHH EEE ?HHH EEEEE HHHHH? ??EE HHHH EEEEE ?HHHHH EEEE HHHH EEEEE HHHHHH EEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEE ?HHHHH EEEE HH EEEEE HHHH EEE HH EEEE? ?HHH EEE H EEEEE HHH? ??EE HH EEEEE HHH? EEEE HH EEEEE HHHHH EEE H EEEE HHHH EE HHH EEEE HHHHH EEE H EEEE HHH EEE HH

  6. Optimal segmentation of predicted secondary structures Each sequence within an alignment gives rise to a library of n secondary structure predictions, where n is the number of sequences in the alignment. The predictions are recorded by secondary structure type and region position in a single matrix 1 2 3 4 1->1 1->2 1->3 1->4 EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH C E H H score 0 0 0 0 0…. E score 3 4 4 4 3…. C score 1 0 0 0 0….. ? Score 0 0 0 0 1…. Region 0 1 1 1 0….

  7. Optimal segmentation of predicted secondary structures by Dynamic Programming H score The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score. Restrictions: H only if ws>=4 E only if ws>=2 E score C score ? score Region window size Segmentation score (Total score of each path) 2 6 sequence position Max score 5 Offset Label H

  8. Example of an optimally segmented secondary structure prediction library for sequence 3chy 3chy ---------------GYVV-----KPFTAATLEEKLNKIFEKLGM------ 3chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ????????? 3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ????????? 3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh 3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ???? 3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ?????? 3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ??????????? 3chy <- 3chy --------------- ----- hhhhhhhhhhhhhh ------ Consensus ---------------EEEE----- HHHHHHHHHHHHH ------ Consensus-DSSP ...............****.....****xx***************...... PHD --------------- ----- HHHHHHHHHHHHHH ------ PHD-DSSP ...............xxxx.....******************x**...... DSSP ...............EEEE.....SS HHHHHHHHHHHHHHHT ...... LumpDSSP ...............EEEE..... HHHHHHHHHHHHHHH ......

  9. Symmetry-derived secondary structure prediction (SymSSP) • Tried over 120 different consensus weighting schemes (global, regional, positional) • Over ~2700 Homstrad alignments and compared to PHD, on average 0.5% better • 60% of the alignments are improved, 20% not affected and 20% is made worse • Tried to correlate schemes with “cheap” a priori data (pairwise identities, sequence lengths, number of sequences, etc.)

  10. Integrating secondary structure prediction and multiple sequence alignment • Low key example shown of fairly homogeneous data (strings of letters in both cases) • But already difficult to do and methods are not easily tunable • How to scale up to knowledge-integrating and inference engines?

  11. Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Globalised local alignment • Matrix extension Objective: try to avoid (early) errors

  12. Globalised local alignment • Aim: fill each DP search matrix with the highest possible local alignment going through that cell • Problem: Forward calculation + traceback for each local alignment is too slow • Solution: Double dynamic programming • Local DP in forward and reverse direction (no traceback) + matrix summation • Global DP over matrix from step 1 + traceback

  13. Globalised local alignment 1.Local (SW) alignment (M + Po,e) + = 2.Global (NW) alignment (no M or Po,e) Double dynamic programming

  14. M = BLOSUM62, Po= 0, Pe= 0

  15. M = BLOSUM62, Po= 12, Pe= 1

  16. M = BLOSUM62, Po= 60, Pe= 5

  17. Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Globalised local alignment • Matrix extension Objective: try to avoid (early) errors

  18. Integrating alignment methods and alignment information with T-Coffee • Integrating different pair-wise alignment techniques (NW, SW, ..) • Combining different multiple alignment methods (consensus multiple alignment) • Combining sequence alignment methods with structural alignment techniques • Plug in user knowledge

  19. Matrix extension • T-Coffee • Tree-based Consistency Objective Function For alignmEnt Evaluation • Cedric Notredame • Des Higgins • Jaap HeringaJ. Mol. Biol., 302, 205-217;2000

  20. Using different sources of alignment information Structure alignments Clustal Clustal Dialign Lalign Manual T-Coffee

  21. Progressive multiple alignment 1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Similarity matrix Scores 5×5 Guide tree Multiple alignment

  22. Default T-COFFEE • Uses information from all sequences for each pair-wise alignment • Reconciles global and local alignment information

  23. T-Coffee matrix extension 2 1 3 1 4 1 3 2 4 2 4 3

  24. Search matrix extension

  25. T-Coffee • Combine different alignment techniquesby adding scores: • W(A(x), B(y)) = S(A(x), B(y)) • A(x) is residue x in sequence A • summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y)) • S is sequence identity percentage of the associated alignment • Combine direct alignment seqA- seqB with each seqA-seqI-seqB: • W’(A(x), B(y)) = W(A(x), B(y)) + • IA,BMin(W(A(x), I(z)), W(I(z), B(y))) • Summation over all third sequences I other than A or B

  26. T-Coffee Other sequences Direct alignment

  27. T-Coffee library system Seq1 AA1 Seq2 AA2 Weight 3 V31 5 L33 10 3 V31 6 L34 14 5 L33 6 R35 21 5 l33 6 I36 35

  28. T-Coffee progressive alignment MDAGSTVILCFVG M D A A S T I L C G S Amino Acid Exchange Matrix Search matrix Gap penalties (open,extension) MDAGSTVILCFVG- MDAAST-ILC--GS

  29. Kinase nucleotide binding sites

  30. Comparing T-coffee with other methods

  31. but..... T-COFFEE (V1.23)multiple sequence alignment Flavodoxin-cheY 1fx1 ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK----- FLAV_DESVH ---MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK----- FLAV_DESGI ---MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK----- FLAV_DESSA ---MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK----- FLAV_DESDE ---MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK----- 4fxn ------MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE-------ESEFEPF-IEEIS-TKISGKK----- FLAV_MEGEL -----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE-------DSVVEPF-FTDLA-PKLKGKK----- FLAV_CLOAB ----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN---------ISWEMKKW-IDESSEFNLEGKL----- 2fcr -----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP----- FLAV_ENTAG ---MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT----- FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL--------QSDWEGL-YSELDDVDFNGKL----- FLAV_AZOVI ----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT----- FLAV_ECOLI ----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA--------QCDWDDF-FPTLEEIDFNGKL----- 3chy ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE--------------LLKTIRADGAMSALPVLMV :. . . : . :: 1fx1 ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI-------- FLAV_DESVH ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI-------- FLAV_DESGI ---------VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS---------------------LKIDGEPDSA----EVLDWAREVLARV-------- FLAV_DESSA ---------VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS---------------------LKIDGDPE----RDEIVSWGSGIADKI-------- FLAV_DESDE ---------VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG---------------------LKMEGDASND--PEAVASFAEDVLKQL-------- 4fxn ---------VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP---------------------LIVQNEPD--EAEQDCIEFGKKIANI--------- FLAV_MEGEL ---------VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA---------------------IV--NEMP--DNAPECKELGEAAAKA--------- FLAV_CLOAB ---------GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF----------- 2fcr ---------VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ FLAV_ENTAG ---------VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL------- FLAV_ANASP ---------VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL------ FLAV_AZOVI ---------VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL---- FLAV_ECOLI ---------VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA 3chy TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM---------------------------------------------------------- .

  32. Evaluating multiple alignments • Conflicting standards of truth • evolution • structure • function • With orphan sequences no additional information • Benchmarks depending on reference alignments • Quality issue of available reference alignment databases • Different ways to quantify agreement with reference alignment (sum-of-pairs, column score) • “Charlie Chaplin” problem

  33. Evaluating multiple alignments • As a standard of truth, often a reference alignment based on structural superpositioning is taken

  34. Evaluation measures Query Reference Column score Sum-of-Pairs score

  35. Scoring a multiple alignment Query • Sum-of-Pairs score: • For each alignment position: take the sum of all pairs (add a.a. exchange values) • As an option, subtract gap penalties

  36. Evaluating multiple alignments SP BAliBASE alignment nseq * len

  37. Summary • Weighting schemes simulating simultaneous multiple alignment • Profile pre-processing (global/local) • Matrix extension (well balanced scheme) • Smoothing alignment signals • globalised local alignment • Using additional information • secondary structure driven alignment • Schemes strike balance between speed and sensitivity

  38. References • Heringa, J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem.23, 341-364. • Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217. • Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.

  39. Where to find this….http://www.ibivu.cs.vu.nl/teaching

More Related