1 / 41

Gabriel Pons, Departament de Ciències Fisiològiques II, Campus de

Tema 13 . Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast . Multiple alignment, profiles. Families of proteins. Functional prediction based on sequence. Gabriel Pons, Departament de Ciències Fisiològiques II, Campus de

artan
Download Presentation

Gabriel Pons, Departament de Ciències Fisiològiques II, Campus de

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families of proteins. Functional prediction based on sequence. Gabriel Pons, Departament de Ciències Fisiològiques II, Campus de Ciències de la salut. Bellvitge. Universitat de Barcelona

  2. Sequence comparison

  3. Goals • To take advantage from functional or structural information identifiyng homologies between sequences • Differences between Homology and identity • Two sequences are homologous when: • They have the same evolutive origin • They have similar function and structure

  4. • Homologous sequences - sequences that share a common evolutionary ancestry • Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Sequence similarity: • The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages

  5. More definitions • Orthologs: sequences which exactely correspond to the same function/structure in different species • Paralogs: sequences produced by gene duplications in the same organism. Usually, it involves change in function, but keeping functional relationship many times.

  6. Homology

  7. Homology and prediction • Very divergent protein sequences may suport similar structures • Similar protein structures will probably have related or similar functions

  8. 3D STRUCTURE VERSUS SEQUENCE Sequence alignment between human myoglobin,  and globins from hemoglobin

  9. Comparison of 3D structures of human myoglobin,  and globins from hemoglobin -globin myoglobin -globin

  10. Superposition of 3D structures of human myoglobin and globin from hemoglobin

  11. Homology and prediction • Sequence comparison is the simplest method in order to identify the presence of homology between sequences. • Identity > 30% in proteins involves homology (>65% nucleic) • Identity > 80-90% usual in orthologs from close species • Identity 10-30%. If there is homology may be not detectable (“twilight zone”)

  12. No me gusta la bioinformatica Teme usted la ionosfera optica Nomegusta-labioin-forma--tica Teme-ustedla-ionosfer-aoptica 64% identity? But… I don´t like bioinformatics Do you fear optical ionospher?

  13. ¿DNA or protein? • Both give information about homología • Protein: Exists functional equivalence between aminoacids

  14. DNA: only identity is relevant Canonical base pairing (Watson-Crick) Mismatches do not have variable cost. No substitution is better than other usually

  15. genetic code Third base pare degeneration XYC = XYU XYA ~ XYG • Trp, Met (1) • Leu, Ser, Arg (6) • others (2) • Initiation AUG • Stop (3)

  16. “Equivalent aminoacids” • Hydrophobics • Ala (A), Val (V), Met (M), Leu (L), Ile (I), Phe (F), Trp (W), Tyr (Y) • Small • Gly (G), Ala (A), Ser (S) • Polar • Ser (S), Thr (T), Asn (N), Gln (Q), Tyr (Y) • En la superficie de la proteína polares y cargados son equivalentes • With charge • Asp (D), Glu (E) / Lys (K), Arg (R) • Difficult to be substituted • Gly (G), Pro (P), Cys (C), His (H) • BE CAREFULL: aminoacids do not always perform the same function in proteins

  17. 3D visualization of some conserved residues in globin family (Myoglobin structure) Prolin in a turn Histidin For the hemo coordination bonds 2 conserved glycines in 2 separate helix crossing each other

  18. DNA sequence diverges quicker than protein • Mutation or recombination may alter DNA but must mantain function/structure • Protein sequence comparison permits finding and localize very distant homologous proteins

  19. Sequence alignment • Measure the degree of similarity/identity and thus the existence of homology requires un “alignment” Strong identity/similarity: AWTRRATVHDGLMEDEFAA AWTRRATVHDGLCEDEFAA Weak identity/similarity: AWTKLATAVVVFEGLCEDEWGG AWTRRAT---VHDGLMEDEFAA

  20. Alignments • “pairwise” • 2 sequences • Multiple • More than 2 sequences • Global • Whole sequence is considered • Local • Only similar regions are aligned

  21. StrategiesDepends of the goal • Sequence comparison • Goal: establish homology, identify equivalent aminoacuds • global, ”pairwise”/multiple • Search in data bases • Goal: Identify homologous proteins in a big group of sequences • Local, “pairwise”

  22. Automatic Alignment • Requires • Objective method to compare aminoacids or bases in order to “score” the alignment (comparison matrix) • Algoritm to find the best alignment with the maximal score • Quick and easy to reproduce • Do not permit, in general, introduce additional information

  23. Matrix types • Identity • Physico-chemical properties • Genetics (codon substitution) • Evolution

  24. Blosum 62 Small positive score for changes in similar aminoacids Small positive score for commonaminoacids Infrequente aminoacids have high score High Penalty for very different aminoacids Same score independent of position !!

  25. Choice of a Matrix! BLOSUM90 PAM30 BLOSUM80 PAM120 BLOSUM62 PAM180 BLOSUM45 PAM240 Rat versus mouse protein Rat versus bacterial protein

  26. PAM Point Accepted Mutatiton

  27. Gaps (inserciones/delecciones) • Normalmente localizados en loops AWTKLATAVVVFEGLCEDEWGG AWTRRAT---VHDGLMEDEFAA

  28. Global versus local alignment • Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length • Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length

  29. Comparación de secuencias contra bases de datos Base de datos De secuencias AGLM...WTKR TCGGLMN..HICG WRKCPGL ... Secuencia incógnita ATTVG...LMN Requiere algoritmos de comparación muy rápidos

  30. Alignments • “pairwise” • 2 sequences • Multiple • More than 2 sequences • Global • Whole sequence is considered • Local • Only similar regions are aligned

  31. Diasdvantages from global alignment Global alignment server • Slow • Scores whole sequence • Do not recognize multidomain proteins A B C A C’ B D

  32. alfa-globin MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNA VAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK YR Beta-globin MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

  33. Alfa-actinin MNQIEPGVQYNYVYDEDEYMIQEEEWDRDLLLDPAWEKQQRKTFTAWCNSHLRKAGTQIE NIEEDFRNGLKLMLLLEVISGERLPKPDRGKMRFHKIANVNKALDYIASKGVKLVSIGAE EIVDGNVKMTLGMIWTIILRFAIQDISVEETSAKEGLLLWCQRKTAPYRNVNIQNFHTSW KDGLGLCALIHRHRPDLIDYSKLNKDDPIGNINLAMEIAEKHLDIPKMLDAEDIVNTPKP DERAIMTYVSCFYHAFAGAEQAETAANRICKVLAVNQENERLMEEYERLASELLEWIRRT IPWLENRTPEKTMQAMQKKLEDFRDYRRKHKPPKVQEKCQLEINFNTLQTKLRISNRPAF MPSEGKMVSDIAGAWQRLEQAEKGYEEWLLNEIRRLERLEHLAEKFRQKASTHETWAYGK EQILLQKDYESASLTEVRALLRKHEAFESDLAAHQDRVEQIAAIAQELNELDYHDAVNVN DRCQKICDQWDRLGTLTQKRREALERMEKLLETIDQLHLEFAKRAAPFNNWMEGAMEDLQ DMFIVHSIEEIQSLITAHEQFKATLPEADGERQSIMAIQNEVEKVIQSYNIRISSSNPYS TVTMDELRTKWDKVKQLVPIRDQSLQEELARQHANERLRRQFAAQANAIGPWIQNKMEEI ARSSIQITGALEDQMNQLKQYEHNIINYKNNIDKLEGDHQLIQEALVFDNKHTNYTMEHI RVGWELLLTTIARTINEVETQILTRDAKGITQEQMNEFRASFNHFDRRKNGLMDHEDFRA CLISMGYDLGEAEFARIMTLVDPNGQGTVTFQSFIDFMTRETADTDTAEQVIASFRILAS DKPYILAEELRRELPPDQAQYCIKRMPAYSGPGSVPGALDYAAFSSALYGESDL Calmodulin MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADG NGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDE EVDEMIREADIDGDGQVNYEEFVQMMTAK

  34. Alineamiento local • 10 – 100x más rápidos • Reconocen dominios individuales • No proporcionan necesariamente el mejor alineamiento! • BLAST, FASTA

  35. Basic Local Alignment Search ToolBlast NCBI

  36. Number of letters in data baseScore Number of letters in query Score Normalization factors E value (Expect) • E value: • Expect: This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. E = K.m.n.e-l.S • Warning: •  E → Falsos negativos

  37. E parameter (More) • Expect For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. However, keep in mind that searches with short sequences, can be virtually indentical and have relatively high EValue. This is because the calculation of the E-value also takes into account the length of the Query sequence. This is because shorter sequences have a high probability of occuring in the database purely by chance.

  38. Exercice • Find mouse orthologous. Data • Find closest human paralogous • Find highest significant homolog in drosophila

More Related