1 / 72

Data bases ( Biosequences, Structures, Genomes, DNA Chips, Proteomics, Interactomics ) Design

BIOINFORMATICS. Data bases ( Biosequences, Structures, Genomes, DNA Chips, Proteomics, Interactomics ) Design Curation Data Mining. Computational Biology Tools for: Sequence analysis Structure prediction Docking Structural genomics Functional genomics Proteomics Interactomics.

yagil
Download Presentation

Data bases ( Biosequences, Structures, Genomes, DNA Chips, Proteomics, Interactomics ) Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIOINFORMATICS • Data bases • (Biosequences, Structures, Genomes, DNA Chips, Proteomics, Interactomics) • Design • Curation • Data Mining • Computational Biology • Tools for: • Sequence analysis • Structure prediction • Docking • Structural genomics • Functional genomics • Proteomics • Interactomics

  2. The problem: Sequence comparison • How to compare two sequences • How to compare one sequence (target) to many sequences (database search) The solution: Sequence alignment

  3. Why to compare • Similarity search is necessary for: • Family assignment • Sequence annotation • Construction of phylogenetic trees • Protein structure prediction

  4. Sequence Alignment Rita Casadio

  5. Types of alignments: • Aligment of Pairs of Sequences • Multiple Sequence Alignment of three or more protein sequences

  6. Pairwise and multiple sequence alignments can be global or local • Globlal: the whole sequence is aligned • Local : only fragments of the sequence are aligned

  7. Pairwise sequence alignments can be global or local

  8. A basic concept: Measures of sequence similarity

  9. Given two character strings, two measures of the distance between them are: 1) The Hamming distance defined between two strings of equal length is the number of positions with mismatching characters agtc cgta Hd=2 2) The Levenshtein or edit distance between two strings of not necessarily equal length is the minimal number of edit operations required to change one string into the other, where an edit operation is a deletion, insertion or alteration of a single character in either sequence. A given sequence of edit operations induces an unique alignment, but not viceversa ag-tcc cgctca Ld=3

  10. Scoring schemes A scoring scheme must account for residue substitutions, insertions or deletions (gaps) Scores are measures of sequence similarity (similar sequences have small distances (high scores), dissimilar sequences give large distances (low scores))

  11. Algorithms for optimal aligment can seek either to minimize a dissimilarity measure or maximize a scoring function

  12. For nucleic acid sequences (Genetic Code Scoring) A simple scheme for substitution for nucleic acid sequences: match +1 mismatch -1 More complicated scheme are based on the higher frequency of transition mutations than transverse mutations

  13. Example: a t g c a 2010 5 5 t 10 205 5 g 5 5 2010 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine, and pyrimidine=pyrimidine are more common than purine=pyrimidine, or pyrimidine=purine.

  14. For proteins a variety of scoring schemes have been proposed Similarity of physicochemical type (Chemical similarity scoring. Eg. McLachlan similarity matrix:polar, non polar;size, shape, charge, rare (F)) Substitution matrices

  15. Substitution matrices for proteins • PAM • BLOSUM • Matrices derived from tertiary structure aligment

  16. Derivation of substitution matrices The Dayhoff mutation matrix: As sequence diverge, mutations accumulate. To measure the relative probability of any particular substitution we can count the number of changes in pairs of aligned similar sequences (relative frequency of such changes to form a scoring matrix for substitution) 1PAM: 1 Percent Accept Mutation two sequences 1PAM apart have 99% identical residues. One change in any position Collecting statistics from pairs of sequences closely related (1PAM) and correcting for different aminoacid abundances produces the 1PAM substitution matrix

  17. For more widely diverged sequences powers of 1PAM are used PAM250 (20% overall sequence identity) Different PAMS for different level of sequence identity PAM 0 30 80 110 200 250 % Identity 100 75 50 60 25 20 M.Dayhoff (1978) PET91 (Jones et al.,1992) is based on 2621 families

  18. Score of mutation i,j log observed i,j mutation rate/ mutation rate expected from aminoacid frequencies log-odds values (x10) 2 =0.2 (scaling) log 10= 10^0.2=1.6˜ 2 The value is the expectation value of the mutation The probability of two independent mutational events is the product of their probabilities (addition when we consider logs)

  19. PAM250

  20. The BLOSUM Matrices Henikoff and Henikoff (1991) Best performing in identifying distant relationships, making use of the much larger amount of data that had become available since M. Dayhoff’s work Based on a data base of multiple alignments without gaps for short regions of related sequences. Within each alignment in the data base, the sequences were clustered into groups where the sequences are similar at some threshold value of percentage identity log odds BLOSUM(blocks substitution matrix) BLOSUM40, 62, 80..

  21. BLOSUM62

  22. DOTPLOT ANALYSIS a simple picture that gives an overwiev of the similarites between two sequences The doplot is a table or matrix based on scoring schemes

  23. Dotlet - A Java applet for sequence comparisons using the dot matrix method http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

  24. How to search for internal repeats in the sequence

  25. How to search for conserved domains

  26. Exons and Introns: How to find them ANCALM (J05545): TGAATCCCAGTTCAGCTCTTCAGCCTTTCGTGGATAAGAGAAGGCTGAAAGCGGGTCACGTTTTGGACTAAGCGACGCCC TTGCCAGGCATCCAGCTTAGTGGCTGTTGGTTTATTTGTAGAGTCCCCTTAACTCTCTCTCCCCCACATCGCCCATCTCC ACCGACGCCTCTCTCTCTCGTGTTATTTCTCCCCATTCTCGCTTCATTTCCCATCCATTTTCGAGTTCTGCAATATCCTC ACTAACTAGTATAGCCATGGTACGCCTCACTCGATCATCATCGTTGTTCGTGCGCTCAAACGCATCCGCTGTGCGGGGCA GATCTACTGGTGTCCTCCTGCGTAGATGAGCTGACGACTTCACTTCCAGGCCGACTCTCTGACCGAAGAGCAAGTTTCCG AGTACAAGGAGGCCTTCTCCCTATTTGTAAGTGCCATTGGTTACTGTTATATCAAAATCGAATTTGTATTGAGAGTATAC TAATACATTCCGCACTAAACAGGACAAGGATGGCGATGGTTAGTGCATCTGTCCCCCCAGGCTTGATCGCATTCGCCCAG CATGTCTGCTGTAGCTCTATATAACCGTTTCTGACAAACGGCGACAGGCCAGATTACCACTAAGGAGCTTGGCACTGTCA TGCGCTCGCTCGGTCAGAATCCTTCAGAGTCTGAGCTTCAGGACATGATCAACGAAGTTGACGCCGACAACAATGGCACC ATTGACTTTCCAGGTACGCGAACTCCCCAATCTACTTCGCACCAGCCTAGAAATGTACTAATGCTAAACAGAGTTCCTTA CCATGATGGCCAGAAAGATGAAGGACACCGATTCCGAGGAGGAAATTCGGGAGGCGTTCAAGGTCTTCGACCGTGACAAC AATGGTTTCATCTCCGCTGCTGAGCTGCGTCACGTCATGACCTCGATCGGTGAGAAGCTCACCGATGACGAAGTCGACGA GATGATCCGCGAGGCGGACCAGGATGGCGACGGCCGAATTGACTGTACGTTGGCTCCCCGCTTATCCTTGACCGTAGAAG AGGTATGATACTGATCGGCTGCAGACAACGAATTCGTCCAACTTATGATGCAAAAATAAACGCTCTTACCTTTGATGTTT ATCGTTAGCGAAGAAGGTGTGGACACTTTCCAGCTGTCTCATCTTAGTTGTCATATCATTGAATGTAGCCTATCTGATTG CGGATAAGCAACTGATGGTTGTAACGGCTTCCATTTTGCTCTGACTTCTGAGTACCCTTTTCCTTCATGTTTGTTCGTCG ACCATTCTGCTAGTGAGATATGCGTAGAGTTGGGTAGGCTGAATTTACGAGTCTCTGTTGGGGGATATCACATGCTTCAC TACAATCTTTCTCTAC CALM_EMENI (P19533): ADSLTEEQVSEYKEAFSLFDKDGDGQITTKELGTVMRSLGQNPSESELQDMINEVDADNNGTIDFPEFLTMMARKMKDTD SEEEIREAFKVFDRDNNGFISAAELRHVMTSIGEKLTDDEVDEMIREADQDGDGRIDYNEFVQLMMQK

  27. Sequence Alignment:Methods Rita Casadio

  28. Alignment of pairs of sequences • Dot matrix analysis (dotplot) • The dynamic programming algorithm • Word or K-tuple methods (FASTA, BLAST)

  29. Sequence comparison with gaps Deletions are referred to as ``gaps'', while insertions and deletions are collectively referred to as ``indels''. Insertions and deletions are needed to align accurately even quite closely related sequences such as the and globins The naive approach to finding the best alignment of two sequences including gaps is to generate all possible alignments, add up the scores for equivalencing each amino acid pair in each alignment then select the highest scoring alignment. However, for two sequences of 100 residues there are alternative alignments so such an approach would be time consuming and infeasible for longer sequences.

  30. Finding the best alignment with dynamic programming A group of algorithms calculate the best score and alignment in the order of steps. These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch (J Mol Biol 48, 443, 1970).

  31. Pairwise and multiple sequence alignments can be global or local • Globlal: the whole sequence is aligned • Local : only fragments of the sequence are aligned

  32. Pairwise sequence alignments can be global or local

  33. DATABASE SCANNING • Word or K-tuple methods (FAST, BLAST)

  34. Heuristic strategies for sequence comparison

  35. Sequence similarity with FASTA

  36. FASTA

  37. Sequence similarity with BLAST (Basic Local Alignment Search Tool)

  38. BLAST

  39. PAM250

  40. Blosum50 A R N D C Q E G H I L K M F P S T W Y V A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0 R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3 N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3 D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4 C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3 E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3 G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4 I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4 L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1 K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3 M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1 F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1 P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3 S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0 W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3 Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1 V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

  41. BLOSUM62

  42. MEGABLAST Search Mega BLAST uses the greedy algorithm for nucleotide sequence alignment search. This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". When larger word size is used (see explanation below), it is up to 10 times faster than more common sequence similarity programs. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm.

More Related