1 / 38

BCB 444/544

BCB 444/544. Finish: Lecture 2- Biological Databases Lecture 4 Sequence Alignment #4_Aug27. Required Reading ( before lecture). Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment Chp 3 - pp 31-41 Xiong Textbook Wed Aug 29 - for Lecture #5 Dynamic Programming

jace
Download Presentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Finish: Lecture 2- Biological Databases Lecture 4 Sequence Alignment #4_Aug27 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  2. Required Reading (before lecture) Mon Aug 27- for Lecture #4 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Xiong Textbook Wed Aug 29- for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? Thurs Aug 30- Lab #2: Databases, ISU Resources,& Pairwise Sequence Alignment Fri Aug 31- for Lecture #6 Scoring Matrices and Alignment Statistics • Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  3. HW#2: BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  4. Back to: Chp 2- Biological Databases • Xiong: Chp 2 Introduction to Biological Databases • What is a Database? • Types of Databases • Biological Databases • Pitfalls of Biological Databases • Information Retrieval from Biological Databases BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  5. What is a Database? Duh!! OK: skip we'll skip that! BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  6. Types of Databases 3 Major types of electronic databases: • Flat files- simple text files • no organization to facilitate retrieval • Relational- data organized as tables ("relations") • shared features among tables allows rapid search • Object-oriented- data organized as "objects" • objects associated hierarchically BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  7. Biological Databases Currently - all 3 types, but MANY flat files What are goals of biological databases? • Information retrieval • Knowledge discovery Important issue: Interconnectivity BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  8. Types of Biological Databases 1- Primary • "simple" archives of sequences, structures, images, etc. • raw data, minimal annotations, not always well curated! 2- Secondary • enhanced with more complete annotation of sequences, structures, images, etc. • usually curated! 3- Specialized • focused on a particular research interest or organism • usually - not always - highly curated BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  9. Examples of Biological Databases 1- Primary • DNA sequences • GenBank - US • European Molecular Biology Lab - EMBL • DNA Data Bank of Japan - DDBJ • Structures (Protein, DNA, RNA) • PDB - Protein Data Bank • NDB - Nucleic Acid Data Bank BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  10. Examples of Biological Databases 2- Secondary • Protein sequences • Swiss-Prot, TreEMBL, PIR • these recently combined into UniProt 3- Specialized • Species-specific (or "taxonomic" specific) • Flybase, WormBase, AceDB, PlantDB • Molecule-specific,disease-specific BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  11. Pitfalls of Biological Databases • Errors! & • Lack of documentation re: quality or reliability of data • Limited mechanisms for "data checking" or preventing propagation of errors (esp. annotation errors!!) • Redundancy • Inconsistency • Incompatibility (format, terminology, data types, etc.) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  12. Information Retrieval from Biological Databases 2 most popular retrieval systems: • ENTREZ - NCBI • will use a LOT - was introduced in Lab 1 • SRS - Sequence Retrieval Systems - EBI • will use less, similar to ENTREZ Both: • Provide access to multiple databases • Allow complex queries BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  13. Web Resources: Bioinformatics & Computational Biology • NCBI - National Center for Biotechnology Information • ISCB - International Society for Computational Biology • JCB - Jena Center for Bioinformatics • Pitt - OBRC Online Bioinformatics Resources Collection • UBC - Bioinformatics Links Directory • UWash - BioMolecules • ISU - Bioinformatics Resources - Andrea Dinkelman • ISU - YABI = "Yet Another Bioinformatics Index" (from BCB Lab at ISU) • Wikipedia: Bioinformatics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  14. ISU Resources & Experts ISU Research Centers & Graduate Training Programs: • LH Baker Center - Bioinformatics & Biological Statistics • BCB - Bioinformatics & Computational Biology • BCB Lab - (Student-Led Consulting & Resources) • CIAG - Center for Integrated Animal Genomics • CCILD - Computational Intelligence, Learning & Discovery • IGERT Training Grant - Computational Molecular Biology ISU Facilities: • Biotechnology - Instrumentation Facilities • PSI - Plant Sciences Institute • PSI Centers BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  15. SUMMARY: #2- Biological Databases BEWARE! BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  16. Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • Evolutionary Basis • Sequence Homology versus Sequence Similarity • Sequence Similarity versus Sequence Identity • Methods • Scoring Matrices • Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  17. Motivation for Sequence Alignment "Sequence comparison lies at the heart of bioinformatics analysis." Jin Xiong Sequence comparison is important for drawing functional & evolutionary inferences re: new genes/proteins Pairwise sequence alignment is fundamental; it used to: • Search for common patterns of characters • Establish pair-wise correspondence between related sequences Pairwise sequence alignment is basis for: • Database searching (e.g., BLAST) • Multiple sequence alignment (MSA) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  18. Why Align Sequences? Databases contain many sequences with known functions & many sequences with unknown functions. Genes (or proteins) with similar sequences may have similar structures and/or functions. Sequence alignment can provide important clues to the function of a novel gene or protein BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  19. Examples of Bioinformatics Tasks that Rely on Sequence Alignment • Genomic sequencing (> 500complete genomes sequenced!) • Assembling multiple sequence reads into contigs, scaffolds • Aligning sequences with chromosomes • Finding genes and regulatory regions • Identifying gene products • Identifying function of gene products • Studying the structural organization of genomes • Comparative genomics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  20. Evolutionary Basis • DNA, RNA and proteins are "molecular fossils" • they encode the history of millions of years of evolution • During evolution, molecular sequences accumulate random changes (mutations/variants) • some of which provide a selective advantage or disadvantage, and some of which are neutral • Sequences that are structurally and/or functionally important tend to be conserved • (e.g., chromosomal telomeric sequences; enzyme active sites) • Significant sequence conservation allows inference of evolutionary relatedness BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  21. Homology Homology has a very specific meaning in evolutionary & computational biology - & the term is often used incorrectly For us: Homology = similarity due to descent from a common evolutionary ancestor But, HOMOLOGY ≠ SIMILARITY When 2 sequences share a sufficiently high degree of sequence similarity (or identity), we may infer that they are homologous We can infer homology from similarity(can't prove it!) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  22. Orthologs vs Paralogs A Speciation Duplication B C C' 2 types of homologous sequences: • Orthologs - "same genes" in different species; result of common ancestry; corresponding proteins have "same" functions (e.g., human -globin & mouse -globin) • Paralogs -"similar genes" within a species; result of gene duplication events; corresponding proteins may (or may not) have similar functions (e.g., human -globin & human -globin) A is the parent gene Speciation leads to B & C Duplication leads to C’ B and C are Orthologous C and C’ areParalogous BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  23. Sequence Homology vs Similarity • Homologous sequences - sequences that share a common evolutionary ancestry • Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Sequence similarity: • The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  24. Sequence Similarity vs Identity For nucleotide sequences (DNA & RNA), sequence similarity and identity have the "same" meaning: • Two DNA sequences can share a high degree of sequence identity (or similarity) -- means the same thing • Drena's opinion:Always use "identity" when making quantitative comparisons re: DNA or RNA sequences (to avoid confusion!) For protein sequences, sequence similarity and identity have different meanings: • Identity = % of exact matches between two aligned sequences • Similarity= % of aligned residues that share similar characteristics (e.g, physicochemical characteristics, structural propsensities, evolutionary profiles) • Drena's opinion:Always use "identity" when making quantitative comparisons re: protein sequences (to avoid confusion!) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  25. What is Sequence Alignment? Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence. Align: 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE. 1: THIS IS A RATHER LONGERSENTENCE THAN THE NEXT. 2: THIS IS A######SHORT##SENTENCE##############. OR 1: THIS IS A RATHER LONGERSENTENCE THANTHENEXT. 2: THIS IS A ##SHORT###SENT#EN###CE##############. Is one of these alignments "optimal"? Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  26. Goal of Sequence Alignment Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA 4 letter alphabet (+ gap) TTGACAC TTTACAC • Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  27. Statement of Problem Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences Find: Optimal pairing of sequences that • Retains the order of characters • Introduces gaps where needed • Maximizes total score BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  28. Types of Sequence Variation • Sequences can diverge from a common ancestor through various types of mutations: • Substitutions ACGA  AGGA • Insertions ACGA  ACCGA • Deletions ACGA  AGA • Insertions or deletions ("indels") result in gaps in alignments • Substitotions result in mismatches • No change? match BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  29. Gaps Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  30. Avoiding Random Alignments with a Scoring Function • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--cesometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define a scoring function that accounts for mismatches and gaps Scoring Function (F): e.g. Match: + m +1 Mismatch: - s -1 Gap: - d -2 F = m(#matches) + s(#mismatches) + d(#gaps) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  31. Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others; e.g., Ser and Thr are more similar than Trp and Ala • A substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  32. Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  33. Methods • Global and Local Alignment • Alignment Algorithms • Dot Matrix Method • Dynamic Programming Method • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  34. Global vs Local Alignment Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  35. Global vs Local Alignment - example Local alignment Global alignment CTGTCG-CTGCACG -TGC-CG-TG---- CTGTCGCTGCACG-- -------TGC-CGTG CTGTCG-CTGCACG -TGCCG--TG---- S = CTGTCGCTGCACG T = TGCCGTG Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  36. Global vs Local Alignment When use which? Both are important but it is critical to use right method for a given task! Global alignment: • Good for: aligning closely related sequences of approx. same length • Not good for: divergent sequences or sequences with different lengths Local Alignment: • Good for: searching for conserved patterns (domains or motifs) in DNA or protein sequences • Not good for: generating alignment of closely related sequences Global and local alignments are fundamentally similar and differ only in optimization strategy used in aligning similar residues BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  37. Alignment Algorithms 3 major methods for alignment: • Dot matrix analysis • Dynamic Programming • Word or k-tuple methods (later, in Chp 4) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

  38. Dot Matrix Method (Dot Plots) C G G A C A C A C G • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix • Plot a dot each time there is a match between an element of row sequence and an element of column sequence • For proteins, usually use more sophisticated scoring schemes than "identical match" • Diagonal lines indicate areas of match • Reverse diagonals (perpendicular to diagonal) indicate inversions Exploring Dot Plots BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment

More Related