560 likes | 1.39k Views
BCB 444/544. Finish: Lecture 2- Biological Databases Lecture 4 Sequence Alignment #4_Aug27. Required Reading ( before lecture). Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment Chp 3 - pp 31-41 Xiong Textbook Wed Aug 29 - for Lecture #5 Dynamic Programming
E N D
BCB 444/544 Finish: Lecture 2- Biological Databases Lecture 4 Sequence Alignment #4_Aug27 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Required Reading (before lecture) Mon Aug 27- for Lecture #4 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Xiong Textbook Wed Aug 29- for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? Thurs Aug 30- Lab #2: Databases, ISU Resources,& Pairwise Sequence Alignment Fri Aug 31- for Lecture #6 Scoring Matrices and Alignment Statistics • Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
HW#2: BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Back to: Chp 2- Biological Databases • Xiong: Chp 2 Introduction to Biological Databases • What is a Database? • Types of Databases • Biological Databases • Pitfalls of Biological Databases • Information Retrieval from Biological Databases BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
What is a Database? Duh!! OK: skip we'll skip that! BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Types of Databases 3 Major types of electronic databases: • Flat files- simple text files • no organization to facilitate retrieval • Relational- data organized as tables ("relations") • shared features among tables allows rapid search • Object-oriented- data organized as "objects" • objects associated hierarchically BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Biological Databases Currently - all 3 types, but MANY flat files What are goals of biological databases? • Information retrieval • Knowledge discovery Important issue: Interconnectivity BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Types of Biological Databases 1- Primary • "simple" archives of sequences, structures, images, etc. • raw data, minimal annotations, not always well curated! 2- Secondary • enhanced with more complete annotation of sequences, structures, images, etc. • usually curated! 3- Specialized • focused on a particular research interest or organism • usually - not always - highly curated BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Examples of Biological Databases 1- Primary • DNA sequences • GenBank - US • European Molecular Biology Lab - EMBL • DNA Data Bank of Japan - DDBJ • Structures (Protein, DNA, RNA) • PDB - Protein Data Bank • NDB - Nucleic Acid Data Bank BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Examples of Biological Databases 2- Secondary • Protein sequences • Swiss-Prot, TreEMBL, PIR • these recently combined into UniProt 3- Specialized • Species-specific (or "taxonomic" specific) • Flybase, WormBase, AceDB, PlantDB • Molecule-specific,disease-specific BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Pitfalls of Biological Databases • Errors! & • Lack of documentation re: quality or reliability of data • Limited mechanisms for "data checking" or preventing propagation of errors (esp. annotation errors!!) • Redundancy • Inconsistency • Incompatibility (format, terminology, data types, etc.) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Information Retrieval from Biological Databases 2 most popular retrieval systems: • ENTREZ - NCBI • will use a LOT - was introduced in Lab 1 • SRS - Sequence Retrieval Systems - EBI • will use less, similar to ENTREZ Both: • Provide access to multiple databases • Allow complex queries BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Web Resources: Bioinformatics & Computational Biology • NCBI - National Center for Biotechnology Information • ISCB - International Society for Computational Biology • JCB - Jena Center for Bioinformatics • Pitt - OBRC Online Bioinformatics Resources Collection • UBC - Bioinformatics Links Directory • UWash - BioMolecules • ISU - Bioinformatics Resources - Andrea Dinkelman • ISU - YABI = "Yet Another Bioinformatics Index" (from BCB Lab at ISU) • Wikipedia: Bioinformatics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
ISU Resources & Experts ISU Research Centers & Graduate Training Programs: • LH Baker Center - Bioinformatics & Biological Statistics • BCB - Bioinformatics & Computational Biology • BCB Lab - (Student-Led Consulting & Resources) • CIAG - Center for Integrated Animal Genomics • CCILD - Computational Intelligence, Learning & Discovery • IGERT Training Grant - Computational Molecular Biology ISU Facilities: • Biotechnology - Instrumentation Facilities • PSI - Plant Sciences Institute • PSI Centers BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
SUMMARY: #2- Biological Databases BEWARE! BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • Evolutionary Basis • Sequence Homology versus Sequence Similarity • Sequence Similarity versus Sequence Identity • Methods • Scoring Matrices • Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Motivation for Sequence Alignment "Sequence comparison lies at the heart of bioinformatics analysis." Jin Xiong Sequence comparison is important for drawing functional & evolutionary inferences re: new genes/proteins Pairwise sequence alignment is fundamental; it used to: • Search for common patterns of characters • Establish pair-wise correspondence between related sequences Pairwise sequence alignment is basis for: • Database searching (e.g., BLAST) • Multiple sequence alignment (MSA) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Why Align Sequences? Databases contain many sequences with known functions & many sequences with unknown functions. Genes (or proteins) with similar sequences may have similar structures and/or functions. Sequence alignment can provide important clues to the function of a novel gene or protein BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Examples of Bioinformatics Tasks that Rely on Sequence Alignment • Genomic sequencing (> 500complete genomes sequenced!) • Assembling multiple sequence reads into contigs, scaffolds • Aligning sequences with chromosomes • Finding genes and regulatory regions • Identifying gene products • Identifying function of gene products • Studying the structural organization of genomes • Comparative genomics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Evolutionary Basis • DNA, RNA and proteins are "molecular fossils" • they encode the history of millions of years of evolution • During evolution, molecular sequences accumulate random changes (mutations/variants) • some of which provide a selective advantage or disadvantage, and some of which are neutral • Sequences that are structurally and/or functionally important tend to be conserved • (e.g., chromosomal telomeric sequences; enzyme active sites) • Significant sequence conservation allows inference of evolutionary relatedness BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Homology Homology has a very specific meaning in evolutionary & computational biology - & the term is often used incorrectly For us: Homology = similarity due to descent from a common evolutionary ancestor But, HOMOLOGY ≠ SIMILARITY When 2 sequences share a sufficiently high degree of sequence similarity (or identity), we may infer that they are homologous We can infer homology from similarity(can't prove it!) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Orthologs vs Paralogs A Speciation Duplication B C C' 2 types of homologous sequences: • Orthologs - "same genes" in different species; result of common ancestry; corresponding proteins have "same" functions (e.g., human -globin & mouse -globin) • Paralogs -"similar genes" within a species; result of gene duplication events; corresponding proteins may (or may not) have similar functions (e.g., human -globin & human -globin) A is the parent gene Speciation leads to B & C Duplication leads to C’ B and C are Orthologous C and C’ areParalogous BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Sequence Homology vs Similarity • Homologous sequences - sequences that share a common evolutionary ancestry • Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Sequence similarity: • The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Sequence Similarity vs Identity For nucleotide sequences (DNA & RNA), sequence similarity and identity have the "same" meaning: • Two DNA sequences can share a high degree of sequence identity (or similarity) -- means the same thing • Drena's opinion:Always use "identity" when making quantitative comparisons re: DNA or RNA sequences (to avoid confusion!) For protein sequences, sequence similarity and identity have different meanings: • Identity = % of exact matches between two aligned sequences • Similarity= % of aligned residues that share similar characteristics (e.g, physicochemical characteristics, structural propsensities, evolutionary profiles) • Drena's opinion:Always use "identity" when making quantitative comparisons re: protein sequences (to avoid confusion!) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
What is Sequence Alignment? Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence. Align: 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE. 1: THIS IS A RATHER LONGERSENTENCE THAN THE NEXT. 2: THIS IS A######SHORT##SENTENCE##############. OR 1: THIS IS A RATHER LONGERSENTENCE THANTHENEXT. 2: THIS IS A ##SHORT###SENT#EN###CE##############. Is one of these alignments "optimal"? Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Goal of Sequence Alignment Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA 4 letter alphabet (+ gap) TTGACAC TTTACAC • Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Statement of Problem Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences Find: Optimal pairing of sequences that • Retains the order of characters • Introduces gaps where needed • Maximizes total score BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Types of Sequence Variation • Sequences can diverge from a common ancestor through various types of mutations: • Substitutions ACGA AGGA • Insertions ACGA ACCGA • Deletions ACGA AGA • Insertions or deletions ("indels") result in gaps in alignments • Substitotions result in mismatches • No change? match BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Gaps Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Avoiding Random Alignments with a Scoring Function • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--cesometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define a scoring function that accounts for mismatches and gaps Scoring Function (F): e.g. Match: + m +1 Mismatch: - s -1 Gap: - d -2 F = m(#matches) + s(#mismatches) + d(#gaps) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others; e.g., Ser and Thr are more similar than Trp and Ala • A substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Methods • Global and Local Alignment • Alignment Algorithms • Dot Matrix Method • Dynamic Programming Method • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Global vs Local Alignment Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Global vs Local Alignment - example Local alignment Global alignment CTGTCG-CTGCACG -TGC-CG-TG---- CTGTCGCTGCACG-- -------TGC-CGTG CTGTCG-CTGCACG -TGCCG--TG---- S = CTGTCGCTGCACG T = TGCCGTG Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Global vs Local Alignment When use which? Both are important but it is critical to use right method for a given task! Global alignment: • Good for: aligning closely related sequences of approx. same length • Not good for: divergent sequences or sequences with different lengths Local Alignment: • Good for: searching for conserved patterns (domains or motifs) in DNA or protein sequences • Not good for: generating alignment of closely related sequences Global and local alignments are fundamentally similar and differ only in optimization strategy used in aligning similar residues BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Alignment Algorithms 3 major methods for alignment: • Dot matrix analysis • Dynamic Programming • Word or k-tuple methods (later, in Chp 4) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Dot Matrix Method (Dot Plots) C G G A C A C A C G • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix • Plot a dot each time there is a match between an element of row sequence and an element of column sequence • For proteins, usually use more sophisticated scoring schemes than "identical match" • Diagonal lines indicate areas of match • Reverse diagonals (perpendicular to diagonal) indicate inversions Exploring Dot Plots BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment