Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

Introduction to Bioinformatics: Lecture IIFrom Molecular Processes to String Matching Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

Outline of the lecture • Sequence approximation in computational molecular biology: the premise and the limits • Getting ready for analysis of exact string matching and sequence alignment algorithms: some definitions and interplay with biology • The notion of string/sequence similarity • Substitution matrices for sequence alignment JM - http://folding.chmcc.org

R: unique - 0.7 Gb; common with both H and M – 1.1 Gb R: 2.75 Gb M: 2.5 Gb H: 2.9 Gb Before we start: literature watch A draft of the Rat genome has been published! RGSPC Nature 428 What are the first conclusions from the comparison with other mammalian genomes? What approaches and tools have been used to perform this comparative analysis? JM - http://folding.chmcc.org

Biological Polymers and Central Dogma Bio-Polymer (alphabet) Process (algorithm) DNA (A,T,G,C) replication transcription mRNA (U,A,C,G) splicing translation Proteins (20 a.a.) folding interactions Lipids, polysaccharides, membranes, signal transduction, environmental signals etc.

Complexity of “DNA computing” http://www.genecrc.org/site/lc/lc2d.htm JM - http://folding.chmcc.org

Get the relevant sequences to compare them: conservation and differences Problem  Algorithms  Programs Sequencing  Fragment assembly problem  The Shortest Superstring Problem  Phrap (Green, 1994) Gene finding  Hidden Markov Models, pattern recognition methods  GenScan (Burge & Karlin, 1997) Sequence comparison  pairwise and multiple sequence alignments  dynamic algorithm, heuristic methods  BLAST (Altschul et. al., 1990) JM - http://folding.chmcc.org

Redundancy in biological systems An example: sperm whale vs. human myoglobin: Query: 1 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 60 MLS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE Sbjct: 1 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60 Query: 61 DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFISEAIIHVLHSRH 120 DLKKHG TVLTALG ILKKKGHHEAE+KP AQSHATKHKIP+KYLEFISE II VL S+H Sbjct: 61 DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120 Query: 121 PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG 154 PG+FGADAQGAMNKALELFRKD+A+ YKELG+QG Sbjct: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154 Ex. Find the sequence of 1mba in the PDB and “blast” against nr using NCBI JM - http://folding.chmcc.org

Limits of the sequence approximation • All the information and various fingerprints of information processing at the molecular level (via interactions etc.), including adjustment to physiologically relevant external signals seem to be included in nucleotide and protein sequences • However, there are limits to this simple approximation: actual understanding of molecular processes requires structure, chemistry, kinetics and thermodynamics • On the other hand, a deeper understanding of the nature of biological objects and processes greatly facilitates sequence-based studies by suggesting critical features, similarity measurements etc. JM - http://folding.chmcc.org

Strings, sequences and string operations String vs. sequence duality will be important for exact vs. inexact string matching JM - http://folding.chmcc.org

Beyond the letters: how to find better models (e.g. GC content for gene finding) http://www.imb-jena.de/IMAGE_BPDIR.html

Another example: active sites, functional motifs and multiple alignment JM - http://folding.chmcc.org

Distance and similarity measures JM - http://folding.chmcc.org

Edit distance vs. substitution score JM - http://folding.chmcc.org

Substitution matrices for protein sequence alignment: learning and extrapolating from examples • PAM matrices (Dayhoff et. al): extrapolating longer evolutionary times from data for very similar proteins with more than 85% sequence identity (short evolutionary time), s(a,b | t) = log P(b|a,t)/qa e.g. P(b|a,2)= Sc P(b|c,1)P(c|a,1) • BLOSUM matrices (Henikoff & Henikoff): multiple alignments of more distantly related proteins (e.g. BLOSUM50 with 50% sequence identity), s(a,b) = log pab/qaqb where pab= Fab / Scd Fcd Expected score: Sab qaqb s(a,b) = - Sab qaqb log qaqb / pab = -H(q||p) JM - http://folding.chmcc.org

Summary JM - http://folding.chmcc.org

Web resources and materials for the course • Protein Modeling Lab • Remote access to PML and the Citrix software • All lectures and other materials available electronically from the PML servers • Electronic tests and homework, web submission interfaces • The web site for the Introduction to Bioinformatics course • Updates http://folding.chmcc.org http://folding.chmcc.org/protlab/protlab.html http://folding.chmcc.org/intro2bioinfo/intro2bioinfo.html JM - http://folding.chmcc.org

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching