510 likes | 838 Views
Chapter 6 The Computational Foundations of Genomics. Applying algorithms to analyze genomics data. Contents. What are computational biology and bioinformatics? Understanding computers and algorithms Sequence alignment Gene prediction Algorithms for analysis of phylogeny
E N D
Chapter 6The Computational Foundations of Genomics Applying algorithms to analyze genomics data © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Contents • What are computational biology and bioinformatics? • Understanding computers and algorithms • Sequence alignment • Gene prediction • Algorithms for analysis of phylogeny • Analysis of microarray data • Computer simulation © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Computational Biology and Bioinformatics • Computational biology • Development of computational methods to solve problems in biology • Bioinformatics • Application of computational biology to analysis and management of real molecular biology data • Why do molecular biologists need computer science? • Discrete nature of sequence data is ideal for analysis using digital computers • Size and complexity of genomics data make the data impossible to analyze without computers © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithm • an algorithm is a procedure (a finite set of well-defined instructions) for accomplishing some task • A recipe to perform a task • Algorithms often have steps that repeat (iterate) or require decisions (such as logic or comparison). Algorithms can be composed to create more complex algorithms. • Concept originated in 1936 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A historical perspective • The 1960s: the birth of bioinformatics • High-level computer languages • Protein sequence data • Academic access to computers • Margaret Oakley Dayhoff • First protein database • First program for automatic sequence assembly IBM 7090 computer © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Solving problems in computer science • Necessary parameters for assessing the difficulty of a computer science problem • Algorithmic complexity • Is the problem theoretically solvable? • If so, what is the most efficient solution? • Current state of computer technology • Memory • CPU speed • Cost • sequencing entire genomes via the shotgun approach was not possible until the mid-1990s because the computational power needed was unavailable until that time. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithmic problems • Example: searching for a number in an unordered list • If the list has N numbers, the average amount of time the search will take will be proportional to N • A more clever approach • Place the numbers in order • Do a binary search • Step 1: Pick a number in the middle of the list • Step 2: Restrict the search to the half that contains your number • Return to Step 1 until you find your number • Time for this approach is proportional to log2N © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The digital computer • Represents everything in a code of zeros and ones • Computer architecture • CPU (Central Processing Unit) • Memory • Input / Output • Advantages of digital computer • Deterministic • Minimization of noise Input CPU Memory Output © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The limitations of digital computers • The limitations of digital computers are conceptual, not just technological • Digital computers are deterministic • Incapable of truly random behavior • Digital computers deal with strictly discrete values • Can only approximate continuous behavior • Many interesting biological phenomena occur in the continuous realm of space and time © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Sequence databases • What is a database? • An indexed set of records • Records retrieved using a query language • Database technology is well established • Examples of sequence databases • GenBank (NCBI) • Encompasses all publicly available protein and nucleotide sequences • Protein Data Bank • Contains 3-D structures of proteins © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The client-server modelA single computer to GCG to Internet.. • The clients and servers are software processes • Clients request data from servers • Servers and clients can reside on the same or different machines • Clients can act as servers to other processes and vice versa Web Browser Web Server BLAST Search Engine Database © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Sequence alignment • Sequence alignments search for matches between sequences • Two broad classes of sequence alignments • Global (wide) maximize overall score • Local (narrow) high score in limited area • Alignment can be performed between two or more sequences QKESGPSSSYC VQQESGLVRTTC Global alignment ESG ESG Local alignment © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The biological importance of sequence alignment • Sequence alignments assess the degree of similarity between sequences • Similar sequences suggest similar function • Proteins with similar sequences are likely to play similar biochemical roles • Regulatory DNA sequences that are similar will likely have similar roles in gene regulation • Sequence similarity suggests evolutionary history • Fewer differences mean more recent divergence • Orthologs versus paralogs © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The algorithmic problem of aligning sequences • Comparison of similar sequences of similar length is straightforward • How does one deal with insertions and gaps that may hide true similarity? • How does one interpret minimal similarity? • Are sequences actually related? • Is alignment by chance? QKESGPSRSYC QQESGPVRSTC RQQEPVRSTC QQESGPVRSTC QKGSYQEKGYC QQESGPVRSTC © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Methods of sequence alignment • Graphical methods: visual • Dynamic-programming methods: mathematically best but needs time • Heuristic methods: approximate but close to real answer © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dot matrix analysis RQQEPVRSTC • A graphical method • Shows all possible alignments • Caveats • Some guesswork in picking parameters • Window size • Stringency • Not as rigorous or quantitative as other methods QQESGPVRSTC © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dot matrix analysis: a real example Window size: 1 Window size: 23 Stringency: 1 Stringency: 15 Noise to signal ratio © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Devising a scoring system • Scoring matrices allow biologists to quantify the quality of sequence alignments • Use different scoring matrices for different purposes • Score for similar structural domains in proteins • Score for evolutionary relationship • Some popular scoring matrices • PAM for evolutionary studies (Percent Accepted Mutation) • BLOSUM for finding common motifs (BLOcks amino acid SUbstitution Matrix) © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
An example of scoring A sequence comparison A A 4 D Q 0 D E 2 R R 5 Q Q 5 C E -4 E C -4 R Q 1 A A 4 D Q 0 Total score: 18 BLOSUM62 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dynamic programming (DP) • Possibility of gaps (or insertions) makes number of possible sequence alignments astronomical • Dynamic programming makes sequence alignment possible by abandoning low scoring alignments among subsequences as the algorithm progresses • Mathematically proven to provide optimal alignments • DP algorithms for sequence alignment • Needleman-Wunsch-Gotoh algorithm for global alignments • Smith-Waterman algorithm for local alignments • DP alignment algorithms still too slow for searching an entire sequence database © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Heuristic methods with k-tuples • Example: BLAST/FASTA • Using query sequence, derive a list of words (tupules) of length w (e.g.,3) • Keep high-scoring matching words • High-scoring words are compared with database sequences • Sequences with many matches to high- scoring words are used as anchors (not just words but the order) for final alignments © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Statistical significance • Chance alignments have no biological significance • Statistical significance implies low probability of generating a chance alignment • Probability of long alignments increases with longer sequences • The extreme-value distribution (E value) • Used to calculate the probability of chance alignment • Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A practical example of sequence alignment MASH-1, a transcription factor http://blast.ncbi.nlm.nih.gov/Blast.cgi © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
BLAST results © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Detailed BLAST results © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A pairwise alignment with MASH-1 • HASH-2, a human homolog of MASH-1 • “+” indicates conservative amino acid substitution • “–” indicates gap/insertion • XXXX… shows areas of low complexity (common occurrence) © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Multiple-sequence alignments • Uses of multiple-sequence alignments • Automated reconstruction of sequence fragments • Phylogenetic analysis • Identification of sequence families • The problem of multiple-sequence alignment • O(NM) where N is the average sequence length and M is the number of sequences being aligned (optimal methods) • Dynamic programming will work only for small M • Heuristic methods are required © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Some methods for globalmultiple-sequence alignment • Progressive methods • Align most closely related sequences, and then less related ones • Use phylogenetic trees to quantify similarities • Downside: poor results with distantly related sequences • Iterative methods • Start with progressive alignment • Realign sequences after leaving one sequence out • Add left-out sequence • Repeat until acceptable alignment is achieved • Probabilistic methods • Hidden Markov models ( we will talk later) © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Phylogenetic analysis • Phylogenetic trees • Describe evolutionary relationships between sequences • Three common methods • Maximum parsimony • Distance • Maximum likelihood human immunodeficiency viruses from around the world © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Comparison of methods for phylogenetic analysis • Maximum parsimony (machine input)(closely related seqs) • Finds optimal tree (or trees) requiring minimum number of substitutions to explain sequence variation • Maximum likelihood (user input) (distantly related) • Finds most probable tree • Similar to maximum parsimony • Distance (mix of close and distantly related) • Compare pairs of sequences for number of differences between them • Use many methods to get consensus tree © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithmic complexity and phylogenetic analysis • Four steps • Sequence alignment • Substitution model (scoring matrix) • Tree building • Tree evaluation • Tree building and evaluation are computationally expensive • Heuristic methods required in most cases © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Gene prediction • A problem of pattern recognition • Algorithms look for features of genes: • E.g., Splice sites, ORFs, starting methionine • Identification of regulatory regions is very difficult • Statistical understanding of genes is ongoing • Problems of this type require machine learning algorithms: learn what is the pattern based on small dataset © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Central Dogma in Molecular Biology © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Artificial neural networks • Machine learning algorithms that mimic the brain • Connections between “neurons” vary in strength • Connection weights (wij) (strength) change while network is exposed to training set • Fully trained network recognizes pattern in novel input • GRAIL output hidden input A feed forward neural network © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Hidden Markov models • Can be used for machine learning • Units constitute transition states • Transitions not dependent on history • Many uses in genomics • Gene prediction • Multiple sequence alignment • Finding periodic patterns start End © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs • The example of a dishonest gambler is often used to illustrate this point. The gambler may carry a loaded die that he or she occasionally substitutes for a fair die, but not so often that the other players would notice. The fair die has a one-in-six chance of showing any particular number. When using the loaded die, a player will have a 50% chance of rolling a one and a 10% chance of rolling any other number. It is in these types of situations that stochastic models called hidden Markov models (HMMs) are useful, because they take into account unknown (or hidden) states. For example, exactly when the cheating gambler is using a fair or loaded die is hidden from the other players, but insight may still be gained by looking at the outcome of the cheater’s rolls. If he or she rolls three ones in a row, it is more likely (a 12.5% chance) that the loaded die is being used than the fair one, which would have only a 0.5% chance of generating three ones in a row. Hidden Markov models describe the probability of transitions between hidden states, as well as the probabilities associated with each state. In the example of the cheating gambler, an HMM would describe the probabilities of rolling particular numbers given the loaded or fair die and the probability that the dishonest gambler would switch from one die to another. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs continued • Hidden Markov models can be used to answer three types of questions. The first type is the likelihood question: Given a particular HMM, what is the probability of obtaining a particular outcome (e.g., rolling three ones)? The second type is the decoding question: Given a particular HMM, what is the most likely sequence of transitions between states for a particular outcome? In the case of the cheating gambler, this sequence would be the order in which he or she transitioned from one die to another. The third type is the learning question: Given a particular outcome and set of assumptions about possible transition states, what are the best model parameters (e.g., probabilities between transition states)? This third question allows HMMs to be used for machine learning. The figure in the slide shows a simple example of a hidden Markov model being used to account for the DNA sequence at the bottom. Every HMM has a start and end state, denoted by the S and E, respectively, in the slide. Hidden states lie between the start and end states. In the figure, the squares are states, and the lines between them indicate the probability of one state transitioning to another. The loops on the upper and lower states show the probabilities associated with the state remaining the same. States transition back and forth until the HMM reaches the end state. In this HMM, the top square represents a state that has equal probabilities of generating A, G, C, or T. The bottom state has probabilities of 0.1, 0.1, 0.1, and 0.7 of generating A, G, C, and T, respectively. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Hidden Markov models • Can be used for machine learning • Units constitute transition states • Transitions not dependent on history • Many uses in genomics • Gene prediction • Multiple sequence alignment • Finding periodic patterns start End © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs for gene prediction • HMMs are trained on sequences that are members of known gene class • HMM gives probability that a particular sequence belongs to the gene class • Length of the bar indicates probability • Bigger the bar higher probability • Genscan: gene predicting program 2000 human introns © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithms for secondary-structure determination • Chou-Fasman / GOR method • Based on experimentally determined frequency of amino acids in secondary structures • Machine learning algorithms • Neural networks: three-dimensional structures have already been determined Structures • Nearest-neighbor methods: closest matches • Trained on previously deduced structures to detect amino acid patterns in secondary structures © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Analysis of microarray data • Microarrays can measure the expression of thousands of genes simultaneously • Vast amounts of data require computers • Types of analysis • Gene-by-gene • Method: Statistical techniques • Categorizing groups of genes • Method: Clustering algorithms • Deducing patterns of gene regulation • Method: Under development © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Unsupervised techniques • Make no assumptions about how the data should behave • Cluster genes based on similar patterns of gene expression • Examples • Hierarchical clustering • Principal components analysis (PCA) Hierarchical clustering PCA © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Metrics for gene expression • Need a method to measure how similar genes are based on expression • Examples • Euclidean distance • Pearson correlation coefficient Euclidean distance Pearson correlation coefficient © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Supervised techniques • Divide groups of genes based on sample properties • Can predict sample condition based on gene expression pattern • Examples • Support vector machine • Nearest neighbor Support vector machine Nearest neighbor © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The usefulness of simulation • Why simulate when you can experiment? • Models involving many parameters may be difficult to conceptualize without simulations • A simulation may suggest ways of testing a hypothesis • Some experiments cannot be done in vivo, or in vitro, and must therefore be done in silico • If a simulation is good, it can be used in place of more expensive or time-consuming experiments. Nuclear experiments by the US. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Numerical methods • Numerical methods are needed because of the discrete nature of computers • Differential equations are turned into difference equations that deal with discrete rather than continuous quantities • Smaller steps lead to greater simulation accuracy © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Examples of computer simulations in biology • Gene regulatory networks • Simulations of cells • Networks of neurons • Population genetics A model of gene regulation © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Prospects for a fully simulated cell © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Limitations of computer simulation • Algorithmic • Computers only can process discrete values • Simulating continuous behavior accurately often requires an unfeasible number of calculations • Experimental • Simulation only as good as data it is based on • Critical data often missing from simulation • Conceptual • Overly complex simulations do not contribute to understanding of a biological system © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Summary • Vast amounts of data require bioinformatics • These are limited by the following: • Algorithmic complexity of bioinformatics problems • Computer hardware performance • Heuristic methods used to get around these limitations • Bioinformatics methods used in the following areas: • Sequence alignment • Phylogenetic-tree construction • Gene prediction • Secondary-structure determination • Analysis of microarray data • Simulation of biological systems © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458