420 likes | 693 Views
Biological sequence analysis. Terry Speed Wald Lecture II, August 8, 2001. The objects of our study. DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units. DNA : units are the nucleotide residues A, C, G and T
E N D
Biological sequence analysis Terry Speed Wald Lecture II, August 8, 2001
The objects of our study • DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units. • DNA: units are the nucleotide residues A, C, G and T • RNA: units are the nucleotide residues A, C, G and U • Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. • To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.
The use of statistics to study linear sequences of biomolecular units • Can be descriptive, predictive or everything else in between…..almost business as usual. • Stochastic mechanisms should never be takenliterally, but nevertheless can be amazingly useful. • Care is always needed: a model or method can breakdown at any time without notice. • Biological confirmation of predictions is almost always necessary.
The statistics of biological sequences can be global or local • Base composition of genomes: • E. coli: 25% A, 25% C, 25% G, 25% T • P. falciparum: 82%A+T • Translation initiation: • ATG is the near universal motif indicating the • start of translation in DNA coding sequence.
From certainty to statistical models: a brief case study 1 ZNF: Cys-Cys-His-His zinc finger DNA binding domain
Cys-Cys-His-Hiszinc finger DNA binding domain • Its characteristic motif has regular expression • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 1ZNF:XYKCGLCERSFVEKSALSRHQRVHKNX • Regular expressions can be limiting, and position-specific distributionscame to represent the variability in motifs.
Cys-Cys-His-His profile: sequence logo form A sequence logo is a scaled position-specific a.a.distribution. Scaling is by a measure of a position’s information content.
Representation of motifs: the next steps • Missing from the position-specific distribution • representation of motifs are good ways of dealing with: • Length distributions for insertions/deletions • Cross-position association of amino acids • Hidden Markov models help with the first. The second remains a hard unsolved problem.
Hidden Markov models • Processes {(St,Ot), t=1,…}, where Stis the hidden • state and Ot the observation at time t, such that • pr(St | St-1,Ot-1,St-2 ,Ot-2 …) = pr(St | St-1) • pr(Ot | St-1,Ot-1,St-2 ,Ot-2 …) = pr(Ot | St, St-1) • The basics of HMMs were laid bare in a series of beautiful papers by L E Baum and colleagues around 1970, and their formulation has been used almost unchanged to this day.
Hidden Markov models:extensions • Many variants are now used. For example, the distribution of O may not depend on previous S but on previous O values, • pr(Ot | St , St-1 ,Ot-1 ,..) = pr(Ot | St ), or • pr(Ot | St , St-1 ,Ot-1 ,..) = pr(Ot | St ,St-1 ,Ot-1) . • Most importantly for us, the times of S and O may be decoupled, permitting the Observation corresponding to Statetimet to be a string whose length and composition depends on St(and possibly St-1 and part or all of the previous Observations). This is called a hidden semi-Markov or generalized hidden Markov model.
A 0.4 A 0.05 C 0.1 C 0.4 G 0.1 G 0.5 T 0.4 T 0.05 A T C A A G G C G A T A simple HMM (Churchill, 1989) O.O1 O.99 O.9 O.1 hidden states . . . . . . observations . . . . . .
The algorithms • As the name suggests, with an HMM the series O = (O1,O2 ,O3 ,……., OT) is observed, while the states S = (S1 ,S2 ,S3 ,……., ST) are not. • There are elegant algorithms for calculating pr(O|),arg max pr(O|) in certain specialcases, and arg maxS pr(S|O,). • Here are the parameters of the model, e.g. transition and observation probabilities.
Some current applications of HMMs to biology mapping chromosomes aligning biological sequences predicting sequence structure inferring evolutionary relationships finding genes in DNA sequence Some early applications of HMMs finance, but we never saw them speech recognition modelling ion channels In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.
HMMs representing coiled-coil domains 2TMA: Tropomyosin
Coiled-coil domains, schematically dimeric parallel helices, heptad repeats, knobs-into-holes
HMM: decoding Sequence Labels Path1 Path2 WGPARQLNESVKDINKMLERHP BBBCCCCCCCCCCCCCCCCCBB 000abcdefgabcdefgabc00 00cdefgabcdefgabcdefg0 VITERBI decoding: of all possible state-paths, we determine the maximum probability one given the amino acid sequence O; POSTERIOR decoding: at each position, we determine the state with the highest probability given O. Issue: how to measure the strength of a potential CC domain, and how should this depend on the length of a domain?
CC-PROBABILITY PROFILE Fusion protein of simian parainfluenza virus 5
Assessing performance: terms • TP true positive: a predicted fragment that overlaps the annotated fragment (aa in the annotated region) • FP false positive: a predicted fragment does not overlap any annotated fragment (aa not in the annotated region) • LS learning set of sequences • NTS negative test set; sequences with no CCD used for estimating FP • PTS positive test set; used for estimating TP • Much care/effort required to create these sets
Assessing performance: study design • Study the variability of performance under variation of the sequences used for determining the model parameters. • Compare methods using the same set of aa-frequencies / emission probabilities. • Use the same set of domains for learning and testing instead of testing on different protein families. • Choose a number of FP-rates and calculate the corresponding TP-rates (ROC curve).
PTS subdivided 150 times at random (stratified) into 2/3 for learning and the rest for testing. LEARNING PHASE => PARAMETERS TESTING ON NTS => THRESHOLDS 150 X TESTING ON PTS => TP-VALUES ANALYSIS OF RESULTS
Assessing performance: summaries • TP-rate at given FP-rate: per family / per length-class • TP- and FP-rates for aa’s: per family / per length-class • Accuracy: at the borders and of length prediction
Finding genes in DNA sequence This is one of the most challenging and interesting problems in computational biology at the moment. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally.
DNA transcription mRNA translation Protein What is a (protein-coding) gene? CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE
What is a gene, ctd? • In general the transcribed sequence is longer than the translated portion: parts called introns (intervening sequence) are removed, leaving exons (expressed sequence), and yet other regions remain untranslated. The translated sequence comes in triples called codons, beginning and ending with a unique start (ATG) and one of three stop (TAA, TAG, TGA) codons. • There are also characteristic intron-exon boundaries called splice donor and acceptor sites, and a variety of other motifs: promoters, transcriptionstart sites, polyA sites,branching sites, and so on. • All of the foregoing have statistical characterizations.
In more detail (color ~state)
Some facts about human genes Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon Some exons can be as small as 1 or 3 bp. HUMFMR1S is not atypical: 17 exons 40-60 bp long, comprising 3% of a 67,000 bp gene
The idea behind a GHMM genefinder • States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). • Observationsembody state-dependent base composition, dependence, and signal features. • In a GHMM, durationmust be included as well. • Finally, reading frames and both strands must be dealt with.
Beyond position-specific distributions The bases in splice sites exhibit dependence, and not simply of the nearest neighbor kind. High-order (non-stationary) Markov models would be one option, but the number of parameters in relation to the amount of data rules them out. The class of variable length Markov models (VLMMs) deriving from early research by Rissanen prove to be valuable in this context. However, there is likely to be room for more research here.
GENSCAN (Burge & Karlin) 6 2 0 0 1 A G G A C A G G T A C G G C T G T C A T C A C T T A G A C C T C A C C C T G T G G A G C C A C A C C 6 2 0 5 1 C T A G G G T T G G C C A A T C T A C T C C C A G G A G C A G G G A G G G C A G G A G C C A G G G C 6 2 1 0 1 T G G G C A T A A A A G T C A G G G C A G A G C C A T C T A T T G C T T A C A T T T G C T T C T G A 6 2 1 5 1 C A C A A C T G T G T T C A C T A G C A A C C T C A A A C A G A C A C C A T G G T G C A C C T G A C E E E 6 2 2 0 1 T C C T G A G G A G A A G T C T G C C G T T A C T G C C C T G T G G G G C A A G G T G A A C G T G G 0 1 2 6 2 2 5 1 A T G A A G T T G G T G G T G A G G C C C T G G G C A G G T T G G T A T C A A G G T T A C A A G A C 6 2 3 0 1 A G G T T T A A G G A G A C C A A T A G A A A C T G G G C A T G T G G A G A C A G A G A A G A C T C 6 2 3 5 1 T T G G G T T T C T G A T A G G C A C T G A C T C T C T C T G C C T A T T G G T C T A T T T T C C C 6 2 4 0 1 A C C C T T A G G C T G C T G G T G G T C T A C C C T T G G A C C C A G A G G T T C T T T G A G T C I I I 0 1 2 6 2 4 5 1 C T T T G G G G A T C T G T C C A C T C C T G A T G C T G T T A T G G G C A A C C C T A A G G T G A 6 2 5 0 1 A G G C T C A T G G C A A G A A A G T G C T C G G T G C C T T T A G T G A T G G C C T G G C T C A C 6 2 5 5 1 C T G G A C A A C C T C A A G G G C A C C T T T G C C A C A C T G A G T G A G C T G C A C T G T G A 6 2 6 0 1 C A A G C T G C A C G T G G A T C C T G A G A A C T T C A G G G T G A G T C T A T G G G A C C C T T E E i t 6 2 6 5 1 G A T G T T T T C T T T C C C C T T C T T T T C T A T G G T T A A G T T C A T G T C A T A G G A A G 6 2 7 0 1 G G G A G A A G T A A C A G G G T A C A G T T T A G A A T G G G A A A C A G A C G A A T G A T T G C 6 2 7 5 1 A T C A G T G T G G A A G T C T C A G G A T C G T T T T A G T T T C T T T T A T T T G C T G T T C A 6 2 8 0 1 T A A C A A T T G T T T T C T T T T G T T T A A T T C T T G C T T T C T T T T T T T T T C T T C T C 6 2 8 5 1 C G C A A T T T T T A C T A T T A T A C T T A A T G C C T T A A C A T T G T G T A T A A C A A A A G E 5'UTR s 3'UTR 6 2 9 0 1 G A A A T A T C T C T G A G A T A C A T T A A G T A A C T T A A A A A A A A A C T T T A C A C A G T 6 2 9 5 1 C T G C C T A G T A C A T T A C T A T T T G G A A T A T A T G T G T G C T T A T T T G C A T A T T C 6 3 0 0 1 A T A A T C T C C C T A C T T T A T T T T C T T T T A T T T T T A A T T G A T A C A T A A T C A T T 6 3 0 5 1 A T A C A T A T T T A T G G G T T A A A G T G T A A T G T T T T A A T A T G T G T A C A C A T A T T poly-A 6 3 1 0 1 G A C C A A A T C A G G G T A A T T T T G C A T T T G T A A T T T T A A A A A A T G C T T T C T T C promoter 6 3 1 5 1 T T T T A A T A T A C T T T T T T G T T T A T C T T A T T T C T A A T A C T T T C C C T A A T C T C 6 3 2 0 1 T T T C T T T C A G G G C A A T A A T G A T A C A A T G T A T C A T G C C T C T T T G C A C C A T T 6 3 2 5 1 C T A A A G A A T A A C A G T G A T A A T T T C T G G G T T A A G G C A A T A G C A A T A T T T C T 6 3 3 0 1 G C A T A T A A A T A T T T C T G C A T A T A A A T T G T A A C T G A T G T A A G A G G T T T C A T Forward (+) strand Forward (+) strand intergenic 6 3 3 5 1 A T T G C T A A T A G C A G C T A C A A T C C A G C T A C C A T T C T G C T T T T A T T T T A T G G region Reverse (-) strand Reverse (-) strand 6 3 4 0 1 T T G G G A T A A G G C T G G A T T A T T C T G A G T C C A A G C T A G G C C C T T T T G C T A A T 6 3 4 5 1 C A T G T T C A T A C C T C T T A T C T T C C T C C C A C A G C T C C T G G G C A A C G T G C T G G 6 3 5 0 1 T C T G T G T G C T G G C C C A T C A C T T T G G C A A A G A A T T C A C C C C A C C A G T G C A G 6 3 5 5 1 G C T G C C T A T C A G A A A G T G G T G G C T G G T G T G G C T A A T G C C C T G G C C C A C A A 6 3 6 0 1 G T A T C A C T A A G C T C G C T T T C T T G C T G T C C A A T T T C T A T T A A A G G T T C C T T 6 3 6 5 1 T G T T C C C T A A G T C C A A C T A C T A A A C T G G G G G A T A T T A T G A A G G G C C T T G A 6 3 7 0 1 G C A T C T G G A T T C T G C C T A A T A A A A A A C A T T T A T T T T C A T T G C A A T G A T G T
Remark • In general the problem of identifying (annotating) human genes is considerably harder than ß-globin might suggest. • The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron. • The biggest human gene yet is for dystrophin. It has > 30 exons and is spread over 2.4 million bp.
Challenges in the analysis of sequence data • Understanding the biology well enough to begin. • Designing HMM architecture, e.g. in Marcoil for coiled-coils. • Modelling the parts, e.g. VLMMs for splice sites. • Coding = software engineering, is the hardest and most important task of all: making it all work. • Obtaining good data sets for use in careful evaluation and comparison with competing algorithms; designing the studies. • Opportunities for methodological research.
Topics not mentioned include • Molecular evolution, including phylogenetic inference (building trees from aligned sequence data) • Sequence alignment (pairwise, multiple), including use of Gibbs sampler • Stochastic context-free grammar models and the analysis of RNA sequence data.
Acknowledgements • Mauro Delorenzi (WEHI) • Simon Cawley (Affymetrix) • Tony Wirth (CS, Princeton) • Lior Pachter (Math, UCB) • Marina Alexandersson (Stat, UCB)
References • Biological Sequence Analysis • R Durbin, S Eddy, A Krogh and G Mitchison • Cambridge University Press, 1998. • Bioinformatics The machine learning approach • P Baldi and S Brunak • The MIT Press, 1998 • Post-Genome Informatics • M Kanehisa • Oxford University Press, 2000