670 likes | 794 Views
Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth Information and Computer Science University of California, Irvine Funding Acknowledgements National Science Foundation, Microsoft Research, IBM Research. Outline. . Pattern discovery problem
E N D
Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth Information and Computer Science University of California, Irvine Funding Acknowledgements National Science Foundation, Microsoft Research, IBM Research
Outline • Pattern discovery problem • Problem statement • Research questions • Bayes error rate framework • Experimental results • Conclusions
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADABCCCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBAABBCCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCABBBCBBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBABBCCAADCBCDACBCABABCCBACBBBCADDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADABCCCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBAABBCCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCABBBCBBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBABBCCAADCBCDACBCABABCCBACBBBCADDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
Applications in Biology • Motif discovery problem • Task: Identification of potential binding sites of proteins • Input: Upstream regions of co-regulated genes • Patterns are non-deterministic • fixed length motifs • may have substitution errors • substitutions are independent
Generative Model • Extensions: • Variable length patterns • Multiple patterns, multi-part patterns • Richer background |A| • Hidden Markov Model A 0.9 A 0.9 B 0.9 B 0.9 D 0.9 B 0.25 C 0.25 A 0.25 D 0.25 P1 P3 BG P2 P4 0.01 1.0 1.0 1.0 0.99 F 1.0 L
Pattern Detection and Discovery • Unsupervised discovery of embedded patterns EM Unlabeled Data Pattern Model • Supervised detection of known patterns Viterbi Unlabeled Data Pattern Model Pattern Locations
Research Questions • Current state • Successful algorithms for motif discovery • Can not solve some seemingly simple problems • Why is this happening? • Are the algorithms suboptimal? • Is the data set too small? • Is the problem inherently difficult and ambiguous? • Can such questions be answered in a principled way?
Outline • Pattern discovery problem • Bayes error rate framework • Definitions • Examples • Application to pattern discovery • Experimental results • Conclusions
Difficulty of Pattern Discovery • Assume true model is given • Measure of difficulty • classification performance of pattern detection • Multiple factors influence the difficulty • Single characteristic that quantifies it? • Bayes error rate
Bayes Error Rate: Definition • Bayes error rate: • average error rate of the optimal decision rule • a lower bound on classification performance of any algorithm on a given problem • Optimal rule: • pick the class with the highest posterior probability
Classification Example • Simple problem • Harder problem
Pattern Discovery Example • Simple problem • BDCBBDCABCDADCADAADABDAABBCBCACADAADCDDABDCADAADCBBAADBDBDBDAABACABBABBCAADDBCBADCBDDABCDABBBDBBCCBDAABCAABACDDADCADBCABADBCABAAABBCABBCDAABDCDAABACDDACCCDBCDDDBAADCBDAADDBBADAAADAADAADBAABDBDACADBDCCBACBACBADABDCACBCBDDCBACBAAADDCABBADDDCABCDCCCCBDDCADBBCDDACCDBBBACAACADBDACDAADCDACBADAADCBABACADAADBAABAAAADAADDDADBDDCBCDDCCBDDCDCBBBDAADDBDBBCDACBCCCBCBCDAD • Hard problem • DDDBACABBCDDCDCBBCACCBDADDBBACDACCBDADCCCDBDBDADABABDCBCDABDBABABCCADBBDCDBBBBDACBBAABBBBBADCCAACACDACCCBBCADDADDBACCCBDABCCCBDADDADABDBABAAACCDBCBCDCBCABABCDBCDDAAACBADACCBCDABAACDCDCDDBCCACBDDADAACABDADDBBDDBCAADBAADBACBDADDBDBDACACDBBBBCADACCBDDBDBCCAACAADABDCBDDCCDDACBDDDCCBCCBCDCACCBDACDCDADCDCDDDADCCCBDACBCBDCACCDDBBACCBBCCDBBABAADABABDCDDBBDCDDAADDABBCBAB
Bayes Error in Markov Context • Closed-form expressions are hard to obtain • Special cases considered in the 1960s and 70s • Raviv (1967) • Used context to improve text classification • Chu (1970), Lee (1974) • Lower and upper bounds for a 2-state HMM • The context is limited to 1 or 2 symbols
Bayes Error for Pattern Discovery • Analytical approximation • fixed length patterns • uniform background /substitution probabilities • Limited context (IID assumption): • HMM:P ( hidden state | observed sequence ) • IID : P ( hidden state | next L symbols )
Insights from Bayes Error Rate • Example: • Alphabet size |A| = 4 • Pattern length L = 5 • Pattern frequency F = 0.005 • Substitution probability = 0.2 • How hard is this problem? • How sensitive is “problem hardness” to these parameters?
Example Input Problem Solution L, F, Normalized Bayes error rate? Pe* = 0.87 1 L, F s.t. all patternsrecognized as background 0.28 2 L, F, False negative / false positive rate? FN = 77% 3
Extensions • Loss functions other than 0/1 loss • Multiple distinct patterns • Variable length patterns (insertions and deletions) • Insertions and deletions increase the Bayes error
The Autocorrelation Effect • Which pattern is easier to learn • DACBDDBADB or AAAAAAAAAA ?
Outline • Pattern discovery problem • Bayes error rate framework • Experimental results • Comparison of algorithms • Application to real-world problem • Conclusions
Probabilistic Algorithms • HMM-EM • IID-Gibbs • Motif SamplerLiu, Neuwald, Lawrence (1993, 1995, …) • IID-EM • MEME Bailey & Elkan (1995, 1998, ...) IID GibbsIID EMHMM EM Context Local Local Global Learning Stochastic Deterministic Deterministic
Using the Bayes Error Framework • The estimation problem can be decomposed into • Estimation of pattern locations • Estimation of emissions given pattern locations • What is the effect of each of these factors? • How far are these algorithms from their theoretical optimum?
Test Accuracy vs. Training Size Algorithms Known Locations Bayes Error Training Data Size
Can Algorithms be Improved? • Three gaps to be bridged: • From “current algorithms” to “known locations”: • “location noise”: reduce with better algorithms? • From “known locations” to Bayes error: • “estimation noise”: need more data or prior knowledge • From Bayes error to zero error: • need additional features/measurements
Test Accuracy vs. Bayes Error 2K training size Bayes Error
Test Accuracy vs. Bayes Error 2K training size 4K training size Bayes Error
Application to Real Problems • DPInteract database - Robison et al. (1999) • experimentally verified binding sites • 55 protein families in E.Coli • Supervised learning ACAGAATAAAAATACACT TTCGAATAATCATGCAAA ... AGTGAGTGAATATTCTCT Pattern Model • Unsupervised learning • Some problems are not solvable due to high Bayes error
Summary of Contributions • Analyzed sequential pattern discovery using Bayes error rate • Bayes error = lower bound on error rate of any discovery algorithm • Explicit analytical form for error rate dependence on pattern parameters • Provides insight into what makes a learning problem hard • Example: autocorrelated patterns are harder to learn • Experimental results • Test error = Bayes error + location error + estimation error • Current algorithms • tend to perform similarly, can be quite far away from Bayes error • future improvements? • Real world motif discovery problems can have very high Bayes error
Pattern Discovery Problem • Input • Set of strings over finite alphabet (e.g. {A,B,C,D}) • Task • Unsupervised identification of recurrent patterns embedded in a background process
Future work • Further analysis of Bayes error rate • Quantifying the effect of insertions / deletions • Application and insights into biological problems • Development of suitable learning algorithms • Flat likelihood surface
More Complex Models • Variable length patterns 0.1 A 0.9 B 0.9 B 0.9 D 0.9 B 0.25 C 0.25 I1 A 0.25 D 0.25 0.1 0.9 P1 P3 BG P2 P4 0.9 0.01 1.0 1.0 0.99 1.0 • Multiple patterns • Multi-part patterns • Richer background
Extensions of Generative Model • Variable length patterns • special insertion / deletion states • Multiple patterns • multiple pattern chains connected to the background • Multi-part patterns • insertion states at the gaps • Richer background • multiple background states
Learnability of patterns • Multiple factors influence learnability • alphabet size • pattern length • pattern frequency • variability of the pattern • similarity of pattern and background • Single characteristic that quantifies the difficulty? • In multivariate statistics, Bayes error rate • Bayes error rate applies to classification problems
Bayes Error Rate: Definition • Optimal decision rule • Probability of error for each example is • Bayes error rate is obtained by averaging
Example: Pattern Discovery • Pattern model is known • Classify each symbol as pattern or background • ABDABBDABDDACABBDDBCBDBDBCADBD • 000111100000011110000000000000 • Optimal classification • relies on posterior probabilities of states • evaluated by forward-backward algorithm • makes mistakes • How to evaluate Bayes error?
Example: Pattern Discovery • Pattern model is known • Classify each symbol as pattern or background • ABDABBDABDDACABBDDBCBDBDBCADBD • 000111100000011110000000000000 • Optimal classification • relies on posterior probabilities of states • evaluated by forward-backward algorithm • makes mistakes • How to evaluate Bayes error?
IID and IID/pure assumptions • tight for non-autocorrelated patterns • leads to complex closed-form expression • BBBB or PPPP only • BBPPnot allowed
Approximation of Bayes Error • Bayes error for patterns is approximated by • Quality of approximation • Bayes error can be estimated empirically
Bayes Error for Pattern Discovery • Analytical approximation • fixed length patterns • uniform background /substitution probabilities • Limited context (IID assumption): • HMM: P(hidden state | observed sequence) ABDABBDABDDACABBDDBCBDBDBCADBD • IID: P(hidden state | next L symbols) ABDABBDABDDACABBDDBCBDBDBCADBD
IID and PURE Assumptions • BayesError<BayesErrorIID • tight for non-autocorrelated patterns • leads to complex closed-form expression • BayesErrorIID BayesErrorPURE • BBBB or PPPP only • BBPPnot allowed
The Autocorrelation Effect • Autocorrelated (e.g., ABABABAB) patterns have higher Bayes error rate • harder to detect when true model is known • harder to learn when true model is not known • Boundaries of periodic patterns are fuzzy when substitutions are allowed • Illustrated by the posterior pattern probability
Example: |A| = 4, L = 5, F = 0.005, = 0.2; Pe* = 0.87 Input Problem Solution L, F Pe* as 0 ? Pe* = 0.2 1 L, F s.t. all patternsrecognized as background 0.28 2 L, F s.t. all patternsrecognized as background F 0.003 3 L, F, k* :the max allowed errors in the pattern? k* = 0 4 L, F, False negative / false positive rate? FP = 77% 5
Algorithms for motif discovery • Combinatorial search algorithm • Pevzner & Sze, 2000 • finds the largest cliques in the graph induced by the edit-distance between the L-mers • Hill climbing • Hu, Kibler & Sandmeyer (1999) • objective function maximizes the difference between background and pattern distributions
Algorithms for motif discovery • Detection of over-represented exactk-mers • Van Helden, Abdre & Collado-Vides (1998) • comparing the number of occurrences in the USRs of co-regulated genes and in the whole genome • Detection of over-represented non-exactk-mers • Buhler & Tompa (2000) • Method of random projections • May be used to initialize probabilistic models
Quality of solution • Pattern structure effects the quality of fitted models • Higher Bayes error means lower quality of solutions • Parameters: Length=10, Ppat=0.01, E[#Errors] = 2
Pattern Structure • Bayes error depends on pattern structure through the autocorrelation vector (Gubais, Odlyzko, 1981) • translation-invariant patterns are harder to label / learn Ranked by increasing difficulty