Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth

Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth Information and Computer Science University of California, Irvine Funding Acknowledgements National Science Foundation, Microsoft Research, IBM Research

Outline  • Pattern discovery problem • Problem statement • Research questions • Bayes error rate framework • Experimental results • Conclusions

ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB

ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADABCCCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBAABBCCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCABBBCBBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBABBCCAADCBCDACBCABABCCBACBBBCADDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADABCCCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBAABBCCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCABBBCBBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBABBCCAADCBCDACBCABABCCBACBBBCADDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB

Applications in Biology • Motif discovery problem • Task: Identification of potential binding sites of proteins • Input: Upstream regions of co-regulated genes • Patterns are non-deterministic • fixed length motifs • may have substitution errors • substitutions are independent

Generative Model • Extensions: • Variable length patterns • Multiple patterns, multi-part patterns • Richer background |A| • Hidden Markov Model  A 0.9 A 0.9 B 0.9 B 0.9 D 0.9 B 0.25 C 0.25 A 0.25 D 0.25 P1 P3 BG P2 P4 0.01 1.0 1.0 1.0 0.99 F 1.0 L

Pattern Detection and Discovery • Unsupervised discovery of embedded patterns EM Unlabeled Data Pattern Model • Supervised detection of known patterns Viterbi Unlabeled Data Pattern Model Pattern Locations

Research Questions • Current state • Successful algorithms for motif discovery • Can not solve some seemingly simple problems • Why is this happening? • Are the algorithms suboptimal? • Is the data set too small? • Is the problem inherently difficult and ambiguous? • Can such questions be answered in a principled way?

Outline  • Pattern discovery problem • Bayes error rate framework • Definitions • Examples • Application to pattern discovery • Experimental results • Conclusions 

Difficulty of Pattern Discovery • Assume true model is given • Measure of difficulty • classification performance of pattern detection • Multiple factors influence the difficulty • Single characteristic that quantifies it? • Bayes error rate

Bayes Error Rate: Definition • Bayes error rate: • average error rate of the optimal decision rule • a lower bound on classification performance of any algorithm on a given problem • Optimal rule: • pick the class with the highest posterior probability

Classification Example • Simple problem • Harder problem

Pattern Discovery Example • Simple problem • BDCBBDCABCDADCADAADABDAABBCBCACADAADCDDABDCADAADCBBAADBDBDBDAABACABBABBCAADDBCBADCBDDABCDABBBDBBCCBDAABCAABACDDADCADBCABADBCABAAABBCABBCDAABDCDAABACDDACCCDBCDDDBAADCBDAADDBBADAAADAADAADBAABDBDACADBDCCBACBACBADABDCACBCBDDCBACBAAADDCABBADDDCABCDCCCCBDDCADBBCDDACCDBBBACAACADBDACDAADCDACBADAADCBABACADAADBAABAAAADAADDDADBDDCBCDDCCBDDCDCBBBDAADDBDBBCDACBCCCBCBCDAD • Hard problem • DDDBACABBCDDCDCBBCACCBDADDBBACDACCBDADCCCDBDBDADABABDCBCDABDBABABCCADBBDCDBBBBDACBBAABBBBBADCCAACACDACCCBBCADDADDBACCCBDABCCCBDADDADABDBABAAACCDBCBCDCBCABABCDBCDDAAACBADACCBCDABAACDCDCDDBCCACBDDADAACABDADDBBDDBCAADBAADBACBDADDBDBDACACDBBBBCADACCBDDBDBCCAACAADABDCBDDCCDDACBDDDCCBCCBCDCACCBDACDCDADCDCDDDADCCCBDACBCBDCACCDDBBACCBBCCDBBABAADABABDCDDBBDCDDAADDABBCBAB

Bayes Error in Markov Context • Closed-form expressions are hard to obtain • Special cases considered in the 1960s and 70s • Raviv (1967) • Used context to improve text classification • Chu (1970), Lee (1974) • Lower and upper bounds for a 2-state HMM • The context is limited to 1 or 2 symbols

Bayes Error for Pattern Discovery • Analytical approximation • fixed length patterns • uniform background /substitution probabilities • Limited context (IID assumption): • HMM:P ( hidden state | observed sequence ) • IID : P ( hidden state | next L symbols )

How Accurate is the Analysis?

Insights from Bayes Error Rate • Example: • Alphabet size |A| = 4 • Pattern length L = 5 • Pattern frequency F = 0.005 • Substitution probability = 0.2 • How hard is this problem? • How sensitive is “problem hardness” to these parameters?

Example Input Problem Solution L, F,  Normalized Bayes error rate? Pe* = 0.87 1 L, F s.t. all patternsrecognized as background  0.28 2 L, F,  False negative / false positive rate? FN = 77% 3

Extensions • Loss functions other than 0/1 loss • Multiple distinct patterns • Variable length patterns (insertions and deletions) • Insertions and deletions increase the Bayes error

The Autocorrelation Effect • Which pattern is easier to learn • DACBDDBADB or AAAAAAAAAA ?

Outline  • Pattern discovery problem • Bayes error rate framework • Experimental results • Comparison of algorithms • Application to real-world problem • Conclusions  

Probabilistic Algorithms • HMM-EM • IID-Gibbs • Motif SamplerLiu, Neuwald, Lawrence (1993, 1995, …) • IID-EM • MEME Bailey & Elkan (1995, 1998, ...) IID GibbsIID EMHMM EM Context Local Local Global Learning Stochastic Deterministic Deterministic

Using the Bayes Error Framework • The estimation problem can be decomposed into • Estimation of pattern locations • Estimation of emissions given pattern locations • What is the effect of each of these factors? • How far are these algorithms from their theoretical optimum?

Test Accuracy vs. Training Size Algorithms Known Locations Bayes Error Training Data Size

Can Algorithms be Improved? • Three gaps to be bridged: • From “current algorithms” to “known locations”: • “location noise”: reduce with better algorithms? • From “known locations” to Bayes error: • “estimation noise”: need more data or prior knowledge • From Bayes error to zero error: • need additional features/measurements

Test Accuracy vs. Bayes Error 2K training size Bayes Error

Test Accuracy vs. Bayes Error 2K training size 4K training size Bayes Error

Application to Real Problems • DPInteract database - Robison et al. (1999) • experimentally verified binding sites • 55 protein families in E.Coli • Supervised learning ACAGAATAAAAATACACT TTCGAATAATCATGCAAA ... AGTGAGTGAATATTCTCT Pattern Model • Unsupervised learning • Some problems are not solvable due to high Bayes error

Bayes Error of Experimental Problems

Summary of Contributions • Analyzed sequential pattern discovery using Bayes error rate • Bayes error = lower bound on error rate of any discovery algorithm • Explicit analytical form for error rate dependence on pattern parameters • Provides insight into what makes a learning problem hard • Example: autocorrelated patterns are harder to learn • Experimental results • Test error = Bayes error + location error + estimation error • Current algorithms • tend to perform similarly, can be quite far away from Bayes error • future improvements? • Real world motif discovery problems can have very high Bayes error

Pattern Discovery Problem • Input • Set of strings over finite alphabet (e.g. {A,B,C,D}) • Task • Unsupervised identification of recurrent patterns embedded in a background process

Future work • Further analysis of Bayes error rate • Quantifying the effect of insertions / deletions • Application and insights into biological problems • Development of suitable learning algorithms • Flat likelihood surface

More Complex Models • Variable length patterns 0.1 A 0.9 B 0.9 B 0.9 D 0.9 B 0.25 C 0.25 I1 A 0.25 D 0.25 0.1 0.9 P1 P3 BG P2 P4 0.9 0.01 1.0 1.0 0.99 1.0 • Multiple patterns • Multi-part patterns • Richer background

Extensions of Generative Model • Variable length patterns • special insertion / deletion states • Multiple patterns • multiple pattern chains connected to the background • Multi-part patterns • insertion states at the gaps • Richer background • multiple background states

Learnability of patterns • Multiple factors influence learnability • alphabet size • pattern length • pattern frequency • variability of the pattern • similarity of pattern and background • Single characteristic that quantifies the difficulty? • In multivariate statistics, Bayes error rate • Bayes error rate applies to classification problems

Bayes Error Rate: Definition • Optimal decision rule • Probability of error for each example is • Bayes error rate is obtained by averaging

Example: Pattern Discovery • Pattern model is known • Classify each symbol as pattern or background • ABDABBDABDDACABBDDBCBDBDBCADBD • 000111100000011110000000000000 • Optimal classification • relies on posterior probabilities of states • evaluated by forward-backward algorithm • makes mistakes • How to evaluate Bayes error?

IID and IID/pure assumptions • tight for non-autocorrelated patterns • leads to complex closed-form expression • BBBB or PPPP only • BBPPnot allowed

Approximation of Bayes Error • Bayes error for patterns is approximated by • Quality of approximation • Bayes error can be estimated empirically

Bayes Error for Pattern Discovery • Analytical approximation • fixed length patterns • uniform background /substitution probabilities • Limited context (IID assumption): • HMM: P(hidden state | observed sequence) ABDABBDABDDACABBDDBCBDBDBCADBD • IID: P(hidden state | next L symbols) ABDABBDABDDACABBDDBCBDBDBCADBD

IID and PURE Assumptions • BayesError<BayesErrorIID • tight for non-autocorrelated patterns • leads to complex closed-form expression • BayesErrorIID  BayesErrorPURE • BBBB or PPPP only • BBPPnot allowed

Varying the substitution probability

The Autocorrelation Effect • Autocorrelated (e.g., ABABABAB) patterns have higher Bayes error rate • harder to detect when true model is known • harder to learn when true model is not known • Boundaries of periodic patterns are fuzzy when substitutions are allowed • Illustrated by the posterior pattern probability

Example: |A| = 4, L = 5, F = 0.005, = 0.2; Pe* = 0.87 Input Problem Solution L, F Pe* as   0 ? Pe* = 0.2 1 L, F s.t. all patternsrecognized as background  0.28 2 L,  F s.t. all patternsrecognized as background F  0.003 3 L, F,  k* :the max allowed errors in the pattern? k* = 0 4 L, F,  False negative / false positive rate? FP = 77% 5

Algorithms for motif discovery • Combinatorial search algorithm • Pevzner & Sze, 2000 • finds the largest cliques in the graph induced by the edit-distance between the L-mers • Hill climbing • Hu, Kibler & Sandmeyer (1999) • objective function maximizes the difference between background and pattern distributions

Algorithms for motif discovery • Detection of over-represented exactk-mers • Van Helden, Abdre & Collado-Vides (1998) • comparing the number of occurrences in the USRs of co-regulated genes and in the whole genome • Detection of over-represented non-exactk-mers • Buhler & Tompa (2000) • Method of random projections • May be used to initialize probabilistic models

Quality of solution • Pattern structure effects the quality of fitted models • Higher Bayes error means lower quality of solutions • Parameters: Length=10, Ppat=0.01, E[#Errors] = 2

Pattern Structure • Bayes error depends on pattern structure through the autocorrelation vector (Gubais, Odlyzko, 1981) • translation-invariant patterns are harder to label / learn Ranked by increasing difficulty

Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth

Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth

Presentation Transcript

Mr Jimmy Smyth

Pattern Discovery in Biological Sequences: A Review

Pattern Finding and Pattern Discovery in Time Series

Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

Assumption

Jason Smyth

Temporal Pattern Discovery in Smart Homes

Algorithms for pattern matching and pattern discovery in music

Stephanie Smyth

Lec . 37 – Data Through Time / Sequences: Markov Chains

Analysis of biological sequences using Markov Chains and Hidden Markov Models

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Pattern tree algebras: sets or sequences?

Point-set algorithms for pattern discovery and pattern matching in music

Uncovering Sequences Mysteries With Hidden Markov Model

CyberBridges Protein Pattern Discovery

Cascading Spatio-Temporal Pattern Discovery

Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth

Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Interactive Pattern Discovery with Mirage

Part1 Markov Models for Pattern Recognition – Introduction

Uncovering Sequences Mysteries With Hidden Markov Model