SPLASH: S tructural P attern L ocalization A nalysis by S equential H istograms

SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14th, 2004

Motif: A functional regain of a DNA or protein sequence Amino Acids sequences How to discover the functional regains automatically?

Automatic Motif discovery • Problem • Use A, B, C, … stands for different amino acids • A protein sequence: ABABAABCDBAA… • Motifs are certain patterns in sequences for example: ABCA • Previous Methods: small scale discovery • Several sequences  similar functions  alignment • Can we use data mining to generate motifs candidates first?

Automatically discover motifs: What properties should a motif have? • It has a specific function  conservative  frequent appearing in sequences • Evolution  likely not continually identical For example: ABCBABABA AB--ABAB- string matching,suffix tree … AB-BAB-B-  how?

Formal problem definition • Input: A string of characters: S=s1s2,…,sL • Output:A frequent pattern: (∑ U ●)* ●: a wild card to match a single character, ∑: a full character * : repeat arbitrary times Note: NO arbitrary-length gap. ABCD, AED are different

Regular Expression: to describe a certain type of patterns • | or : A|B means A or B • ● wild card to match any characters A●B means: AAB, ABB, ACB, … • * to repeat any times (including 0 times) (AB)* means null, AB, ABAB, ABABAB, … • + to repeat any times (not including 0 times) • …

Any requirements for output patterns? • Can wild card be anywhere? Do we need some constraints on wild cards? • What means “frequent”? • How long should a qualified pattern be?

Can wild card be anywhere? • A pattern can have ●: for example A●BA●●B • But, A●●●●●●●●●●●●●●●●BBA ?? Probably, it cannot be too “sparse”… • Naïve solution: no more than n ● • But, for example n=5 A●●●●●B : 5 ● A●BB●●A●●B●●A : 7●

Two ● at most Density: how “sparse” do we allow? Given a pattern P, any length l0 region in P must have k0 full characters Example: l0 = 5, k0 = 3 s1s2s3s4s5s6s7s8s9……

Density: how “sparse” do we allow? Given a pattern P, any length l0 region in P must have k0 full characters Example: l0 = 5, k0 = 3 s1s2s3s4s5s6s7s8s9…… A●●ABB●A √ BA●●A●BB X

Frequency and length • At least, the patterns have K0 full characters repeating J0times • Example J0 =3 ABCBABABA √ ABCBABABA X • Example K0 =3 ABCBABABA X ABCBABABA √

Summary of parameters for a pattern • Sequence S, and its length L • Pattern P, K full character, appears J times • Length constraints: K ≥K0 • Frequency constraint: J ≥J0 • Density constraint: l0 , k0

Apriori property • A constraint has a-priori property means: If a setviolates this constraint, any its superset will violate this constraint as well. • For example max(S) < 5 • Frequency constraint has a-priori property! For example, BA●A●BB appears less than J0times, any its super patterns CANNOT appears more than J0times!

A whole picture of the algorithm • To form longer pattern only from short qualified patterns. • First, to generate candidates/seed (length l0): every seed should repeat at least J0 times • To generate longer patterns from short patterns, iteratively 1. Two patterns are together 2. Longer patterns repeat at least J0 ……

Generate the seeds: enumerating … • To generate seeds (shortest patterns) first ABAABBCBACBDB… J0=4 A: 4, B: 6, C:2, D: 1 • Are length 1 seeds too short? How long could those seeds be? • Too long: enumerating costs too much time • Too short: maybe not efficient, also not consider the density constraints • Maybe, we should start from the patterns with length l0 .

How to generate seeds with length l0 ? • Give l0 and k0, and character sets ABC… • Enumerating all possible patterns with length l0 • Scan the sequence the count the frequency • For example, l0=3, k0=2, ABC AAA, AAB, AAC, ABA, … AA●, AB●, AC●, … A●A, A●B, A●C, … …

Can we do it more efficiently? • Give l0 and k0, 1: full character 0:wild card Enumerating all possible patterns by 1 and 0? • Example l0=5, k0=3, to find comb 11111, 11110, 11101, 11100, 11011, 11010, 11001, 10111, 10110, 10101 10011, 01111, 01110, 01101, 01011, 00111

A●A●B How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B

B●A●A How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B

A●B●B How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B

How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B A●A●B 3, B●A●A 2, A●B●B 2 B●B●A 2, B●B●B 2, A●A●A 1 A●B●A 1, B●A●B 1 J0 = 3? only A●A●B left By the same way, use others combs to generate other seeds, different combs won’t generate the same patterns

How to get long patterns? • Long pattern  two patterns could be merged  need short patterns and their locations • Pattern: A●B●●C {A:0,B:2,C:5} • Locus: the locations where a pattern occurs: Patten AB in string ABBCABAB Its locus {0, 4, 6}

Append operation: to connect two small patterns to a longer pattern Patten S1: A●●B●C and S2: B●D●  S1S2: A●●B●CB●D● conditional on: Their locus have intersection S1 locus: {1, 20, 32, 57 …} S2 locus: {7, 13, 38, 63 … }  {1,7,32,57,…} S1S2 locus: {1,32,57,…} -6

Add: to make the patterns more “dense” • Patten A●●B●C●●●D and ●●●B●CE  A●●B●CE●●D on the conditions: Their locus (with shifting) have intersection

Significance: whether it can be generated by randomly sampling ? • hypothesis: A pattern is not randomly generated Given: character set: {A,B,C,D,E} sequence length: L A pattern: A●BA●AA●B Its frequency j Probability to generate this pattern j times pure randomly?

Statistical significance • Pure random sampling, the frequency should satisfy normal distribution • Z score, (A-E[A])/σA --- normalized into N(0,1)

Experiments • Two questions to answer. • How efficient is this algorithm? • How effective is this algorithm? • Baseline algorithm • PRATT(EBI), MEME(UCSD)

Efficiency PRATT SPLASH

Effectiveness Search against SWISS-PROT Rel. 36, 578 GPCR proteins returned, only 4 false positive MEME cannot find it, PRATT program crashed

Conclusions • Deterministic algorithm: It can discover all patterns satisfying the requirements • Efficient and scalable: It beats PRATT and MEME. More scalable … • Effective: It can discover useful patterns.

Problems • All problems that A-priori algorithm could have: too many results, cannot really avoid worse-case exponential … • Doesn’t really consider the 3D structure of proteins • The software crashes sometimes

SPLASH: S tructural P attern L ocalization A nalysis by S equential H istograms