Parameter Tuning for Differential Mining of String Patterns

Parameter Tuning for Differential Miningof String Patterns J.Besson, C. Rigotti, I. Mitasiunaite and J.-F. Boulicaut DDDM'08, Pisa - 15/12/2008

Tuning extraction parameters • Local pattern mining: itemsets, closed itemsets, episodes, seq. patterns, substrings • …. under constraints (monotonic or not or neither, pattern shapes, occurrence properties, measures …) • can select/focus …. • … where to look in the parameter space ? • often easy when a single threshold • … but when multiple constraints/multiple thresholds ? DDDM'08, Pisa - 15/12/2008

Two different kinds of tuning • 1) exploratory stage: find in parameter space promising areas • 2) fine grain tuning: ako greedy strategy by small local exploration of the parameter space DDDM'08, Pisa - 15/12/2008

Tools ? • Best ever tool used in exploratory stage to find promising setting of the parameters in local pattern mining ??? … DDDM'08, Pisa - 15/12/2008

Tools • GREP + Word Count • method: manual mix • count extracted patterns • choose points in parameter space • random walk • try local greedy strategy • having in mind known properties of the constraints (when applicable) and domain knowledge DDDM'08, Pisa - 15/12/2008

Tools • … when several parameters, several thresholds, e.g., minimal support and maximal support on another dataset … • perform more exhaustive exploration of pattern space • draw curves depicting the extraction landscape DDDM'08, Pisa - 15/12/2008

Tools / landscape • Examples DDDM'08, Pisa - 15/12/2008

Obtaining extraction landscapes • use script - can need a lot of resources to execute - too much time needed to explore a large parameter space (several parameters) • use a global model of the presence of the local patterns to estimate the number of patterns • reuse/adapt a model - not so much exist • develop a new global model - each kind of patterns and each conjunction of constraints can be a research problem in itself • incorporate K of domain ? Global analytical model even more complex to exhibit … DDDM'08, Pisa - 15/12/2008

What about sampling the pattern space ? • sounds too naive, needing complicated frameworks • how to sample ? • size of the sample ? • number of pattern in the sample that satisfy the constraints ? • using domain knowledge ? • how to estimate value for the whole pattern space ? DDDM'08, Pisa - 15/12/2008

What about simple choices ? • sampling with replacement in pat. that satisfies the syntactic constraints (conjunction of constraints) • number of patterns in the sample that satisfy the constraints • compute probability to satisfy the constraints for each patterns (incorporate K of the domain) in the sample • approx. number of patterns that sat. the constraints (in the sample) • sample size: growth the sample up to convergence of percentage of patterns satisfying the constraints • estimate the number of patterns in the pattern space that satisfy the constraints: percentage of the pat. that sat. syntactic constraints DDDM'08, Pisa - 15/12/2008

Whole process • 1) built an initial sample of Psynt • 2) comp. estimate of E(N) from the sample • 3) add more patt. to the sample • 4) comp. estimate of E(N) from the sample • 5) if estimate changes a lot goto 3) DDDM'08, Pisa - 15/12/2008

Using it in freq. substring mining • Two datasets: R1 and R2 (two sets of strings) • Constraints • having size Z • appearing at least min times in R1 • appearing no more than max times in R2 • Consider exact and approx. matching DDDM'08, Pisa - 15/12/2008

Pattern space and K of domain • string over an alphabet of 4 or 8 symbols • K of domain as three models of symbol distribution • Me - independent symbols with equal frequency • Md - independent symb. with different frequencies • Mm - first order Markov model • for given p, and Me or Md or Mm, we have the proba that exits at-least one occ. of p in a string • from binomial distribution we have the proba that p sat. min and max support constraints DDDM'08, Pisa - 15/12/2008

Example / random data • 4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of length 1000 in R1 and R2 , exact match DDDM'08, Pisa - 15/12/2008

Example / random data • 4 symb. Mm, 100 strings of length 1000 in R1 and R2, exact and approx. match DDDM'08, Pisa - 15/12/2008

Example / gene promoter seq. • 4 symb. A,C,G,T - Md, strings of 4000 symb., 29 in R1 and 21 in R2 - approx. match DDDM'08, Pisa - 15/12/2008

Example / gene promoter seq. • Estimate vs. extraction DDDM'08, Pisa - 15/12/2008

Conclusion • Drawing extraction landscape for parameter tuning, in local pattern extraction, using pattern space sampling … • seems possible … • … at-least in some cases • … using simple framework • … incorparating K of domain (to some extend - many works on proba of a given patt. to sat. constraints) • simplier than building a global analytical model • faster than running real extractions • … sufficient in exploratory stage ? • … companion software? DDDM'08, Pisa - 15/12/2008

Example / random data • 8 symb. Me, 100 strings of length 30000 in R1 and R2, approx. match DDDM'08, Pisa - 15/12/2008

Pb - Sampling / estimate • kind of sampling (with replacement ?) • specific sampling (ako stratified sampling) for some constraints ? • kinds of patterns ? • quality of estimates … occurrences of different patterns are not independent DDDM'08, Pisa - 15/12/2008

Pb - Other parameters added • size of starting set • convergence criterion ? 5% ? • size of additional subsets • … not so hard to tune ? DDDM'08, Pisa - 15/12/2008

Number of patterns • conjunction of constraints C • patterns in patt. space PS • for each patt. p, let var Xp=1 if p sat. C or Xp=0 if p not sat. C • N = nb of patt. that sat. C = sum of Xp over PS • E(N) = sum of E(Xp) over PS • E(Xp) = proba that p sat. C • Psynt = patt. in PS that sat. syntactic constraint in C • E(N) = sum of E(Xp) over Psynt DDDM'08, Pisa - 15/12/2008

Number of patterns • comp. NS = sum of E(Xp) over a sample of Psynt • comp. ratio NR = NS/sample size • use NR * size of Psynt as an estimate of E(N) DDDM'08, Pisa - 15/12/2008

Example / gene promoter seq. • Estimate vs. extraction DDDM'08, Pisa - 15/12/2008

Often repeat exploratory stage • redo exploratory stage after important changes as: • data selection (e.g., part of sequences) • encoding (e.g., mapping on event types) • discretization (e.g., threshold of binarization) • … DDDM'08, Pisa - 15/12/2008

Parameter Tuning for Differential Mining of String Patterns