Linear-Time Computation of Similarity Measures for Sequential Data

Linear-Time Computation of Similarity Measures for Sequential Data Presenter : Cheng-Feng Weng Authors : Konrad Rieck and Pavel Laskov 2008/09/11 ML.26 (2008)

Outline • Introduction • Motivation • Objective • Methods • Experimental results • Conclusion • Comments

Introduction • Sequential data is a fundamental data representations in computer science. • search engines to document ranking, gene finding to prediction of protein functions, network surveillance tools to anti-virus programs • Providing an interface to sequential data is therefore an essential prerequisite for applications of machine learning in these domains. DNA sequence …ATGCAACTAAT….

Motivation • Most of learning algorithms imposes a much looser constraint on the type of data that can be handled. • a powerful abstraction between algorithms and data representations must be established. • Numerous applications exist for which relationships are defined as metric or non-metric distances for similarity measure. • It is imperative to address pairwise comparison of objects in a most general setup.

Objective • The aim of this contribution is to develop a general frameworkfor pairwise comparison of sequences. • The authors focus on algorithms with linear-time asymptotic complexity in the sequence lengths • It also provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures.

Embedding Sequences using a Formal Language • The authors focus on three definitions for embedding language. • Bag-of-Words: • L = Dictionary (explicit), L = (A\ D)∗ (implicit). This is a book. ADDACCTACA this is a book ADD ACCT AC A

Embedding Sequences using a Formal Language (con.) • K-grams: • Contiguous sequences: abbaac (k=4) abba bbaa baac abbac abbbbbbbad

The embedding function • Given an embedding language L, a sequence x can be mapped into the |L|-dimensional feature space by calculating a function φw(x) for every w ∈ L appearing in x. a weight frequency, probability or binary flag

Weighting schemes • The following three weighting schemes for defining W have been proposed in previous research: • Corpus dependent weighting: • Length dependent weighting: • Position dependent weighting: Decay factor 0 ≤ λ ≤ 1

A Generic Framework for Similarity Measures • All of the similarity measures share a similar mathematical construction: • an inner component-wise function is aggregated over each dimension using an outer operator. Inner function Outer operator

A Generic Framework for Similarity Measures (cont.) • Unified formulation of similarity measures:

A Generic Framework for Similarity Measures (cont.) • Define m(0,0) = e, where e is the neutral element for the operator ⊕. • Conjunctive similarity measures: • Disjunctive similarity measures:

Algorithms and Data Structures • The authors present three approaches differing in capabilities and implementation complexity covering simple sorted arrays, tries and generalized suffix trees. • The sorted arrays are simple but limited in capabilities, tries are more involved, yet they do not cover all embedding languages and generalized suffix trees are relatively complex and support the full range of embedding languages.

Sorted Arrays • Sorted arrays of 3-grams for x = abbaa and y = baaaab. Disjunctive

Tries Root = nil • Tries of 3-grams for x = abbaa and y = baaaab. word

Generalized Suffix Trees • Generalized suffix tree for x = abbaa$1 and y = baaaab$2. occ(w,x),occ(w,y)

Generalized Suffix Trees (cont.) Construct the tree

Run-time Experiments • Embedding language: bag-of-words.(textual data)

Run-time Experiments • Embedding language: k-grams.(all data sets)

Applications • Unsupervised text categorization. better

Applications • Network intrusion detection.

Applications • Transcription start site recognition.

Conclusions • The framework for comparison of sequences proposed in this article provides means for efficient computation of a large variety of similarity measures. • Including kernels, distances and non-metric similarity coefficients. • As realizations of the framework it provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. • Sorted arrays are the most efficient but more limit to apply. • Generalized suffix trees can handle unrestricted embedding languages but more cost.

Comments • Advantage • Practical for these domain • Drawback • Uncleanly, too many references • Application • …

Linear-Time Computation of Similarity Measures for Sequential Data