1 / 24

Linear-Time Computation of Similarity Measures for Sequential Data

This research paper discusses the development of a general framework for pairwise comparison of sequences with linear-time algorithms using different data structures. It covers embedding sequences using formal languages, weighting schemes, similarity measures, algorithms, and data structures, with examples of sorted arrays, tries, and generalized suffix trees. Applications include unsupervised text categorization, network intrusion detection, and transcription start site recognition. The framework enables the efficient computation of various similarity measures, including kernels, distances, and non-metric coefficients.

ghenson
Download Presentation

Linear-Time Computation of Similarity Measures for Sequential Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear-Time Computation of Similarity Measures for Sequential Data Presenter : Cheng-Feng Weng Authors : Konrad Rieck and Pavel Laskov 2008/09/11 ML.26 (2008)

  2. Outline • Introduction • Motivation • Objective • Methods • Experimental results • Conclusion • Comments

  3. Introduction • Sequential data is a fundamental data representations in computer science. • search engines to document ranking, gene finding to prediction of protein functions, network surveillance tools to anti-virus programs • Providing an interface to sequential data is therefore an essential prerequisite for applications of machine learning in these domains. DNA sequence …ATGCAACTAAT….

  4. Motivation • Most of learning algorithms imposes a much looser constraint on the type of data that can be handled. • a powerful abstraction between algorithms and data representations must be established. • Numerous applications exist for which relationships are defined as metric or non-metric distances for similarity measure. • It is imperative to address pairwise comparison of objects in a most general setup.

  5. Objective • The aim of this contribution is to develop a general frameworkfor pairwise comparison of sequences. • The authors focus on algorithms with linear-time asymptotic complexity in the sequence lengths • It also provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures.

  6. Embedding Sequences using a Formal Language • The authors focus on three definitions for embedding language. • Bag-of-Words: • L = Dictionary (explicit), L = (A\ D)∗ (implicit). This is a book. ADDACCTACA this is a book ADD ACCT AC A

  7. Embedding Sequences using a Formal Language (con.) • K-grams: • Contiguous sequences: abbaac (k=4) abba bbaa baac abbac abbbbbbbad

  8. The embedding function • Given an embedding language L, a sequence x can be mapped into the |L|-dimensional feature space by calculating a function φw(x) for every w ∈ L appearing in x. a weight frequency, probability or binary flag

  9. Weighting schemes • The following three weighting schemes for defining W have been proposed in previous research: • Corpus dependent weighting: • Length dependent weighting: • Position dependent weighting: Decay factor 0 ≤ λ ≤ 1

  10. A Generic Framework for Similarity Measures • All of the similarity measures share a similar mathematical construction: • an inner component-wise function is aggregated over each dimension using an outer operator. Inner function Outer operator

  11. A Generic Framework for Similarity Measures (cont.) • Unified formulation of similarity measures:

  12. A Generic Framework for Similarity Measures (cont.) • Define m(0,0) = e, where e is the neutral element for the operator ⊕. • Conjunctive similarity measures: • Disjunctive similarity measures:

  13. Algorithms and Data Structures • The authors present three approaches differing in capabilities and implementation complexity covering simple sorted arrays, tries and generalized suffix trees. • The sorted arrays are simple but limited in capabilities, tries are more involved, yet they do not cover all embedding languages and generalized suffix trees are relatively complex and support the full range of embedding languages.

  14. Sorted Arrays • Sorted arrays of 3-grams for x = abbaa and y = baaaab. Disjunctive

  15. Tries Root = nil • Tries of 3-grams for x = abbaa and y = baaaab. word

  16. Generalized Suffix Trees • Generalized suffix tree for x = abbaa$1 and y = baaaab$2. occ(w,x),occ(w,y)

  17. Generalized Suffix Trees (cont.) Construct the tree

  18. Run-time Experiments • Embedding language: bag-of-words.(textual data)

  19. Run-time Experiments • Embedding language: k-grams.(all data sets)

  20. Applications • Unsupervised text categorization. better

  21. Applications • Network intrusion detection.

  22. Applications • Transcription start site recognition.

  23. Conclusions • The framework for comparison of sequences proposed in this article provides means for efficient computation of a large variety of similarity measures. • Including kernels, distances and non-metric similarity coefficients. • As realizations of the framework it provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. • Sorted arrays are the most efficient but more limit to apply. • Generalized suffix trees can handle unrestricted embedding languages but more cost.

  24. Comments • Advantage • Practical for these domain • Drawback • Uncleanly, too many references • Application • …

More Related