130 likes | 139 Views
This introduction explores sequential pattern mining, covering multiple, single data sequences, and challenges. It discusses frameworks, algorithms, and applications of this data mining method in various fields. The text delves into technical topics and ongoing/future research directions in this domain.
E N D
A Short Introduction to Sequential Data Mining Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi The First Franco-Japanese Symposium on Knowledge Discovery in System Biology, September 17, Aix-en-Provence
Two Main Frameworks of Sequential Mining • Sequential pattern mining for multiple data sequences • Sequential pattern mining for a single data sequence
What Is Sequential Pattern Mining? J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji • Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern
Challenges on Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should • find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold • be highly efficient, scalable, involving only a small number of database scans • be able to incorporate various kinds of user-specificconstraints J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
Sequential Pattern Mining Algorithms for Multiple Data Sequences • Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’96) • Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.@KDD’00; Pei, et al.@ICDE’01) • Vertical format-based mining: SPADE (Zaki@Machine Leanining’00) • Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02) • Mining closed sequential patterns: CloSpan (Yan, Han & Afshar @SDM’03) J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
Mining Sequential Patterns from a Very-Long Single Sequence A series of daily news paper articles < > typhoon flood, landslide typhoon flood, landslide <typhoon(flood,landslide)>
Sequential Pattern Mining Algorithms for a Single data Sequence • Discovery of frequent episodes in event sequences, based on a sliding window system [Mannila 1998]: • The frequency measure becomes anti-monotonic, but has a problem, i.e., a duplicate counting of an occurrence. • Asynchronous periodic pattern mining [Yang et.al 2000, Huang 2004]: • Any anti-monotonic frequency measures are not investigated. • On-line approximation algorithm for mining frequent items, not for frequent subsequences • Lossy counting algorithm [Manku and Motwani, VLDB’02]
Research in Our Laboratory • Sequential Data Mining from a very-large single data sequence. • Main target: sequential textual data, especially, newspaper-articles corpora • Objectives: to generate a robust and useful large-scale event-sequences corpus. • Application 1: topic tracking/detection in information retrieval. • Application 2: automated content-tracking in WEB. • Application 3: scenario/story semi-automatic creation • Ordinary temporal data analysis: various log data in computer systems, genetic information, etc.
Technical Topics (1/2) • A new framework for extracting frequent subsequences from a single long data sequence:in IEEE Inter. Conf. on Data Mining 2005 (ICDM2005): • A new rational frequency measures, which satisfies the Apriori (anti-monotonic)property and has no duplicate counting. • A fast on-line algorithm for a some limited case
Technical Topics (1/2) On-going current works and future work • On-line rational filters based on confidence criteria and/or information-gain for eliminating redundant valueless sequences from system output • Methods for finding meta-structures embedded in huge amount of frequent sequences generated by a system • A method using compression based on context-free grammar-inference/learning • More fast extraction algorithm based on a method for simultaneously searching multiple strings over compressed data.
References: • Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques (Chapter 8). www.cs.uiuc.edu/~hanj
پایگاه پاورپوینت ایرانwww.txtzoom.comبانک اطلاعات هوشمند پاورپوینت