1 / 102

你的一小步,我的一大步

你的一小步,我的一大步. Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University. * http://www.wretch.cc/blog/EtudeBIKE. * http://www.giant-bicycles.com/zh-TW/. * http://cape7.pixnet.net/blog. * http://cape7.pixnet.net/blog. * http://cape7.pixnet.net/blog. * http://www.wretch.cc/blog/orzboyz

kirti
Download Presentation

你的一小步,我的一大步

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University

  2. Jen-Wei Huang

  3. * http://www.wretch.cc/blog/EtudeBIKE Jen-Wei Huang

  4. * http://www.giant-bicycles.com/zh-TW/ Jen-Wei Huang

  5. Jen-Wei Huang

  6. Jen-Wei Huang

  7. * http://cape7.pixnet.net/blog Jen-Wei Huang

  8. * http://cape7.pixnet.net/blog Jen-Wei Huang

  9. * http://cape7.pixnet.net/blog Jen-Wei Huang

  10. * http://www.wretch.cc/blog/orzboyz * http://blog.sina.com.tw/9winds/ * http://atomcinema.pixnet.net/blog Jen-Wei Huang

  11. Jen-Wei Huang

  12. * http://www.amazon.com Jen-Wei Huang

  13. * http://www.amazon.com Jen-Wei Huang

  14. * http://www.hq.nasa.gov/office/pao/History/ap11ann/kippsphotos/apollo.html Jen-Wei Huang

  15. A General Model for Sequential Pattern Mining with a Progressive Database Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou and Ming-Syan Chen National Taiwan University * IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 6, June 2008

  16. Outlines • Introduction • Preliminaries • Algorithm Pisa • Experiments • Conclusions • Q & A Jen-Wei Huang 16

  17. Introduction to SPM • “Mining of frequently occurring patterns related to time or other sequences.” • J. Han, Data Mining – Concepts and Techniques • “Given a set of sequences, find the complete set of frequent subsequences” • J. Pei, PrefixSpan • Ex) What items one will buy if he/she has bought some certain items Jen-Wei Huang 17

  18. Time-related data • Customers’ buying behavior • Natural phenomena • Sensor network data • Web access patterns • Stock price changes • DNA sequence applications Jen-Wei Huang 18

  19. Definition • Let I= {x1, x2, ..., xn} be a set of different items. • An element e, denoted by (xi xj ...), is a subset of items ⊆ I of which items appear in a sequence at the same time. • A sequence s, denoted by < e1, e2, ..., em >, is an ordered list of elements. • A sequence database Db contains a set of sequences and |Db| represents the number of sequences in Db. Jen-Wei Huang 19

  20. Definition • A sequence α = < a1, a2, ..., an > is a subsequence of another sequence β = < b1, b2, ..., bm > if • there exists a set of integers, 1 ≤ i1 < i2 < ... < in ≤ m, such that a1 ⊆ bi1 , a2 ⊆ bi2 , ..., and an ⊆ bin . Jen-Wei Huang 20

  21. Definition • The sequential pattern mining can be defined as • "Given a sequence database, Db, and a user-defined minimum support, min_sup, find the complete set of subsequences whose occurrence frequencies ≥ min_sup ∗ |Db|." Jen-Wei Huang 21

  22. Three Categories • Depending on the management of the corresponding database, sequential pattern mining can be divided into three categories, namely sequential pattern mining with • a static database. • an incremental database. • a progressive database. Jen-Wei Huang 22

  23. How To Do Sequential Pattern Mining on a Static Database An Overview

  24. How? • Apriori-like algorithms • AprioriAll – by Agrawal et al • GSP – by R. Srikant et al • Partition-based algorithms • FreeSpan – by J. Han et al • PrefixSpan – by J. Pei et al • Vertical format algorithms • SPADE – by Zaki et al • SPAM – by Ayres et al jwhuang National Taiwan University

  25. Apriori-like Algorithms • 1.Sort phase • Sort the database • Customer id as the primary key and time as the second key • 2.Litemset phase • Count the frequency of each itemset • The fraction of customers who bought the itemset jwhuang National Taiwan University

  26. Apriori-like Algorithms • 3.Transformation phase • Transform each tx to all litemsets in the form of C01: <(1,5) (2) (3) (4)> C02: <(1) (3) (4) (3,5)> C03: <(1) (2) (3) (4}> C04: <(1) (3) (5)> C05: <(4) (5)> jwhuang National Taiwan University

  27. Jen-Wei Huang

  28. Jen-Wei Huang

  29. Apriori-like Algorithms • 4.Mining phase • Apriori-like algorithm • 5.Maximal phase • Find the maximum patterns jwhuang National Taiwan University

  30. Jen-Wei Huang

  31. Therefore, frequent sequential patterns are: <1 2> <3 4> <3 5> <3 6> <3 7> <4 6> <5 6> <7 6> <3 4 6> <3 5 6> <3 7 6> According to mappings, original frequent sequential patterns are: <10 20> <30 40> <30 70> <30 90> <30 {40 70}> <40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90> <30 {40 70} 90> Jen-Wei Huang

  32. According to mappings, original frequent sequential patterns are: <10 20> <30 40> <30 70> <30 90> <30 {40 70}> <40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90> <30 {40 70} 90> Because <30 40> and <30 70> are contained by <30 {40 70}> <40 90> and <70 90> are contained by <{40 70} 90> <30 40 90> and <30 70 90> are contained by <30 {40 70} 90>, final maximal sequential patterns are: <10 20> <30 90> <30 {40 70}> <{40 70} 90> <30 {40 70} 90> Jen-Wei Huang

  33. Related Works • Static database • AprioriAll – by Agrawal et al • GSP – by R. Srikant et al • SPADE – by Zaki et al • FreeSpan – by J. Han et al • PrefixSpan – by J. Pei et al • SPAM – by Ayres et al Jen-Wei Huang 33

  34. Related Works • Incremental database • ISM – by Parthasarathy et al • IncSP – by Lin et al • ISE – by Masseglia et al • IncSpan – by Cheng et al • MILE – by Chen et al Jen-Wei Huang 34

  35. Motivation • The assumption of having a static database may not hold in practice. • The data in real world change on the fly. • Finding sequential patterns in an incremental database may lack of interest to the users. • It is noted that users are usually more interested in the recent data than the old ones. Jen-Wei Huang 35

  36. Motivation • If a certain sequence does not have any newly arriving elements, this sequence will still stay in the database and undesirably contribute to |Db|. • New sequential patterns which appear frequently in the recent sequences may not be considered as frequent sequential patterns. Jen-Wei Huang 36

  37. Definition -- Period of Interest • Period of Interest (abbreviated as POI) is a sliding window • whose length is a user-specified time interval, • continuously advancing as the time goes by. • The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns. Jen-Wei Huang 37

  38. t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … S01 S02 S03 S04 S05 S06 Db1,5 Db2,6 Db3,7 Db4,8 Db5,9 Db6,10 A C BD B C AD B AD B A C A A BC B C D D C A BC D D B A C D A C SID time POI=5, min_supp=0.5

  39. Outlines • Introduction • Preliminaries • Algorithm Pisa • Experiments • Conclusions • Q & A Jen-Wei Huang 39

  40. Progressive Sequential Pattern • Progressive sequential pattern mining problem is defined as follows • "Given a progressive sequence database, a user-specified period of interest, and a user-defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in every period of interest of the database." Jen-Wei Huang 40

  41. Naïve Algorithm • Use conventional static sequential pattern mining algorithms to mine sequential patterns separately from all combination of POIs • e.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc. • For the sequence database which has the elements appearing in the interval of n timestamps, the total number of POIs in this interval is equal to (n − POI +1). Jen-Wei Huang 41

  42. Prior Work • The only prior work on progressive database is GSP+ and MFS+ proposed by Zhang based on static algorithms GSP and MFS (also derived by the same authors). • However, these algorithms still have to re-mine each sub-database using the static algorithms GSP and MFS. • Nevertheless, the performance improvement of GSP+ and MFS+ over GSP and MFS is only within 15% as reported by their authors. Jen-Wei Huang 42

  43. Algorithm DirApp • Stands for Direct Append. • Consists of two procedures • Progressively Updating • abbreviated as PrUp • Immediately Filtering • abbreviated as ImFi Jen-Wei Huang 43

  44. Procedure PrUp • When progressively reading newly incoming elements, Procedure PrUp can • update each sequence in the sequence database • generate candidate sequential patterns • calculate occurrence frequencies of all candidate equential patterns in the current POI. Jen-Wei Huang 44

  45. Procedure ImFi • DirApp uses Procedure ImFi to • filter out obsolete data from the existing sequence database • prune away obsolete candidate sequential patterns from the candidate set. • report the most up-to-date frequent sequential patterns to the user in every POI Jen-Wei Huang 45

  46. A C BD B C AD B AD B A C A A BC B C D D C A BC D D B A C D t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … A C SID time S01 S02 S03 S04 S01 S05 S06 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … time A B C AD B

  47. A B B C AD time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … Example Jen-Wei Huang 47

  48. (1) (4) A B C AD B t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … (2) (3)

  49. (4) (5) A B C AD B t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

  50. (5) (6) A B C AD B t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

More Related