1 / 46

Incremental Clustering for Mining in a Data Warehousing Environment

Incremental Clustering for Mining in a Data Warehousing Environment. Martin Ester, Hans-Peter Kriegel , Jörg Sander, Michael Wimmer , Xiaowei Xu 24th VLDB Conference, 1998 組員: 邱建舜、 韋 宇航 、曾學正 、 楊岳霖. OUTLINE. Introduction Related Work The Algorithm DBSCAN IncrementalDBSCAN

wolfe
Download Presentation

Incremental Clustering for Mining in a Data Warehousing Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incremental Clustering for Mining in a Data Warehousing Environment Martin Ester, Hans-Peter Kriegel, Jörg Sander, Michael Wimmer, Xiaowei Xu 24th VLDB Conference, 1998 組員:邱建舜、韋宇航、曾學正、楊岳霖

  2. OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion

  3. OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion

  4. Introduction • The strategic importance of the knowledge • Data warehouse

  5. Introduction: Data Warehouse • Data • Collection from multiple sources • Integrated • Analysis • Typical results • Clusters of items • Symptoms • Updated periodically

  6. Introduction • Maintenance of derived information • Views • Summary tables • They present the first incremental clustering algorithm • Based on DBSCAN

  7. OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions

  8. Related Work • Association rule • Incremental algorithm and non-incremental algorithm • Attribute-oriented generalization • Clustering algorithm

  9. Related Work: Clustering algorithm • Partitioning Algorithm • K-Means • K-Medoids • Hierarchical Algorithm • Agglomerative • Divisive • Density-Based Algorithms • DBSCAN

  10. Related Work: K-Means • K=3

  11. Related Work: K-Means

  12. Related Work: K-Medoids • K=3

  13. 5 1 2 3 4 Related Work: Hierarchical Algorithm

  14. Related Work:Density-Based Algorithms • DBSCAN

  15. OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion

  16. The Algorithm DBSCAN • Density-based Spatial Clustering of Applications with Noise (DBSCAN) • radius (Eps) • minimum number of objects (MinPts) MinPts = 3 MinPts = 4 D Eps A B C A is a non-core object ! A is a core object !

  17. The Algorithm DBSCAN • Directly Density-Reachable • p∈ NEps(q) (NEps(q) is the subset of D contained in the Eps-neighborhood of q.) • Card(NEps(q)) ≥ MinPts MinPts = 3 A is a core object D A B C • C & D are directly density-reachable from core object A

  18. The Algorithm DBSCAN This relation is transitive ,but it is not symmetric • Density-Reachable • An object p is density-reachable from an object qwrt • there is a chain of objects p1, ..., pn, p1 = q, pn = p such that pi ∈ D and pi+1 is directly density-reachable from piwrt, in the set of objects D, denoted as p >Dq MinPts = 3 A , E and F are core objects G is a border object D F G E A B C • G >D A • A >D G ? • A not density-reachable from G • G density-reachable from A

  19. The Algorithm DBSCAN Is a symmetric relation • Density-Connected • An object p is density-reachable from an object qwrt • Eps and MinPts in the set of objects D if there is an object o ∈ D such that both p and q are density-reachable from owrt. Eps and MinPts in D MinPts = 3 D F G E A B C • G and C density-connected to each other by E

  20. The Algorithm DBSCAN • Cluster • Let D be a set of objects. A cluster Cwrt. Eps and MinPts in D is a non-empty subset of D satisfying the following conditions: • Maximality: ∀p,q ∈ D: if p ∈ C and q >D p wrt. Eps and MinPts, then also q ∈ C. • Connectivity: ∀p,q ∈ C: p is density-connected to q wrt.Epsand MinPtsin D. MinPts = 3 D F G E A B C H

  21. The Algorithm DBSCAN • Noise • Let C1 ,. . ., Ckbe the clusters wrt. • Eps and MinPts in D. Then, we define the noise as the set of objects in the database D not belonging to any cluster Ci , • i.e. noise = {p ∈ D | ∀ i: p ∉ Ci}. H is a noise D L N F MinPts = 3 G O E M I A H B J K C C1 Ck

  22. The Algorithm DBSCAN

  23. OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion

  24. IncrementalDBSCAN • Affected objects • Insertions • Deletions

  25. Affected objects o

  26. Affected objects

  27. D D q’ q’

  28. Insertions • Noise: • is empty. • Creation: • contains only core objects which did not belong to a cluster before insertion of p. • Absorption: • contains core objects which were members of exactly one cluster C before the insertion. • Merge: • contains core objects which were members of several clusters before the insertion.

  29. Insertions-merger • Transitive: Cluster A is merged with B and C, implying the merge of B and C. • Transitive merging can occur if MinPts is larger than 5.

  30. Deletions • Removal • is empty. • Reduction • All objects in are directly density-reachable from each other. • Potential Spilt • The objects in are not directly density-reachable. • These objects belonged to exactly one cluster C before the deletion of p.

  31. OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion

  32. First Experiment (I) • Database type : Synthetic database • Application : Earthquake catalog • Goal : clustering earthquake epicenters • 1,000,000 2D points • K = 40 • Noise : 21.7% • Note that the value had to be rather large due to the high percentage of noise

  33. First Experiment (II) • Other synthetic database • n varying from 100,000 to 1,000,000 • K varying from 7 to 40 • noise percentage varying from 10% up to 20%

  34. Second Experiment (I) • Database type : WWW access log database of the University of Munich • Goal : Discovering groups of similar sessions • Session structure :

  35. Second Experiment (II) • Property of : • Domain : [0 . . 1] • symmetric, the triangle inequality

  36. IncrementalDBSCAN versus DBSCAN • Comparison measure • Typically, the number of page accesses is used as a cost measure for database algorithms because the I/O time heavily dominates CPU time. In both algorithms, region queries are the only operations requiring page accesses. • Absolute performance V.Srelative performance

  37. Cost • In general, a deletion affects more objects than an insertion.

  38. Speed-up Factor • Speed-up factor depending on • Database size • Number of updates • experimental speed-up factor > analytically

  39. Result (I)

  40. Result (II) • : number of updates when setting the speed-up factor to 1.0

  41. Result (III)

  42. OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion

  43. Conclusions • Data warehouse will update periodically • DBSCAN is inefficient to the dynamic database • Incremental DBSCAN is more efficient

  44. OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion

  45. Discussion • DBSCAN需要設定EpsandMinPts,但要如何找到適合的值? • 適合用在需要短周期、時效性更新的資料,例如流行病分類、季節性商品市場分析

  46. Q&A

More Related