500 likes | 693 Views
Incremental Clustering for Mining in a Data Warehousing Environment. Martin Ester, Hans-Peter Kriegel , Jörg Sander, Michael Wimmer , Xiaowei Xu 24th VLDB Conference, 1998 組員: 邱建舜、 韋 宇航 、曾學正 、 楊岳霖. OUTLINE. Introduction Related Work The Algorithm DBSCAN IncrementalDBSCAN
E N D
Incremental Clustering for Mining in a Data Warehousing Environment Martin Ester, Hans-Peter Kriegel, Jörg Sander, Michael Wimmer, Xiaowei Xu 24th VLDB Conference, 1998 組員:邱建舜、韋宇航、曾學正、楊岳霖
OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion
OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion
Introduction • The strategic importance of the knowledge • Data warehouse
Introduction: Data Warehouse • Data • Collection from multiple sources • Integrated • Analysis • Typical results • Clusters of items • Symptoms • Updated periodically
Introduction • Maintenance of derived information • Views • Summary tables • They present the first incremental clustering algorithm • Based on DBSCAN
OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions
Related Work • Association rule • Incremental algorithm and non-incremental algorithm • Attribute-oriented generalization • Clustering algorithm
Related Work: Clustering algorithm • Partitioning Algorithm • K-Means • K-Medoids • Hierarchical Algorithm • Agglomerative • Divisive • Density-Based Algorithms • DBSCAN
Related Work: K-Means • K=3
Related Work: K-Medoids • K=3
5 1 2 3 4 Related Work: Hierarchical Algorithm
OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion
The Algorithm DBSCAN • Density-based Spatial Clustering of Applications with Noise (DBSCAN) • radius (Eps) • minimum number of objects (MinPts) MinPts = 3 MinPts = 4 D Eps A B C A is a non-core object ! A is a core object !
The Algorithm DBSCAN • Directly Density-Reachable • p∈ NEps(q) (NEps(q) is the subset of D contained in the Eps-neighborhood of q.) • Card(NEps(q)) ≥ MinPts MinPts = 3 A is a core object D A B C • C & D are directly density-reachable from core object A
The Algorithm DBSCAN This relation is transitive ,but it is not symmetric • Density-Reachable • An object p is density-reachable from an object qwrt • there is a chain of objects p1, ..., pn, p1 = q, pn = p such that pi ∈ D and pi+1 is directly density-reachable from piwrt, in the set of objects D, denoted as p >Dq MinPts = 3 A , E and F are core objects G is a border object D F G E A B C • G >D A • A >D G ? • A not density-reachable from G • G density-reachable from A
The Algorithm DBSCAN Is a symmetric relation • Density-Connected • An object p is density-reachable from an object qwrt • Eps and MinPts in the set of objects D if there is an object o ∈ D such that both p and q are density-reachable from owrt. Eps and MinPts in D MinPts = 3 D F G E A B C • G and C density-connected to each other by E
The Algorithm DBSCAN • Cluster • Let D be a set of objects. A cluster Cwrt. Eps and MinPts in D is a non-empty subset of D satisfying the following conditions: • Maximality: ∀p,q ∈ D: if p ∈ C and q >D p wrt. Eps and MinPts, then also q ∈ C. • Connectivity: ∀p,q ∈ C: p is density-connected to q wrt.Epsand MinPtsin D. MinPts = 3 D F G E A B C H
The Algorithm DBSCAN • Noise • Let C1 ,. . ., Ckbe the clusters wrt. • Eps and MinPts in D. Then, we define the noise as the set of objects in the database D not belonging to any cluster Ci , • i.e. noise = {p ∈ D | ∀ i: p ∉ Ci}. H is a noise D L N F MinPts = 3 G O E M I A H B J K C C1 Ck
OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion
IncrementalDBSCAN • Affected objects • Insertions • Deletions
D D q’ q’
Insertions • Noise: • is empty. • Creation: • contains only core objects which did not belong to a cluster before insertion of p. • Absorption: • contains core objects which were members of exactly one cluster C before the insertion. • Merge: • contains core objects which were members of several clusters before the insertion.
Insertions-merger • Transitive: Cluster A is merged with B and C, implying the merge of B and C. • Transitive merging can occur if MinPts is larger than 5.
Deletions • Removal • is empty. • Reduction • All objects in are directly density-reachable from each other. • Potential Spilt • The objects in are not directly density-reachable. • These objects belonged to exactly one cluster C before the deletion of p.
OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion
First Experiment (I) • Database type : Synthetic database • Application : Earthquake catalog • Goal : clustering earthquake epicenters • 1,000,000 2D points • K = 40 • Noise : 21.7% • Note that the value had to be rather large due to the high percentage of noise
First Experiment (II) • Other synthetic database • n varying from 100,000 to 1,000,000 • K varying from 7 to 40 • noise percentage varying from 10% up to 20%
Second Experiment (I) • Database type : WWW access log database of the University of Munich • Goal : Discovering groups of similar sessions • Session structure :
Second Experiment (II) • Property of : • Domain : [0 . . 1] • symmetric, the triangle inequality
IncrementalDBSCAN versus DBSCAN • Comparison measure • Typically, the number of page accesses is used as a cost measure for database algorithms because the I/O time heavily dominates CPU time. In both algorithms, region queries are the only operations requiring page accesses. • Absolute performance V.Srelative performance
Cost • In general, a deletion affects more objects than an insertion.
Speed-up Factor • Speed-up factor depending on • Database size • Number of updates • experimental speed-up factor > analytically
Result (II) • : number of updates when setting the speed-up factor to 1.0
OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion
Conclusions • Data warehouse will update periodically • DBSCAN is inefficient to the dynamic database • Incremental DBSCAN is more efficient
OUTLINE • Introduction • Related Work • The Algorithm DBSCAN • IncrementalDBSCAN • Performance Evaluation • Conclusions • Discussion
Discussion • DBSCAN需要設定EpsandMinPts,但要如何找到適合的值? • 適合用在需要短周期、時效性更新的資料,例如流行病分類、季節性商品市場分析