130 likes | 238 Views
Information-theoretic distance measures for clustering validation: Generalization and normalization. Presenter : Lin, Shu -Han Authors : Ping Luo , Hui Xiong , Guoxing Zhan, Junjie Wu, and Zhongzhi Shi. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009). Outline.
E N D
Information-theoretic distance measures for clustering validation:Generalization and normalization Presenter : Lin, Shu-Han • Authors : Ping Luo, HuiXiong, Guoxing Zhan, Junjie Wu, andZhongzhi Shi IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,TKDE(2009)
Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments
Motivation σ :the“true”partition π:clusteringoutput • Externalcriteriaforclusteringvalidation: • Information-theoreticdistancemeasuresareusedtoComparingtheclusteringoutputwiththe“true”partition • Clusteringabilityofalgorithms:Comparedifferentclusteringalgorithms,givendataset • Clusteringdifficultyofdatasets:Comparedifferentdatasets,givenalgorithm
Objectives • SinceDimension, size, sparseness of data; scales of attributes aredifferentfordifferentdatasets. • therangeofdistancemeasuresaredifferent • Todofaircomparison:distancenormalization
Methodology – ConditionalEntropy π:grouplabel σ:classlabel The equality C1=C2 yields the Shannon entropy 5
Methodology – Quasi-Distance σ :the“true”partition π:clusteringoutput Minimum reachable:d(π,σ)reaches its minimum over both and iffπ=σ Symmetry:d(π,σ)=d(σ,π) Triangle law:d(π,σ)+d(σ,π)≧d(σ,τ) 6
Methodology – NormalizationIssue Howtogetit? 7
Methodology – Computationof Theworseresultofπ(mgroups) Generateaπ0 ∈ PART(A)suchthat 8 σ:n
Methodology – Computationof Thereisandifferencebetweenand 9
Experiments ShannonEntropy GiniIndex Goodman-Kruskal PalEntropy 10
Experiments 11
Conclusions • Quasi-distance:externalmeasureforclusteringvalidation • Symmetry • Trianglelaw • Minimumreachable • Normalization:maximumvalueofadistancemeasure • Compareclusteringperformancesofanalgorithmondifferentdatasets • Thenormalizeddistancemeasuresoutperformtheoriginaldistancemeasure • NormalizedShannondistancehasbestperformanceamong4observeddistancemeasures
Comments • Advantage • Ideaisintuitive • Theoreticallyanalysis • Drawback • Describewhytheythinkquasi-distanceisbetterthanDCV. • Application • ThesameuseofDCV?