310 likes | 515 Views
Unsupervised Learning with Mixed Numeric and Nominal Data. Advisor : Dr. Hsu Graduate : Yu-Cheng Chen Authors : Cen Li, Gautam Biswas. 2002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. Outline. Motivation Objective Introduction Background SABC
E N D
Unsupervised Learning with MixedNumeric and Nominal Data Advisor : Dr. Hsu Graduate : Yu-Cheng Chen Authors :Cen Li, Gautam Biswas 2002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Outline • Motivation • Objective • Introduction • Background • SABC • Experimental results • Conclusions • Personal Opinion
Motivation • Tradition clustering algorithms assume feature are either numeric or categorical valued. • Majority of the useful data is described by numeric and nominal valued features
Objective • Developing unsupervised learning techniques that exhibit good performance with mixed data.
Introduction • Traditional approaches that be used to resolve mixed data have listed as following: • Binary encoding. • Discretize numeric attributes. • Generalize criterion functions to handle mixed data.
Background • COBWEB/3 • use CU measure for categorical attributes • For numeric attributes
Background (cont.) • COBWEB/3 • CU measure for numeric attributes is defined as: • The overall CU is defined as:
Background (cont.) • COBWEB • Limitations: • The normal distribution assumption for numeric data. • The accuracy of the estimate is suspect when sample size is samll • When objects in Ck has a unique value, the σik = 0 and 1/ σik →∞ , so we set the 1/ σik =1 whenσik =1 < 1
Background (cont.) • ECOBWEB want to remedy the disadvantages of COBWEB/3 • The normal distribution assumption • When σik = 0
Background (cont.) • ECOBWEB • Limitations: • The choice of the parameters has a significant effect on CU computation.
Background (cont.) • AUTOCLASS • Use Bayesian method to clustering • Derive the most probable class distribution for the data given prior information. • Limitations: • Computational complexity is too high. • Over fitting problem.
SBAC System • SBAC • uses a similarity measure defined by Goodall • adopts a hierarchical agglomerative approach to build partition structures. • The similarity is decided by • The uncommonality of feature value matches. • X1= {a, b} , X2={a, b}, X3={c, d} , X4={c, d} • ( P(a) =P(b) ) >= ( P(c)=P(d) ) • The similarity of X3 and X4 should be greater than that of X1 and X2.
SBAC System • Summary • For numeric feature values, the similarity takes on: • The feature value difference • The uniqueness of the feature value pair
SBAC System • Computing Similarity for numeric Attributes • We define the More Similar Feature Segment Set (MSFSS) • The set of all pairs of values for feature that are equally or more similar to the pair ( (Vi)k, (Vj)k ).
SBAC System • The probability of picking two pair having a values (Vl)k, (Vm)kMSFVS ((Vi)k ,(Vj)k) is defined as • The dissimilairty of the pair (Dij)k is defined as the summation of the probabilities. • The similarity of the pair ((Vi)k ,(Vj)k is defined as
SBAC System • For nominal feature values, the similarity is • We define the More Similar Feature Value Set (MSFVS) • The set of all pairs of values for feature that are equally or more similar to the pair ( (Vi)k, (Vi)k ). f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a) ,(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)
SBAC System • The probability of picking a pair (Vl)k, (Vl)kMSFVS ((Vi)k) is defined as following • The dissimilairty of the pair (Dii)k is defined as the summation of the probabilities
SBAC System f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a) ,(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)
SBAC System • Aggregating Similarity from Multiple Features • Assuming the results are expressed as Fisher’s χ2 • For numeric features: • For nominal features:
SBAC System • Combining the two types of features: • {c, 9} • {a, 7.5} • {c, 10.5} • 8 {c, 9}
SBAC System • The agglomerative clustering algorithm:
SBAC System • The predefined threshold t • We set t=0.3 * D(root), D(root)=0.876, t=0.263 • If the dissimilarity is dropping larger than t, then stop
Experimental results • Artificial data • 180 data points, three classes, G1, G2, G3 • Two nominal and two numeric attributes. • Each classes has 60 data points.
Experimental results(cont.) COBWEB SBAC AUTOCLASS ECOBWEB
Experimental results (cont.) • Real data • Hand Written Character (8OX) Data • Numeric features • 45 objects • Mushroom Data • Nominal features • 200 objects (100 of them were poisonous) • Heart disease Data • Mixed features • 303 patients
Experimental results (cont.) • Results
Experimental results (cont.) • Results
Experimental results (cont.) • Results
Conclusions • This paper proposed a new similarity measure that assigns greater weight to feature value matches that are uncommon in the population. • The approach has better performance in clustering than another’s do.
Personal Opinion • The time complexity of this approach is too high. • The process of computing similarity and clustering are too messy.