Unsupervised Learning with Mixed Numeric and Nominal Data

Unsupervised Learning with MixedNumeric and Nominal Data Advisor : Dr. Hsu Graduate : Yu-Cheng Chen Authors :Cen Li, Gautam Biswas 2002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Outline • Motivation • Objective • Introduction • Background • SABC • Experimental results • Conclusions • Personal Opinion

Motivation • Tradition clustering algorithms assume feature are either numeric or categorical valued. • Majority of the useful data is described by numeric and nominal valued features

Objective • Developing unsupervised learning techniques that exhibit good performance with mixed data.

Introduction • Traditional approaches that be used to resolve mixed data have listed as following: • Binary encoding. • Discretize numeric attributes. • Generalize criterion functions to handle mixed data.

Background • COBWEB/3 • use CU measure for categorical attributes • For numeric attributes

Background (cont.) • COBWEB/3 • CU measure for numeric attributes is defined as: • The overall CU is defined as:

Background (cont.) • COBWEB • Limitations: • The normal distribution assumption for numeric data. • The accuracy of the estimate is suspect when sample size is samll • When objects in Ck has a unique value, the σik = 0 and 1/ σik →∞ , so we set the 1/ σik =1 whenσik =1 < 1

Background (cont.) • ECOBWEB want to remedy the disadvantages of COBWEB/3 • The normal distribution assumption • When σik = 0

Background (cont.) • ECOBWEB • Limitations: • The choice of the parameters has a significant effect on CU computation.

Background (cont.) • AUTOCLASS • Use Bayesian method to clustering • Derive the most probable class distribution for the data given prior information. • Limitations: • Computational complexity is too high. • Over fitting problem.

SBAC System • SBAC • uses a similarity measure defined by Goodall • adopts a hierarchical agglomerative approach to build partition structures. • The similarity is decided by • The uncommonality of feature value matches. • X1= {a, b} , X2={a, b}, X3={c, d} , X4={c, d} • ( P(a) =P(b) ) >= ( P(c)=P(d) ) • The similarity of X3 and X4 should be greater than that of X1 and X2.

SBAC System • Summary • For numeric feature values, the similarity takes on: • The feature value difference • The uniqueness of the feature value pair

SBAC System • Computing Similarity for numeric Attributes • We define the More Similar Feature Segment Set (MSFSS) • The set of all pairs of values for feature that are equally or more similar to the pair ( (Vi)k, (Vj)k ).

SBAC System • The probability of picking two pair having a values (Vl)k, (Vm)kMSFVS ((Vi)k ,(Vj)k) is defined as • The dissimilairty of the pair (Dij)k is defined as the summation of the probabilities. • The similarity of the pair ((Vi)k ,(Vj)k is defined as

SBAC System • For nominal feature values, the similarity is • We define the More Similar Feature Value Set (MSFVS) • The set of all pairs of values for feature that are equally or more similar to the pair ( (Vi)k, (Vi)k ). f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a) ,(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)

SBAC System • The probability of picking a pair (Vl)k, (Vl)kMSFVS ((Vi)k) is defined as following • The dissimilairty of the pair (Dii)k is defined as the summation of the probabilities

SBAC System f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a) ,(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)

SBAC System • Aggregating Similarity from Multiple Features • Assuming the results are expressed as Fisher’s χ2 • For numeric features: • For nominal features:

SBAC System • Combining the two types of features: • {c, 9} • {a, 7.5} • {c, 10.5} • 8 {c, 9}

SBAC System • The agglomerative clustering algorithm:

SBAC System • The predefined threshold t • We set t=0.3 * D(root), D(root)=0.876, t=0.263 • If the dissimilarity is dropping larger than t, then stop

Experimental results • Artificial data • 180 data points, three classes, G1, G2, G3 • Two nominal and two numeric attributes. • Each classes has 60 data points.

Experimental results (cont.)

Experimental results(cont.) COBWEB SBAC AUTOCLASS ECOBWEB

Experimental results (cont.) • Real data • Hand Written Character (8OX) Data • Numeric features • 45 objects • Mushroom Data • Nominal features • 200 objects (100 of them were poisonous) • Heart disease Data • Mixed features • 303 patients

Experimental results (cont.) • Results

Conclusions • This paper proposed a new similarity measure that assigns greater weight to feature value matches that are uncommon in the population. • The approach has better performance in clustering than another’s do.

Personal Opinion • The time complexity of this approach is too high. • The process of computing similarity and clustering are too messy.

Unsupervised Learning with Mixed Numeric and Nominal Data

Unsupervised Learning with Mixed Numeric and Nominal Data

Presentation Transcript

Nominal Data

CSC411- Machine Learning and Data Mining Unsupervised Learning

Unsupervised Learning

Unsupervised Learning

Unsupervised Learning

Unsupervised learning

Unsupervised Learning

Unsupervised Learning

Unsupervised Learning and Data Mining

Unsupervised Learning

Unsupervised Learning With Non-ignorable Missing Data

Unsupervised Learning

Unsupervised learning

Unsupervised Evolutionary Clustering Algorithm for Mixed Type Data

Unsupervised Learning

Unsupervised Learning

Unsupervised learning

Unsupervised Learning

Unsupervised Learning

Unsupervised Learning