Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science

Learning Dissimilarities for Categorical Symbols Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk, zaki}@cs.rpi.edu

Presentation Outline • Introduction • Related Work • Learning Dissimilarity (LD) Algorithm • Experimental Results • Conclusion

Introduction • Distance plays an important role in many data mining tasks • Distance is rarely defined precisely for categorical data • nominal and ordinal • e.g., rating of a movie {very bad, bad, fair, good, very good} • Goal: derive dissimilarities between categorical symbols • To enable the full power of distance-based methods. • Hopefully easier for interpretation as well.

Notation • A dataset X ={x1,x2,…,xt} of t data points. Each point xi has m attributes values xi = (x1i,…, xmi ). • Each attribute Ai is drawn from ni discrete values {ai1,…, aini}. Each aij is also called a symbol. • The similarity between symbols aik and ail : • The dissimilarity or • The distance between two data points xi and xj is defined in terms of the distance between symbols

Notation (cont.) • Let the frequency of symbol ai in the dataset be then the probability • Class label • Output of the classier on point xi : • The error of misclassifying point xi: • Total classification error:

Related Work • Unsupervised methods: • Assign based on frequency; Emphasize mismatch or match for frequent or rare symbols from certain probability or information theory point of views. • Lin • Burnaby • Smirnov • Goodall • Supervised methods: • Take the classes information into account • Value Difference Metric (VDM) • Cheng et al.. • Gambaryan • Eskin • Occurrence Frequency (OF) • Inverse Occurrence Frequency (IOF)

Unsupervised Method Examples • Goodall : less frequent attribute values make greater contribution to the overall similarity than frequent attribute values on match. That is, if ai=aj otherwise, 0 • Inverse Occurrence Frequency (IOF): assigns higher weight to mismatches on less frequent symbols. That is, if ai!=aj otherwise, 1

Supervised Method Examples • VDM: • Symbols are similar if they occur with a similar relative frequency for all the classes. where Cai,c is the number of times symbol ai occurs in class c. Cai is the total number of times ai occurs in the whole dataset. h is a constant. • Cheng: • based on RBF classier • They attempt to evaluate all the pair-wise distances between symbols, and they optimize the error function using gradient descent method

Learning Dissimilarity Algorithm • Motivation: • learn a mapping function from each categorical attribute Ai onto the real number interval based on the classes information may facilitate the classification task and is possible.

Learning Dissimilarity Algorithm (cont.) • Based on nearest neighbor classifier and the distancedifference from two classes • Iteration learning • Guided by gradient descent method to minimize the total classification error

Learning Dissimilarity Algorithm (cont.) • Objective Function and Update Equation

Learning Dissimilarity Algorithm (cont.) • The Derivative of ∆d • The full update equation

Learning Dissimilarity Algorithm (cont.) • Intuitive meaning of assignment update

Experimental Result • Datasets

Experimental Result (cont.) • Redundancy among symbols

Experimental Result (cont.) • Comparison with Various Data-Driven Methods • On average, the LD and VDM achieve the best accuracy, indicating that supervised dissimilarities attain better results over the unsupervised ones. Among the unsupervised measures, IOF, Lin are slightly superior to others.

Experimental Result (cont.) • Analysis with confidence interval (accuracy +/- standard deviation) • LD performed statistically worse than Lin on datasets Splice and Tic-tac-toe but better than Lin on datasets Connection-4, Hayes and Balance Scale. • LD performed statistically worse than VDM only on one dataset (Splice) but better on two datasets (Connection-4 and Tic-tac-toe). • Finally, LD performed statistically at least as well as (and on some datasets, e.g. Connection-4, better than) the remaining methods.

Experimental Result (cont.) • Comparison with Various Classifiers • LD performed statistically worse than the other methods on only one dataset (Splice) but performed better on at least three other datasets than each of the other methods.

Conclusion • A task-oriented or supervised iterative learning approach to learn a distance function for categorical data. • Explores the relationships between categorical symbols by utilizing the classification error as guidance. • The real value mappings found by our algorithm provide discriminative information, which can be used to refine features and improve classification accuracy.

Thank you!

Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science

Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science

Presentation Transcript

Aree A. Mohammed Computer Science Department aree.ali@univsul 2010-2011

COMPUTER SCIENCE DEPARTMENT

Aree A. Mohammed Computer Science Department aree.ali@univsul.net 2010-2011

Aree A. Mohammed Computer Science Department aree.ali@univsul.net 2010-2011

Computer Science Department

Department of Computer Science

Aree A. Mohammed Computer Science Department aree.aliunivsul 2010-201

Department of Computer Science

Department of Computer Science

Department of Computer Science

Computer Science Department

Computer Science Department

Computer Science Department

Department of computer science

Department of Computer Science

Department of Computer Science

DEPARTMENT OF COMPUTER SCIENCE

Department of Computer Science

Department of Computer Science

Department of Computer Science

Department of Computer Science

Department of Computer Science