270 likes | 463 Views
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks. Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In IEEE Trans. on PAMI , 23 (6), 2001. Summarized by Kyu-Baek Hwang. Abstract.
E N D
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In IEEE Trans. on PAMI, 23(6), 2001. Summarized by Kyu-Baek Hwang
Abstract • Feature selection for unsupervised learning of Gaussian networks • Unsupervised learning for Bayesian networks? • Which feature is good for the learning task? • Assessment of the relevance of the feature for learning process • How to determine the threshold for cutting? • Accelerate the learning time and still obtain reasonable models • Two artificial datasets • Two benchmark datasets from the UCI repository
Unsupervised Learning for Conditional Gaussian Networks • Data clustering learning the probabilistic graphical model from the unlabeled data • Cluster membership a hidden variable • Conditional Gaussian networks • Cluster variable is the ancestor for all the other variables. • The joint probability distribution over all the other variables given the cluster membership is multivariate Gaussian. • Feature selection in classification feature selection in clustering • Consider all the features eventually, to describe the domain.
Conditional Gaussian Distribution • Data clustering • X = (Y, C) = (Y1, …, Yn, C) • Conditional Gaussian distribution • Pdf for Y given C = c is, • whenever p(c) = p(C = c) > 0 Positive definite
Conditional Gaussian Networks • Factorization of the conditional Gaussian distribution • Conditional independencies among all the variables is encoded by the network structure s. • Local probability distribution
n 1 O H N Learning CGNs from Data • Incomplete dataset d • Structural EM algorithm
Structural EM Algorithm Expected score Relaxed version:
Scoring Metricsfor the Structural Search • The log marginal likelihood of the expected complete data
Feature Selection • Large databases • Many instances • Many attributes • Dimensionality reduction required • Select features based on some criterion. • The criterion differs from the purpose of learning. • Learning speed, accurate predictions, and the comprehensibility of the learned models • Non exhaustive search (2n) • Sequential selection (forward or backward) • Evolutionary, population-based, randomized search based on the EDA.
Wrapper and Filter • Wrapper • Feature subsets tailored to the performance function of learning process • Predictive accuracy on the test data set. • Filter • Based on the intrinsic properties of the data set. • Correlation between the class label and each attribute • Supervised learning • Two problems in unsupervised learning • Absence of the class label different criterion for the feature selection • No standard accepted performance task multiple predictive accuracy or class prediction
Feature Selection in Learning CGNs • Data analysis (clustering) description, not prediction • All the features are necessary for the description. • CGN learning with many features is a time-consuming task. • Preprocessing: feature selection • Learning CGNs • Postprocessing: addition of the other features as conditionally independent given the cluster membership • The goal how to measure the relevance • Fast learning time • Accuracy log likelihood for the test data
Relevance • Those features that exhibit low correlation with the rest of the features can be considered irrelevant for the learning process. • Conditionally independent given the cluster membership. • First trial in the continuous domain
Relevance Measure • The relevance measure: • Null hypothesis (edge exclusion test) • r2ij|rest • The sample partial correlation of Yi and Yj • The maximum likelihood estimates (mles) of the elements of the inverse variance matrix
Relevance Threshold • Distribution of the test statistic • G(x): pdf of a 12 random variable • 5 percent test • The resolution of the above equation optimization
C Experimental Settings • Model speicifications • Tree augmented Naïve Bayes (TANB) models • Predictive attributes may have, at most, one other predictive attribute as a parent. • An example
Data Sets • Synthetic data sets (4000:1000) • TANB model with 25 (15:14[-1, 1]) attributes, (0, 4, 8), 1 • C: uniform, (0, 1) • TANB model with 30 (15:14[-1, 1]) attributes, (0, 4, 8), 2 • C:uniform, (0, 5) • Waveform (artificial data) (4000:1000) • 3 clusters, 40 attributes, the last 19 are noise attributes • Pima • 768 cases (700:68) • 8 attributes
Performance Criteria • The log marginal likelihood of the training data • The multiple predictive accuracy • A probabilistic approach to the standard multiple predictive accuracy • Runtime • 10 independent runs for the synthetic data sets and the waveform data • 50 independent runs for the pima data • On a Pentium 366 machine
Conclusions and Future Work • Relevance assessment for feature selection in unsupervised learning and continuous domain • Reasonable learning performance • Extension to categorical domain • Redundant feature problem • Relaxation of the model structure • More realistic data set