Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks

Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In IEEE Trans. on PAMI, 23(6), 2001. Summarized by Kyu-Baek Hwang

Abstract • Feature selection for unsupervised learning of Gaussian networks • Unsupervised learning for Bayesian networks? • Which feature is good for the learning task? • Assessment of the relevance of the feature for learning process • How to determine the threshold for cutting? • Accelerate the learning time and still obtain reasonable models • Two artificial datasets • Two benchmark datasets from the UCI repository

Unsupervised Learning for Conditional Gaussian Networks • Data clustering  learning the probabilistic graphical model from the unlabeled data • Cluster membership  a hidden variable • Conditional Gaussian networks • Cluster variable is the ancestor for all the other variables. • The joint probability distribution over all the other variables given the cluster membership is multivariate Gaussian. • Feature selection in classification  feature selection in clustering • Consider all the features eventually, to describe the domain.

Conditional Gaussian Distribution • Data clustering • X = (Y, C) = (Y1, …, Yn, C) • Conditional Gaussian distribution • Pdf for Y given C = c is, • whenever p(c) = p(C = c) > 0 Positive definite

Conditional Gaussian Networks • Factorization of the conditional Gaussian distribution • Conditional independencies among all the variables is encoded by the network structure s. • Local probability distribution

An Example of CGNs C

n 1 O H N Learning CGNs from Data • Incomplete dataset d • Structural EM algorithm

Structural EM Algorithm Expected score Relaxed version:

Scoring Metricsfor the Structural Search • The log marginal likelihood of the expected complete data

Feature Selection • Large databases • Many instances • Many attributes •  Dimensionality reduction required • Select features based on some criterion. • The criterion differs from the purpose of learning. • Learning speed, accurate predictions, and the comprehensibility of the learned models • Non exhaustive search (2n) • Sequential selection (forward or backward) • Evolutionary, population-based, randomized search based on the EDA.

Wrapper and Filter • Wrapper • Feature subsets tailored to the performance function of learning process • Predictive accuracy on the test data set. • Filter • Based on the intrinsic properties of the data set. • Correlation between the class label and each attribute •  Supervised learning • Two problems in unsupervised learning • Absence of the class label  different criterion for the feature selection • No standard accepted performance task  multiple predictive accuracy or class prediction

Feature Selection in Learning CGNs • Data analysis (clustering)  description, not prediction • All the features are necessary for the description. • CGN learning with many features is a time-consuming task. • Preprocessing: feature selection • Learning CGNs • Postprocessing: addition of the other features as conditionally independent given the cluster membership • The goal  how to measure the relevance • Fast learning time • Accuracy  log likelihood for the test data

Relevance • Those features that exhibit low correlation with the rest of the features can be considered irrelevant for the learning process. • Conditionally independent given the cluster membership. • First trial in the continuous domain

Relevance Measure • The relevance measure: • Null hypothesis (edge exclusion test) • r2ij|rest • The sample partial correlation of Yi and Yj • The maximum likelihood estimates (mles) of the elements of the inverse variance matrix

Graphical Gaussian Models (1/2)

Graphical Gaussian Models (2/2)

Relevance Threshold • Distribution of the test statistic • G(x): pdf of a 12 random variable • 5 percent test • The resolution of the above equation  optimization

Learning Scheme

C Experimental Settings • Model speicifications • Tree augmented Naïve Bayes (TANB) models • Predictive attributes may have, at most, one other predictive attribute as a parent. • An example

Data Sets • Synthetic data sets (4000:1000) • TANB model with 25 (15:14[-1, 1]) attributes, (0, 4, 8), 1 • C: uniform, (0, 1) • TANB model with 30 (15:14[-1, 1]) attributes, (0, 4, 8), 2 • C:uniform, (0, 5) • Waveform (artificial data) (4000:1000) • 3 clusters, 40 attributes, the last 19 are noise attributes • Pima • 768 cases (700:68) • 8 attributes

Performance Criteria • The log marginal likelihood of the training data • The multiple predictive accuracy • A probabilistic approach to the standard multiple predictive accuracy • Runtime • 10 independent runs for the synthetic data sets and the waveform data • 50 independent runs for the pima data • On a Pentium 366 machine

Relevance Ranking

Likelihood Plots for Synthetic Data

Likelihood Plots for Real Data

Runtime

Automatic Dimensionality Reduction

Conclusions and Future Work • Relevance assessment for feature selection in unsupervised learning and continuous domain • Reasonable learning performance • Extension to categorical domain • Redundant feature problem • Relaxation of the model structure • More realistic data set

Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks