160 likes | 257 Views
Generalized Model Selection For Unsupervised Learning in High Dimension. Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99. Abstract. Bayesian approach to model selection in unsupervised learning
E N D
Generalized Model SelectionFor Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS’99
Abstract • Bayesian approach to model selection in unsupervised learning • propose a unified objective function whose arguments include both the feature space and number of clusters. • determining feature set (dividing feature set into noise features and useful features • determining the number of clusters • marginal likelihood with Bayesian scheme vs. cross-validation(cross-validated likelihood). • DC (Distributional clustering of terms) for initial feature selection.
Model Selection in Clustering • Bayesian approaches1), cross-validation2) techniques, MDL approaches3). • Need for unified objective function • the optimal number of clusters is dependent on the feature space in which the clustering is performed. • c.f. feature selection in clustering
Model Selection in Clustering (Cont’d) • Generalized model for clustering • data D = {d1,…,d}, feature space T with dimension M • likelihood P(DT|) maximization, where (with parameter ) is the structure of the model (# of clusters, the partitioning of the feature set into U(useful set), N(noise set) and the assignment of patterns to clusters). • Bayesian approach to model selection • regularization using marginal likelihood
Bayesian Approach to Model Selection for Clustering • Data • data D = {d1,…,dn}, feature space T with dimension M • Clustering D • finding and such that • where is the structure of the model and is the set of all parameter vectors • the model structure consists of the # of clusters + the partitioning of the feature set and the assignment of patterns to clusters.
lack of • regularization • marginal or integrated likelihood Assumptions • The feature sets T represented by U and N are conditionally independent and the data is independent. 2. Data = {d1,…,dn} is i.i.d
computationally • very expensive • pruning of search space by reducing the number of feature partitions model complexity 3. All parameter vectors are independent. • marginal likelihood • Approximations to Marginal Likelihood/Stochastic Complexity
Document Clustering • Marginal likelihood (11) adapting multinomial models using term counts as the features assuming that priors (..) is conjugate to the Dirichlet distribution NLML (Negative Log Marginal Likelihood)
Document Clustering (cont’) • Cross-Validated likelihood
Distributional clustering for feature subset selection • heuristic method to obtain a subset of tokens that are topical and can be used as features in the bag-of-words model to cluster documents • reduce feature size M to C • by clustering words based on their distributions over the documents. • A histogram for each token • the first bin: # of documents with zero occurrences of the token • the second bin: # of documents consisting of a single occurrence of the token • the third bin: # of documents that contain two or more occurrence of the term
DC for feature subset selection(Cont’d) • measure of similarity of the histograms • relative entropy or the K-L distance (.||.) • e.g. for two terms with prob. p1(.), p2(.) • k-means DC
Experimental Setup • AP Reuters Newswire articles from the TREC-6 • 8235 documents from the routing track, 25 classes, disregard multiple classes • 32450 unique terms (discarding terms that appeared in less than 3 documents) • Evaluation measure of clustering • MI
function words Results of Distributional Clustering • cluster 32450 tokens into 3,4,5 clusters. • eliminating function words Figure 1. centroid of a typical high-frequency function-words cluster
Finding the Optimum Features and Document Clusters for a Fixed Number of Clusters • Now, apply the objective function (11) to the feature subsets selected by DC • EM/CEM (Classification EM: hard-thresholded version of the EM)1) • initialization: k-means algorithm
Comparison of feature-selection heuristics • FBTop20: Removal of the top 20% of the most frequent terms • FBTop40: Removal of the top 40% of the most frequent terms • FBTop40Bot10: Removal of top 40% of the most frequent terms and removal of all tokens that do not appear in at least 10 documents • NF: No feature selection • CSW: Common stop words removed