340 likes | 519 Views
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH . Outline. Background - Traditional Methods Bayesian Hierarchical Clustering (BHC) Basic ideas Dirichlet Process Mixture Model ( DPM) Algorithm Experiment results Conclusion.
E N D
Bayesian Hierarchical ClusteringPaper by K. Heller and Z. GhahramaniICML 2005Presented by HAO-WEI, YEH
Outline • Background - Traditional Methods • Bayesian Hierarchical Clustering (BHC) • Basic ideas • Dirichlet Process Mixture Model (DPM) • Algorithm • Experiment results • Conclusion
Background Traditional Method
Hierarchical Clustering • Given : data points • Output: a tree (series of clusters) • Leaves : data points • Internal nodes : nested clusters • Examples • Evolutionary tree of living organisms • Internet newsgroups • Newswire documents
Traditional Hierarchical Clustering • Bottom-up agglomerative algorithm • Closeness based on given distance measure (e.g. Euclidean distance between cluster means)
Traditional Hierarchical Clustering (cont’d) • Limitations • No guide to choosing correctnumber of clusters, or where to prune tree. • Distance metric selection (especially for data such as images or sequences) • Evaluation (Probabilistic model) • How to evaluate how good result is ? • How to compare to other models ? • How to make predictions and cluster new data with existing hierarchy ?
BHC Bayesian Hierarchical Clustering
Bayesian Hierarchical Clustering • Basic ideas: • Use marginal likelihoods to decide which clusters to merge • P(Data to merge were from the same mixture component)vs. P(Data to merge were from different mixture components) • Generative Model : Dirichlet Process Mixture Model (DPM)
Dirichlet Process Mixture Model (DPM) • Formal Definition • Different Perspectives • Infinite version of Mixture Model (Motivation and Problems) • Stick-breaking Process (How generated distribution look like) • Chinese Restaurant Process, Polya urn scheme • Benefits • Conjugate prior • Unlimited clusters • “Rich-Get-Richer, ” Does it really work? Depends! • Pitman-Yor process, Uniform Process, …
BHC Algorithm - Overview • Same as traditional • One-pass, bottom-up method • Initializes each data point in own cluster, and iteratively merges pairs of clusters. • Difference • Uses a statistical hypothesis test to choose which clusters to merge.
BHC Algorithm - Concepts • Two hypotheses to compare • 1. All data was generated i.i.d. from the same probabilistic model with unknown parameters. • 2. Data has two or more clusters in it.
Hypothesis H1 • Probability of the data under H1: • : prior over the parameters • Dk: data in the two trees to be merged • Integral is tractable with conjugate prior
Hypothesis H2 • Probability of the data under H2: • Product over sub-trees
BHC Algorithm - Working Flow • From Bayes Rule, the posterior probability of the merged hypothesis: • The pair of trees with highest probability are merged. • Natural place to cut the final tree: Data number, concentration(DPM) Hidden features (Beneath Distribution)
Tree-Consistent Partitions • Consider the right tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4) • (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions. • (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions.
Merged Hypothesis Prior (πk) • Based on DPM (CRP perspective) • πk= P(All points belong to one cluster) • d’s are the case for all tree-consistent partitions
Predictive Distribution • BHC allow to define predictive distributions for new data points. • Note : P(x|D) != P(x|Dk) for root!?
Approximate Inference for DPM prior • BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions. • Idea : deterministically sum over partitions with high probability, therefore accounting for most of the mass. • Compare to MCMC method, this is more deterministic and efficient.
Learning Hyperparameters • α: Concentration parameter • β : Define G0 • Learned by recursive gradients and EM-like method
To Sum Up for BHC • Statistical model for comparison and decides when to stop. • Allow to define predictive distributions for new data points. • Approximate Inference for DPMmarginal. • Parameters • α: Concentration parameter • β : Define G0
Unique Aspects of BHC Algorithm • Hierarchical way of organizing nested clusters, not a hierarchical generative model. • Derived from DPM. • Hypothesis test : one vs. many other clusterings (compare to one vs. two clusters at each stage) • Not iterative and does not require sampling. (except for learning parameters)
Results from the experiments
Conclusion and some take home notes
Conclusion Solved!! • Limitations -> No guide to choosing correctnumber of clusters, or where to prune tree. (Natural Stop Criterion) <- -> Distance metric selection (Model-based Criterion) <- -> Evaluation, Comparison, Inference (Probabilistic model) <- Some useful results for DPM) <-
Summary • Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree. • Model-based criterion to decide on merging clusters. • Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree. • Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.
Limitations • Inherent greediness • Lack of any incorporation of tree uncertainty • O(n2)complexity for building tree
References • Main paper: • Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005 • Thesis: • Efficient Bayesian Methods for Clustering, Katherine Ann Heller • Other references: • Wikipedia • Paper Slides • www.ee.duke.edu/~lcarin/emag/.../DW_PD_100705.ppt • http://cs.brown.edu/courses/csci2950-p/fall2011/lectures/2011-10-13_ghosh.pdf • General ML • http://blog.echen.me/
References • Other references(cont’d) • DPM & Nonparametric Bayesian : • http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt • https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf • http://www.iro.umontreal.ca/~lisa/seminaires/31-10-2006.pdf • http://videolectures.net/mlss07_teh_dp/ , http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf • http://www.cns.nyu.edu/~eorhan/notes/dpmm.pdf (Easy to read) • http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf • Heavy text: • http://stat.columbia.edu/~porbanz/reports/OrbanzTeh2010.pdf • http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf • http://www.stat.uchicago.edu/~pmcc/reports/clusters.pdf • Hierarchical DPM • http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf • Other methods • https://people.cs.umass.edu/~wallach/publications/wallach10alternative.pdf