University of Turin , Italy Department of Computer Science. Parameter-free Hierarchical Co-Clustering by n -Ary Splits. Dino Ienco, Ruggero G. Pensa and Rosa Meo { ienco ,pensa, meo }@di.unito.it. ECML-PKDD 2009 – Bled (Slovenia).
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits Outline MotivationsOur IdeaCo-Clustering and BackgroundHierachicalCo-ClusteringResultsConclusions ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits MOTIVATIONS Motivations Co-Clustering: - effectiveapproachthatobtainsinterestingresults - Commonlyinvolvedwithhigh-dimensional data - Partitionsimultaneouslyrows and columns ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits MOTIVATIONS ManyCo-clusteringalgorithms: - Spectralapproach (Dhillonet al. KDD01) - Information theoreticapproach(Dhillonet al. KDD03) - Minimum Sum-Squared Residue approach(Choet al. SDM04 ) - Bayesianapproach(Shanet al. ICDM08) ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits MOTIVATIONS Allprevioustechniques: - requirenum. ofrow/column cluster asparameter - produce flatpartitions, withoutanystructure information ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits MOTIVATIONS In general: - parameters are difficultto set - structured output (likehierarchies) help the usertounderstand dataHierarchicalstructures are usefulto: - indexing and visualize data - explore the parent-childrelationships - derive generalization/specializationconcept ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits OUR IDEA Our Idea CO-CLUSTERING Buildtwohierarchies on bothdimensionssimultaneously ALLOWS + HIERARCHICAL APPROACH • PROPOSED APPROACH: • - Extendpreviousflatco-clusteringalgorithm (Robardet02) • tohierarchicalsetting ECML-PKDD 2009 –Bled (Slovenia)
UniversityofTurin, Italy Departmentof Computer Science CO-CLUSTERING Background τ-CoClust (Robardet02):- Co-Clustering for counting or frequency data - No number of row/column clustering needed - Maximize a statistical measure Goodman and Kruskal τbetween row and column partitions ECML-PKDD 2009 –Bled (Slovenia)
UniversityofTurin, Italy Departmentof Computer Science CO-CLUSTERING Goodman and Kruskal τ : - Measure the proportional reduction in the prediction error of a dep. Variablegivenanindep. Variable CO1 = {O1,O2} CF1 ={F2} CO2 = {O3,O4} CF2= {F1,F3} ECML-PKDD 2009 –Bled (Slovenia)
UniversityofTurin, Italy Departmentof Computer Science CO-CLUSTERING Goodman and Kruskal τ : - Measure the proportional reduction in the prediction error of a dep. Variablegivenanindep. Variable ECO Predictionerror on CO without knowledgeabout CF partition Predictionerror on COwith knowledgeabout CF partition ECO|CF ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits CO-CLUSTERING Optimization strategy: - τ is asymmetrical, for this reason the algorithm alternates the optimization of two functions τCO|CF and τCF|CO- Stochastic optimization (example on rows): # Start with an initial parition on rows for i in 1..n_times # augment the current partition with an empty cluster # Move at random one element from a partition to another one# If obj. func. improve keep solution, else undo the operation # If there is an empty cluster, removeit end - Thisoptimizationallows the num. ofclusterstogrow or decrease - In (Robardet02) anefficient way to update incrementally the objectivefunctionwasintroduced ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits HIERARCHICAL CO-CLUSTERING HIERARCHICAL CO-CLUSTERING HiCC:- Hierarchical Co-Clustering algorithm that extends τ-CoClust - Divisive Approach - No parameter settings needed- No predefined number of splits for each node of the hierarchy ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits HIERARCHICAL CO-CLUSTERING HiCC:At the beginning use τ-CoClust repeat - From the current Row/Column partitions - Fix the Column partition - For each cluster in the Row partition Re-cluster with τ-CoClust and optimize the obj. func. τCO|CF-Update Row Hierarchy-Fix the new Row partition-For each cluster in the Column partition Re-cluster with τ-CoClust and optimize the obj. func. τCF|newCO-Update Column Hierarchy until (TERMINATION) ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits HIERARCHICAL CO-CLUSTERING A SIMPLE EXAMPLE Goes on …untilterminationconditionissatisfied ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits RESULTS Experimentation: - No previoushierarchicalco-clusteringalgorithmexists - Use a flatco-clusteringalgorithmwith the samenumberofclustersobtainedbyourapproachforeachlevel - Wechoose Information theoreticapproach (KDD03) and foreachlevelweperform 50 runsthenwecompute the average - Weusedocument-worddatasetto validate ourapproach: * OHSUMED (collectionofpubmedabstract) {oh0, oh15} * REUTERS-21578 (collected and labeled by Carnegie Group) {re0, re1} * TREC (Text RetrievalConference) {tr11, tr21} ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits RESULTS An exampleofrowhierachy on OHSUMED Welabeleach cluster with the majorityclass ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits RESULTS An exampleofcolumnhierachy on REUTERS Welabeleach cluster with top 10 wordsrankedbymutual information ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits RESULTS External Validation Indices: - Purity - Normalized Mutual Information (NMI) - Adjusted Rand IndexHierarchical setting:We combine the hierarchical result with this formula - is one of the external validation indices - is a weight for the hierarchy level i, in our case αi is equal to 1/i ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits RESULTS Performance Results ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits RESULTS Performance Results on re1 dataset ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits CONCLUSIONS We propose: - New approachtohierarchicalco-clustering - Parameter free -No apriorifixednumberofsplits (n-arysplits) - Obtainsgoodresults • -Buildssimultaneouslyhierarchies on bothdimensions • - Improveco-clusteringresultsexploration Conclusions ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits CONCLUSIONS • Future works: • - Parallelize the algorithmtoimprovetime performance • - Pushingconstraints inside ittouse background knowledge • - Extend the frameworktomanagecontinuous data ECML-PKDD 2009 –Bled (Slovenia)
Parameter-freeHierarchicalCo-Clusteringbyn-ArySplits AnyQuestion? Thankyouforyourattention ECML-PKDD 2009 –Bled (Slovenia)