390 likes | 411 Views
Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema. Yizhou Sun, Yintao Yu and Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign 9/1/2009. Outline. Background and Motivation Preliminaries NetClus Algorithm Experiments
E N D
Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema Yizhou Sun, Yintao Yu and Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign 9/1/2009
Outline • Background and Motivation • Preliminaries • NetClus Algorithm • Experiments • Conclusions and Future Work
Homogenous vs. Heterogeneous Networks • Information Networks are ubiquitous • Homogenous network • Collaboration network, friendship network, citation network, and so on • Usually converted from a heterogeneous network • Heterogeneous network • Bibliographic network, movie network, tagging network, and so on • Represent the real relations
Why Clustering on Heterogeneous Information Networks? • Why clustering on heterogeneous networks? • Understand the hidden structure • Understand individual roles of each object plays in the network • Existing work • Clustering on homogeneous network • SimRank + Clustering methods on homogeneous network • Time consuming • Meaning of similarity becomes controversial • RankClus [EDBT’09] • Clustering on one type of objects • Experiments are on two-typed heterogeneous networks
Better and More Efficient Clustering • Motivation 1: Generate clusters that are • More meaningful • Propose a new definition of cluster called net-cluster, following the schema of original network, comprised of different types of objects • More understandable • Provide ranking information for each type of objects in each cluster • Motivation 2: Provide an efficient algorithm: • NetClus: linear to the links in the network
SubNetwork-Clusters: An Illustration • Database net-cluster of bibliographic network Conference Type Author Type Term Type
NetClus Methodology: An Illustration • Split a network into different layers, each representing by a net-cluster
Outline • Background and Motivation • Preliminaries • Star Network Schema • Ranking Functions • Net-Cluster Definition • NetClus Algorithm • Experiments • Conclusions and Future Work
Star Network Schema • Addresses a specific type of heterogeneous networks • Special type: Star Network Schema • Center type: target type • E.g., a paper, a movie, a tagging event • A center object is a co-occurrence of a bag of different types of objects, which stands for a multi-relation among different types of objects • Surrounding type: attribute type
Ranking Functions • Ranking objects in a network, denoted as p(x|Tx, G) • Give a score to each object, according to its importance • Different rules defined different ranking functions: • Simple Ranking • Ranking score is assigned according to the degree of an object • Authority Ranking • Ranking score is assigned according to the mutual enhancement by propagations of score through links • E.g., according to rules of (1) highly ranked conferences accept many good papers published by many highly ranked authors and (2) highly ranked authors publish many good papers in highly ranked conferences:
Ranking Function (Cont.) • Ranking distribution • Normalize ranking scores to 1, given them a probabilistic meaning • Similar to the idea of PageRank • Priors can be added: • PP(X|Tx,Gk) = (1-λP)P(X|Tx,Gk)+ λPP0(X|Tx,Gk) • P0(X|Tx,Gk) is the prior knowledge, usually given as a distribution, denoted by only several words • λP is the parameter that we believe in prior distribution
Net-Cluster • Given an information network G, a net-cluster C contains two sorts of information: • Topology: Node set and link set as a sub-network of G • Stats info: Membership indicator for each node x: P(x=C) • Given a information network G, cluster number K, a clustering for G, , is defined: • is a net-cluster of G
Outline • Background and Motivation • Preliminaries • NetClus Algorithm • Framework of NetClus • Net-Cluster Generative Model • Posterior Probability Estimation (PPE) • Impact of Ranking Functions • Experiments • Conclusions and Future Work
Framework of NetClus • General idea: Map each target object into a new low-dimensional feature space according to current net-clustering, and adjust the clustering further in the new measure space. • Step 0: generate initial random clusters • Step 1: generate ranking-based generative model for target objects for each net-cluster • Step 2: calculate posterior probs for target objects, which serves as the new measure, and assign target objects to the nearest cluster accordingly • Step 3: repeat 1 and 2, until clusters do not change significantly • Step 4: calculate posterior probs for attribute objects in each net-cluster
Generative Model for Target Objects Given a Net-cluster • Recall that, each target object stands for an co-occurrence of a bag of attribute objects • Define the probability of a target object <=> define the probability of the co-occurrence of all the associated attribute objects • Generative probability for target object d in cluster : • where is ranking distribution, is type probability • Two assumptions of independence • The probabilities to visit objects of different types are independent to each other • The probabilities to visit two objects within the same type are independent to each other
PPE: Smoothing and Background Generative Model • Smoothing on ranking distributions for each type of objects in each net-cluster • Smoothing each conditional ranking distribution with global ranking distribution • PS(X|Tx,Gk) = (1-λS)P(X|Tx,Gk)+ λSP(X|Tx,G) • Goal: avoid zero probabilities for some object • Background generative model (BG) • The probability to generate target object d in the original network: P(d|G) • Target objects that are not high related to any specific cluster should have high probs in back ground model
PPE: Posterior Probability Estimation for Target Objects • Now we have K net-clusters, corresponding to K generative models, and a background model • Given p(d|G1), p(d|G2), … , p(d|GK), p(d|G), what’s the posterior probabilities of p(k|d), (k=1,2,…,K,K+1)? • Estimation solution: • Maximize the log-likelihood for the whole collection • Using EM algorithm to estimate the best p(z=k) • Hidden variable: • Iterative formula:
PPE: Posterior Probability Estimation for Attribute Objects • Posterior probs for attribute objects are only needed when the sub-networks of the net-clustering are stable (Step 4) • Aim: calculate the membership for each attribute object • Solution: using the target object information for each attribute object, average of target objects’ membership! • E.g., a conference membership indicator is the percentage of papers in each cluster
Cluster Adjustment • Using posterior probabilities of target objects as new feature space • Each target object => K-dimension vector • Each net-cluster center => K-dimension vector • Average on the objects in the cluster • Assign each target object into nearest cluster center (e.g., cosine similarity) • A sub-network corresponding to a new net-cluster is then built • by extracting all the target objects in that cluster and all the linked attribute objects
Time Complexity Analysis • Global ranking for attribute objects • O(t1|E|) • During each iteration • Conditional ranking O(t1|E|) • Conditional probs for target objects O(|E|) • Posterior for target objects O(t2(K+1)N) • Cluster adjustment O(K2N) • Posterior for attribute objects • O(K|E|) • In all • O(|E|) for fixed K
Impact of Ranking Functions • Which ranking function is better? • For a simple 3-type star network on object set , where Z is the center type • The estimated joint probability for simple ranking has the error of I(X,Y) • The estimated joint probability for authority ranking has the best estimation for the propagation matrix between X and Y under Frobenius norm.
Outline • Background and Motivation • Preliminaries • NetClus Algorithm • Experiments • Conclusions and Future Work
Experiments • Data Set • DBLP “all-area” data set • All conferences • Top 50K authors • DBLP “four-area” data set • 20 conferences from DB, DM, ML, IR • All authors from these conferences • All papers published in these conferences • Running case illustration
NetClus: Database System Cluster Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469 Michael J. Carey 0.00545769 C. Mohan 0.00528346 David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497 H. V. Jagadish 0.00434289 David B. Lomet 0.00397865 Raghu Ramakrishnan 0.0039278 Philip A. Bernstein 0.00376314 Joseph M. Hellerstein 0.00372064 Jeffrey F. Naughton 0.00363698 Yannis E. Ioannidis 0.00359853 Jennifer Widom 0.00351929 Per-?ke Larson 0.00334911 Rakesh Agrawal 0.00328274 Dan Suciu 0.00309047 Michael J. Franklin 0.00304099 Umeshwar Dayal 0.00290143 Abraham Silberschatz 0.00278185 database 0.0995511 databases 0.0708818 system 0.0678563 data 0.0214893 query 0.0133316 systems 0.0110413 queries 0.0090603 management 0.00850744 object 0.00837766 relational 0.0081175 processing 0.00745875 based 0.00736599 distributed 0.0068367 xml 0.00664958 oriented 0.00589557 design 0.00527672 web 0.00509167 information 0.0050518 model 0.00499396 efficient 0.00465707 VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE 0.188746 PODS 0.107943 EDBT 0.0436849 Ranking authors in XML 27
Parameter Study: Parm Setting • Prior parameter is relatively stable when it >0.4, the bigger the better • Smoothing parameter is relatively stable, the smaller the better (except at no smoothing)
Accuracy Study: Experiments • Accuracy study, compare with • PLSA, which is pure text model, no other types of objects and links are used, use the same prior as in NetClus • RankClus, which is a bi-typed clustering method on only one type
NetClus: Distinguishing Conferences AAAI 0.0022667 0.00899168 0.934024 0.0300042 0.0247133 CIKM 0.150053 0.310172 0.00723807 0.444524 0.0880127 CVPR 0.000163812 0.00763072 0.931496 0.0281342 0.032575 ECIR 3.47023e-05 0.00712695 0.00657402 0.978391 0.00787288 ECML 0.00077477 0.110922 0.814362 0.0579426 0.015999 EDBT 0.573362 0.316033 0.00101442 0.0245591 0.0850319 ICDE 0.529522 0.376542 0.00239152 0.0151113 0.0764334 ICDM 0.000455028 0.778452 0.0566457 0.113184 0.0512633 ICML 0.000309624 0.050078 0.878757 0.0622335 0.00862134 IJCAI 0.00329816 0.0046758 0.94288 0.0303745 0.0187718 KDD 0.00574223 0.797633 0.0617351 0.067681 0.0672086 PAKDD 0.00111246 0.813473 0.0403105 0.0574755 0.0876289 PKDD 5.39434e-05 0.760374 0.119608 0.052926 0.0670379 PODS 0.78935 0.113751 0.013939 0.00277417 0.0801858 SDM 0.000172953 0.841087 0.058316 0.0527081 0.0477156 SIGIR 0.00600399 0.00280013 0.00275237 0.977783 0.0106604 SIGMOD Conference 0.689348 0.223122 0.0017703 0.00825455 0.0775055 VLDB 0.701899 0.207428 0.00100012 0.0116966 0.0779764 WSDM 0.00751654 0.269259 0.0260291 0.683646 0.0135497 WWW 0.0771186 0.270635 0.029307 0.451857 0.171082 31
Case Study: DBLP “all-area” data set K=8 A “xml” net-cluster derived from database net-cluster
NetClus: KDD Field mining 0.0790963 data 0.0509959 association 0.0424484 frequent 0.0413659 rule 0.0223015 pattern 0.0221282 based 0.012448 clustering 0.00915418 efficient 0.00870164 databases 0.00654573 rules 0.00638362 web 0.00618587 approach 0.00558388 patterns 0.00546508 time 0.00532743 discovery 0.00520791 queries 0.00512735 large 0.00505302 algorithm 0.00495221 classification 0.00477521 ICDE 0.193106 KDD 0.177786 SIGMOD Conf. 0.116497 VLDB 0.112015 ICDM 0.0968135 Philip S. Yu 0.00984668 Jiawei Han 0.0080883 Charu C. Aggarwal 0.00688184 Christos Faloutsos 0.00534601 Wei Wang 0.0039633 Hans-Peter Kriegel 0.0036941 Rakesh Agrawal 0.00352178 Jian Pei 0.00352033 Nick Koudas 0.00326135 Heikki Mannila 0.00302283 Eamonn J. Keogh 0.00285453 Haixun Wang 0.00277766 Divesh Srivastava 0.00275084 Beng Chin Ooi 0.00270741 Ming-Syan Chen 0.00252245 Johannes Gehrke 0.00248227 Mohammed Javeed Zaki 0.0024233 Ke Wang 0.00237186 Yufei Tao 0.00234508 H. V. Jagadish 0.0023317 33
NetClus: ML learning 0.0785149 recognition 0.0616076 pattern 0.0569329 machine 0.0210515 based 0.012122 knowledge 0.0062703 model 0.00563725 system 0.00538452 approach 0.00534144 reasoning 0.00518959 models 0.00482448 data 0.00428022 analysis 0.00427453 planning 0.00416088 search 0.00414499 systems 0.00407711 logic 0.00371819 multi 0.00349816 algorithm 0.0034679 classification 0.00321972 IJCAI 0.427665 AAAI 0.403056 ICML 0.0899892 ECML 0.0245488 CVPR 0.0229665 Richard E. Korf 0.00299098 Craig Boutilier 0.00246557 Tuomas Sandholm 0.00244961 Judea Pearl 0.00242606 Hector J. Levesque 0.00234726 Yoav Shoham 0.00230554 Kenneth D. Forbus 0.00211045 Rina Dechter 0.00208683 Stuart J. Russell 0.00188014 Johan de Kleer 0.00187524 Toby Walsh 0.00186112 Benjamin Kuipers 0.00185742 Subbarao Kambhampati 0.00175271 Peter Stone 0.00170711 Kurt Konolige 0.00170513 James P. Delgrande 0.00167945 Joseph Y. Halpern 0.00164386 Jeffrey S. Rosenschein 0.00161199 Brian C. Williams 0.00157864 Daniel S. Weld 0.00156658 34
NetClus: IR retrieval 0.0833119 information 0.0777979 text 0.0689247 search 0.0306999 web 0.0145188 based 0.0143753 document 0.00950089 query 0.00783011 system 0.0064804 classification 0.00618953 model 0.00614568 language 0.00540877 data 0.00517338 learning 0.0050341 analysis 0.00480311 approach 0.00462792 models 0.0046184 clustering 0.00460905 documents 0.00453735 user 0.00449431 SIGIR 0.638595 CIKM 0.14482 ECIR 0.0726454 WWW 0.0366223 KDD 0.015487 W. Bruce Croft 0.0141826 James Allan 0.00630046 Norbert Fuhr 0.00547785 ChengXiang Zhai 0.00493936 James P. Callan 0.00481386 C. J. van Rijsbergen 0.00471779 Ellen M. Voorhees 0.00467488 Gerard Salton 0.00462283 Mark Sanderson 0.00437391 K. L. Kwok 0.00427169 Chris Buckley 0.00404819 Abraham Bookstein 0.00383661 Justin Zobel 0.00374904 Tetsuya Sakai 0.0035631 Yiming Yang 0.0034947 Donna Harman 0.00335238 Clement T. Yu 0.00330327 Alistair Moffat 0.003292 Ian Soboroff 0.00324063 Nicholas J. Belkin 0.00313201 35
Outline • Background and Motivation • Preliminaries • NetClus Algorithm • Experiments • Conclusions and Future Work
Conclusions and Future Work • A new kind of cluster, net-cluster, is proposed for heterogeneous information networks comprised of multiple types of objects. • An effective and efficient algorithm, NetClus, is proposed, that detects net-clusters in a star network with arbitrary number of types, is proposed. • Experiments on real dataset shows our algorithm can give quite reasonable clustering and ranking results. The clustering accuracy is much higher than the baseline methods. • See our iNextCube system demo at VLDB’09 • http://inextcube.cs.uiuc.edu/DBLP • http://inextcube.cs.uiuc.edu/netclus • Future work • How to automatically set the cluster number K? • How to select a good ranking function in a complex network?
END. Q & A