On Node Classification in Dynamic Content-based Networks

CharuAggarwal Nan Li IBM T. J. Watson Research Center charu@us.ibm.com University of California, Santa Barbara nanli@cs.ucsb.edu Presented by Nan Li On Node Classification in Dynamic Content-based Networks

Motivation Ke Wang Jian Pei “Sequential Pattern” “Data Mining” “Systems” “Rules” … “Mining” “Efficient” “Association Rules” … Jiawei Han “Data Mining” “Databases” “Clustering” “Sequential Pattern” … Kenneth A. Ross “Algorithms” … Year 2001

Motivation Ke Wang Jian Pei “Association Rules” “Data Mining” “Ranking” “Web” … “Pattern” “Data Mining” “Stream” “Semantics” … Jiawei Han Marianne Winslett “Data Mining” “Web” “Sequential Pattern” … “Parallel” “Automated” “Data” … Xifeng Yan Philip S. Yu “Clustering” “Distributed” “Databases” “Mining” … “Pattern Mining” … Year 2002

Motivation Ke Wang Jian Pei “Sequential Pattern” “Mining” “Systems” “Rules” … “Mining” “Efficient” “Association” … Jiawei Han Charu Aggarwal “Clustering” “Indexing” “Knowledge” “XML” … “Mining” “Databases” “Clustering” “Sequential Pattern” … Xifeng Yan Philip S. Yu “Algorithms” “Association Rules” “Clustering” “Wireless” “Web” … “Graph” “Databases” “Sequential Mining” … Year 2003

Motivation • Networks annotated with an increasing amount of text • Citation networks, co-authorship networks, product databases with large amounts of text content, etc. • Highly dynamic • Node classification Problem • Often arises in the context of many network scenarios in which the underlying nodes are associated with content. • A subset of the nodes in the network may be labeled. • Can we use these labeled nodes in conjunction with the structure for the classification of nodes which are not currently labeled? • Applications

Challenges • Information networks are very large • Scalable and efficient • Many such networks are dynamic • Updatable in real time • Self-adaptable and robust • Such networks are often noisy • Intelligent and selective • Heterogeneous correlations in such networks A A A B C B C B C

Outline • Related Works • DYCOS: DYnamic Classification algorithm with cOntent and Structure • Semi-bipartite content-structure transformation • Classification using a series of text and link-based random walks • Accuracy analysis • Experiments • NetKit-SRL • Conclusion

Related Works • Link-based classification (Bhagat et al., WebKDD 2007) • Local iterative • Global nearest neighbor • Content-only classification (Nigam et al. Machine Learning 2000) • Each object’s own attributes only • Relational classification (Sen et al., Technical Report 2004) • Each object’s own attributes • Attributes and known labels of the neighbors • Collective classification (Macskassy & Provost, JMLR 2007, Sen et al., Technical Report 2004, Chakrabarti, SIGMOD 1998) • Local classification • Flexible: ranging from a decision tree to an SVM • Approximate inference algorithms • Iterative classification • Gibbs sampling • Loopy belief propagation • Relaxation labeling

DYCOS in A Nutshell • Node classification in a dynamic environment • Dynamic network: the entire network is denoted by Gt = (Nt, At, Tt) at time t. • Problem statement: • Classify the unlabeled nodes (Nt \ Tt) using both the content and structure of the network for all the time stamps in an efficient and accurate manner t t+1 t+2 Both the structure and the content of the network change over time!

Semi-bipartite Transformation • Text-augmented representation • Leveraged for a random walk-based classification model that uses both links and text • Two partitions: structural nodes and word nodes • Semi-bipartite: one partition of nodes is allowed to have edges either within the set, or to nodes in the other partition. • Efficient updates upon dynamic changes

Random Walk-Based Classification • Random walks over augmented structure • Starting node: the unlabeled node to be classified. • Structural hop • A random jump from a structural node to one of its neighbors • Content-based multi-hop • A jump from a structural node to another through implicit common word nodes • Structural parameter: ps • Classification • Classify the starting node with the most frequently encountered class label during the random walks

Gini-Index & Inverted Lists • Discriminative keywords • A set Mt of the top m words with the highest discriminative power are used to construct the word node partition. • Gini-index • The value of G(w) lies in the range (0, 1). • Words with a higher value of gini-index are more discriminative for classification purposes. • Inverted lists • Inverted list of keywords for each node • Inverted list of nodes for each keyword

Analysis • Why do we care? • DYCOS is essentially using Monte-Carlo sampling to sample various paths from each unlabeled node. • Advantage: fast approach • Disadvantage: loss of accuracy • Can we present analysis on how accurate DYCOS sampling is? • Probabilistic bound: bi-class classification • Two classes C1 and C2 • E[Pr[C1]] = f1, E[Pr[C2]] = f2, f1 - f2 = b ≥ 0 • Pr[mis-classification] ≤ exp{-lb2/2} • Probabilistic bound: multi-class classification • k classes {C1, C2, …, Ck} • b-accurate • Pr[b-accurate] ≥ 1 - (k-1)exp{-lb2/2}

Experimental Results • Data sets • CORA: a set of research papers and the citation relations among them. • Each node is a paper and each edge is a citation relation. • A total of 12,313 English words are extracted from the paper titles. • We segment the data into 10 synthetic time periods. • DBLP: a set of authors and their collaborations • Each node is an author and each edge is a collaboration. • A total of 194 English words in the domain of computer science are used. • We segment the data into 36 annual graphs from year 1975 to year 2010.

Experimental Results • NetKit-SRL toolkit • An open-source and publicly available toolkit for statistical relational learning in networked data (Macskassy and Provost, 2007). • Instantiations of previous relational and collective classification algorithms • Configuration • Local classifier: domain-specific class prior • Relational classifier: network-only multinomial Bayes classifier • Collective inference: relaxation labeling • Parameters • 1) The number of most discriminative words, m; 2) The size constraint of the inverted list for each keyword a; 3) The number of top content-hop neighbors, q; 4) The number of random walks, l; 5) The length of each random walk, h; 6) Structure parameter, ps. The results demonstrate that DYCOS improves the classification accuracy over NetKit by 7.18% to 17.44%, while reducing the runtime to only 14.60% to 18.95% of that of NetKit.

Experimental Results DYCOS vs. NetKit on CORA Classification Time Comparison Classification Accuracy Comparison

Experimental Results Parameter Sensitivity of DYCOS Sensitivity to m, l and h (a=30, ps=70%) Sensitivity to a, m and ps (l=3, h=5) DBLP Data CORA Data

Experimental Results Dynamic Updating Time: CORA Dynamic Updating Time: DBLP

Conclusion • We propose an efficient, dynamic and scalable method for node classification in dynamic networks. • We provide analysis on how accurate the proposed method will be in practice. • We present experimental results on real data sets, and show that our algorithms are more effective and efficient than competing algorithms.

Thank You!!!

Challenges

Analysis

Experimental Results Classification accuracy comparison: DBLP Classification time comparison: DBLP

Experimental Results Sensitivity to m, l and h Sensitivity to a, l and h Sensitivity to m, a and ps Sensitivity to a, m and ps

On Node Classification in Dynamic Content-based Networks