280 likes | 290 Views
This research paper explores the analysis of the blogosphere by using community factorization to identify dense subgraphs and analyze the temporal dynamics of communication within communities. The goal is to understand the structure and behavior of the blogosphere over time.
E N D
Structural and Temporal Analysis of the Blogosphere Through Community Factorization Y. Chi, S. Zhu, X. Song, J. Tatemura, B.L. Tseng Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007 Advisor:Dr. Jia-Ling Koh Speaker:Chien-Liang Wu
Outline • Introduction • Definition • Basic Idea • Community Factorization • Experimental Studies
Introduction • Blog • Self-publishing media on the web • Grow quickly and become important • Key difference between blogosphere and Web • Lifetime of their contents (i.e., pages and links) • Blog consists of a temporal sequence of entries • An event publish entries obsolete • Short lifetime • Web • Longer lifetime • A new page refer to a very old page (such as authoritative page)
Introduction(cont.) • Traditional web analysis • A dense subgraph of web pages a community • Temporal dynamics observe how the subgraph grows over time • Blogosphere analysis • A dense subgraph grows only within a short time span • Traditional analysis can only capture dynamics within a short-term activity (such as a single thread of discussion)
Introduction (cont.) • Alternative method • Accumulate such links over a very long period • Generate a graph of blog • The graph is static and changes incrementally • Miss detailed temporal behavior • Example: Two communities - Politics (P) and Economics (E) • One becomes inactive while the other becomes active • Some blogs are interested both in P and E, moving from one community to another • Two communities have overlap a single community in the aggregated graph
Definition • Community • A set of blogs that communicates with each other in a synchronized manner • Communication is triggered by some events • Result in a number of dense subgraphs, each of which is a short-term thread of discussion • Community graph • Structure of community which represents how much one blog communicates with another • Community intensity • Represent the activity level of a community at a particular time
Observation • Communication in a community • Dense subgraphs • Structure of subgraphs community graph structure • A single dense subgraph does not necessarily reflect the entire structure of a community • Since a member does not always participate in a thread of discussion, a community may appear as smaller pieces of disconnected subgraphs at a particular time • Members of different communities participate in a single subgraph reflect
Basic Idea • Represent a community structure as a combination of the observed subgraphs • Transfer the problem • How to find coefficients for such combination as well as the values of the community intensity over time • Community factorization • Find such parameters that give the best explanation of the observed data • Use intensities as weighting factor approximate the observed data
Overview of method Identify dense subgraphs by applying a graph partitioning algorithm Community graph is defined as a linear combination of the dense subgraphs
Problem Formulation • n blogs bi (i = 1,…, n) in the blogosphere • Linking activity a graph structure for each time window s (s = 1,…, t) • These graphs As (s = 1,…, t) a tensor A • Extract dense subgraphs from the observed graph As. These dense subgraphs another tensor B • Given A and B, find k communities such that their community graphs and intensities best explain the observed data A • Community graphs {Cl} (l = 1,…, k) • Community intensities {vsl} (s = 1,…, t) aggregate as stack as stack as
Data Tensor A • A =[A1,…, At] : represent the blogosphere over time • Snapshot graph As • Adjacency matrix that presents a snapshot of activity in the blogosphere at time s • (As)ij: the count of links from bi to bj in the s-th time window • A link from bi to bj at time s (i ≠ j) • bi publishes an entry at time s that has a hyperlink pointing to any content of bj • These links are counted for each time window (say a day)
Basis Tensor B • For each snapshot graph As, identify dense subgraphs • Apply a graph partitioning algorithm • Shi's normalized cut or Newman's optimal modularity • Remove insignificant subgraphs • e.g., a subgraph with only a couple of nodes • Have ms graphs , and call them basis subgraphs • For the t time windows • Have total basis subgraphs • Stack them together and get basis tensor B
Community Graphs and Intensities ] B =[ B1 B2 B3 B4 B5 B6 B7 Bm-3 Bm-2 Bm-1 Bm Linear combination u1l B1 + Community graph Cl = u2l B2 + u3l B3 + + uml Bm …….(1) upl is a weight that indicates how important the p-th basis subgraph is to the l-th community The coefficients {upl } are parameters and need to estimate.
Community Graphs and Intensities(cont.) • Communities behave concurrently • At time s, multiple community graphs can affect the structure of As • Community intensity vsl • How much the l-th community contributes at time s • Use to represent the observed data As • Problem is formulated as minimization of the error
Community Graphs and Intensities(cont.) • For t time windows, minimization of the error …….(2) • Plug equation (1) into equation (2) • Minimize the objective function , where …….(3) • ×3 presents the 3-mode multiplication of a tensor by a matrix
Solution by Non-negative Matrix Factorization • For the objective function (3) • Equivalently written in a matrix form as • Each column A(s) of A • Stack the columns of the snapshot graph As into an n2×1 vector • Similar way to obtain B • In equation (4), given A and B • Task: search for non-negative matrices U and V that minimize J1 , where …….(4)
Smoothing by Regularization • Incorporate prior knowledge into the objective function • Tikhonov regularization terms • In this paper, γ1 set to be 1 and R1 to be the identity matrix • For V, apply intuitive prior knowledge – temporal trends • The value difference between two consecutive (in temporal order) elements in the same column of V should be small • Set R2 to be a difference matrix • Turn γ2 to demonstrate the effect of smoothness in experiment …….(5) , where γ1 and γ2 are user defined parameters
Iterative Updating Rules • Solve for U and V in Equation (5) • Start by setting U and V to some random non-negative matrices and then iteratively update U and V • Theorem 1: • The following multiplicative updating rules will converge to non-negative solutions to the optimization problem whose objective function is given by Equation (5)
Some Practical Issues • Size of Time Windows and Basis Subgraphs • This algorithm is not very sensitive to these two sizes, as long as they are not overly large • For size of time windows • Days or Weeks • Not necessarily uniform • For size of basis subgraphs • Choose different numbers in each time window • Number of Communities • Try different k's to compare the reconstruction error • Choose one reasonably small and explains data reasonably well
Experimental Studies • Synthetic data set • 150 blogs and 2 communities • NEC Laboratories America • 407 English blogs that have 274,679 entries in 441 days (63 weeks) between 2005.07.10 ~ 2006. 09 23 • These entries are connected with 148,681 links • Roughly two groups of blogs: technology focus and politics focus • Benchmark data set • WWW 2006 Workshop on the Weblogging Ecosystem • 8.37 million entries from 1.43 million different blog sites during a 3-week • Constrain the subset of blogs that contain at least one link • 141K blogs, and 1.62 million links among them
Synthetic Data Set • Separate two overlapped communities that have different temporal trends
Synthetic Data Set (cont.) more smoother
Synthetic Data Set (cont.) Generate a random number p between 1~3 and aggregate next p time windows into a single one
NEC Data Set • Blog graph : Technology focus : Politics focus
NEC Data Set (cont.) • Use normalized cut algorithm 50 communities • Report the number of links among each community every day
NEC Data Set (cont.) • Size of a node: the corresponding row sum in the community graph Cl in Equation (1) • Width of a link: the corresponding entry in Cl • This community is formed around an authoritative blog by David Sifry • David Sifry posted a comprehensive study on the current status of the blogosphere
Benchmark Data Set • Scalability • A weak point of this algorithm: #blog ↑, t ↓ difficult to extract meaningful communities • t/n is large this method • t/n is mall traditional • approach
Running Time • Implement in Matlab • Run on a PC of Pentium IV processor with 2G Hz CPU and 2GB memory • Running time second per iteration • Criterion for convergence • Algorithm can converge within 1000 iterations