220 likes | 329 Views
PET: A Statistical Model for Popular Events Tracking in Social Communities. Cindy Xide Lin 1 , Bo Zhao 1 , Qiaozhu Mei 2 , Jiawei Han 1 1 University of Illinois at Urbana-Champaign, 2 University of Michigan KDD 2010 2010. 09. 16.
E N D
PET: A Statistical Model for Popular Events Tracking in Social Communities Cindy Xide Lin1, Bo Zhao1, Qiaozhu Mei2, Jiawei Han1 1University of Illinois at Urbana-Champaign, 2University of Michigan KDD 2010 2010. 09. 16. Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University
Contents • Introduction • Concept Definition • Problem Definition • Model • Interest model • Topic model • Experiment • Data Collection • Baseline and Gold standard • Analysis on Popularity Trend • Analysis on Content Evolution • Conclusions & Discussions
Introduction • Boom of online communities • e.g., Facebook, Blogger, Twitter, … • Facilitates the information creation, sharing and diffusion. • Popular topic or event can spread much faster. • Needs to track the diffusion and evolution of a popular event • Hot topics emerge, prevail and die • It is desirable to monitor whether people like, what they like, and how their interests change over time • e.g., Who are still interested in watching Avatar 50 days after its release date?
Introduction • Tracking the evolution of a popular topic is challenging • Diffusion of an event is vague • e.g., You don’t know whether I am interest in an event • e.g., and even if you do, from whom did I get this interest. • Fortunately, a large volume of text data is generated from the social communities. • Besides Communicating with friends, a web user also constantly generates text contents such as blog. • A network structure and a text collection which evolve simultaneously and interrelatedly.
Goal • Tracking Popular Eventin a time-variant social community • A stream of text information • A stream of network structures • Modeling the interest of user • Modeling the change of topic
Concept Definition: Network Stream 2 6 v2 v6 1 v1 3 v3 5 4 v5 v4 Gk: The snapshot of network at time tk G = { G1, G2, …, Gn }
Concept Definition: Document Stream 2 w2, w2 w3, w1, … w8, w6 w2, w5, … 6 v2 dk,2 dk,5 v6 1 w1, w2 w3, w1, … 3 w4, w1 w1, w1, … v1 dk,1 dk,3 v3 5 4 w7, w7 w7, w7, … w2, w6 w2, w5, … v5 v4 dk,5 dk,4 Document Collection Stream D = {D1, D2, …, DT} Documents collections Dk = {dk,1, dk,2, …., dk,N}
Concept Definition: Topic and Event • Topic • topic θ is a multinomial distribution of words {p(w|θ)}w∈W • Topic has different version over time, denoting the version at time tk as θk • Event • A stream of topics Theta E = {θ0E, θ1E, θ2E, … θTE} • θ0E is the primitive topic of the event • θkEcorresponds to the version of θ0E at time tk • Indicates the major aspects of the event in network Gk
Concept Definition: Interest • Interest • hk(i): node vi in Gk has a certain level of interest in the particular event at time tk • Real value between 0 and 1 • Hk = {hk(1), hk(2), …, hk(N)}
Problem: Popular Event Tracking • Inputs • Network Stream G • Document Stream D • Primitive topic of an event θ0 • Task1: Popularity Tracking • Inferring the latent stream of interests. (Hk) • providing much richer information about how the interest e • Task2: Topic Tracking • Inferring the latent stream of topics about the event ΘE • Keeping track of the new development about the event, • Understanding event evolution
Intuitions • Observation 1. Interest and Connections • The behavior of each individual is usually influenced by its friend. • Observation 2. Interest and History • The behavior of each individual should be generally consistent over time. • Events should not change dramatically. • Observation 3. Content and Interest • When an individual has a higher level of interest in an event, the content she generates should be more likely to be related to the event
The General Model • Current interest and topic depends on • Current network • Current Documents • Previous history (Markovian simplification) • Formal representation • P(Hk, Θk| Gk, Dk, Hk-1)
Assumption • How to model P(Hk, Θk | Gk, Dk, Hk-1)? • Assumption 1. • Given current network structure Gk and previous Hk-1, • Current interest status Hk is independent of the document collection Dk • Hkㅛ Dk | Gk, Hk-1 • People first become interested in the event and therefore generate discussion it • Assumption 2. • Given the current interest status Hk and the document collection Dk, • The current topic model k is independent of Gk and Hk-1 • θk ㅛ Gk, Hk-1| Hk, Dk • Once the author has developed an interest in the event, the contents she writes will only depend on the event itself and the level of interest • P( Hk, Θk | Gk, Dk, Hk-1 ) = P(Hk | Gk, Hk-1) P(Θk|Hk, Dk)
Interest Model 0.3 0.2 0.8 0.1 0.2 1 h’=1*0.2+0.3*0.8+0.2*0.1 = 0.46 • Gibbs Random field • Great use in studying natural processes • (Gibbs distribution) • cf. (Gaussian distribution is a special member of Gibbs distribution family) • P (Hk | Gk, Hk-1) • h’(k) is weighted sum of friends’ interest • The first part is transition energy of node i • The last part represents neighbors expectation
Topic Model • Considering each document is generated two multinomial component model • Background model: θkB • Modeling Common words • Latent event topic model: θkE • Modeling discriminative and meaningful words • The probability of generating word • P(Θk|Hk, Dk)
Twitter Data collection • Selecting 5000 users with follower-followee relationship • Considering each day as a time point (tk: the kth day) • Document dk,i is obtained by concatenating tweets displayed by user i in k • weight of relationship between user equals the number of tweets displayed by user I by following user j during the period from tk-30 to tk.
Baseline and Gold standard • BOM: extracting the daily box office at Mojo • The box office earning is a trustworthy criterion to reflect the movie’s popularity • GInt: Google Insight • PET • PET- : special version of PET by removing network structure • JonK / Cont
Analysis on Popularity Trend • PET always has the best performance • Historic, textual and structured information is reflected well • PET- can not response sufficiently to sudden changes
Conclusion & Discussion • Propose the novel problem of Popular Event Tracking • Propose popular event tracking model, PET • Unified probabilistic framework to model different factors • Covers classical models • Experimental studies show that PET outperforms existing ones • PET is not good framework for tracking interest • There exist the more accurate data such as Google Insight. • Tracking topic changing is a novel problem. • PET detects and tracks topic evolution well.