330 likes | 345 Views
Data Association for Topic Intensity Tracking. Andreas Krause Jure Leskovec Carlos Guestrin School of Computer Science, Carnegie Mellon University. Document classification. Emails from two topics: Conference and Hiking. Will you go to ICML too?. Let’s go hiking on Friday!.
E N D
Data Association for Topic Intensity Tracking Andreas Krause Jure Leskovec Carlos Guestrin School of Computer Science, Carnegie Mellon University
Document classification • Emails from two topics: Conference and Hiking Will you go toICML too? Let’s go hikingon Friday! P(C | words) = .1 P(C | words) = .9 Conference Hiking
Could refer to both topics! C1 C2 Ct Ct+1 D1 D2 Dt Dt+1 A more difficult example • Emails from two topics: Conference and Hiking • What if we had temporal information? • How about modeling emails as HMM? 2:00 pm 2:03 pm Let’s have dinnerafter the talk. Should we go onFriday? P(C | words) = .5 P(C | words) = .7 Conference Assumes equal time steps,“smooth” topic changes. Valid assumptions?
Typical email traffic (Enron data) Topic 2 Topic 1 • Email traffic very bursty • Cannot model with uniform time steps! • Topic intensities change over time, separately per topic • Bursts tell us, how intensely a topic is pursued Bursts are potentially very interesting! Bursts No emails
Identifying both topics and bursts • Given: • A stream of documents (emails): d1, d2, d3, … • and corresponding document inter-arrival times (time between consecutive documents): Δ1, Δ2, Δ3, ... • Simultaneously: • Classify (or cluster) documents into K topics • Predict the topic intensities – predict time between consecutive documents from the same topic
High intensity for “Conference”, Low intensity for “Hiking” Low intensity for “Conference”, High intensity for “Hiking” Intensity for “Conference” ??? Intensity for “Hiking” ??? Conference Data association problem Hiking • If we know the email topics, we can identify bursts • If we don’t know the topics, we can’t identify bursts! • Two-step solution: First classify documents, then identify bursts [Kleinberg ’03] Can fail badly! • This paper: Simultaneously identify topics and bursts! time
The Task • Have to solve a data association problem: • We observe: • Message Deltas – time between the arrivals of consecutive documents • We want to estimate: • Topic Deltas – time between messages of the same topic • We can then compute the topic intensity L = E[ 1/ ] • Therefore, need to associate each document with a topic Need topics to identify intensity Chicken and Eggproblem: Need intensity toclassify (better)
τ1 τ2 τ3 C:4:15pm C: 4:15 pm H: 2:30 pm H: 7:30 pm Δ2 C2 How to reason about topic deltas? • Associate with each email time vectors per topic Email 1,Conference At 2:00 pm Email 2,Hiking At 2:30 pm Email 3, Conference At 4:15 pm Expected arrival times per topic C: 2:00 pm H: 2:30 pm Topic = 2h 15min (consecutive msg. of same topic) Message = min 2 – min 1(= 30min) Topic C = argmin 2(= Hiking) Conference Hiking
L(C)1 L(C)2 L(C)2 τ1 τ2 τ3 Generating message arrival times • Want generative model for the time vectors L(H)1 L(H)2 L(H)3 Incremented by exponential distribution, parameter Exp(L(C)) C: 4:15 pm C: 2:00 pm C:4:15pm H: 7:30 pm H: 2:30 pm H: 2:30 pm Does not change, as topic not “active”.
L(C)t-1 L(C)t L(C)t+1 L(H)t-1 L(H)t L(H)t+1 τt-1 τt τt+1 Δt Ct Dt Generative Model (conceptual) Intensity for“Hiking” Intensity for“Conference” Problem: Need to reason about entire history of timesteps t!(Domain of t grows linearly with time.)Makes inference intractable, even for few topics! ETA per topic Message Topic Document
Last arrival time irrelevant! Do we really need ETA vectors? • We know Message t = min t – min t-1. • Since Topic follow exponential distribution, memorylessness implies P(t+1(C) > 4pm | t (C) = 2pm, it’s now 3pm) = P(t+1(C) > 4pm | t (C) = 3pm, it’s now 3pm) • Hence t distributed as min {Exp(Lt(C)),Exp(Lt(H))} • Closed form: t ~ Exp(Lt(C) + Lt(H) ) • Similarly, Ct ~ argmin {Exp(Lt(C)),Exp(Lt(H))} • Closed form: Ct ~ Bernoulli( Lt(C) / (Lt(C) + Lt(H) ) ) Can discard ETA vectors ! Quite general modeling trick!
L(C)t-1 L(C)t L(C)t+1 L(H)t-1 L(H)t L(H)t+1 τt-1 τt τt+1 Δt Ct Generative Model (conceptual) Implicit Data Association (IDA) Model • Turns model (essentially) into Factorial HMM • Many efficient inference techniques available! Dt
Exponential distribution appropriate? • Previous work on document streams (E.g., Kleinberg ‘03) • Frequently used to model transition times • When adding hidden variables, can model arbitrary transition distributions (Nodelman et al)
Experimental setup • Inference Procedures • Full (conceptual) model: • Particle filter • Simplified Model: • Particle filter • Fully factorized mean field • Extract inference • Comparison to the two-step approach (first classify, then identify bursts)
Results (Synthetic data) • Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB… Naïve Bayesmisclassifies based on features 30 Topic 25 20 Topic delta 15 10 5 0 0 20 40 60 80 Message number
Results (Synthetic data) • Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB… Naïve Bayesmisclassifies based on features 30 Topic Part. Filt.(Full model) 25 20 Topic delta 15 0 0 20 40 60 80 Message number
Results (Synthetic data) • Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB… Naïve Bayesmisclassifies based on features 30 Topic Exactinference Part. Filt.(Full model) 25 20 Topic delta 15 0 0 20 40 60 80 Message number
Results (Synthetic data) • Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB… Naïve Bayesmisclassifies based on features Implicit Data Association gets both topics and intensity right, inspite severe (30%) label noise. Memorylessness trick identifies true intensity. Separate topic and burst identification fails badly. 30 Topic Exactinference Part. Filt.(Full model) 25 20 Topic delta 15 10 Weighted automaton(first classify, then bursts) 5 0 0 20 40 60 80 Message number
Inference comparison (Synthethic data) • Two topics, with different frequency pattern Topic for topic 1 More bursty Message (both combined)
Inference comparison (Synthethic data) • Two topics, with different frequency pattern Exactinference Topic for topic 1 More bursty Message (both combined)
Inference comparison (Synthethic data) • Two topics, with different frequency pattern Particlefilter Exactinference Topic for topic 1 More bursty Message (both combined)
Inference comparison (Synthethic data) • Two topics, with different frequency pattern Mean-field Particlefilter Exactinference Topic for topic 1 Implicit Data Association identifies true frequency parameters (does not get distracted by observed ) In addition to exact inference (for few topics),several approximate inference techniques perform well. More bursty Message (both combined)
Experiments on real document streams • ENRON Email corpus • 517,431 emails from 151 employees • Selected 554 messages from tech-memos and universities folders of Kaminski • Stream between December 1999 and May 2001 • Reuters news archive • Contains 810,000 news articles • Selected 2,303 documents from four topics: wholesale prices, environment issues, fashion and obituaries
Intensity identification for Enron data Topic More bursty
Intensity identification for Enron data Topic WAM More bursty
Intensity identification for Enron data Topic WAM IDA-IT More bursty
Intensity identification for Enron data Topic WAM IDA-IT More bursty Implicit Data Association identifies bursts which are missed by two-step approach
50 Topic WAM 40 IDA-IT 30 Topic delta 20 10 0 0 100 200 300 400 500 600 700 Message number Reuters news archive • Again, simultaneous topic and burst identification outperforms separate approach
What about classification? • Temporal modeling effectively changes class prior over time. • Impact on classification accuracy?
Classification performance • Modeling intensity leads to improved classification accuracy IDAModel NaïveBayes Lower is better
Generalizations • Learning paradigms • Not just supervised setting, but also: • Unsupervised- / semisupervised learning • Active learning (select most informative labels) • See paper for details. • Other document representations / classifiers(Just need P(Dt | Ct)) • Other applications • Fault detection • Activity recognition • …
L(C)t-1 L(C)t L(C)t+1 L(H)t-1 L(H)t L(H)t+1 Δt t t-1 t+1 Ct Tracking topic drift over time Topic param.(Mean for LSI representation) t tracks topic means (Kalman Filter) Dt Document (LSI)
Conclusion • General model for data association in data streams • Exponential order statistics enable implicit data association and tractable exact inference • A principled model for “changing class priors” over time • Can be used in supervised, unsupervised and (semisupervised) active learning setting • Synergetic effect between intensity estimation and classification on several real-world data sets