250 likes | 351 Views
Mining Correlated Bursty Topic Patterns from Coordinated Text Streams. Xuanhui Wang, ChengXiang Zhai, Xiao Hu, Richard Sproat From KDD 07. Outline. Introduction Preliminaries Coordinated mixture model Experiment conclusion. Introduction.
E N D
Mining Correlated Bursty Topic Patterns from Coordinated Text Streams Xuanhui Wang, ChengXiang Zhai, Xiao Hu, Richard Sproat From KDD 07
Outline. • Introduction • Preliminaries • Coordinated mixture model • Experiment • conclusion
Introduction. • Text mining research has almost exclusively focused on mining one single text stream. • Topic Detection and Tracking (TDT) • Others
Preliminaries. • Text Stream: A text stream S of length n and with vocabulary V is an ordered sequence of text samples (S1, S2, ..., Sn) indexed by time, where Si is a sequence of words from the vocabulary set V at time point i.
Preliminaries. (cont) • Coordinated Text Streams: A set of text streams is called coordinated text streams if all the streams share the same time index and have the same length
Preliminaries. (cont) • Topic: A topic in stream Si is defined as a probability distribution of words in vocabulary set Vi. We also call such a word distribution a topic model.
Preliminaries. (cont) • Bursty Topic: Let θ be a topic (model) in stream Si. Let t ∈ [1, n] be a time index variable and p(θ|t, Si) be the relative coverage of the topic θ at time t in stream Si. θ is a bursty topic in stream Si if ∃t1, t2 ∈ [1, n] such that t2 − t1 ≥ σ and ∀t ∈ [t1, t2], p(θ|t, Si) ≥ κ where σ is a span threshold and κ is a coverage threshold.
Preliminaries. (cont) • Correlated Bursty Topic Patten: A correlated bursty topic pattern in a set of coordinated text streams S = {S1, ..., Sm} is defined as a set of topics {θ1 , ..., θm} such that θi is a bursty topic in stream Si and ∃t1, t2 ∈ [1, n] such that t2 − t1 ≥ σ and ∀t ∈ [t1, t2], ∀i ∈ [1,m], p(θi |t, Si) ≥ κ where σ is a span threshold and κ is a coverage threshold.
Coordinated mixture model. • The basic idea of our approach is to align the text samples from different streams based on the shared time stamps and discover topics from multiple streams simultaneously with a single probabilistic mixture model.
Coordinated mixture model. (cont) • There are two problems with this simple approach: (1) We will need to match topics across different streams, which is difficult because the vocabularies of different streams do not necessarily overlap. (2) The topics discovered in each stream may explain the corresponding stream well but not necessarily match the common topics shared by multiple streams.
Coordinated mixture model. (cont) • Formal Definition: S = {S1, ..., Sm} be m coordinated text streams with vocabularies V1, ..., Vm. Assume there are k correlated bursty topic patterns in our streams. A latent cause variable z ∈ [1, k]. w ∈ Vi
Coordinated mixture model. (cont) • The generative model: • We assume that a word w appearing at time t in stream Si with probability P(w|t, i). • λB is the mixture weight of the background model. • P(z|t) is the probability of choosing pattern z at time point t.
Coordinated mixture model. (cont) • : • : • :
Coordinated mixture model. (cont) • The log-likelihood of generating text sample Sit • c(w, Sit) is the count of word w in Sit. • Generating all the m coordinated streams is
Coordinated mixture model. (cont) • Parameter Estimation: • P(w|z, i) and P(z|t) use the expectation-maximization (EM) algorithm to compute an estimate iteratively.
Coordinated mixture model. (cont) • The expectation step is to calculate: • The maximization step is to update the probabilities:
Coordinated mixture model. (cont) • Constraining EM with Temporal Dependency:
Coordinated mixture model. (cont) • Mutual Reinforcement across Streams:
Experiment. • News streams consist of six months’ news articles of Xinhua English and Chinese newswires dated from June 8th,2001 through November 7th, 2001.
Experiment. (cont) • We use λB = 0.95 in our experiments. • we set λ = 0.1 in the following experiments. • Bursty patterns which satisfy σ = 5 for κ = 0.01 are kept.
Experiment. (cont) • PLSA: (document-based clustering)
mutual reinforcement: Noisy words such as “APEC”、“economic.” Experiment. (cont)