Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

Modeling Document Dynamics: An Evolutionary Approach Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, radev} @ umich.edu

What are dynamic texts? • Sets of topically related documents (news stories, Web pages, etc.) • Multiple sources • Written/published at different points in time – may change over time • Challenging features: • Paraphrases • Contradictions • Incorrect/biased information

Milan plane crash: April 18, 2002 04/18/02 13:17 (CNN) The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday. 04/18/02 13:42 (ABCNews) The plane was destined for Italy’s capital Rome, but there were conflicting reports as to whether it had come from Locarno, Switzerland or Sofia, Bulgaria. 04/18/02 13:42 (CNN) The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday. 04/18/02 13:42 (FoxNews) The plane had taken off from Locarno, Switzerland, and was heading to Milan’s Linate airport, De Simone said.

Problem for IR systems • User poses a question or query to a system • Known facts change at different points in time • Sources contradict one another • Many paraphrases – similar but not necessarily equivalent - information • What is the “correct” information? What should be returned to the user?

Current Goals • Propose that dynamic texts “evolve” over time • Chronology recovery task • Approaches • Phylogenetics: reconstruct history of a set of species based on DNA • Language modeling: LM constructed from first document should fit less well over time

Phylogenetic models • [Fitch&Margoliash,67] • Given a set of species and information about their DNA, construct a tree that describes how they are related, w.r.t. a common ancestor • Statistically optimal tree minimizes the deviation between the original distances and those represented in the tree Distance matrix Candidate tree 1 22 20 2 bear 24 24 dog wolf

Phylogenetic models (2) • History of chain letters [Bennett&al,03] • “Genes” were facts in the letters: • Names/titles of people • Dates • Threats to those who don’t send the letter on • Distance metric was the amount of shared information between two chain letters • Used Fitch/Margoliash method to construct trees • Result: An almost perfect phylogeny. Letters that were close to one another in the tree shared similar dates, “genes” and even geographical properties.

Procedure: Phylogenetics • For each document cluster and representation, generate a phylogenetic tree using Fitch [Felsenstein, 95] • Representations: full document, extractive summaries • Generate the Levenshtein distance matrix • Input matrix into Fitch to obtain unrooted tree • “Reroot” the unrooted tree at the first document in the cluster. • To obtain the chronological ordering, traverse the rerooted tree. • Assign chronological ranks, starting with ‘1’ for the root.

Unrooted tree 1 (d=0) S1(d=3.5) S1 S2(d=6.5) S2 S3 S3(d=0) 2 (d=8.5) S4 S4(d=1) time t

Rerooted tree S1 (d=0) S1 1 (d=3.5) S2(d=10) S2 2 (d=12) S3 S3(d=12) S4(d=13) S4 time t

Procedure: LM Approach • Inspiration: document ranking for IR • If candidate document’s LM assigns high probability to query relevant [Ponte & Croft, 98] • Create LM from earliest document • Trigram backoff model using CMU-Cambridge toolkit [Clarkson & Rosenfeld,97] • Evaluate it on remaining documents • Use fit to rank them: OOV rates (increasing), trigram (decreasing) and unigram-hit ratios (increasing)

Evaluation • Metric: Kendall’s rank-order correlation coefficient (Kendall’s ) [Siegel & Castellan,88] • -1    1 • Expresses extent to which the chronological rankings assigned by the algorithm agree with the actual rankings • Randomly assigned rankings have, on average, a  = 0.

36 document sets Manually collected (6) NewsInEssence clusters (3) TREC Novelty clusters (27) [Soboroff & Harman, 03] 15 training, 6 dev/test, 15 test Example topics Dataset

Training Phase

Training Phase (2)

Test Phase(15 clusters)

Manual Clusters

Conclusions • Over all clusters, LM approach based on OOV had best performance • LM and phylogenetic models had similar performance on manual clusters • Have more salient “evolutionary” properties

Future work • Tracking facts in multiple news stories over time • Produce a timeline of known facts • Determine if the facts have settled at each time

Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)