1 / 19

Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

Modeling Document Dynamics: An Evolutionary Approach. Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, radev} @ umich.edu. What are dynamic texts?. Sets of topically related documents (news stories, Web pages, etc.) Multiple sources

hilda-haney
Download Presentation

Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Document Dynamics: An Evolutionary Approach Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, radev} @ umich.edu

  2. What are dynamic texts? • Sets of topically related documents (news stories, Web pages, etc.) • Multiple sources • Written/published at different points in time – may change over time • Challenging features: • Paraphrases • Contradictions • Incorrect/biased information

  3. Milan plane crash: April 18, 2002 04/18/02 13:17 (CNN) The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday. 04/18/02 13:42 (ABCNews) The plane was destined for Italy’s capital Rome, but there were conflicting reports as to whether it had come from Locarno, Switzerland or Sofia, Bulgaria. 04/18/02 13:42 (CNN) The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday. 04/18/02 13:42 (FoxNews) The plane had taken off from Locarno, Switzerland, and was heading to Milan’s Linate airport, De Simone said.

  4. Problem for IR systems • User poses a question or query to a system • Known facts change at different points in time • Sources contradict one another • Many paraphrases – similar but not necessarily equivalent - information • What is the “correct” information? What should be returned to the user?

  5. Current Goals • Propose that dynamic texts “evolve” over time • Chronology recovery task • Approaches • Phylogenetics: reconstruct history of a set of species based on DNA • Language modeling: LM constructed from first document should fit less well over time

  6. Phylogenetic models • [Fitch&Margoliash,67] • Given a set of species and information about their DNA, construct a tree that describes how they are related, w.r.t. a common ancestor • Statistically optimal tree minimizes the deviation between the original distances and those represented in the tree Distance matrix Candidate tree 1 22 20 2 bear 24 24 dog wolf

  7. Phylogenetic models (2) • History of chain letters [Bennett&al,03] • “Genes” were facts in the letters: • Names/titles of people • Dates • Threats to those who don’t send the letter on • Distance metric was the amount of shared information between two chain letters • Used Fitch/Margoliash method to construct trees • Result: An almost perfect phylogeny. Letters that were close to one another in the tree shared similar dates, “genes” and even geographical properties.

  8. Procedure: Phylogenetics • For each document cluster and representation, generate a phylogenetic tree using Fitch [Felsenstein, 95] • Representations: full document, extractive summaries • Generate the Levenshtein distance matrix • Input matrix into Fitch to obtain unrooted tree • “Reroot” the unrooted tree at the first document in the cluster. • To obtain the chronological ordering, traverse the rerooted tree. • Assign chronological ranks, starting with ‘1’ for the root.

  9. Unrooted tree 1 (d=0) S1(d=3.5) S1 S2(d=6.5) S2 S3 S3(d=0) 2 (d=8.5) S4 S4(d=1) time t

  10. Rerooted tree S1 (d=0) S1 1 (d=3.5) S2(d=10) S2 2 (d=12) S3 S3(d=12) S4(d=13) S4 time t

  11. Procedure: LM Approach • Inspiration: document ranking for IR • If candidate document’s LM assigns high probability to query relevant [Ponte & Croft, 98] • Create LM from earliest document • Trigram backoff model using CMU-Cambridge toolkit [Clarkson & Rosenfeld,97] • Evaluate it on remaining documents • Use fit to rank them: OOV rates (increasing), trigram (decreasing) and unigram-hit ratios (increasing)

  12. Evaluation • Metric: Kendall’s rank-order correlation coefficient (Kendall’s ) [Siegel & Castellan,88] • -1    1 • Expresses extent to which the chronological rankings assigned by the algorithm agree with the actual rankings • Randomly assigned rankings have, on average, a  = 0.

  13. 36 document sets Manually collected (6) NewsInEssence clusters (3) TREC Novelty clusters (27) [Soboroff & Harman, 03] 15 training, 6 dev/test, 15 test Example topics Dataset

  14. Training Phase

  15. Training Phase (2)

  16. Test Phase(15 clusters)

  17. Manual Clusters

  18. Conclusions • Over all clusters, LM approach based on OOV had best performance • LM and phylogenetic models had similar performance on manual clusters • Have more salient “evolutionary” properties

  19. Future work • Tracking facts in multiple news stories over time • Produce a timeline of known facts • Determine if the facts have settled at each time

More Related