1 / 129

Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries W

Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011. Who are you again?. Ph.D. student w/ MLN since 2005 Diagnostic exam in 2006, dissertation proposal in 2008

base
Download Presentation

Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries W

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

  2. Who are you again? • Ph.D. student w/ MLN since 2005 • Diagnostic exam in 2006, dissertation proposal in 2008 • 17 publications to date • Outstanding RA award CS dept • CoS dissertation fellowship • 3 ACM SIGWEB + 2 misc travel grants • CS595 (S10) & CS518 (F10)

  3. The Problem http://www.jcdl2007.org http://www.jcdl2007.org/JCDL2007_Program.pdf

  4. The Problem • Web users experience 404 errors • expected lifetime of a web page is 44 days [Kahle97] • 2% of web disappears every week [Fetterly03] • Are they really gone? Or just relocated? • has anybody crawled and indexed it? • do Google, Yahoo!, Bing or the IA have a copy of that page? • Information retrieval techniques needed to (re-)discover content

  5. The Environment Web Infrastructure (WI)[McCown07] • Web search engines (Google, Yahoo!, Bing) and their caches • Web archives (Internet Archive) • Research projects (CiteSeer)

  6. Refreshing and Migration in the WI Digital preservation happens in the WI Google Scholar CiteSeerX Internet Archive http://waybackmachine.org/*/http:/techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf

  7. URI – Content Mapping Problem U1 404 U1 U1 U2 U1 U1 U1 U1 U1 ??? C1 C1 C2 C1 C1 C1 C1 time time time time B B B B A A A A

  8. Content Similarity JCDL 2005 http://www.jcdl2005.org/ July 2005 http://www.jcdl2005.org/ Today

  9. Content Similarity Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today

  10. Content Similarity PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html August 2003 http://www.pspcentral.org/events/archive/annual_meeting_2003.html Today

  11. Content Similarity ECDL 1999 http://www-rocq.inria.fr/EuroDL99/ October 1999 http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html Today

  12. Content Similarity Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 Today ? ?

  13. Lexical Signatures (LSs) • First introduced by Phelps and Wilensky [Phelps00] • Small set of terms capturing “aboutness” of a document, “lightweight” metadata Resource Abstract

  14. Generation of Lexical Signatures • Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88] • Term frequency (TF): • “How often does this word appear in this document?” • Inverse document frequency (IDF): • “In how many documents does this word appear?”

  15. LS as Proposed by Phelps and Wilensky • “Robust Hyperlink” • 5 terms are suitable • Append LS to URLhttp://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago • Limitations: • Applications (browsers) need to be modified to exploit LSs • LSs need to be computed a priori • Works well with most URLs but not with all of them

  16. Generation of Lexical Signatures • Park et al. [Park03] investigated performance of various LS generation algorithms • Evaluated “tunability” of TF and IDF component • Weight on TF increases recall (completeness) • Weight on IDF improves precision (exactness)

  17. Lexical Signatures -- Examples

  18. Synchronicity 404 error occurs while browsing look for same or older page in WI (1) if user satisfied return page   (2) else  generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page   (4) else  get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages   (6) The system may not return any results at all 

  19. Synchro…What? Synchronicity • Experience of causally unrelated events occurring together in a meaningful manner • Events reveal underlying pattern, framework bigger than any of the synchronous systems • Carl Gustav Jung (1875-1961) • “meaningful coincidence” • Deschamps – de Fontgibu plumpudding example picture from http://www.crystalinks.com/jung.html

  20. 404 Errors

  21. 404 Errors

  22. “Soft 404” Errors

  23. “Soft 404” Errors

  24. A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web(WIDM 2008)

  25. The Problem • LSs are usually generated following the TF-IDF scheme • TF rather trivial to compute • IDF requires knowledge about: • overall size of the corpus (# of documents) • # of documents a term occurs in • Not complicated to compute for bounded corpora (such as TREC) • If the web is the corpus, values can only be estimated

  26. The Idea • Use IDF values obtained from • Local collection of web pages • ``screen scraping‘‘ SE result pages • Validate both methods through comparison to baseline • Use Google N-Grams as baseline • Note: N-Grams provide term count (TC) and not DF values – details to come

  27. Accurate IDF Values for LSs Screen scraping the Google web interface

  28. The Dataset Local universe consisting of copies of URLsfrom the IA between 1996 and 2007

  29. The Dataset Same as above, follows Zipf distribution 10,493 observations 254,384 total terms 16,791 unique terms

  30. The Dataset Total terms vs new terms

  31. LSs Example Based on all 3 methods URL: http://www.perfect10wines.com Year: 2007 Union: 12 unique terms

  32. Comparing LSs • Normalized term overlap • Assume term commutativity • k-term LSs normalized by k • Kendall Tau • Modified version since LSs to compare may contain different terms • M-Score • Penalizes discordance in higher ranks

  33. Comparing LSs Top 5, 10 and 15 terms LC – local universe SC – screen scraping NG – N-Grams

  34. Conclusions • Both methods for the computation of IDF values provide accurate results • compared to the Google N-Gram baseline • Screen scraping method seems preferable since • similaity scores slightly higher • feasible in real time

  35. Correlation of Term Count and Document Frequency for Google N-Grams(ECIR 2009)

  36. The Problem • Need of a reliable source to accurately compute IDF values of web pages (in real time) • Shown, screen scraping works but • missing validation of baseline (Google N-Grams) • N-Grams seem suitable (recently created, based on web pages) but provide TC and not DF  what is their relationship?

  37. Background & Motivation • Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept • Used (among others) to generate lexical signatures (LSs) • TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated! • Most text corpora provide term count values (TC) • D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” • D3 = “All You Need Is Love” D4 = “Long, Long, Long” • TC >= DF but is there a correlation? Can we use TC to estimate DF?

  38. The Idea • Investigate relationship between: • TC and DF within the Web as Corpus (WaC) • WaC based TC and Google N-Gram based TC • TREC, BNC could be used but: • they are not free • TREC has been shown to be somewhat dated [Chiang05 ]

  39. The Experiment • Analyze correlation of list of terms ordered by their TC and DF rank by computing: • Spearman‘s Rho • Kendall Tau • Display frequency of TC/DF ratio for all terms • Compare TC (WaC) and TC (N-Grams) frequencies

  40. Experiment Results • Investigate correlation between TC and DF • within “Web as Corpus” (WaC) • Rank similarity of all terms

  41. Experiment Results • Investigate correlation between TC and DF • within “Web as Corpus” (WaC) • Spearman’s ρ and Kendall τ

  42. Experiment Results Top 10 terms in decreasing order of their TF/IDF values taken from http://ecir09.irit.fr U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Google: screen scraping DF values from the Google web interface

  43. Experiment Results Frequency of TC/DF Ratio Within the WaC Two Decimals One Decimal Integer Values

  44. Experiment Results • Show similarity between WaC based TC and • Google N-Gram based TC • TC frequencies N-Grams have a threshold of 200

  45. Conclusions • TC and DF Ranks within the WaC show strong correlation • TC frequencies of WaC and Google N-Grams are very similiar • Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages •  Does not mean everything correlated to TC can be used as DF substitude!

  46. Inter-Search EngineLexical Signature Performance(JCDL 2009)

  47. http://en.wikipedia.org/wiki/Elephant Elephant Tusks Trunk African Loxodonta Inter-Search EngineLexical Signature Performance Elephant, Asian, African Species, Trunk Elephant, African, Tusks Asian, Trunk

  48. Revisiting Lexical Signatures to(Re-)Discover Web Pages(ECDL 2008)

  49. How to Evaluate the Evolution of LSs over Time Idea: • Conduct overlap analysis of LSs generated over time • LSs based on local universe mentioned above • Neither Phelps and Wilensky nor Park et al. did that • Park et al. just re-confirmed their findings after 6 month

More Related