1.29k likes | 1.45k Views
Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011. Who are you again?. Ph.D. student w/ MLN since 2005 Diagnostic exam in 2006, dissertation proposal in 2008
E N D
Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011
Who are you again? • Ph.D. student w/ MLN since 2005 • Diagnostic exam in 2006, dissertation proposal in 2008 • 17 publications to date • Outstanding RA award CS dept • CoS dissertation fellowship • 3 ACM SIGWEB + 2 misc travel grants • CS595 (S10) & CS518 (F10)
The Problem http://www.jcdl2007.org http://www.jcdl2007.org/JCDL2007_Program.pdf
The Problem • Web users experience 404 errors • expected lifetime of a web page is 44 days [Kahle97] • 2% of web disappears every week [Fetterly03] • Are they really gone? Or just relocated? • has anybody crawled and indexed it? • do Google, Yahoo!, Bing or the IA have a copy of that page? • Information retrieval techniques needed to (re-)discover content
The Environment Web Infrastructure (WI)[McCown07] • Web search engines (Google, Yahoo!, Bing) and their caches • Web archives (Internet Archive) • Research projects (CiteSeer)
Refreshing and Migration in the WI Digital preservation happens in the WI Google Scholar CiteSeerX Internet Archive http://waybackmachine.org/*/http:/techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf
URI – Content Mapping Problem U1 404 U1 U1 U2 U1 U1 U1 U1 U1 ??? C1 C1 C2 C1 C1 C1 C1 time time time time B B B B A A A A
Content Similarity JCDL 2005 http://www.jcdl2005.org/ July 2005 http://www.jcdl2005.org/ Today
Content Similarity Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today
Content Similarity PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html August 2003 http://www.pspcentral.org/events/archive/annual_meeting_2003.html Today
Content Similarity ECDL 1999 http://www-rocq.inria.fr/EuroDL99/ October 1999 http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html Today
Content Similarity Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 Today ? ?
Lexical Signatures (LSs) • First introduced by Phelps and Wilensky [Phelps00] • Small set of terms capturing “aboutness” of a document, “lightweight” metadata Resource Abstract
Generation of Lexical Signatures • Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88] • Term frequency (TF): • “How often does this word appear in this document?” • Inverse document frequency (IDF): • “In how many documents does this word appear?”
LS as Proposed by Phelps and Wilensky • “Robust Hyperlink” • 5 terms are suitable • Append LS to URLhttp://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago • Limitations: • Applications (browsers) need to be modified to exploit LSs • LSs need to be computed a priori • Works well with most URLs but not with all of them
Generation of Lexical Signatures • Park et al. [Park03] investigated performance of various LS generation algorithms • Evaluated “tunability” of TF and IDF component • Weight on TF increases recall (completeness) • Weight on IDF improves precision (exactness)
Synchronicity 404 error occurs while browsing look for same or older page in WI (1) if user satisfied return page (2) else generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page (4) else get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages (6) The system may not return any results at all
Synchro…What? Synchronicity • Experience of causally unrelated events occurring together in a meaningful manner • Events reveal underlying pattern, framework bigger than any of the synchronous systems • Carl Gustav Jung (1875-1961) • “meaningful coincidence” • Deschamps – de Fontgibu plumpudding example picture from http://www.crystalinks.com/jung.html
A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web(WIDM 2008)
The Problem • LSs are usually generated following the TF-IDF scheme • TF rather trivial to compute • IDF requires knowledge about: • overall size of the corpus (# of documents) • # of documents a term occurs in • Not complicated to compute for bounded corpora (such as TREC) • If the web is the corpus, values can only be estimated
The Idea • Use IDF values obtained from • Local collection of web pages • ``screen scraping‘‘ SE result pages • Validate both methods through comparison to baseline • Use Google N-Grams as baseline • Note: N-Grams provide term count (TC) and not DF values – details to come
Accurate IDF Values for LSs Screen scraping the Google web interface
The Dataset Local universe consisting of copies of URLsfrom the IA between 1996 and 2007
The Dataset Same as above, follows Zipf distribution 10,493 observations 254,384 total terms 16,791 unique terms
The Dataset Total terms vs new terms
LSs Example Based on all 3 methods URL: http://www.perfect10wines.com Year: 2007 Union: 12 unique terms
Comparing LSs • Normalized term overlap • Assume term commutativity • k-term LSs normalized by k • Kendall Tau • Modified version since LSs to compare may contain different terms • M-Score • Penalizes discordance in higher ranks
Comparing LSs Top 5, 10 and 15 terms LC – local universe SC – screen scraping NG – N-Grams
Conclusions • Both methods for the computation of IDF values provide accurate results • compared to the Google N-Gram baseline • Screen scraping method seems preferable since • similaity scores slightly higher • feasible in real time
Correlation of Term Count and Document Frequency for Google N-Grams(ECIR 2009)
The Problem • Need of a reliable source to accurately compute IDF values of web pages (in real time) • Shown, screen scraping works but • missing validation of baseline (Google N-Grams) • N-Grams seem suitable (recently created, based on web pages) but provide TC and not DF what is their relationship?
Background & Motivation • Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept • Used (among others) to generate lexical signatures (LSs) • TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated! • Most text corpora provide term count values (TC) • D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” • D3 = “All You Need Is Love” D4 = “Long, Long, Long” • TC >= DF but is there a correlation? Can we use TC to estimate DF?
The Idea • Investigate relationship between: • TC and DF within the Web as Corpus (WaC) • WaC based TC and Google N-Gram based TC • TREC, BNC could be used but: • they are not free • TREC has been shown to be somewhat dated [Chiang05 ]
The Experiment • Analyze correlation of list of terms ordered by their TC and DF rank by computing: • Spearman‘s Rho • Kendall Tau • Display frequency of TC/DF ratio for all terms • Compare TC (WaC) and TC (N-Grams) frequencies
Experiment Results • Investigate correlation between TC and DF • within “Web as Corpus” (WaC) • Rank similarity of all terms
Experiment Results • Investigate correlation between TC and DF • within “Web as Corpus” (WaC) • Spearman’s ρ and Kendall τ
Experiment Results Top 10 terms in decreasing order of their TF/IDF values taken from http://ecir09.irit.fr U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Google: screen scraping DF values from the Google web interface
Experiment Results Frequency of TC/DF Ratio Within the WaC Two Decimals One Decimal Integer Values
Experiment Results • Show similarity between WaC based TC and • Google N-Gram based TC • TC frequencies N-Grams have a threshold of 200
Conclusions • TC and DF Ranks within the WaC show strong correlation • TC frequencies of WaC and Google N-Grams are very similiar • Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages • Does not mean everything correlated to TC can be used as DF substitude!
http://en.wikipedia.org/wiki/Elephant Elephant Tusks Trunk African Loxodonta Inter-Search EngineLexical Signature Performance Elephant, Asian, African Species, Trunk Elephant, African, Tusks Asian, Trunk
Revisiting Lexical Signatures to(Re-)Discover Web Pages(ECDL 2008)
How to Evaluate the Evolution of LSs over Time Idea: • Conduct overlap analysis of LSs generated over time • LSs based on local universe mentioned above • Neither Phelps and Wilensky nor Park et al. did that • Park et al. just re-confirmed their findings after 6 month