90 likes | 235 Views
Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure. -- Martin Klein & Michael L. Nelson Old Dominion University. Looks familiar?. “Moved” but not lost. Reasons for “404” Change in website structure Original webpage relocated in the same website
E N D
Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure -- Martin Klein & Michael L. Nelson Old Dominion University
“Moved” but not lost • Reasons for “404” • Change in website structure • Original webpage relocated in the same website • Server/domain name issues • Original webpage captured by other websites
Rediscovering Missing Webpages • Search-based solutions • URL • Lexical Signature (LS) • Title • Social bookmarking tags • Link NeighbourhoodLexical Signature (LNLS)
Evaluation • Corpus • 500 random samples from Open Directory Project • “Pretend” to be missing • Search Engines: • Google/Yahoo/MSN • Metric • Percentage of webpages rediscovered from the top-N search results (N=1, 2-10, 11-100)
Results • LS • Majority either rediscovered in top-10 or undiscovered • Yahoo!: 67.6% top-1, 7.5% top-2-10, 22% undiscovered • Title • Similar distribution but with more webpagesrediscovered • Google: 69.3% top-1, 8.1% top-2-10, 19.7% undiscovered • Unquoted better than quoted • Tags and LNLS • Poor performance from both
Results • Combining LS and Title • Better performance than any single method • Yahoo! uniformly outperforms the rest • 76.4% top-1, 7.8% top-2-10, 13.6% undiscovered • Title analysis • Length of 3~6 words most frequent and well-performing • Further improvement by removing stopwords
Research Insights • Common but non-trivial problem • Simple methodology • Detailed, multi-step evaluation