210 likes | 304 Views
Dynamic Reference Sifting. A Case Study in the Homepage Domain. Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering. Outline. Introduction Softbots and Dynamic Reference Sifters Searching the Web
E N D
Dynamic Reference Sifting A Case Studyin the Homepage Domain Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering
Outline • Introduction • Softbots and Dynamic Reference Sifters • Searching the Web • Case Study: Personal Homepages • Ahoy! The Homepage Finder • Experimental Results • Future and Related Work • Other Domains for DRS Introduction - Outline
Softbots and Dynamic Reference Sifters • Dynamic Reference Sifters • Part of “Internet Softbots Project” [Etzioni and Weld, 1994] • Softbots • person states what • softbot determines how and where Introduction - Softbots & DRS
Information Retrieval Definitions • Precision • Measure of Search Service Accuracy • Recall • Measure of Search Service Comprehensiveness Introduction - IR Definitions
Precision Relevant Search Results • Precision: All Search Results Search Space Irrelevant Documents Relevant Documents Introduction - IR Definitions All Search Results
Recall Relevant Search Results • Recall: All Relevant Documents Search Space Irrelevant Documents Relevant Documents Introduction - IR Definitions All Search Results
Searching the Web • Web Indices (AltaVista, Hotbot) • Automated - high recall • Keyword based - low precision • Web Directories (Yahoo, A2Z) • Classified manually - high precision - low recall • Manual Search • slow Introduction - Searching the Web
Searching the Web • Dynamic Reference SifterAn information retrieval tool that uses: • multiple, complementary data sources for high recall, • domain-specific filtering techniques for high precision, and • machine learning to improve performance over time. Introduction - Searching the Web
Case Study: The Personal Homepage Domain • “Conventional” Search Services • Indices find too much • Directories find too little • Manual Search takes too long • Failures are expensive • Ahoy! The Homepage Finderattempts to provide • High Recall • High Precision • Speed Case Study - Overview
Ahoy! Architecture User Input Web PageReference Source Case Study - Ahoy! Architecture InstitutionalInformation Source E-mail Address Sources Filters Output
Performance Analysis • Test using lists of known homepages • Researchers sample: 582 homepages • Transportation sample: 53 homepages • Compare against • MetaCrawler, Hotbot, AltaVista, Yahoo! • Maximize competitors’ performance by • using “expert” options • allowing up to 200 references Case Study - Performance Analysis
Performance Analysis • “Precision” - Researcher Sample Case Study - Performance Analysis
Performance Analysis • Top 10 References - Researcher Sample Case Study - Performance Analysis
Performance Analysis • Recall (all References) - Researcher Sample Case Study - Performance Analysis
Performance Analysis • Recall (all References) - Transportation Sample Case Study - Performance Analysis
Learning in Ahoy! • Learns URL ‘patterns’ • http://sdcc3.ucsd.edu/home-pages/<Login>/ • 50,000+ patterns in 3 months • Indexes patterns by institution • 11,000+ institutions indexed in 3 months • Performance Impact • Up to 8% gain in recall Case Study - Learning in Ahoy!
Domain Characteristics • Many elements • Easily identifiable target • Some targets found in web indices • User can form specific query Future Work - Domain characteristics
Domain Examples • Personal Homepages • Articles or Papers • Product Reviews • Price Lists • Transportation Schedules • Recipes • Jokes • and more Future Work - Domain examples
(un)Related Work • Automated Index Generation • WebCrawler, Lycos, AltaVista, ... • Automated Directory Generation • IAF, OKRA, WhoWhere? • Dynamic Internet Search • Netfind • Learning User Preferences on web • WebWatcher, Syskill & Webert, Firefly • Learning about the web • ShopBot, auto-generated wrappers Future Work - (un)Related Work
Summary and Conclusions • Dynamic Reference Sifting • domain-specific, high precision, high recall, fast • Ahoy! the Homepage Finder • 2000 searches per day • 1-2 references returned per search • 50-75% targets found • 25% not found, often correctly so • 10-15 seconds per search • Future domains • Academic Papers, Jokes Summary & Conclusions
Ahoy! the Homepage Finder http://www.cs.washington.edu/research/ahoy/ Ahoy! The Homepage Finder