510 likes | 643 Views
Choosing and Using the Best Metas. Hyper-searching the Web. Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council
E N D
Choosing and Using the Best Metas Hyper-searching the Web Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Library Services and Technology Act (LSTA) and/or Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2002
For Today … • Metas: History and Functions • Search and Retrieval Issues • Major Players in 2003 • Clustering Technology • More Good Metas • Web Search Agents • Evaluating Metasearch Services
Metasearch defined . . . • Group of search engines, subject directories and/or databases made searchable through a common interface. • Results may or may not follow the original source’s rankings • Today our focus is free metaengines using subject directories (Yahoo, LII, OD) and crawler-based engines as sources (Google, FAST, Teoma) • We will NOT examine specialized or Deep Web metas
A GOOD Meta will . . . • Re-format queries to be compatible with search syntax of each source • Enable searchers to use advanced features (when the sources support them) • Indicate overlapping results without repeating them • Perform additional processing of results, eg. ranking for appropriateness, catagorization, etc. • Use only sources with unique databases
The beginnings of metasearch • A conceptual descendant of Veronica • March 1995 –Harvest (later Savvysearch, now Search.com) developed at Colorado State by Daniel Dreilinger • July 1995 – Metacrawler developed at U. of Washington by Selberg and Etzioni • “Metacrawler Architecture for Resource Aggregation on the Web” 1996
The beginnings of metasearch • 1996 - Dogpile • 1998 - Ixquick • 1999 - Kartoo • 2000 - Ithaki • 2001 - Vivisimo
More facts about metas • “Flavor” determined by choice of sources • Comprehensive • Vivisimo, Ixquick, Metacrawler • General Lifestyle, popular culture • Dogpile, Profusion • Commercial • Search.com, Excite@home
Metas and retrieval • Metas search quickly but not deeply • Search time or a quantity of searches are purchased from sources (typically top 10-50 hits from each) • Metas are subject to time-out limits from their sources • Each source is usually NOT searched for each query
Metas and retrieval • “Dumbing Down the Query” • Advanced features are often not available, and then only those that are shared among sources • Default setting for time-out is the shortest; set to maximum for more comprehensive searches (when available) • For most metas, advertising is the only source of revenue; software sales are rare
Metas and retrieval • What is their place in my search strategy? • Metas best used for simple searches, with little (or no) syntactic complexity • Use them to find the top few sites on a topic • For a quick overview of a topic’s coverage on the Web in general • Use them “as a last resort” for highly focused topics that elude your usual search tools • As a possible indication of coverage of a topic among several engines (NOTE: problematic)
Searching the metas • Results depend on • Choice of sources • Query processing speed OF THE SOURCE • Length of time spent at each source
A search comparison . . . • Searched heterotropia (abnormal binocular vision) on 4/21/03 • Vivisimo 77 Shortest 126 Longest • Ixquick 37 “from at least 450 results” • Profusion 30 Shortest 39 Longest • Metacrawler 42 Shortest 61 Longest • Webcrawler 31 Shortest 80 Longest • Dogpile 29 (no time-out option) • Excite 41 Shortest 31 Longest
Stability of ResultsSearched “kids of survival” (modern art group) as a phrase at 3-minute intervals (time-outs at default setting) 4/21/03
Metas and ranking options • Listing by SOURCE • Usually retains ranking of source • COMBINED Listing options • Indicate source of each result • Indicate duplicates without repeating them • Indicate position in original source’s ranking • “Most duplicated hits” listed first • Disclose paid listings (if disclosed by source)
Vivisimo • http://vivisimo.com • Sources: Altavista, Yahoo, MSN, Netscape, Lycos, LookSmart, Gigablast, Vizzavi, BBC, Librarian’s Index to the Internet plus 11 specialized news sources and 7 specialized business, medical and governmental sources • Offers full Boolean and phrase search (if supported by the source)
Vivisimo • Offers the following customizations: • Selection of sources searched • Total number of results retrieved • Length of search (“time-out period”) • Results combined • Source for each result given • Ranking data from that source given • Duplicates noted, but not repeated
Vivisimo • Other features: • Results are clustered by keyword prevalence or website of origin • Offers a preview of each result in a separate window • Offers vertical searches: Top News, Business News, Tech News, Sports News
Clustering results (“folders”) • Automated “subject analysis” • Facilitates navigation and query refinement • Can be hierarchical (folders within folders) • One document may appear in several folders • Northern Light first public search engine to make use of folders
Clustering technology in a metasearch environment • Real-time processing of results retrieved from sources • Variety of data can be returned from each source • Url • Title • First few sentences • Human-created summary • Folder creation varies according to data from sources and processing time available at the moment of the query
Clustering -- Step 1 • Significant terms are identified from all results based on • Frequency of term(s) • Position of term(s) • Normalization algorithms applied • Documents analyzed for word variants (stemming) • Norms set (“authority control”) “game downloads” “download games” “downloading games” • Folder “labels” created
Clustering – Step 2 • Each result from the sources is matched against the set of folder labels and assigned to one or more folders • By linguistic analysis (term position, predictive descriptive importance) • By statistical analysis (term frequency) • Final, proprietary analysis combines these (and more) • Remember: The full documents are not available to a meta for this type of processing
Profusion • http://profusion.com • Sources: Altavista, Yahoo, MSN, About.com, Adobe PDF, AOL, LookSmart, Lycos, Netscape, Raging Search, Teoma, WiseNut • Offers full Boolean and phrase search (if supported by the source)
Profusion • Offers the following customizations: • Selection of sources searched • Total number of results retrieved • Length of search (“time-out period”) • Offers option of results listed by source or combined listing • Source for each result given • Ranking data from that source given • Duplicates noted, but not repeated
Profusion • Other features: • Results can be sorted by relevance score, title or URL • “Similar Result” enhancement • Profusion Relevance Score shown • Search terms highlighted in results listing • “Set Search Alert” feature stores searches and alerts user to page changes; requires setting up a (free) account • Search Analysis available • Offer vertical searches: Deep Web content in 21 broad categories; News
Ixquick • http://ixquick.com • Sources: Altavista, Netscape, Gigablast, Adobe PDF, Avaya PDF, AskJeeves, Teoma, Go, Open Directory, Overture, Kanoodle, LookSmart, WiseNut, FindWhat, Yahoo, MSN • Offers full Boolean and phrase search (if supported by the source) • Offers the following customizations: • Selection of sources searched • Length of search (“time-out period”)
Ixquick • Results combined • Source for each result given • Ranking data from that source given • Duplicates noted, but not repeated
Ixquick • Other features: • Offers 7 field searches (when supported by sources) • Clusters hits from same site • Highlights search terms in each hit • Offers “Related Searches” • Offers vertical searches: MP3, News, Pictures
iBoogie • http://iboogie.com • Sources: Altavista, Yahoo, MSN, FAST, FindWhat, Teoma, WiseNut, OpenFind • Boolean and phrase search somewhat unreliable • Offers the following customizations: • Selection of sources searched • Total number of results retrieved • Length of search (“time-out period”)
iBoogie • Results combined • Source for each result given • Duplicates noted, but not repeated • Other features: • Adult content filter (when supported by source) • Language limit (when supported by source) • Clusters results by keyword and/or website • Offers “Similar Pages” enhancement • Offers vertical searches: Newspapers, Bookstores, Reference, Shopping
Metacrawler • http://metacrawler.com • Sources: FAST, Google, About.com, AskJeeves, FindWhat, LookSmart, Inktomi (?), Open Directory, Overture, Search Hippo, Sprinks, Teoma • Offers Boolean “and”, “or” (no “not”) and phrase search (if supported by the source) • Offers the following customizations: • Selection of sources searched • Total number of results retrieved • Length of search (“time-out period”)
Metacrawler • Offers option of results listed by source or combined listing • Source for each result given • Duplicates noted, but not repeated • Other features: • Offers Related Searches • “More like this” results enhancement • Offers a wide range of vertical searches: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards
Dogpile • http://dogpile.com • Sources: Google, Fast, About.com, Ah-ha, AskJeeves, FindWhat, LookSmart, Open Directory, Search Hippo, Sprinks, Overture, Inktomi (?) • Offers Boolean “and”, “or” (no “not”) and phrase search (if supported by the source) • Offers the following customization: • Selection of sources searched
Dogpile • Results listed ONLY by source • Source for each result given • Other features: • Offers Related Searches • Offers a wide range of vertical searches, similar to Metacrawler: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards
Web Search Agentsaka desktop client search programs • Software must be purchased • Queries a fixed set of engines, directories, news and other databases • Sites that review and feature search agents • Searchenginewatch.com • Searchengineshowdown.com • www.botspot.com • www.agentland.com
Web Search Agentstypical features • Queries are re-formulated to follow syntax of source databases • Duplicates removed • Additional ranking performed • Source given • Optional sort orders • Optional grouping of results into “folders” • Many output options (html, word processor, xml, e-mail and more)
Web Search Agentsdifferent from other metas? • Differences from the (good) free metas • Many more sources queried • Several output options • Update option (re-running the search at specified intervals) • Customizable search parameters
Web Search Agents • BullsEye Pro 3.0 $199 • BullsEye Plus $49.99 • Covers 1000+ sources • Removes dead links • Multiple language capability • Government and News search groups • Customization of sources available for an additional fee • All other “typical features” • Available at intelliseek.com
Web Search Agents • Copernic Pro 5.02 $79.95 • Copernic 2001 Plus $39.95 • Copernic Plus Basic Free • Pro version covers 1000+ sources • Removes dead links • Post-search refinement and processing of retrieved results • Automatic document summarizations (requires more software) • All other “typical features” • Available at www.copernic.com
Ultrabar: choosing your own sources • Free download • Searches a small set of pre-selected engines and allows more to be added, including Deep Web resources • Offers search term highlighting • Does not re-formulate queries for each source • No output options • Available at ultrabar.com
Evaluating metasearch services • What are the sources for the results? • Good general search engines and high-quality directories? Shopping engines? Do any sources share the same database? • What search features are offered? • Remember, these are only in effect for the sources that support them. • What results-based enhancements are offered? • Clustering? “More like this”? Highlighting of search terms? “Related Searches”?
Evaluating metasearch services • What factors determine the ranking of results? • Is there any processing of results after retrieval from the sources? • Is the source and/or ranking in that source given for each hit? • Can the user expand the number of sources searched and/or the search time?
Evaluating metasearch services • Use your own test-drive questions and compare with results from other meta-engines and good single engines and directories. • Search for questions in specialized subject areas you are familiar with (tests database depth). • Search for very recent topics (tests database freshness)
Evaluating metasearch services • Check its popularity through an independent rating or popularity monitoring service • Media Metrix http://www.mediametrix.com/ • The oldest user-based rating service on the Web: lists top 50 most visited sites. • PC Data Online http://www.pcdataonline.com/reports/ • Check for information at the site • About, FAQ, Contact Us
A GOOD meta will . . . • Re-format queries to be compatible with search syntax of each source • Enable searchers to use advanced features (when the sources support them) • Indicate overlapping results without repeating them • Perform additional processing of results, eg. ranking for appropriateness, catagorization, etc. • Use only sources with unique databases
In conclusion . . . • How do metas fit into my search strategy? • Metas best used for simple searches, with little (or no) syntactic complexity • Use them to find the top few sites on a topic • For a quick overview of a topic’s coverage on the Web in general • Use them “as a last resort” for highly focused topics that elude your usual search tools • As a possible indication of coverage of a topic among several engines (NOTE: problematic) • Other uses??