1 / 53

Introductory Survey of Internet Search Services

Introductory Survey of Internet Search Services. Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council

monet
Download Presentation

Introductory Survey of Internet Search Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introductory Survey of Internet Search Services Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Library Services and Technology Act (LSTA) and/or Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2000

  2. What is a search engine? • A searchable database of resources extracted from the Internet by computer-generated search and retrieval processes. • Updated frequently • Search features vary among engines • Results of searches are ranked for”relevance” as predicted by automated logical algorithms.

  3. Search Engines and Subject Directories in 2000: Genres in Flux • What types are available today? Automatic Human-compiled “Pure” Crawler Crawler “Plus” Specialized Crawler Peer-Reviewed

  4. Search Engines in 2000 • “Pure” crawler-based • Google • Fast • Crawler “plus” • Subject Directory (HB, Lycos, Excite, AV, Infoseek, WebCrawler) • Special Collection (NL) • Pre-programmed answers-Ask Jeeves (AV)

  5. Search Engines in 2000 • Specialized (chiefly crawler-based) • SearchEdu.com • Specialized (crawler/human compiled) • Scicentral.com metasite • Peer reviewed • Hippias-Philosophy http://hippias.evansville.edu/ • Argos -Classics and ancient history http://argos.evansville.edu/

  6. How large is a search engine? • Typical personal computer - 64 MB RAM • “General” search engines - 4,000 MB RAM (and more) • Database - 1,000 GB of storage

  7. WS WS WS WS WS WS WS WS WS CR CR CR CR WS DATABASE CR CR CR - Crawler WS - Web Server

  8. User 1 User 2 User 3 User 4 User 5 User 6 User 7 Search Engine DATABASE

  9. WS WS WS WS WS WS WS WS WS CR CR CR CR WS DATABASE CR CR CR - Crawler WS - Web Server

  10. Crawling the WebThe Big Picture ... • Crawlers • download a page • extract links to other web pages • index words from the page • “crawl” the extracted links and • continue the cycle

  11. Crawling the WebThe Detailed View ... • While downloading one page the crawlers simultaneously … • check for the next page to download in the “queue” • check for any “robots exclusion” files that prohibit downloading of pages from a web server • download the whole page • extract all links from the page and add them to the “queue”

  12. Crawling the WebThe Detailed View ... • Index contents (extract all words and save them to a database associated with the page’s URL; also save the order of the words to allow for phrase searching) • Optionally filter for adult content, language of document, other criteria • Save (or make) summary of the page • Record the date downloaded for future reference in scheduling re-visits to the site

  13. Scale? • One page at a time? • Covering the Internet would take several years • Instead … • Thousands of pages are processed simultaneously by multiple crawlers (Google has ca. 4,000)

  14. Performance? • What about maintenance down-time? • Services have duplicate machines so no interruptions occur during maintenance • Why are interface changes so rare? • Updating software on complex systems is expensive • Usually slows service down, or stops it completely

  15. Performance? • If I execute the same search in the same engine several times in succession I get different results. Why? • Query is run against multiple machines in parallel • Ranking may be performed on a limited subset of the hits (ie, those returned first) rather than the entire set of results.

  16. Why do search engines exist? • To make money!!! • Advertising • Banner ads • Allied services • Pay-for-placement in search results • Man;y other commercial endeavors

  17. In pursuit of user loyalty . . . • Advertisers want “stickyness” ie, users that return often and at length • “Stickyness” drives design • Portalization – “One-stop access for all your Intenet needs” • Speed • Freshness • Relevance of results • Value-added search features such as customization (My Yahoo, etc.)

  18. How Search Engines Differ . . . • Content • Update frequency (“freshness”) • Ways you can search • Ways results are presented to you

  19. Breadth of Content • How much of the “geographic” Internet is searched and to what degree? • What types of files are included? • Web sites • Usenet News • Software • Image/Video/Audio • Multimedia • FTP

  20. Depth of Content • How much of a given site has been downloaded? • URL? • Title? • First heading? • First 200 words? • Full text? • Full text and some of the documents linked to? • Full text and all of the documents linked to? • Full text and documents that are linking to this one?

  21. Update frequency • When was the content last refreshed or rebuilt from direct searching of the Internet?

  22. Ways you can search • Boolean operators • Requiring, combining or excluding words or phrases • Searching for a phrase • Searching by word stem (truncation) • Searching by location in the document (field searching) • Searching by date • Searching by media • Searching by language

  23. Ways results are presented to youRelevance Prediction Based on TEXT ON THE PAGE FACTORS EXTERNAL TO THE PAGE

  24. Relevance PredictionText on the page • Based on • Word frequency profiles • “More like this” • “Suggested similar sites” • Relational clustering • Northern Light’s “Custom Folders”

  25. Relevance PredictionProblems with text on the page ranking • Designed for “text-heavy” pages; “design-heavy” pages may be ranked lower as a result • No added weight possible for evaluated, rated or reviewed sites • Ill-suited for a web that grows so rapidly

  26. Relevance PredictionFactors external to the page • Link popularity • Sites with more links pointing to them ranked higher • Click popularity • Sites visited more often and longer ranked higher (Direct Hit’s knowledge base of users’ click paths) • “Sector” popularity • Tracking demographic or social groups’ clickpaths

  27. Relevance PredictionFactors external to the page • Pre-packaged human-generated questions with answers (Ask Jeeves) • Business alliances among services • Editorial partnerships • Pay-for-placement options (GoTo)

  28. Relevance PredictionFactors external to the page • Advantages • Helps focus and limit results for popular, common queries • Human-generated criteria improve quality of results • Disadvantages • Increases the “invisible layer” between the searcher and the results • “How did I get these results?” • “Who is controlling the search process?” • Privacy issues around tracking users’ click paths

  29. What is a subject directory? • A human-generated listing of resources usually classified and hierarchically arranged by subject category, often containing descriptions of the resources included.

  30. Subject Directories • Ways directories differ from search engines • Sites are examined and cataloged by a human being • Descriptions of the sites are often included • Generally fewer ways of searching • Generally not updated as frequently

  31. Types of Subject Directories • “My favorite links” • Personal homepages • Subject-focused sites with “related links” • The Cervantes Home Page • Subject-focused metasites • Scicentral http://sciquest.com • Sections of the WWW Virtual Library http://wwwvl.org • General “comprehensive” directories • Yahoo, Snap, Excite

  32. Important aspects of subject directories • Authorship/sponsorship • Intended audience • Update frequency

  33. How are users faring?NPD User Study April, 2000 • 40,000 respondents chosen randomly • October-November, 1999 • Conducted by NPD New Media Services “on behalf of 13 major search services” • Summary at http://searchenginewatch.com/reports/npd.html • See http://www.npd.com for more information

  34. Search Engine or Subject Directory—Which one do I use? • Portalization has blurred the distinctions; however --- • Use a search engine for • Narrowly defined topics “Plate tectonics in northern California” • Up-to-date news and research • Occurrences of a name or phrase

  35. Search Engine or Subject Directory—Which one do I use? • Use a subject directory for • Broadly defined topics “geophysical research” • Subject-specific gateways or “vortals” • websites • discussion groups • media files • “A few good sites” • General browsing

  36. Improving search strategy • Little overlap in coverage among engines (Gregg Notess at http://searchengineshowdown.com) • Even the largest ones cover no more than 20 – 25% of the Internet • Therefore use 2 or more engines you know and trust to insure a wider range of results

  37. Improving search strategy • Know the advanced features of your favorite engine(s) and use them. • Use unique identifiers or keywords • Use phrase searching when possible • Restrict search to title or other fields • Incorporate date searching when available • Use the “Find in page” function to locate your search term(s) quickly

  38. What NO search engine covers . . . • Dynamic Web content • Created through user interaction • File extensions include *.asp, *.php, *.jsp • PDF files (See Adobe’s new engine for these at http://searchpdf.adobe.com • Pages requiring a login • Wireless content • WAP (Wireless Application Protocol) engine available at FAST http://alltheweb.com)

  39. Once you have a list of hits ask yourself . . . • How might the domain type influence the content of this site? • Do I trust the author/creator? Why or why not? • How might the organization responsible influence the content?

  40. Once you have a list of hits ask yourself . . . • Is the date of publication critical or important in this case? • Is the intended audience appropriate for this information need?

  41. Search is . . . Intriguing Frustrating Exciting Maddening Gratifying

  42. The Internet is . . . Vast Constantly changing Uncataloged Of wildly varying quality

More Related