1 / 55

ILS 501 Unit 3 Searching Issues

ILS 501 Unit 3 Searching Issues. The Rise of Search. Pew Internet and Americal Life Project. Search is the 2 nd most popular online activity, after email. Percentage of net users who search on a typical day grew 70% from 2002 to 2009. The Rise of Search. What is a search engine?.

orde
Download Presentation

ILS 501 Unit 3 Searching Issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ILS 501Unit 3Searching Issues ILS 501 / Dr. Liu, ILSSCSU

  2. The Rise of Search Pew Internet and Americal Life Project • Search is the 2nd most popular online activity, after email. • Percentage of net users who search on a typical day grew 70% from 2002 to 2009 ILS 501 / Dr. Liu, ILSSCSU

  3. The Rise of Search ILS 501 / Dr. Liu, ILSSCSU

  4. What is a search engine? • A program that searches documents for specified keywords and returns a list of the documents where the keywords were found. • Typically, a search engine works by sending out a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document. ILS 501 / Dr. Liu, ILSSCSU

  5. What is web search engine? A search engines is a huge database of web page files that have been assembled automatically by the machine. ILS 501 / Dr. Liu, ILS SCSU

  6. What a search engine does? It uses software indexers (spiders or "robots") to “crawl” around the Web and, Build indexes based on what they find in available Web pages. ILS 501 / Dr. Liu, ILS SCSU

  7. How Do Search Engines Work? • Crawling: A ‘spider’ or ‘robot’ explores your site, following links from page to page. • Indexing: Data from the crawl is stored in the search engine index. The stored copy is referred to as the ‘cached page’. • Ranking: The Search Engine algorithm looks at a variety of factors (over 200) to determine the importance of a web page and where it should rank for any given keyword phrase.

  8. How search engines work? • Crawler-Based Search Engines • They "crawl" or "spider" the web, then people search through what they have found. • Human-Powered Directories • Hybrid Search Engines (Source:http://www.searchenginewatch.com) ILS 501 / Dr. Liu, ILS SCSU

  9. Web site to explain PageRank a1 b1 b2 b3 b4 c1 d2 e2 d1 e1

  10. PageRank - Motivation • The number incoming links to a page is a measure of importance and authority of the page. • Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.

  11. Expanding the Root Set

  12. PageRank

  13. Three elements of Crawler-Based Search Engine • The spider (crawler). The spider visits a web page, reads it, and then follows links to other pages within the site. The spider returns to the site on a regular basis, such as every month or two, to look for changes. • The index. It is like a catalog containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. • Search engine software. It is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what is most relevant. (Source: http://www.searchenginewatch.com) ILS 501 / Dr. Liu, ILS SCSU

  14. A search engine is an index compiler Search engines compile their databases by employing "spiders" or "robots" to crawl through web space from link to link, identifying and pages. Once the spiders get to a web site, they typically index most of the words on the publicly available pages at the site. ILS 501 / Dr. Liu, ILS SCSU

  15. Two earch Methods • The Searchable Subject Index, • Search Title& Meta, i.e. Yahoo • The Full-Text Search Engine • Use Spider to search Title but also Content , i.e. Google ILS 501 / Dr. Liu, ILS SCSU

  16. What are top 10 search providers in 2009? • They are …..? • Ranked by Nielsen MegaView Search: • Top 10 Search Providers for August 2009, Ranked by Searches (U.S.) ILS 501 / Dr. Liu, ILS SCSU

  17. How many types of search engines exist? Three common search engines: • Directory – Subject Search • Individual search engine – Keyword search • Metasearch engine – Meta search through multi-engines ILS 501 / Dr. Liu, ILS SCSU

  18. DIRECTORY by Subjects • Galaxy • GoGuides • LookSmart • NexTag • OpenDirectory • Yahoo* • Zeal ILS 501 / Dr. Liu, ILS SCSU

  19. AllTheWeb AltaVista Entireweb Google WistNut HotBot Lycos Yahoo NexTag OverTure INDIVIDUAL SEARCH ENGINES by Keywords ILS 501 / Dr. Liu, ILS SCSU

  20. What is metasearch engine? • It does not crawl the web compiling their own searchable databases. Instead, they search the databases of multiple sets of individual search engines simultaneously. • It provides a quick way of finding out which engines are retrieving the best results for you in your search. ILS 501 / Dr. Liu, ILS SCSU

  21. What Are "Meta-Search" Engines? How Do They Work? • “In a meta-search engine, you submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages. Within a few seconds, you get back results from all the search engines queried. Meta-search engines do not own a database of Web pages; they send your search terms to the databases maintained by search engine companies.” • From http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html ILS 501 / Dr. Liu, ILS SCSU

  22. Better Meta-SearchersUC Berkeley - Teaching Library Internet Workshops ILS 501 / Dr. Liu, ILS SCSU

  23. Meta-Search Engines for SERIOUS Deep DiggingUC Berkeley - Teaching Library Internet Workshops ILS 501 / Dr. Liu, ILS SCSU

  24. METASEARCH ENGINES • Dogpile • Clusty • Ixquick • Mamma • MetaCrawler • Metor • Profusion • qbSearch • Surfwax • Vivisimo ILS 501 / Dr. Liu, ILS SCSU

  25. Free search engine for your site? • For your website • Freefind • Atomz • For your desktop • Google Desktop 5 (search engine) • Microsoft desktop search engine • Copernic Desktop Search Professional 3.1 • Everything ILS 501 / Dr. Liu, ILS SCSU

  26. Atomz Search Add site search to your site in minutes. • Create an account. • Crawl your site. • Add search box to your site. ILS 501 / Dr. Liu, ILS SCSU

  27. http://www.freefind.com/ • Add a site search engine to your website • Easy to install: • Enter your website address • Enter your email address • Click the button. You're done! ILS 501 / Dr. Liu, ILS SCSU

  28. http://www.freefind.com/ ILS 501 / Dr. Liu, ILS SCSU

  29. Why so many search engines? • Because of different …. ILS 501 / Dr. Liu, ILS SCSU

  30. Why so many search engines –Different Coverage • They vary in coverage. In fact coverage is very much incomplete, with the largest search engine providing access to only a minor portion of the web. ILS 501 / Dr. Liu, ILS SCSU

  31. Why so many search engines –Different Search Capabilities • They have different tools and capabilities. Some have NEAR as an operator, some can search by different parameters, and so forth. ILS 501 / Dr. Liu, ILS SCSU

  32. Why so many search engines –Different Spider or Crawlers They have different spider or crawlers indexing the web. They go out at different intervals, they crawl to different depths (only the first page, the first three pages, or perhaps all pages), and the spiders differ in indexing techniques. ILS 501 / Dr. Liu, ILS SCSU

  33. Why so many search engines –Different Ways of Ranking They differ in how they rank items for display after the items are retrieved. Most rank on the basis of how many times the terms you search for are found or where they are found (more weight to higher placement) in the target websites. ILS 501 / Dr. Liu, ILS SCSU

  34. Why so many search engines –Different Ranking Protocols They differ with protocols. Google, for example, uses an algorithm that ranks output on the basis of the number of other websites that have linked to the websites your search retrieves. ILS 501 / Dr. Liu, ILS SCSU

  35. Portal definitions …. A portal is a Web site that is commonly used as a gateway to other Web sites. (Source: http://www.searchenginewatch.com) ILS 501 / Dr. Liu, ILS SCSU

  36. What is a Portal? • A portal is a client-server application (including web-based interface pages, related java applets, configuration files, and Perl and C-CGI scripts) for use on a organization’s web server. • It is a set of support materials for target community members. • It is designed to facilitate substantive communication between members in the community(ies). ILS 501 / Dr. Liu, ILS SCSU

  37. Boolean Search? • Boolean search is named after 19th century mathematician George Boole, who developed theories for working with sets of information. • Boolean search allows you to specify the relationships among your keywords and phrases. ILS 501 / Dr. Liu, ILS SCSU

  38. What are Boolean search commands AND OR NOT NEAR NESTING ILS 501 / Dr. Liu, ILS SCSU

  39. Boolean AND command The Boolean AND command is used to require that all search terms be present on the web pages listed in results. Your example command is? Cats AND dogs ILS 501 / Dr. Liu, ILS SCSU

  40. Boolean OR command The Boolean OR command is used to allow any of the specified search terms to be present on the web pages listed in results. Your example command is? house OR home ILS 501 / Dr. Liu, ILS SCSU

  41. Boolean NOT command The Boolean NOT command is used to require that a particular search term NOT be present on web pages listed in results. Examples: Cats NOT dogs canine NOT dog ILS 501 / Dr. Liu, ILS SCSU

  42. Be careful using the NOT Boolean operator. If seek documents on the Mustang automobile, there are many documents retrieved might be about the mustang horse. "Mustang NOT horse?" What’s the problem? This search strategy would reject articles or websites that mentioned the term "horse power." ILS 501 / Dr. Liu, ILS SCSU

  43. Boolean Nesting command Nesting ( ) allows you to build complex queries. You nest queries using parentheses Example: impeachment AND (clinton OR johnson) ILS 501 / Dr. Liu, ILS SCSU

  44. Advanced Search-Google Search with “quotes” for better phrase matching Keyword + site:www.site.com - Search only a specific site Keyword + site:www.site.com/folder/ - Search Folder intitle:keyword phrase – Titles only with kw Keyword + filetype:ppt/doc/mp3/pdf/etc – Search by filetype Kw + site: + folder + filetype – Starting to see the power! Kw + site: + folder + filetype + downthemall + prefs = Research Powerhouse - Check a specific folder on a website for a specific file type…then show them all… and with one click down load everything in the folder! SEO

  45. SEO = Search Engine Optimization Definition Using targeted keywords and phrases so a website’s pages will rank high on SERPs. SERP = Search Engine Results Page Note that SEO also stands for Search Engine Optimizer

  46. What do you need to do before searching? • Find the focus of your question • Clarify the key concepts • Determine the key terms for the concepts • Prepare alternative terms to describe these concepts • Chose a way to start looking ILS 501 / Dr. Liu, ILS SCSU

  47. Google search with “Index of/” • “Index of/”inurl:lib • 1. index of mpeg4 3. index of mp3 4. index of cnki 5. index of rmvb 6. index of rm 7. index of movie 8. index of swf 9. index of jpg 10. index of admin 12. index of pdf 13. index of doc 14. index of wmv 15. index of mdb 16. index of mpg 17. index of mtv 18. index of software 19. index of mov 20. index of asf 23. index of lib 24. index of vod 25. index of rar 27. index of exe 28. index of iso 29. index of video 30. index of book 31. index of soft 32. index of chm 33. index of password 34. index of game 35. index of music 36. index of dvd 37. index of mid 38. index of ebook 40. index of download ILS 501 / Dr. Liu, ILS SCSU

  48. Find an exact file you need • “index of/” MTV • “index of/” MPEG • “index of/” rmvb ... ILS 501 / Dr. Liu, ILS SCSU

  49. Recall and Precision Measurement of quality in search? Recall & Precision? There is normally an inverse relationship between recall and precision. ILS 501 / Dr. Liu, ILS SCSU

  50. Recall is a measure of the proportion of relevant documents that are captured by a search formulation N of relevant retrieved docs Recall = ------------------------------ N of relevant docs • For example you are searching a database with 100 articles dealing with dolphins caught by tuna fishermen and you only retrieve ten of the 100 because you only searched for the terms dolphin AND tuna, your recall would be ten percent. • You can improve your recall by finding more relevant terms and using the Boolean OR to increase the set. Thus Porpoise OR dolphin would have better recall. ILS 501 / Dr. Liu, ILS SCSU

More Related