560 likes | 1.07k Views
ILS 501 Unit 3 Searching Issues. The Rise of Search. Pew Internet and Americal Life Project. Search is the 2 nd most popular online activity, after email. Percentage of net users who search on a typical day grew 70% from 2002 to 2009. The Rise of Search. What is a search engine?.
E N D
ILS 501Unit 3Searching Issues ILS 501 / Dr. Liu, ILSSCSU
The Rise of Search Pew Internet and Americal Life Project • Search is the 2nd most popular online activity, after email. • Percentage of net users who search on a typical day grew 70% from 2002 to 2009 ILS 501 / Dr. Liu, ILSSCSU
The Rise of Search ILS 501 / Dr. Liu, ILSSCSU
What is a search engine? • A program that searches documents for specified keywords and returns a list of the documents where the keywords were found. • Typically, a search engine works by sending out a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document. ILS 501 / Dr. Liu, ILSSCSU
What is web search engine? A search engines is a huge database of web page files that have been assembled automatically by the machine. ILS 501 / Dr. Liu, ILS SCSU
What a search engine does? It uses software indexers (spiders or "robots") to “crawl” around the Web and, Build indexes based on what they find in available Web pages. ILS 501 / Dr. Liu, ILS SCSU
How Do Search Engines Work? • Crawling: A ‘spider’ or ‘robot’ explores your site, following links from page to page. • Indexing: Data from the crawl is stored in the search engine index. The stored copy is referred to as the ‘cached page’. • Ranking: The Search Engine algorithm looks at a variety of factors (over 200) to determine the importance of a web page and where it should rank for any given keyword phrase.
How search engines work? • Crawler-Based Search Engines • They "crawl" or "spider" the web, then people search through what they have found. • Human-Powered Directories • Hybrid Search Engines (Source:http://www.searchenginewatch.com) ILS 501 / Dr. Liu, ILS SCSU
Web site to explain PageRank a1 b1 b2 b3 b4 c1 d2 e2 d1 e1
PageRank - Motivation • The number incoming links to a page is a measure of importance and authority of the page. • Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.
Three elements of Crawler-Based Search Engine • The spider (crawler). The spider visits a web page, reads it, and then follows links to other pages within the site. The spider returns to the site on a regular basis, such as every month or two, to look for changes. • The index. It is like a catalog containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. • Search engine software. It is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what is most relevant. (Source: http://www.searchenginewatch.com) ILS 501 / Dr. Liu, ILS SCSU
A search engine is an index compiler Search engines compile their databases by employing "spiders" or "robots" to crawl through web space from link to link, identifying and pages. Once the spiders get to a web site, they typically index most of the words on the publicly available pages at the site. ILS 501 / Dr. Liu, ILS SCSU
Two earch Methods • The Searchable Subject Index, • Search Title& Meta, i.e. Yahoo • The Full-Text Search Engine • Use Spider to search Title but also Content , i.e. Google ILS 501 / Dr. Liu, ILS SCSU
What are top 10 search providers in 2009? • They are …..? • Ranked by Nielsen MegaView Search: • Top 10 Search Providers for August 2009, Ranked by Searches (U.S.) ILS 501 / Dr. Liu, ILS SCSU
How many types of search engines exist? Three common search engines: • Directory – Subject Search • Individual search engine – Keyword search • Metasearch engine – Meta search through multi-engines ILS 501 / Dr. Liu, ILS SCSU
DIRECTORY by Subjects • Galaxy • GoGuides • LookSmart • NexTag • OpenDirectory • Yahoo* • Zeal ILS 501 / Dr. Liu, ILS SCSU
AllTheWeb AltaVista Entireweb Google WistNut HotBot Lycos Yahoo NexTag OverTure INDIVIDUAL SEARCH ENGINES by Keywords ILS 501 / Dr. Liu, ILS SCSU
What is metasearch engine? • It does not crawl the web compiling their own searchable databases. Instead, they search the databases of multiple sets of individual search engines simultaneously. • It provides a quick way of finding out which engines are retrieving the best results for you in your search. ILS 501 / Dr. Liu, ILS SCSU
What Are "Meta-Search" Engines? How Do They Work? • “In a meta-search engine, you submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages. Within a few seconds, you get back results from all the search engines queried. Meta-search engines do not own a database of Web pages; they send your search terms to the databases maintained by search engine companies.” • From http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html ILS 501 / Dr. Liu, ILS SCSU
Better Meta-SearchersUC Berkeley - Teaching Library Internet Workshops ILS 501 / Dr. Liu, ILS SCSU
Meta-Search Engines for SERIOUS Deep DiggingUC Berkeley - Teaching Library Internet Workshops ILS 501 / Dr. Liu, ILS SCSU
METASEARCH ENGINES • Dogpile • Clusty • Ixquick • Mamma • MetaCrawler • Metor • Profusion • qbSearch • Surfwax • Vivisimo ILS 501 / Dr. Liu, ILS SCSU
Free search engine for your site? • For your website • Freefind • Atomz • For your desktop • Google Desktop 5 (search engine) • Microsoft desktop search engine • Copernic Desktop Search Professional 3.1 • Everything ILS 501 / Dr. Liu, ILS SCSU
Atomz Search Add site search to your site in minutes. • Create an account. • Crawl your site. • Add search box to your site. ILS 501 / Dr. Liu, ILS SCSU
http://www.freefind.com/ • Add a site search engine to your website • Easy to install: • Enter your website address • Enter your email address • Click the button. You're done! ILS 501 / Dr. Liu, ILS SCSU
http://www.freefind.com/ ILS 501 / Dr. Liu, ILS SCSU
Why so many search engines? • Because of different …. ILS 501 / Dr. Liu, ILS SCSU
Why so many search engines –Different Coverage • They vary in coverage. In fact coverage is very much incomplete, with the largest search engine providing access to only a minor portion of the web. ILS 501 / Dr. Liu, ILS SCSU
Why so many search engines –Different Search Capabilities • They have different tools and capabilities. Some have NEAR as an operator, some can search by different parameters, and so forth. ILS 501 / Dr. Liu, ILS SCSU
Why so many search engines –Different Spider or Crawlers They have different spider or crawlers indexing the web. They go out at different intervals, they crawl to different depths (only the first page, the first three pages, or perhaps all pages), and the spiders differ in indexing techniques. ILS 501 / Dr. Liu, ILS SCSU
Why so many search engines –Different Ways of Ranking They differ in how they rank items for display after the items are retrieved. Most rank on the basis of how many times the terms you search for are found or where they are found (more weight to higher placement) in the target websites. ILS 501 / Dr. Liu, ILS SCSU
Why so many search engines –Different Ranking Protocols They differ with protocols. Google, for example, uses an algorithm that ranks output on the basis of the number of other websites that have linked to the websites your search retrieves. ILS 501 / Dr. Liu, ILS SCSU
Portal definitions …. A portal is a Web site that is commonly used as a gateway to other Web sites. (Source: http://www.searchenginewatch.com) ILS 501 / Dr. Liu, ILS SCSU
What is a Portal? • A portal is a client-server application (including web-based interface pages, related java applets, configuration files, and Perl and C-CGI scripts) for use on a organization’s web server. • It is a set of support materials for target community members. • It is designed to facilitate substantive communication between members in the community(ies). ILS 501 / Dr. Liu, ILS SCSU
Boolean Search? • Boolean search is named after 19th century mathematician George Boole, who developed theories for working with sets of information. • Boolean search allows you to specify the relationships among your keywords and phrases. ILS 501 / Dr. Liu, ILS SCSU
What are Boolean search commands AND OR NOT NEAR NESTING ILS 501 / Dr. Liu, ILS SCSU
Boolean AND command The Boolean AND command is used to require that all search terms be present on the web pages listed in results. Your example command is? Cats AND dogs ILS 501 / Dr. Liu, ILS SCSU
Boolean OR command The Boolean OR command is used to allow any of the specified search terms to be present on the web pages listed in results. Your example command is? house OR home ILS 501 / Dr. Liu, ILS SCSU
Boolean NOT command The Boolean NOT command is used to require that a particular search term NOT be present on web pages listed in results. Examples: Cats NOT dogs canine NOT dog ILS 501 / Dr. Liu, ILS SCSU
Be careful using the NOT Boolean operator. If seek documents on the Mustang automobile, there are many documents retrieved might be about the mustang horse. "Mustang NOT horse?" What’s the problem? This search strategy would reject articles or websites that mentioned the term "horse power." ILS 501 / Dr. Liu, ILS SCSU
Boolean Nesting command Nesting ( ) allows you to build complex queries. You nest queries using parentheses Example: impeachment AND (clinton OR johnson) ILS 501 / Dr. Liu, ILS SCSU
Advanced Search-Google Search with “quotes” for better phrase matching Keyword + site:www.site.com - Search only a specific site Keyword + site:www.site.com/folder/ - Search Folder intitle:keyword phrase – Titles only with kw Keyword + filetype:ppt/doc/mp3/pdf/etc – Search by filetype Kw + site: + folder + filetype – Starting to see the power! Kw + site: + folder + filetype + downthemall + prefs = Research Powerhouse - Check a specific folder on a website for a specific file type…then show them all… and with one click down load everything in the folder! SEO
SEO = Search Engine Optimization Definition Using targeted keywords and phrases so a website’s pages will rank high on SERPs. SERP = Search Engine Results Page Note that SEO also stands for Search Engine Optimizer
What do you need to do before searching? • Find the focus of your question • Clarify the key concepts • Determine the key terms for the concepts • Prepare alternative terms to describe these concepts • Chose a way to start looking ILS 501 / Dr. Liu, ILS SCSU
Google search with “Index of/” • “Index of/”inurl:lib • 1. index of mpeg4 3. index of mp3 4. index of cnki 5. index of rmvb 6. index of rm 7. index of movie 8. index of swf 9. index of jpg 10. index of admin 12. index of pdf 13. index of doc 14. index of wmv 15. index of mdb 16. index of mpg 17. index of mtv 18. index of software 19. index of mov 20. index of asf 23. index of lib 24. index of vod 25. index of rar 27. index of exe 28. index of iso 29. index of video 30. index of book 31. index of soft 32. index of chm 33. index of password 34. index of game 35. index of music 36. index of dvd 37. index of mid 38. index of ebook 40. index of download ILS 501 / Dr. Liu, ILS SCSU
Find an exact file you need • “index of/” MTV • “index of/” MPEG • “index of/” rmvb ... ILS 501 / Dr. Liu, ILS SCSU
Recall and Precision Measurement of quality in search? Recall & Precision? There is normally an inverse relationship between recall and precision. ILS 501 / Dr. Liu, ILS SCSU
Recall is a measure of the proportion of relevant documents that are captured by a search formulation N of relevant retrieved docs Recall = ------------------------------ N of relevant docs • For example you are searching a database with 100 articles dealing with dolphins caught by tuna fishermen and you only retrieve ten of the 100 because you only searched for the terms dolphin AND tuna, your recall would be ten percent. • You can improve your recall by finding more relevant terms and using the Boolean OR to increase the set. Thus Porpoise OR dolphin would have better recall. ILS 501 / Dr. Liu, ILS SCSU