Computer Science 1000

Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Search Engine • a collection of computer programs designed to help us find information on the Web • typically served through a website • different search providers exist, but basic functionality is consistent • type keywords into a text box • page returns links to other pages

Search Engine • why is a search engine like an index? • recall that an index maps keywords to a location in some medium (like a page number in a book) • a search engine does a very similar thing • takes keywords of interest from a user • maps these keywords to relevant web pages • in fact, one of the key components of a search engine is its index

Search Engine • what differentiates a search engine from other indexes (like a book index)? • the ability to quickly combine keywords in searches • e.g. search for information on ducks and foxes • result ranking • personalization • among others …

Search Engine – How it Works • different search engines employ different technologies • the full details of commercial search engines are typically not public • however, some of the basics are consistent • crawling • indexing • query processing

Crawling • for a search engine to be able to link to a web page, it must know about its existence • search engines find pages by crawling the web • programs called crawlers or spiders • e.g. Googlebot • a crawler visits web pages, in much the same way that you do • as each page is visited, information is remembered about the page (indexing)

Crawling – Todo List • the todo list is a list of pages that are visited by the crawler • the crawling process starts with an initial to-do list, populated with sites from previous crawls • however, the list is updated as the crawl takes place • hyperlinks on visited sites are added to the list Todo List http://www.uleth.ca http://www.tsn.ca http://www.usask.ca ...

Crawling – Example • suppose that this page was being processed by a crawler • as a consequence of this page being crawled, its links would be added to the todo list (if they aren't already there) • those pages would subsequently be checked by the crawler at some point Kev's Page • Favorite Stuff: • New York Islanders • Saskatchewan Roughriders • John Deere

The "Invisible Web" • not all information is crawled, which means it are not visible to search engines • some pages are new, and haven't yet had a chance to be crawled • however, there are other reasons that certain information does not get crawled

The "Invisible Web" • 1) No hyperlinks to that page • recall that in order for a page to be crawled, it must be: • on the todo list • be linked to a page that appears on the todo list • without a hyperlink, that page will never be found Todo List Web pages Page 1 Page 2 Page 3 Page 1 Page 4 Page 2 Page 3 Page 6 Page 5 will not be crawled, as it is not on the to-do list, and no other pages link to it. Page 4 Page 5 Page 6

The "Invisible Web" • 2) The Page is synthetic • a synthetic page is created on demand, depending on user input • e.g. the results of a search on another search engine My personal search for "New York Islanders" on Bing results in an on-demand page that is not stored. Hence, it will not be crawled.

The "Invisible Web" • 3) The content is unreadable to the crawler • search engines are primarily text-based • certain data, such as movie content, is not crawlable The webpage containing the movie might be crawled, but not the movie itself. http://support.google.com/webmasters/bin/answer.py?hl=en&answer=72746

The "Invisible Web" • 4) The content is password-protected • if you require a password to access a page, then so does a search engine*

The "Invisible Web" • 5) You ask the search engine to ignore your site • the presence of certain files stored with your website will restrict your site from being crawled • e.g. The Robots Exclusion Protocol • a file called robots.txt can be stored that will request that your site (or just certain pages) are not indexed • unlike the previous four examples, this does not prevent search engines from crawling your site • they can choose to ignore robots.txt Example: User-agent: Google Disallow: User-agent: * Disallow: / http://www.robotstxt.org/

Indexing • the primary role of the crawler is to build an index • an index is a list of tokens • words • phrases (not considered here)* • each token is associated with a list of URLs • in other words, like a book index, but with page URLs instead of page numbers • other information might be stored with URLs (e.g. page location of token) • these indexes are saved by the search provider • search queries use information from the indexes (fast), rather than crawling the web for each query (slow) *http://www.google.com/patents/US7536408

Index Lists – Example * from text – Figure number might be different

Indexing – What Makes a Token? • page text • a common approach • search providers differ on which text is selected* • some may use all text • others may only use certain text, such as: • titles and headings • frequently occuring words • words occuring early in a page • sometimes, stop words (a, an, the) are ignored • hyperlink text • the term from a hyperlink on another page may be used to describe the page that it links to *http://computer.howstuffworks.com/internet/basics/search-engine1.htm

Query Processing • the part of the search engine that we see • the query processor: • reads words/phrases from the user interface • returns pages that are relevant to that query • modern query processors: • are extremely fast • are very accurate • allow a considerable variety in their capabilities • how does this all work?

Query Processing – How it works • let's start simple: suppose we search for a single word (e.g. cat) • in a nutshell: • the search engine finds the list for the token 'cat' • contains list of pages that contain 'cat' in the appropriate text (e.g. title) • this list is ranked according to perceived relevance • the ranked list is returned as an ordered set of hyperlinks

Query Processing – How it works • Step 1: the search engine finds the list for the token 'cat'

Query Processing – How it works • Step 2: this list is ranked according to perceived relevance www.cat.com en.wikipedia.org/wiki/Cat www.youtube.com/watch?v=J---aiyznGQ ...

Query Processing – How it works • Step 3: the ranked list is returned as an ordered set of hyperlinks www.cat.com en.wikipedia.org/wiki/Cat www.youtube.com/watch?v=J---aiyznGQ ...

Query Processing • what about multi-word searching? • as mentioned, some search engines index phrases as well • however, what if a particular phrase is not indexed? • e.g. (text) red fish guppy • solution: intersecting queries • the webpages that are common to all of the search words are returned

Intersecting Queries • example (text): suppose the query was “red fish guppy” • further suppose that the indexes for each word were as follows: • result is the set of sites that contain all of the keywords • in other words, the sites that are found on all three lists guppy: en.wikipedia.org/wiki/guppy www.ifga.org www.fullredguppy.com www.sciencedaily.com www.tropicalfish.com red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com guppy: en.wikipedia.org/wiki/guppy www.ifga.org www.fullredguppy.com www.sciencedaily.com www.tropicalfish.com red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com Result: www.fullredguppy.com www.sciencedaily.com

Intersecting Queries - Efficiency • the size of index lists can be large • 'cat' returns over 2.3 billion results • modern search engines are fast • hence, clever algorithms must be developed for optimizing queries • example: intersecting queries

Intersecting Queries - Efficiency • suppose you had two search terms • e.g. red and fish • the query processor has a list for tokens • suppose each list contained 1 billion tokens • let's consider a method for performing the intersecting query • that is, how do we find all pages that occur on both lists?

The Naive Approach • for each entry in the 'red' list • search through the entire 'fish' list • if we find the entry from the red list, then add that to our result red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result:

The Naive Approach • First search: www.sciencedaily.com • do we find it in second list? • yes – add it to result red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com

The Naive Approach • Second search: en.wikipedia.org/wiki/red • do we find it in second list? • no red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com

The Naive Approach • Third search: newsroom.urc.edu • do we find it in second list? • yes, add it to list red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com newsroom.urc.edu

The Naive Approach • Fourth search: www.red.com • do we find it in second list? • no red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com newsroom.urc.ed

The Naive Approach • Fifth search: www.fullredguppy.com • do we find it in second list? • yes – add it to list red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com newsroom.urc.edu www.fullredguppy.com

The Naive Approach • problems? • slow!! • for each URL in left list, we potentially had to compare it to every URL in right list • under our previous assumption (billion size lists), we have to do 1 billion x 1 billion comparisons • even for a powerful computer, this would require a considerable amount of time

Alphabetized Lists • suppose that each list was maintained alphabetically • then we could employ the following approach • place a marker at start of each list • if markers point to same URL: • add URL to result list • move both markers down • otherwise, move the marker whose URL is lexicographically smaller • stop when at least one marker goes off the end of the list

The Sorted Approach • place markers at the start of each list red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result:

The Sorted Approach • do markers point to same URL? • no • since right marker's URL is less than left marker's URL, move right marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result:

The Sorted Approach • do markers point to same URL? • no • since left marker's URL is less than right marker's URL, move left marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result:

The Sorted Approach • do markers point to same URL? • yes • add URL to result • move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu

The Sorted Approach • do markers point to same URL? • no • since right marker's URL is less than left marker's URL, move right marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu

The Sorted Approach • do markers point to same URL? • yes • add URL to result • move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu www.fullredguppy.com

The Sorted Approach • do markers point to same URL? • no • since left marker's URL is less than right marker's URL, move left marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu www.fullredguppy.com

The Sorted Approach • do markers point to same URL? • yes • add URL to result • move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu www.fullredguppy.com www.sciencedaily.com

The Sorted Approach • at least one marker has completed its list, so we can stop • notice that our result contains correct values red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu www.fullredguppy.com www.sciencedaily.com

The Sorted Approach • how many comparisons are done? • note that every step involves moving at least one arrow • hence, the maximum number of steps is 2 billion • this is considerably less than (1 billion) squared • result: a massive speedup

The Sorted Approach – Notes • remember: commercial search engines don't fully publicize strategies • hence, some search engines may use alternate approaches for efficient intersections • the previous strategy applies to more than two lists simultaneously • hence, we can search for multiple tokens, rather than just two

Example (from text):

Ranking Results • a typical search can produce millions of results • however, we often find what we are looking for in the first few results • according to Optify, first returned result from Google gets clicked 36.4% of time • first page gets clicked through 90% of the time • how does this occur? • via a page ranking system http://searchenginewatch.com/article/2049695/Top-Google-Result-Gets-36.4-of-Clicks-Study

Ranking Results • search providers have different ways of ranking the results of the search • Google: PageRank • proprietary (not all details available) • some details are public (considered next) • the higher the PageRank score, the closer to the top of the search results a page will be http://support.google.com/webmasters/bin/answer.py?hl=en&answer=70897

PageRank • a scoring system • links from other pages add to a page's score Web pages Page 1 Page 4 Page 5 Page 2 Page 5 Page 6 Page 3 Page 5 Page 6 • the link from Page 1 adds to Page 4's score • the links from Pages 1,2,3 add to Page 5's score • the links from Page 2 and 3 add to Page 6's score Page 4 Page 5 Page 6

PageRank • the score from each page is not weighted equally • the higher a page's PageRank, the more important its contribution is Web pages • suppose that Page 3 has one link (Page 1), and Page 4 has one link (Page 2) • since Page 2's rank is higher than Page 1's, then Page 4's rank will be higher than Page 3's Page 1 Page 3 Page 2 Page 4 Low Rank High Rank Page 3 Page 4

Computer Science 1000