1 / 20

G o o g l e ’s Search Engine

G o o g l e ’s Search Engine. Shuying Wang 2003.09. Outline. How search engines work Google Architecture Overview What is PageRank? How PageRank is calculated Analysis of PageRank. Types of search engines. Search engines are computer programs that explore the net in search of webpages.

willhampton
Download Presentation

G o o g l e ’s Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google’s Search Engine Shuying Wang 2003.09

  2. Outline • How search engines work • Google Architecture Overview • What is PageRank? • How PageRank is calculated • Analysis of PageRank

  3. Types of search engines Search engines are computer programs that explore the net in search of webpages. • Crawler-based engines (send crawlers out into cyberspace. These crawlers visit a Web site, read the information on the website and also follow the links that the site connects to. The crawler returns all that information back to a central depository where the data is indexed. ) • Human-powered search engines (rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index.) • A combination of the two

  4. How search engines work • Crawling the web (Crawler,Spider or Robot) • Indexing the webpages • Searching the index

  5. What is Google? • A googol is 1followedby100zeroes = 10^100 • Google is a privately held and profitable company focused on search services. Named for the mathematical term "googol", Google operates a web site at www.google.com that is widely recognized as the "World's Best SearchEngine" and is fast, accurate and easy to use. • 3.7 billions pages with 15,000 servers in 8 search centers. • 0.34 per second for one query

  6. Google Architecture Overview

  7. Google Query Process • Parse the query. • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for every word. • Scan through the doclists until there is a document that matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. • If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top

  8. What is PageRank? • PageRank is one of the methods Google uses to determine a page’s relevance or importance. • PageRank is a “vote”, by all the other pages on the Web, about how important a page is. A link to a page counts as a vote of support. If there’s no link there’s no support • If a page links to another page, it is casting a vote, which indicates that page is good. • If lots of page links to a page, then it has more votes and its worth should be higher. • People only links pages they think are good.

  9. Google Toolbar Toolbar PageRank Real PageRank 0 0.15 - 0.9 1 0.9 - 5.4 2 5.4 - 32.4 3 32.4 - 194.4 4 194.4 - 1,166.4 5 1,166.4 - 6,998.4 6 6,998.4 - 41,990.4 7 41,990.4 - 251,942.4 8 251,942.4 - 1,511,654.4 9 1,511,654.4 - 9,069,926.4 10 9,069,926.4 - 0.85 × N + 0.15

  10. Important factors of a webpage

  11. How PageRank is calculated? PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • PR(A) is the PageRank of Page A • d is a dampening factor(0-1). Nominally this is set to 0.85 • PR(T1) is the PageRank of a site pointing to Page A • C(T1) is the number of links off that page • PR(Tn)/C(Tn) means we do that for each page pointing to Page A • Minimum: 1-d Maximum: (1-d)+dN

  12. The Iterative Computation of PageRank A PR(A) = (1-0.15) + 0.85 (PR(C) / 1) PR(B) = (1-0.15) + 0.85 (PR(A) / 2) PR(C) = (1-0.15) + 0.85 (PR(A) / 2 + PR(B)) (PR : 1 d : 0.15 ) B C Iteration PR(A) PR(B) PR(C) 0 1 1 1 1 1 0.575 1.244375 2 0.9232344 0.6788594 1.2232645 … … … … 50 0.9999999 0.7017544 1.2982456 51 1.00000000 0.76923077 1.15384615 52 1.00000000 0.76923077 1.15384615

  13. Internal Structures and Linkages A A A B C B C B C (Hierarchical) (Looping) (Interlinking) Page A = 1.4594595 Page B = 0.7702703 Page C = 0.7702703 Page A = 1 Page B = 1 Page C = 1 Page A = 1 Page B = 1 Page C = 1 Initial PR : 1 d : 0.15Iterations :100Total PR : 3.0

  14. Link to your site X X =0.15 X =0.15 =0.15 A A A B C B C B C (Looping) (Interlinking) (Hierarchical) PR(A) = 1.9189189 PR(B) = 0.9655405 PR(C) = 0.9655405 PR(A) = 1.3304179 PR(B) = 1.2387269 PR(C) = 1.2808552 PR(A) = 1.3429825 PR(B) = 1.2535088 PR(C) = 1.2535088 Initial PR : 1 d : 0.15Iterations :100Total PR : 3.85

  15. Link out of your site A A B C D B C D PR(A) = 0.5342466 PR(B) = 0.3013699 PR(C) = 0.3013699 PR(D) = 0.3013699 PR(A) = 0.9536153 PR(B) = 0.420191 PR(C) = 0.420191 PR(D) = 0.420191 Total PR : 2.3383562 Total PR : 3.1141883 Initial PR : 1 d : 0.15Iterations :100

  16. Link exchange(1) A D B C E F Page A = 1.4594595 Page B = 0.7702703 Page C = 0.7702703 Page D = 1.4594595 Page E = 0.7702703 Page F = 0.7702703 Total PR: 3.0 Total PR: 3.0 Page A = 1.7234043 Page B = 0.6382979 Page C = 0.6382979 Page D = 1.7234043 Page E = 0.6382979 Page F = 0.6382979 Total PR: 3.0 Total PR: 3.0

  17. Link exchange(2) Page A = 1.8623058 Page D = 1.3183416 Page B = 0.6776533 Page E = 0.7537509 Page C = 0.6776533 Page F = 0.7102952 Total PR: 3.2176124 Total PR: 2.7823877 A D E F B C A D Page A = 1.4934575 Page D = 1.1675242 Page B = 0.9967762 Page E = 0.6992681 Page C = 0.9967762 Page F = 0.6461978 Total PR: 3.4870099 Total PR: 2.5129901 E F B C A D Page A = 1.3913813 Page D = 1.5419126 Page B = 0.9464778 Page E = 0.5868752 Page C = 0.9464778 Page F = 0.5868752 Total PR: 3.2843369 Total PR: 2.7156630 B C E F

  18. Dangling Pages A dangling link is a link to a page that has no links going from it, or a link to a page that Google hasn't indexed. A A B C B C PR(A) = 0.4344423 PR(B) = 0.334638 PR(C) = 0.334638 PR(A) = 1.0 PR(B) = 1.0 PR(C) = 0.575

  19. PR0 Google wants to penalize a page--it is assigned a PageRank of zero. • Spam(i.e., excessive repetition of keywords, same color text as background, deceptive or misleading links) • Link farms(Reciprocal Link)

  20. References The Anatomy of a Large Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page “PageRank Uncovered” by Chris Ridings and Mike Shishigin http://pr.efactory.de/ Tools PageRank Calculator

More Related