200 likes | 209 Views
G o o g l e ’s Search Engine. Shuying Wang 2003.09. Outline. How search engines work Google Architecture Overview What is PageRank? How PageRank is calculated Analysis of PageRank. Types of search engines. Search engines are computer programs that explore the net in search of webpages.
E N D
Google’s Search Engine Shuying Wang 2003.09
Outline • How search engines work • Google Architecture Overview • What is PageRank? • How PageRank is calculated • Analysis of PageRank
Types of search engines Search engines are computer programs that explore the net in search of webpages. • Crawler-based engines (send crawlers out into cyberspace. These crawlers visit a Web site, read the information on the website and also follow the links that the site connects to. The crawler returns all that information back to a central depository where the data is indexed. ) • Human-powered search engines (rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index.) • A combination of the two
How search engines work • Crawling the web (Crawler,Spider or Robot) • Indexing the webpages • Searching the index
What is Google? • A googol is 1followedby100zeroes = 10^100 • Google is a privately held and profitable company focused on search services. Named for the mathematical term "googol", Google operates a web site at www.google.com that is widely recognized as the "World's Best SearchEngine" and is fast, accurate and easy to use. • 3.7 billions pages with 15,000 servers in 8 search centers. • 0.34 per second for one query
Google Query Process • Parse the query. • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for every word. • Scan through the doclists until there is a document that matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. • If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top
What is PageRank? • PageRank is one of the methods Google uses to determine a page’s relevance or importance. • PageRank is a “vote”, by all the other pages on the Web, about how important a page is. A link to a page counts as a vote of support. If there’s no link there’s no support • If a page links to another page, it is casting a vote, which indicates that page is good. • If lots of page links to a page, then it has more votes and its worth should be higher. • People only links pages they think are good.
Google Toolbar Toolbar PageRank Real PageRank 0 0.15 - 0.9 1 0.9 - 5.4 2 5.4 - 32.4 3 32.4 - 194.4 4 194.4 - 1,166.4 5 1,166.4 - 6,998.4 6 6,998.4 - 41,990.4 7 41,990.4 - 251,942.4 8 251,942.4 - 1,511,654.4 9 1,511,654.4 - 9,069,926.4 10 9,069,926.4 - 0.85 × N + 0.15
How PageRank is calculated? PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • PR(A) is the PageRank of Page A • d is a dampening factor(0-1). Nominally this is set to 0.85 • PR(T1) is the PageRank of a site pointing to Page A • C(T1) is the number of links off that page • PR(Tn)/C(Tn) means we do that for each page pointing to Page A • Minimum: 1-d Maximum: (1-d)+dN
The Iterative Computation of PageRank A PR(A) = (1-0.15) + 0.85 (PR(C) / 1) PR(B) = (1-0.15) + 0.85 (PR(A) / 2) PR(C) = (1-0.15) + 0.85 (PR(A) / 2 + PR(B)) (PR : 1 d : 0.15 ) B C Iteration PR(A) PR(B) PR(C) 0 1 1 1 1 1 0.575 1.244375 2 0.9232344 0.6788594 1.2232645 … … … … 50 0.9999999 0.7017544 1.2982456 51 1.00000000 0.76923077 1.15384615 52 1.00000000 0.76923077 1.15384615
Internal Structures and Linkages A A A B C B C B C (Hierarchical) (Looping) (Interlinking) Page A = 1.4594595 Page B = 0.7702703 Page C = 0.7702703 Page A = 1 Page B = 1 Page C = 1 Page A = 1 Page B = 1 Page C = 1 Initial PR : 1 d : 0.15Iterations :100Total PR : 3.0
Link to your site X X =0.15 X =0.15 =0.15 A A A B C B C B C (Looping) (Interlinking) (Hierarchical) PR(A) = 1.9189189 PR(B) = 0.9655405 PR(C) = 0.9655405 PR(A) = 1.3304179 PR(B) = 1.2387269 PR(C) = 1.2808552 PR(A) = 1.3429825 PR(B) = 1.2535088 PR(C) = 1.2535088 Initial PR : 1 d : 0.15Iterations :100Total PR : 3.85
Link out of your site A A B C D B C D PR(A) = 0.5342466 PR(B) = 0.3013699 PR(C) = 0.3013699 PR(D) = 0.3013699 PR(A) = 0.9536153 PR(B) = 0.420191 PR(C) = 0.420191 PR(D) = 0.420191 Total PR : 2.3383562 Total PR : 3.1141883 Initial PR : 1 d : 0.15Iterations :100
Link exchange(1) A D B C E F Page A = 1.4594595 Page B = 0.7702703 Page C = 0.7702703 Page D = 1.4594595 Page E = 0.7702703 Page F = 0.7702703 Total PR: 3.0 Total PR: 3.0 Page A = 1.7234043 Page B = 0.6382979 Page C = 0.6382979 Page D = 1.7234043 Page E = 0.6382979 Page F = 0.6382979 Total PR: 3.0 Total PR: 3.0
Link exchange(2) Page A = 1.8623058 Page D = 1.3183416 Page B = 0.6776533 Page E = 0.7537509 Page C = 0.6776533 Page F = 0.7102952 Total PR: 3.2176124 Total PR: 2.7823877 A D E F B C A D Page A = 1.4934575 Page D = 1.1675242 Page B = 0.9967762 Page E = 0.6992681 Page C = 0.9967762 Page F = 0.6461978 Total PR: 3.4870099 Total PR: 2.5129901 E F B C A D Page A = 1.3913813 Page D = 1.5419126 Page B = 0.9464778 Page E = 0.5868752 Page C = 0.9464778 Page F = 0.5868752 Total PR: 3.2843369 Total PR: 2.7156630 B C E F
Dangling Pages A dangling link is a link to a page that has no links going from it, or a link to a page that Google hasn't indexed. A A B C B C PR(A) = 0.4344423 PR(B) = 0.334638 PR(C) = 0.334638 PR(A) = 1.0 PR(B) = 1.0 PR(C) = 0.575
PR0 Google wants to penalize a page--it is assigned a PageRank of zero. • Spam(i.e., excessive repetition of keywords, same color text as background, deceptive or misleading links) • Link farms(Reciprocal Link)
References The Anatomy of a Large Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page “PageRank Uncovered” by Chris Ridings and Mike Shishigin http://pr.efactory.de/ Tools PageRank Calculator