Set 10 Search Engines & SEO

Set 10 Search Engines & SEO IT452 Advanced Web and Internet

Outline • How do search engines work? • Basic operation • What makes a good one? • What makes it difficult? • Web Design with search engines in mind

Search Engines – Basic Operation • Crawler • Indexer • Query Engine • Separate step b/c delay in downloading

Crawler Early spiders: DMOZ directory Links from pages Submitted URLs Site indexes • How does it find the pages? • Does it crawl everything? • How fast does it crawl? No. “bow-tie” structure of web, see next slide • Courtesy wait • Robots.txt

The Web is a Bow-Tie • Early study of 200 million web pages and links • Broder et al. 2000 • Structure of the web: a bow-tie shape • http://www9.org/w9cdrom/160/160.html SCC: strongly connected component IN: any node has a path to the SCC OUT: any node has a backward path to the SCC A spider starting in OUT won’t find most of the Web!

Indexer • Parse document • Remember • Whole text • Words • Phrases • Link text • Builds an “inverted index” Whole text -- often Maps words to document IDs barista burrito burro 531235, 4324, 6981, 125793, 41009, … 344, 7173, 574527, 14513, 2451245, … 8375, 75346, 345231, 5123523, 52388, …

Query Engine • Process text query from user • Inverse index merges document IDs • Return ranked set of hopefully relevant pages • Ranking factors • 1. Query-specific • 2. Page-specific • 3. Page Genre • 4. Contains all query words? Freq of query words? Reputable page? Popular page? “PageRank” Show some blogs, some news, some photos Hacks! – er proprietary intelligent query features

PageRank • Original basis of Google – still important • Developed in 1998. • http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427 • Basic Model • Two interpretations: • Random walk • Pages voting Fw = links in w pointing out Bw = pages pointing to w c = normalizing factor Draw graph on board, edges, walk. Pages with higher page rank count more

PageRank • Two interpretations: • Random walk • Pages voting

PageRank • Who owns the PageRank patent? • (hint: not Google) Stanford University Stanford gives Google the exclusive rights to use it. • - Stanford received 1.8 million shares for Google to use it. • Sold them in 2005 for $336 million. • 2012 they would be worth $1.1 billion

SEO • Goal • What does it consider? • Types Improve type/quantity of traffic to site from search engines 1.) search algorithm specifics 2.) what people search for 1.) white hat 2.) black hat -- deception

SE0 0.1 • Early search engines heavily dependent on meta tags • What to do? • White hat: • Black hat: • Key issue: easy to _____________________ Provide relevant tags Stuff tags Manipulate with one site

SEO 1.0 • Modern search engines depend heavily on links • What to do? • White hat: • Black hat: Get good, relevant sites to link to you Link farms

SEO 2.0 • Machine Learning • You search for “cats”, which result do you click first? • Learn from user clicks which they prefer • Smarter algorithms cluster words that “mean” the same thing • What to do? • White hat: • Black hat: Nothing? Automate clicks on search engine result pages?

Good principles • Clear hierarchy • Links to all pages (static), not as images • Useful content • Links from relevant sites • Good title / alt / meta • Limit dynamically generated pages (or # args) • No broken links, < 100 links • Use robots.txt – exclude internal search results • Fresh content

Bad principles • Stuff with lots of irrelevant content • Show different version of content to crawler • Link schemes, farms • Hidden text and links • Pages designed just for search engines, not users • Automated querying • Deception in general

Set 10 Search Engines & SEO