480 likes | 611 Views
Search and the New Economy Session 1 Basics of Web Search Engines. Prof. Panos Ipeirotis. Who am I?. Prof. Panagiotis Ipeirotis (a.k.a. Panos) Email: panos@stern.nyu.edu AIM: ipeirotis Office: KMC 8-84 (see “Staff Information” on Blackboard)
E N D
Search and the New EconomySession 1Basics of Web Search Engines Prof. Panos Ipeirotis
Who am I? • Prof. Panagiotis Ipeirotis (a.k.a. Panos) • Email: panos@stern.nyu.edu • AIM: ipeirotis • Office: KMC 8-84 • (see “Staff Information” on Blackboard) • Joined Stern in 2004, “A Computer Scientist in a Business School” • Research in web mining and in data integration • EconoMining Project: • Is there positive buzz about iPod Touch? What is the characteristic for which customers would pay the most? • Which seller on eBay has a reputation of delivering fast? How much higher the merchant can charge and still make a sale? • Web Searching: • What mergers and acquisitions took place in 2007?
Who are you? http://pages.stern.nyu.edu/~panos/teaching/W08.html Mixture of marketing / media + technology backgrounds • 70% have LinkedIn accounts • Penetration of Facebook accounts less obvious
Course Overview Class days, time, place Teaching assistant • KMC 3-120 • Tuesday Jan 29 (6pm-9pm) • Thursday Jan 31 (6pm-9pm) • Sunday Feb 3 (9am-12n) • Sunday Feb 3 (1pm-4pm) • Tuesday Feb 5 (6pm-9pm) • Thursday Feb 7 (6pm-9pm) • NikolayArchak(narchak@stern.nyu.edu)
Course Overview Class requirements • 6 Assignments • One take-home final exam or a project • Both due on February 14th • Submit your proposal for a project by February 1st
Course Overview Blackboard • http://sternclasses.nyu.edu/ • Use your Stern username and password • Confirm that you can access the course as soon as possible • Information about your classroom colleagues • All readings • All assignment descriptions • All assignment submissions (well, almost) • All online discussions • Grades, announcements, exam guidelines, stock tips….
Key Objectives of Course A. Understand the technology behind “search” (Jan 29) How search engines discover and rank web pages? How can we identify issues and opportunities in a web site? (Mainly lecture-based) B. Understand search engine advertising (Jan 31) Advertising on the web: banner ads, contextual ads, keyword ads, Optimizing a website for organic and paid search (Lecture + example discussion) C. Harnessing the wisdom of the crowds (Feb 3) Leveraging social networks for marketing, blog analysis, opinion mining and buzz tracking, long tail and recommender/reputation systems, prediction markets and wikis (Lecture + Case Discussion, focus on cases) D. Data ownership issues (Feb 5) Who owns your data? Privacy threats, the changing face of intellectual property (Case presentation + discussion) At its core: A hands-on, “how-to mentality” class
Objectives of today’s class • Understand the disruptive power of information • Learn how information is stored on the Web • Learn how search engines discover and rank information • Learn how users search for information (Analytics)
Information is ubiquitous How IT changed these industries? Banking Music Travel Newspapers Radio Email Video/TV Advertising Telephony Manufacturing Stock Market Retail / POS
Information technology is ubiquitous What is common in all disruptivechanges? Banking Music Travel Newspapers Radio Email Video/TV Advertising Telephony Manufacturing Stock Market Retail / POS
Key concepts • Digitization • Information Asymmetries • At the root of every disruption caused by search technologies • “Web search” is only part of the equation Google's mission is to organize the world's information and make it universally accessible and useful
Objectives of today’s class • Understand the disruptive power of information • Learn how information is stored on the Web • Learn how search engines discover and rank information • Learn how users search for information
In Assignment 1 you created a website • Can you find it on Google? • If yes, how • If no, why?
Why is this important? Search Engines Influence Consumers
Slide adapted from Marti Hearst, Lew & Davis Internet vs. WWW Let’s cover the basics • Internetand Web are not synonymous • Internet is a global communication network connecting millions of computers • World Wide Web (WWW) is one component of the Internet, along with e-mail, chat, etc
How Does the WWW Work? • You created a web page index.html for the class on your PC • Then you copy the page to a directory /sne/w08/ on a the NYU computer that runs a “web server” • The computer’s name is “homepages.nyu.edu” Web server
Reading a URL http://homepages.nyu.edu/sne/w08/index.html http:// = HyperText Transfer Protocol (i.e., Web) homepages = service name (often is www) .nyu= domain name .edu/ = top level domain i141/ = directory name f07/= directory name index.html= file name of web page
Random Web User NYU Web Server Internet NYU Student Publishing on the Web 1. You create the web page on your computer
NYU Web Server Random Web User Internet FTP Publishing on the Web 2. You send the files to the NYU Web server NYU Student
Random Web User NYU Web Server http request Internet NYU Student Publishing on the Web 3. A web user requests your home page URL
Random Web User NYU Web Server http response Internet Stern Student Client Publishing on the Web 4. The NYU Web server serves up your page
Information on the Web Internet When anyone can publish, how do we find what we need? • The information is spread across multiple autonomous computers • With millions of choices, how do we find what we need? ?
Objectives of today’s class • Understand the disruptive power of information • Learn how information is stored on the Web • Learn how search engines discover and rank information • Learn how users search for information
How Search Engines Work • Gather the contents of all web pages (using a program called a crawler or spider) • Organize the contents of the pages in a way that allows efficient retrieval (indexing) • Take in a query, determine which pages match, and show the results (ranking and display of results) Three main parts:
How do Search Engines Discover Information? • How do crawlers find web pages? • Start with a list of domain names, visit the home pages there. • Look at the hyperlink on the home page, and follow those links to more pages. • Keep a list of URLs visited, and those still to be visited. • Each time the program loads in a new HTML page, add the links in that page to the list to be crawled.
Standard Web Search Engine Architecture Send discovered pages to mothership Google Document Storage Crawler machines Create an “inverted index” user query Inverted index Search engine servers Show results to user For each word, the pages that contain the word
Crawler behavior varies • Parts of a web page that are indexed • Until recently, only the first few parts of the page were retrieved/stored • How deeply a site is indexed • Google/Yahoo/MSN get only the first top levels • How frequently the site is crawled • Can be few minutes (news), hours (blogs), days, or weeks (my site ) • What are the implications?
Indexing Record the following information about each page • List of words • Is the word in the title? • How far down in the page? • Was the word in boldface? • URLs of pages pointing to this one • Anchor text on pages pointing to this one • …many other “secret ingredients”
The importance of anchor text <a href=http://behind-the-enemy-lines…> Finally, another course on prediction markets</a> <a href=http://behind-the-enemy-lines …> An MBA course the way it should be</a> The anchor text summarizes what the website is about. (Gives also birth to the “GoogleBombing” phenomenon) http://en.wikipedia.org/wiki/Google_bomb
Text-based retrieval is not enough • So far, we examined how text is used for retrieving pages • However, text alone is not enough. Why?
Measuring Importance of Linking A PageRank Algorithm • Idea: important pages are pointed to by other important pages • Method: • Each link from one page to another is counted as a “vote” for the destination page • The number of incoming links is important! • But it is not enough! • But each “vote” is different! Pagerank places more importance to votes that come from pages with large number of votes (and so on, and so on) • Compare, for example, the cases for the circled page in cases A and B B
Computing PageRank – don’t need to ‘know’ Page A Page B Page C Page D People who bought this also bought… People who bought this also bought… People who bought this also bought… People who bought this also bought… Page A Page B Page A Page C Page C Page C Page D (ignoring damping factor for illustration)
Computing PageRank Page A Page D Page C Page B People who bought this also bought… People who bought this also bought… People who bought this also bought… People who bought this also bought… Page B Page C Page A Page A Page C Page C Page D
PageRank Page B Page A Page C Page D People who bought this also bought… People who bought this also bought… People who bought this also bought… People who bought this also bought… Page A Page B Page C Page A Page C Page C Page D .250 .250 .250 .250
PageRank Page C Page D Page B Page A People who bought this also bought… People who bought this also bought… People who bought this also bought… People who bought this also bought… Page B Page C Page A Page A Page C Page C Page D .250/3 .250/2 .250 .250 .250/3 .250/3 .250 .250/2 .250 .250 .250
PageRank Page C Page D Page B Page A People who bought this also bought… People who bought this also bought… People who bought this also bought… People who bought this also bought… Page B Page C Page A Page A Page C Page C Page D .250/3 .250/2 .375 .083 .250/3 .250/3 .250 .250/2 .083 .458 .250
PageRank Page C Page D Page B Page A People who bought this also bought… People who bought this also bought… People who bought this also bought… People who bought this also bought… Page B Page C Page A Page A Page C Page C Page D .375/3 .083/2 .375 .083 .375/3 .375/3 .458 .083/2 .083 .458 .083
PageRank Page C Page D Page B Page A People who bought this also bought… People who bought this also bought… People who bought this also bought… People who bought this also bought… Page B Page C Page A Page A Page C Page C Page D .375/3 .083/2 .500 .125 .375/3 .375/3 .458 .083/2 .125 .250 .083
PageRank Page C Page D Page B Page A People who bought this also bought… People who bought this also bought… People who bought this also bought… People who bought this also bought… Page B Page C Page A Page A Page C Page C Page D .400/3 .133/2 .400 .133 .400/3 .400/3 .333 .133/2 .133 .333 .133
How PageRank is used • Locate the pages that contain the query text • Weight the “text score” with the “link score” • Rank results Lesson: PageRank of competitors matters! Do not obsess (only) about your PageRank
Cool! Let’s Get some PageRank • Obvious incentives to game the system • Or at least to speed up the process of going up in the results
A few spam technologies SPAM N Is this a Search Engine spider? Fake Doc Y • Cloaking • Serve fake content to search engine robot • DNS cloaking: Switch IP address. Impersonate • Doorway pages • Pages optimized for a single keyword that re-direct to the real target page (typically get real content from legitimate pages and synthesize) • Keyword Spam • Misleading meta-keywords, excessive repetition of a term, fake “anchor text” • Hidden text with colors, CSS tricks, etc. Cloaking Meta-Keywords = “… London hotels, hotel, holiday inn, hilton, discount, Pageing, reservation, sex, mp3, britney spears, viagra, …”
Gaming PageRank: Link spam • Link spam: Inflating the rank of a page by creating nepotistic links to it • From own sites: Link farms • From partner sites: Link exchanges • From unaffiliated sites (e.g. blogs, guest books, web forums, etc.) • The more links, the better • Generate links automatically • Use scripts to post to blogs • Synthesize entire web sites • Synthesize many web sites (DNS spam) • The more important the linking page, the better • Buy expired highly-ranked domains • Post links to high-quality blogs
Gaming PageRank and Trust A MIT student TrustRank Algorithm • Initial votes come only from trusted pages • Compare, for example, the cases for the circled page in cases A and B • The main reason behind the initial success of Google • Get links from trusted, quality sites! NYU student Links from untrusted sources B
Other ranking factors • Location, Location, Location...and Frequency • Query words in title, or in first few sentences • The more frequent the query words, the better • Clickthrough measurement • How often users click on your URL, when they see it • How long do they stay (using toolbars!)
How to rank high in the results • Position your keywords (title, headings, early on page) • Make text visible (no tiny fonts, no white-on-white) • “Alt text” for images: Accessibility + search engines • Frames can kill, (Flash, AJAX also problematic) • Have relevant content • Do not change topics • Build links (nice to build a real community) • Just say no to search engine spamming • Submit your key pages • Verify often your listing
Objectives of today’s class • Understand the disruptive power of information • Learn how information is stored on the Web • Learn how search engines discover and rank information • Learn how users search for information (after the break)