1 / 56

jan-sigir-tutorial

Search and Marketplace. Agenda. A Short History. Internet Search Fundamentals. Web Pages ... Modeling the Internet and the Web: Probabilistic Methods and Algorithms ...

victoria
Download Presentation

jan-sigir-tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

    Slide 2:Agenda

    A Short History Internet Search Fundamentals Web Pages Indexing Ranking and Evaluation Third Generation Technologies

    Slide 3:A Short History

    Slide 4:Precursors

    Information Retrieval (IR) Systems online catalogs, and News Limited scale, homogeneous text recall focus empirical Driven by results on evaluation collections free text queries shown to win over Boolean Specialized Internet access Gopher, Wais, Archie FTP archives and special databases Never achieved critical mass

    Slide 5:First Generation Systems

    1993: Mosaic opens the WWW 1993 Architext/Excite (Stanford/Kleiner Perkins) 1994 Webcrawler (full text Indexing) 1994 Yahoo! (human edited Directory) 1994 Lycos (400K indexed pages) 1994 Infoseek (subscription service) Power systems 1994 AltaVista (Dec Labs, advanced query syntax, large index) 1996 Inktomi (massively distributed solution)

    Slide 6:Second Generation Systems

    Relevance matters 1998 Direct Hit (clickthrough based re-ranking) 1998 Google (link authority based re-ranking) Size matters 1999 FAST/AllTheWeb (scalable architecture) The user matters 1996 Ask Jeeves (question answering) Money matters 1997 Goto/Overture (pay-for-performance search)

    Slide 7:Third Generation Systems

    Market consolidation 2002 Yahoo! Purchases Inktomi 2003 Overture purchases AV and FAST/AllTheWeb 2003 MSN announces intention to build a Search Engine Search matures $2B market projected to grow to $6B by 2005 required capital investment limits new players Gigablast? traffic focused in a few sites Yahoo!, MSN, Google, AOL consumer use driven by Brand marketing

    Slide 8:Web Search Fundamentals

    Slide 9:Web Fundamentals

    URL User Browser Web Server HTML Page Page Rendering Page Serving Hyper Links HTTP Request

    Slide 10:Definitions

    URL’s refer to WWW content referential integrity is not guaranteed roughly 10% of Url’s go 404 every month HTTP requests fetch content from a server stateless protocol cookies provide partial state Web servers generate HTML pages can be static or dynamic (output of a program) markup tags determine page rendering HTML pages contain hyperlinks link consists of a url and anchor text

    Slide 11:Url’s

    URL Definition http://host:port/path;params?query#fragment fragment is not considered part of the URL params are considered part of the path params are not frequently used Examples http://www.cnn.com/ http://ad.doubleclick.net/jump;sz=120x60;ptile=6;ord=6981062172 http://us.imdb.com/Title?0068646 http://www.sky.com/skynews/article/0,,30000-12261027,00.html

    Slide 12:Dynamic Url’s

    Urls with Dynamic Components Path (including params) and host are not dynamic If you change the PATH and/or host you will get a 404 or similar error Query is dynamic If you change the query part, you will get a valid page back source of potentially infinite number of pages Examples http://www.cnn.com/index.html?test Returns a valid 200 page, even if test is not a valid query term http://www.cnn.com/index.html;test Returns a 404 error page Not all Url’s Follow this Convention: http://www.internetnews.com/xSP/article.php/1378731

    Slide 13:Dynamic Content

    Content Depends on External (to URL) Factors Cookies IP Referrer User-Agent Examples http://my.yahoo.com/ http://forum.doom9.org/forumdisplay.php?s=af9ddb31710c7b314b75262c1031d8af&forumid=65 Dynamic Url’s and Dynamic Content are Orthogonal static url’s can refer to dynamic content dynamic url’s can refer to static content

    Slide 14:HMTL Sample

    <html> <head> <title>Andreas S. WEIGEND, PhD</title> </head> <body> <blockquote><font face="Verdana,Tahoma,Arial" size=2> <h2><font size="4" face="Verdana, Arial, Helvetica, sans-serif">Andreas S. WEIGEND, </font><font size="3" face="Verdana, Arial, Helvetica, sans-serif">Ph.D.</font><font face="Verdana, Arial, Helvetica, sans-serif"><br> <font size="2">Chief Scientist, Amazon.com</font></font></h2> </font> <blockquote> <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><i>"Sophisticated algorithms have always been a big part of creating the Amazon.com customer experience." (Jeff Bezos, Founder and CEO of Amazon.com)</i></font></p></blockquote> <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"> <a href="http://www.amazon.com">Amazon.com</a> might be the world's largest laboratory to study human behavior and decision making. It for sure is a place with very smart people, with a healthy attitude towards data, measurement, and modeling. I am responsible for research in machine learning and computational marketing. Applications range from real-time predictions of customer intent and satisfaction, to personalization and long-term optimization of pricing and promotions.<font size="-2"> [<a href="http://www.weigend.com/amazonjobs.html" onclick="window.open(this.href);return false;">Job openings.</a>] </font> I'm also the point person for academic relations.</font></p> </blockquote> <font face="Verdana,Tahoma,Arial" size=2> <h3> <font face="Verdana, Arial, Helvetica, sans-serif"><i><font size="3"> Schedule Summer 2003</font></i></font></h3> </font>

    Slide 15:Rendered Page

    Slide 16:WWW Size

    How pages are in the WWW? Lawrence and Giles, 1999: 800M pages with most pages not indexed Dynamically generated pages imply effective size is infinite How many sites are registered? Churn due to SPAM

    Slide 17:Crawling

    Search Engine robot visits every page that will be indexed traversal behavior depends on crawl policy Index parameterized by size and freshness freshness is time since last revisit if page has changed Batch vs Incremental Batch crawl has several, distinct, batch processing stages discover, grab, index AV discovery phase takes 10 days, grab another 10, etc. sharp freshness curve Incremental crawl crawler constantly operates, intermixing discovery with grab mild drop-off in freshness

    Slide 18:Typical Crawl/Build Architecture

    Slide 19:Relative Size

    From SearchEngineShowdown Google claims 3B Fast claims 2.5B AV claims 1B

    Slide 20:Freshness

    From Search Engine Showdown Note hybrid indices; subindices with differing update rates

    Slide 21:Query Language

    Free text with implicit AND and implicit proximity Syntax-free input Explicit Boolean AND (+) OR (|) AND NOT (-) Explicit Phrasing (“”) Filters domain: filetype: host: title: link: image: url: anchor:

    Slide 22:Query Serving Architecture

    Index divided into segments each served by a node Each row of nodes replicated for query load Query integrator distributes query and merges results Front end creates a HTML page with the query results

    Slide 23:Query Evaluation

    Index has two tables: term to posting document ID to document data Postings record term occurrences may include positions Ranking employs posting to score documents Display employs document info fetched for top scoring documents

    Slide 24:Scale

    Indices typically cover billions of pages terrabytes of data Tens of millions of queries served every day translates to hundreds of queries per second User require rapid response query must be evaluated in under 300 msecs Data Centers typically employ thousands of machines Individual component failures are common

    Slide 25:Search Results Page

    Blended results multiple sources Relevance ranked Assisted search Spell correction Specialized indices via Tabs Sponsored listing monetization Localization Country language experience

    Slide 26:Relevance Evaluation

    Slide 27:Relevance is Everything

    The Search Paradigm: 2.4 words, a few clicks, and you’re done only possible if results are very relevant Relevance is ‘speed’ time from task initiation to resolution important factors: Location of useful result UI Clutter latency Relevance is relative context dependent e.g. ‘football’ in the UK vs the US task dependent e.g. ‘mafia’ when shopping vs researching

    Slide 28:Relevance is Hard to Measure

    Poorly defined, subjective notion depends on task, user context, etc. Analysts have Focused on Easier-to-Measure Surrogates index size, traffic, speed anecdotal relevance tests e.g. Vanity queries Requires Survey Methodology averaged over queries averaged over users

    Slide 29:Survey Methodologies

    Internal expert assessments assessments typically not replicated models absolute notion of relevance External consumer assessments assessments heavily replicated models statistical notion of relevance A/B surveys compare whole result sets visual relevance plays a large role Url surveys judge relevance of particular url for query

    Slide 30:A/B Test Design

    Strategy: Compare two ranking algorithms by asking panelists to compare pairs of search results Queries: 1000 semi-random queries, filtered for family-friendly, understandability Users can select from a list of 20 queries URLS Top 10 search results from 2 algorithms Voting: 5 point scale, 7 replications Each user rates 6 queries, one of which is a control query Control query has AV results on one side, random URLs on the other Reject voters who take less than 10 seconds to vote

    Slide 31:Query selection screen

    Slide 32:Rating screen

    Slide 33:A/B Test Scoring

    Test ran until we had 400 decisive votes Margin of error = 5% Compute: Majority Vote: count of queries where more than half of the users said one engine was “somewhat better” or “much better” Total Vote: count of users that rated a result set “somewhat better” of “better” for each engine Compare percentages test if one system ‘out votes’ the other determine if the difference is statistically significant

    Slide 34:Results

    Control Votes (error bar = 1/sqrt(160) = 7.9%) Test One: AV vs SE1 (error bar = 1/sqrt(400) = 5%)

    Slide 35:Results

    Test Three: SE1 Vs SE2 Test Two: AV Vs SE2 (with UI issue)

    Slide 36:Ranking

    Given 2.4 query terms, search 2B documents and return 10 highly relevant in 300 msecs Problem queries: Travel (matches 32M documents) John Ellis (which one) Cobra (medical or animal) Query types Navigational (known item retrieval) Informational Ingredients Keyword match (title, abstract, body) Anchor Text (referring text) Quality (link connectivity) User Feedback (clickrate analysis)

    Slide 37:The Components of Relevance

    First Generation: Keyword matching Title and abstract worth more Second Generation: Computed document authority Based on link analysis Anchor text matching Webmaster voting Development Cycle: Tune Ranking Evaluate Metrics

    Slide 38:Connectivity

    Slide 39:Connectivity Goals

    An indicator of authority As measured by static links Each link is a ‘vote’ in favor of a site Webmasters are the voters Not all links are equal Links from authoritative sites are worth more Introduces an interesting circularity Votes from sites with many links are discounted Use your vote wisely Discount navigational links Not all links are editorial Account for link SPAM

    Slide 40:Connectivity Network

    A B What is authority score for nodes A and B? Inlink computes: A = 3 B = 2 Page Rank Computes A = .225 B = .295

    Slide 41:Definitions

    Connectivity Graph Nodes are pages (or hosts) Directed edges are links Graph edges can be represented as a transition matrix, A The ith row of A represents the links out from node i Authority score Score associated with each node Some function of inlinks to node and outlinks from node Simplest authority score is inlink count

    Contribution averaged over all outlinks Node score is the sum of contributions Fixed point equation If A is normalized Each row sums to 1.0

    Slide 42:Page Rank (Without Random Jump)

    .1 .1 A (.25) B (.3) 1/2 1/2 .1

    Slide 43:A is a stochastic matrix r(i) can be interpreted as a probability Suppose a surfer takes a outlink at random r(i) is the long run probability of landing at a particular node Solution to fixed point equation is the principal Eigen vector principal Eigen value is 1.0 Solution can be found by iteration If then Start with random initial value for r Iterate multiplication by A Contribution of smaller eigen values will drop out Final value is a good estimate of the fixed point solution

    Page Rank Implications

    What’s the score for a node with no in-links? Revised equation Fixed point equation Probability interpretation As before with ? chance of jumping randomly

    Slide 44:Page Rank (with random jump)

    .1 .1 A (.225) B (.293) 1/2 1/2 .1 ? = 0.1

    Slide 45:Eigenrank

    Separates internal from external links Internal transition matrix I External transition matrix E Introduces a new parameter ? is the random jump probability ? is the probability of taking an internal link (1 - ? - ?) is the probability of taking an external link

    Revised equation Fixed point equation Probability interpretation ? chance of random jump ? chance of internal link (1-?-?) chance of external link

    Slide 46:Eigenrank

    .1 .1 A (.2) B (.202) 1/2 1/2 .1 = 0.1 ?= 0.1

    Slide 47:Computational Issues

    Nodes with no outlinks Transition matrix with zero row Internal or external Leave out of computation(?) Redistribute mass to random jump(?) Currently mass is redistributed Complex formula that prefers external links

    Slide 48:Two scores Authority score, a Hub score, h Fixed Point equations Authority Hub Principal Eigen vectors are solutions

    Kleinberg

    Slide 49:SPAM

    Manipulation of content purely to influence ranking Dictionary SPAM Link sharing Domain hi-jacking Link farms Robotic use of search results Meta-search engines Search Engine optimizers Fraud

    Slide 50:Third Generation Technologies

    Slide 51:Handling Ambiguity

    Results for query: Cobra

    Slide 52:Impression Tracking

    Incoherent urls are those that receive high rank for a large diversity of queries. Many incoherent urls indicate SPAM or a bug (as in this case).

    Slide 53:Clickrate Relevance Metric

    Average highest rank clicked perceptibly increased with the release of a new rank function.

    Slide 54:User Interface

    Ranked result lists Document summaries are critical Hit highlighting Dynamic abstracts url No recent innovation Graphical presentations not well fit to the task Blending Predefined segmentation e.g. Paid listing Intermixed with results from other sources e.g. News

    Slide 55:Future Trends

    Question Answering WWW as language model Enables simple methods e.g. Dumais et al. (SIGIR 2002) New contexts Ubiquitous Searching Toolbars, desktop, phone Implicit Searching Computed links New Tasks E.g. Local/ Country Search

    Slide 56:Bibliography

    Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, and Padhraic Smyth John Wiley & Sons; May 28, 2003 Mining the Web: Analysis of Hypertext and Semi Structured Data by Soumen Chakrabarti Morgan Kaufmann; August 15, 2002 The Anatomy of a Large-scale Hypertextual Web Search Engine by S. Brin and L. Page. 7th International WWW Conference, Brisbane, Australia; April 1998. Websites: http://www.searchenginewatch.com/ http://www.searchengineshowdown.com/ Presentations http://infonortics.com/searchengines/sh03/slides/evans.pdf

More Related