1 / 41

Web Algorithmics

Web Algorithmics. Web Search Engines. Goal of a Search Engine. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm “bag of words” Relevant ?!?. The Web: Language and encodings: hundreds…

raylawrence
Download Presentation

Web Algorithmics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Algorithmics Web Search Engines

  2. Goal of a Search Engine Retrieve docs that are “relevant” for the user query • Doc: file word or pdf, web page, email, blog, e-book,... • Query: paradigm “bag of words” Relevant ?!?

  3. The Web: Language and encodings:hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Two main difficulties Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

  4. Evolution of Search Engines • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google Google, Yahoo, MSN, ASK,……… Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research]

  5. +$ -$

  6. This is a search engine!!!

  7. Wolfram Alpha

  8. Clusty

  9. Yahoo! Correlator

  10. Web Algorithmics The structure of a Search Engine

  11. ? Page archive Crawler Web Page Analizer Indexer Query resolver Ranker Control text auxiliary Structure The structure Query

  12. Generating the snippets !

  13. The big fight: find the best ranking...

  14. Ranking: Google vs Google.cn

  15. Problem: Indexing • Consider Wikipedia En: • Collection size ≈ 10 Gbytes • # docs ≈4 * 106 • #terms in total > 1 billion(avg term len = 6 chars) • #terms distinct = several millions • Which kind of data structure do we build to support word-based searches ?

  16. DB-based solution: Term-Doc matrix #docs ≈ 4M #terms > 1M Space ≈ 4Tb ! 1 if play contains word, 0 otherwise

  17. 2 4 6 10 32 1 2 3 5 8 13 21 34 Current solution: Inverted index • A term like Calpurnia may use log2 N bits per occurrence • A term like the should take about 1 bit per occurrence Brutus the Calpurnia 13 16 Currently they get 13% original text

  18. Gap-coding for postings • Sort the docIDs • Store gaps between consecutive docIDs: • Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 … Two advantages: • Space: store smaller integers (clustering?) • Speed: query requires just a scan

  19. g-code for integer encoding Length-1 • v > 0 and Length = log2 v +1 e.g., v=9 represented as <000,1001>. • g-code for v takes 2 log2 v +1 bits (ie. factor of 2 from optimal) • Optimal for Pr(v) = 1/2v2, and i.i.d integers

  20. Rice code (simplification of Golomb code) [q times 0s] 1 Log k bits • It is a parametric code: depends on k • Quotient q=(v-1)/k, and the rest is r= v – k * q – 1 • Useful when integers concentrated around k • How do we choose k ? • Usually k  0.69 * mean(v) [Bernoulli model] • Optimal for Pr(v) = p (1-p)v-1, where mean(v)=1/p, and i.i.d ints

  21. PForDelta coding Use b (e.g. 2) bits to encode 128 numbers or create exceptions 3 42 2 3 3 1 1 … 3 3 23 1 2 11 10 11 11 01 01 … 11 11 01 10 42 23 a block of 128 numbers Translate data: [base, base + 2b-1]  [0,2b-1] Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions

  22. Interpolative coding D = 1 1 1 2 2 2 2 4 3 1 1 1 M = 1 2 3 5 7 9 11 15 18 19 20 21 lo=9+1=10, hi=21, num=6 lo=1, hi=9-1=8, num=5 • Recursive coding  preorder traversal of a balanced binary tree • At every step we know (initially, they are encoded): • num= |M| = 12, Lidx=1, low= 1, Ridx=12, hi= 21 • Take the middle element: h= (Lidx+Ridx)/2=6  M[6]=9, • left_size= h – Lidx = 5, right_size= Ridx-h = 6 • low+left_size =1+5 = 6 ≤ M[h] ≤ hi– right_size=(21 – 6)= 15 • We can encode 9 in log2 (15-6+1) = 4 bits

  23. 2 4 6 13 32 1 2 3 5 8 13 21 34 Query processing • Retrieve all pages matching the query Brutus the Caesar 4 13 17

  24. 2 4 6 13 32 1 2 3 5 8 13 21 34 Some optimization Best order for query processing ? • Shorter lists first… Brutus The Calpurnia 4 13 17 Query: BrutusANDCalpurniaANDThe

  25. Phrase queries • Expand the posting lists with word positions • to: • 2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191; ... • be: • 1:17,19; 4:17,191,291,430,434;5:14,19,101; ... • Larger space occupancy,  5÷8% on Web

  26. 2 4 6 13 32 1 2 3 5 8 13 21 34 Query processing • Retrieve all pages matching the query • Order pages according to various scores: • Term position & freq (body, title, anchor,…) • Link popularity • User clicks or preferences Brutus the Caesar 4 13 17

  27. ? Page archive Crawler Web Page Analizer Indexer Query resolver Ranker Control text auxiliary Structure The structure Query

  28. Web Algorithmics Text-based Ranking (1° generation)

  29. æ ö n = ç ÷ idf log where nt = #docs containing term t n = #docs in the indexed collection è ø t n t A famous “weight”: tf-idf = tf Frequency of term t in doc d = #occt / |d| t,d Vector Space model

  30. Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !! Easy to Spam A graphical example t3 cos(a)= v w / ||v|| * ||w|| d2 d3 Sophisticated algos to find top-k docs for a query Q d1 a t1 d5 t2 d4 The user query is a very short doc

  31. Approximate top-k results • Preprocess: Assign to each term, its m best documents • Search: • If |Q| = q terms, merge their preferred lists ( mq answers). • Compute COS between Q and these docs, and choose the top k. Need to pick m>k to work well empirically. Now SE use tf-idf PLUS PageRank (PLUS other weights)

  32. Web Algorithmics Link-based Ranking (2° generation)

  33. Query-independent ordering • First generation: using link counts as simple measures of popularity. • Undirected popularity: • Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). • Directed popularity: • Score of a page = number of its in-links (es. 3). Easy to SPAM

  34. Second generation: PageRank • Each link has its own importance!! • PageRank is • independent of the query • many interpretations…

  35. Any node d 1-d Neighbors Basic Intuition…

  36. Principal eigenvector Google’s Pagerank Fixed value B(i) : set of pages linking to i. #out(i): number of outgoing links from i.

  37. Any node d “In the steady state” each page has a long-term visit rate - use this as the page’s score. 1-d Neighbors Three different interpretations • Graph (intuitive interpretation) • Co-citation • Matrix (easy for computation) • Eigenvector computation or a linear system solution • Markov Chain (useful to prove convergence) • a sort of Usage Simulation

  38. Pagerank: use in Search Engines • Preprocessing: • Given graph of links, build matrix L • Compute its principal eigenvector r • r[i] is the pagerank of page i We are interested in the relative order • Query processing: • Retrieve pages containing query terms • Rank them by their Pagerank The final order is query-independent

More Related