Understanding Web Search Engines: Algorithms, Goals & Evolution

Web Algorithmics Web Search Engines

Goal of a Search Engine Retrieve docs that are “relevant” for the user query • Doc: file word or pdf, web page, email, blog, e-book,... • Query: paradigm “bag of words” Relevant ?!?

The Web: Language and encodings:hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Two main difficulties Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

Evolution of Search Engines • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google Google, Yahoo, MSN, ASK,……… Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research]

This is a search engine!!!

Wolfram Alpha

Clusty

Yahoo! Correlator

Web Algorithmics The structure of a Search Engine

? Page archive Crawler Web Page Analizer Indexer Query resolver Ranker Control text auxiliary Structure The structure Query

Generating the snippets !

The big fight: find the best ranking...

Ranking: Google vs Google.cn

Problem: Indexing • Consider Wikipedia En: • Collection size ≈ 10 Gbytes • # docs ≈4 * 106 • #terms in total > 1 billion(avg term len = 6 chars) • #terms distinct = several millions • Which kind of data structure do we build to support word-based searches ?

DB-based solution: Term-Doc matrix #docs ≈ 4M #terms > 1M Space ≈ 4Tb ! 1 if play contains word, 0 otherwise

2 4 6 10 32 1 2 3 5 8 13 21 34 Current solution: Inverted index • A term like Calpurnia may use log2 N bits per occurrence • A term like the should take about 1 bit per occurrence Brutus the Calpurnia 13 16 Currently they get 13% original text

Gap-coding for postings • Sort the docIDs • Store gaps between consecutive docIDs: • Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 … Two advantages: • Space: store smaller integers (clustering?) • Speed: query requires just a scan

g-code for integer encoding Length-1 • v > 0 and Length = log2 v +1 e.g., v=9 represented as <000,1001>. • g-code for v takes 2 log2 v +1 bits (ie. factor of 2 from optimal) • Optimal for Pr(v) = 1/2v2, and i.i.d integers

Rice code (simplification of Golomb code) [q times 0s] 1 Log k bits • It is a parametric code: depends on k • Quotient q=(v-1)/k, and the rest is r= v – k * q – 1 • Useful when integers concentrated around k • How do we choose k ? • Usually k  0.69 * mean(v) [Bernoulli model] • Optimal for Pr(v) = p (1-p)v-1, where mean(v)=1/p, and i.i.d ints

PForDelta coding Use b (e.g. 2) bits to encode 128 numbers or create exceptions 3 42 2 3 3 1 1 … 3 3 23 1 2 11 10 11 11 01 01 … 11 11 01 10 42 23 a block of 128 numbers Translate data: [base, base + 2b-1]  [0,2b-1] Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions

Interpolative coding D = 1 1 1 2 2 2 2 4 3 1 1 1 M = 1 2 3 5 7 9 11 15 18 19 20 21 lo=9+1=10, hi=21, num=6 lo=1, hi=9-1=8, num=5 • Recursive coding  preorder traversal of a balanced binary tree • At every step we know (initially, they are encoded): • num= |M| = 12, Lidx=1, low= 1, Ridx=12, hi= 21 • Take the middle element: h= (Lidx+Ridx)/2=6  M[6]=9, • left_size= h – Lidx = 5, right_size= Ridx-h = 6 • low+left_size =1+5 = 6 ≤ M[h] ≤ hi– right_size=(21 – 6)= 15 • We can encode 9 in log2 (15-6+1) = 4 bits

2 4 6 13 32 1 2 3 5 8 13 21 34 Query processing • Retrieve all pages matching the query Brutus the Caesar 4 13 17

2 4 6 13 32 1 2 3 5 8 13 21 34 Some optimization Best order for query processing ? • Shorter lists first… Brutus The Calpurnia 4 13 17 Query: BrutusANDCalpurniaANDThe

Phrase queries • Expand the posting lists with word positions • to: • 2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191; ... • be: • 1:17,19; 4:17,191,291,430,434;5:14,19,101; ... • Larger space occupancy,  5÷8% on Web

2 4 6 13 32 1 2 3 5 8 13 21 34 Query processing • Retrieve all pages matching the query • Order pages according to various scores: • Term position & freq (body, title, anchor,…) • Link popularity • User clicks or preferences Brutus the Caesar 4 13 17

? Page archive Crawler Web Page Analizer Indexer Query resolver Ranker Control text auxiliary Structure The structure Query

Web Algorithmics Text-based Ranking (1° generation)

æ ö n = ç ÷ idf log where nt = #docs containing term t n = #docs in the indexed collection è ø t n t A famous “weight”: tf-idf = tf Frequency of term t in doc d = #occt / |d| t,d Vector Space model

Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !! Easy to Spam A graphical example t3 cos(a)= v w / ||v|| * ||w|| d2 d3 Sophisticated algos to find top-k docs for a query Q d1 a t1 d5 t2 d4 The user query is a very short doc

Approximate top-k results • Preprocess: Assign to each term, its m best documents • Search: • If |Q| = q terms, merge their preferred lists ( mq answers). • Compute COS between Q and these docs, and choose the top k. Need to pick m>k to work well empirically. Now SE use tf-idf PLUS PageRank (PLUS other weights)

Web Algorithmics Link-based Ranking (2° generation)

Query-independent ordering • First generation: using link counts as simple measures of popularity. • Undirected popularity: • Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). • Directed popularity: • Score of a page = number of its in-links (es. 3). Easy to SPAM

Second generation: PageRank • Each link has its own importance!! • PageRank is • independent of the query • many interpretations…

Any node d 1-d Neighbors Basic Intuition…

Principal eigenvector Google’s Pagerank Fixed value B(i) : set of pages linking to i. #out(i): number of outgoing links from i.

Any node d “In the steady state” each page has a long-term visit rate - use this as the page’s score. 1-d Neighbors Three different interpretations • Graph (intuitive interpretation) • Co-citation • Matrix (easy for computation) • Eigenvector computation or a linear system solution • Markov Chain (useful to prove convergence) • a sort of Usage Simulation

Pagerank: use in Search Engines • Preprocessing: • Given graph of links, build matrix L • Compute its principal eigenvector r • r[i] is the pagerank of page i We are interested in the relative order • Query processing: • Retrieve pages containing query terms • Rank them by their Pagerank The final order is query-independent

Understanding Web Search Engines: Algorithms, Goals & Evolution

Understanding Web Search Engines: Algorithms, Goals & Evolution

Presentation Transcript

GAMMA: Generative Aspect Modeling for Middleware Algorithmics

Experimental Algorithmics Reading Group, UBC, CS

The Algorithmics of Probabilistic Automata Weak Bisimulation

Introduction to Computer Algorithmics and Programming Ceng 113

Introduction and overview of FPT algorithmics

Algorithmics for Software Transactional Memory

Introduction to Computer Algorithmics and Programming Ceng 113

ENEE651/CMSC751: Parallel Algorithmics, Spring 2014 Time and Location

Web Algorithmics

Web Algorithmics

Algorithmics in braid groups

Algorithmics of -1 frameshift RNA sequences

Algorithmics and Complexity

Algorithmics animation workshop Minimum Spanning trees

ENEE651/CMSC751: Parallel Algorithmics, Spring 2014 Time and Location

Experimental Algorithmics Reading Group, UBC, CS

Algorithmics and Complexity

Introduction to Computer Algorithmics and Programming Ceng 113

Introduction to Computer Algorithmics and Programming Ceng 113

Introduction to Computer Algorithmics and Programming Ceng 113