Search Engines

Search Engines IST 516 Fall 2011 Dongwon Lee, Ph.D.

Search Engine Overview • Search Engine typically consists of: • Crawler: that crawls the web to identify interesting URLs to fetch • Fetcher: fetches web documents from the stored URLs • Indexer: build local indexes (eg, inverted index) from the fetched web documents • Query Handler: processes users’ queries using indexes and prepare answers accordingly • Presenter: user interface component to present answers to users

1. Crawler (more later) • Called robot or spider too • View Web as a graph • Employee different graph search algorithm • Depth-first • Breadth-first • Frontier • Objectives: • Completeness • Freshness • Resource maximization

2. Fetcher • Crawler generates a stack of URLs to visit • Fetcher retrieves web documents for specific URL • Typically multi-threaded

3. Indexer • To handle large-scale data of the Web, needs to build index structure • Index is a small (memory-resident) data structure to help locate data fast (at the cost of extra space) • Trading space for time • In Search Engine or IR, popular form of index is called the Inverted Index

Inverted Index • A list for every word (index term). • The list of term t holds the locations (documents + offsets) where t appeared. 6

4. Query Handler • Given a query keyword Q, which web documents are the right answers? • Eg, • Boolean-matching model • Return all documents that contain Q • Vector space model • Return a ranked list of documents that have the largest cosine() similarity to Q • PageRank model • Return a ranked list of documents that have the highest PageRank values, in addition to other matching model

Boolean-Matching

Vector Space Model • Cosine similarity: similarity of two same-length vectors measured by the cosine angle between them • Range: -1 -- +1 • Dot product: V1 . V2 • Magnitude: || V1 ||

Vector Space Model • Given N web documents gathered, extract all significant token set (ie words), say T. • |T| becomes the dimension of the vector • Convert each web document w_i to |T|-length boolean vector, say V_w_i • Given a query string Q, convert it to |T|-length boolean vector, say V_q • Compute cosine similarity btw V_q and V_w_i • Sort similarity scores in descending order

Vector Space Model Example • 3 documents • D1 = {penn, state, football} • D2 = {state, gov} • D3 = {psu, football} • Vector space representation • V = {football, gov, penn, psu, state} • D1 = [1, 0, 1, 0, 1] • D2 = [0, 1, 0, 0, 1] • D3 = [1, 0, 0, 1, 0] • Query Q = {state, football} • Q1 = [1, 0, 0, 0, 1] • Which doc is the closest to Q?

Term-Weighting • Instead of Boolean vector-space model, each term in dimension carries an importance weight of the corresponding token • [1, 0, 0, 1, 0 ]  [0.9, 0.12, 0.14, 0.89, 0.13]

Term-Weighting • Eg, In the tf-idf, the importance of a term t • Increases proportionally to # of times t appears in the document  tf • Is offset by # of times t appears in corpus  idf

PageRank • Link graph of the web • Prioritizes the results of a search • d: damping factor (eg, 0.85) • C(Ti): # of outgoing links of the page Ti • High PageRank • Many pages pointing to it • Some pages pointing to it have high PageRank

PageRank Eg, PR(A)=(1-d)+d(PR(T1)/C(T1)+…+(PR(Tn)/C(Tn)) = PR(A)=(1-0.85) +0.85(8/1) +0.85(4/2) +0.85(1/4) +0.85(2/3)= 9.3725 8 4 ? 1 2

Effect of Link Structure (1) • Example for simple PR • Initial RPs of pages A, B, and C are all 0.15 Page A = 0.15 Page A = 1 Page A = 1.85 Page A = 1.4594 Page B = 0.2775 Page B = 1 Page B = 0.575 Page B = 1 Page C = 0.15 Page C = 1 Page C = 0.575 Page C = 0.575 Page A Page B Page A Page B Page A Page B Page A Page B Page C Page C Page C Page C

Effect of Link Structure (2) • Example of practical PR • rank sink : no outbound links, PR0 Page A Page B Page A Page B Page A : 0.6277 Page B : 0.6836 Page C : 0.4405 Page D : 1.5735 Page E : 1.6747 Page A : 1.1922 Page B : 1.1634 Page C : 0.6444 Page D : 0.1500 Page E : 0.1500 Page C Page C Page D Page E Page D Page E Simple PR Practical PR Sink

5. Presenter • Different presentation models • Simple keyword vs. Advanced interface

5. Presenter • Different presentation models • Ranked list: Google, Bing vs. Clustered list: Yippy Clusters

Evaluation of Results • The deciding factor for a search engine is it’s effectiveness. • Two factors: • Precision — The percentage of the relevant documents returned that are actually about the particular query. • Recall — The percentage of the relevant documents that were actually returned from the available set of documents for a particular query. 20

Evaluation of Results (cont.) • T: True-Relevant documents • R: Retrieved documents • Precision = (T intersect R) / R • Recall = (T intersect R) / T • F-measure = 2 X Precision X Recall / (Precision + Recall) • P-R graph P R 21

A Lot More Details Later • Through the second half of the semester, we will review • Each component of search engines and • Principles behind it in more details • We will use materials from the IIR textbook  • Contents are freely available at: • http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Search Engines

Search Engines

Presentation Transcript

Search Engines.

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

SEARCH ENGINES

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines?

Search Engines

Search Engines

Search Engines