230 likes | 318 Views
Search Engines. IST 516 Fall 2011 Dongwon Lee, Ph.D. Search Engine Overview. Search Engine typically consists of: Crawler : that crawls the web to identify interesting URLs to fetch Fetcher : fetches web documents from the stored URLs
E N D
Search Engines IST 516 Fall 2011 Dongwon Lee, Ph.D.
Search Engine Overview • Search Engine typically consists of: • Crawler: that crawls the web to identify interesting URLs to fetch • Fetcher: fetches web documents from the stored URLs • Indexer: build local indexes (eg, inverted index) from the fetched web documents • Query Handler: processes users’ queries using indexes and prepare answers accordingly • Presenter: user interface component to present answers to users
1. Crawler (more later) • Called robot or spider too • View Web as a graph • Employee different graph search algorithm • Depth-first • Breadth-first • Frontier • Objectives: • Completeness • Freshness • Resource maximization
2. Fetcher • Crawler generates a stack of URLs to visit • Fetcher retrieves web documents for specific URL • Typically multi-threaded
3. Indexer • To handle large-scale data of the Web, needs to build index structure • Index is a small (memory-resident) data structure to help locate data fast (at the cost of extra space) • Trading space for time • In Search Engine or IR, popular form of index is called the Inverted Index
Inverted Index • A list for every word (index term). • The list of term t holds the locations (documents + offsets) where t appeared. 6
4. Query Handler • Given a query keyword Q, which web documents are the right answers? • Eg, • Boolean-matching model • Return all documents that contain Q • Vector space model • Return a ranked list of documents that have the largest cosine() similarity to Q • PageRank model • Return a ranked list of documents that have the highest PageRank values, in addition to other matching model
Vector Space Model • Cosine similarity: similarity of two same-length vectors measured by the cosine angle between them • Range: -1 -- +1 • Dot product: V1 . V2 • Magnitude: || V1 ||
Vector Space Model • Given N web documents gathered, extract all significant token set (ie words), say T. • |T| becomes the dimension of the vector • Convert each web document w_i to |T|-length boolean vector, say V_w_i • Given a query string Q, convert it to |T|-length boolean vector, say V_q • Compute cosine similarity btw V_q and V_w_i • Sort similarity scores in descending order
Vector Space Model Example • 3 documents • D1 = {penn, state, football} • D2 = {state, gov} • D3 = {psu, football} • Vector space representation • V = {football, gov, penn, psu, state} • D1 = [1, 0, 1, 0, 1] • D2 = [0, 1, 0, 0, 1] • D3 = [1, 0, 0, 1, 0] • Query Q = {state, football} • Q1 = [1, 0, 0, 0, 1] • Which doc is the closest to Q?
Term-Weighting • Instead of Boolean vector-space model, each term in dimension carries an importance weight of the corresponding token • [1, 0, 0, 1, 0 ] [0.9, 0.12, 0.14, 0.89, 0.13]
Term-Weighting • Eg, In the tf-idf, the importance of a term t • Increases proportionally to # of times t appears in the document tf • Is offset by # of times t appears in corpus idf
PageRank • Link graph of the web • Prioritizes the results of a search • d: damping factor (eg, 0.85) • C(Ti): # of outgoing links of the page Ti • High PageRank • Many pages pointing to it • Some pages pointing to it have high PageRank
PageRank Eg, PR(A)=(1-d)+d(PR(T1)/C(T1)+…+(PR(Tn)/C(Tn)) = PR(A)=(1-0.85) +0.85(8/1) +0.85(4/2) +0.85(1/4) +0.85(2/3)= 9.3725 8 4 ? 1 2
Effect of Link Structure (1) • Example for simple PR • Initial RPs of pages A, B, and C are all 0.15 Page A = 0.15 Page A = 1 Page A = 1.85 Page A = 1.4594 Page B = 0.2775 Page B = 1 Page B = 0.575 Page B = 1 Page C = 0.15 Page C = 1 Page C = 0.575 Page C = 0.575 Page A Page B Page A Page B Page A Page B Page A Page B Page C Page C Page C Page C
Effect of Link Structure (2) • Example of practical PR • rank sink : no outbound links, PR0 Page A Page B Page A Page B Page A : 0.6277 Page B : 0.6836 Page C : 0.4405 Page D : 1.5735 Page E : 1.6747 Page A : 1.1922 Page B : 1.1634 Page C : 0.6444 Page D : 0.1500 Page E : 0.1500 Page C Page C Page D Page E Page D Page E Simple PR Practical PR Sink
5. Presenter • Different presentation models • Simple keyword vs. Advanced interface
5. Presenter • Different presentation models • Ranked list: Google, Bing vs. Clustered list: Yippy Clusters
Evaluation of Results • The deciding factor for a search engine is it’s effectiveness. • Two factors: • Precision — The percentage of the relevant documents returned that are actually about the particular query. • Recall — The percentage of the relevant documents that were actually returned from the available set of documents for a particular query. 20
Evaluation of Results (cont.) • T: True-Relevant documents • R: Retrieved documents • Precision = (T intersect R) / R • Recall = (T intersect R) / T • F-measure = 2 X Precision X Recall / (Precision + Recall) • P-R graph P R 21
A Lot More Details Later • Through the second half of the semester, we will review • Each component of search engines and • Principles behind it in more details • We will use materials from the IIR textbook • Contents are freely available at: • http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html