170 likes | 282 Views
Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm. (c) Wolfgang Hürst, Albert-Ludwigs-University. Recap (PageRank and HITS). PageRank and HITS: Both search the web based on a) Relevance (content, anchor text, ...) b) Quality, importance, authority,.
E N D
Web Search – Summer Term 2006VII. Selected Topics -The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University
Recap (PageRank and HITS) PageRank and HITS: Both search the web based on a) Relevance (content, anchor text, ...) b) Quality, importance, authority, ... The latter one: based on link structure PageRank: Global, query-independent, recursive calculation over all pages HITS: Local subgraph containing relevant documents, distinguishes between hubs and authorities
Hilltop [1]: Basic Idea Observation: Many web user (authors)- Create web pages with link lists about topics they are very familiar with (experts)- Maintain these pages well / try to keep them up-to-date- Link to good, high quality pages Idea: Try to find such pages automatically and use their link structure for ranking Compare HITS: Similar to hubs but more explicit description of "expert" sources and global view (i.e. query independent)
Hilltop [1]: Basic Idea (cont.) An expert page is a page that is about a certain topic and has links to many non-affiliated pages on that topic - "non-affiliated" means authors from non-affiliated organizations (modeled, e.g. by URL processing) - "links to many ... pages" can be modeled (e.g.) by a threshold A page is an authority on a query topic if, and only if, some of the best experts on that query topic point to it
Hilltop [1]: Basic Idea (cont.) General approach: 1. Identify experts (in advance, i.e. query independent) 2. Select experts for a particular topic (depending on a specific query) 3. Use these experts to find and rank authorities for this topic
Identifying good expert pages What makes a good expert and how can they be found? A good expert is objective, diverse, unbiased, and point to numerous non-affiliated pages. Two hosts can be defined as affiliated if ... - ... they share the same first 3 octets of the IP address OR - ... the rightmost non-generic token in the hostname is the same (token = substrings in a hostname delimited by ".")
Identifying good expert pages 1st: Devide all (indexed) web pages into groups of affiliated ones 2nd: Get experts (i.e. pages pointing to lots of non-affiliated pages) based on their number of links to different groups (e.g. using a threshold) Note: This is all topic-independent! Possible extensions:- Consider topic-related clusters (if available)- Consider special characteristics of a page (e.g. similar formatting, etc.)
Indexing experts Identification of experts: done in advance, i.e. topic / query independent Selection of experts for a particular topic: done during the search process, i.e. query dep. Therefore: create inverted file for all pages that have been identified as an expert Only index so called key phrases, i.e.- Take all words in the title, in headlines (<h1>, <h2>, ... tags), in the anchor text of a URL- Associate these phrases with the respective URLs
Search: Get and rank authorities With this, we have: - Experts for different topics - All information we need to select all experts for a particular topic given the query terms qi Query processing is now done in two steps 1. Select & rate experts (based on query) 2. Select & rate authorities (based on experts)
1. Select & rate experts Select page as an expert (e.g.) if all query terms qi are associated with at least one URL Rate the selected experts by calculating an expert score for each expert p For this, we define - LevelScore(p) = Weighting of the type of key phrase (e.g. title: 16, heading: 6, anchor: 1) - FullnessFactor(p,qi) = Measure for the no. of terms in p that contain query terms q IF m 2 THEN FullnessFactor(p,q) = 1 ELSE FullnessFactor(p,q) = 1-(m-2) / plen
1. Select & rate experts (cont.) Based on the LevelScore and FullnessScore, some measures Si are calculated as follows: Si = LevelScore(p) X FullnessFactor(p,q)(with being the sum over all key phrases p with k-i query terms) The expert score is finally calculated as Expert_score = 232S0 + 216S1 + S2
2. Select & rate authorities Select pages as targets if they are referenced by at least two of these experts Rate them by calculating a target score: 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T Edge_Score(E,T) = Expert_Score(E)*query terms q occ(q,T) with occ(q,T) = no. of diff. key phrases for T containing q
2. Select & rate authorities (cont.) 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T 2. Check all experts pointing to the same target and for affiliated experts, remove all edges but the one with the highest Edge_Score 3. The Target_Score is now calculated as the sum of all remaining Edge_Scores Possible extension: Combine Target_Scores with a page-dependent Match_Score (depending on the appearance of search terms on the page)
Hilltop: Summary Preprocessing: - Divide the web into groups of affiliated pages (based on their authors / URLs) - Select experts (based on linkage and groups) Searching: Select and rate 1. Experts referencing to pages about a particular topic (represented by the query) 2. Authorities for this particular topic
Hilltop: Discussion Main properties (when compared to PageRank and HITS): - Topic/query-dependent (unlike PageRank) - Pre-selection of experts (unlike HITS), i.e. - all experts are considered (no subgraph) - efficient online calculation can be done - Page content and structure is considered Potential problems / criticism: - Uses lots of intuitive assumptions that are modeled by heuristics
References [1] BHARAT, MIHAILA: WHEN EXPERTS AGREE: USING NON-AFFILIATED EXPERTS TO RANK POPULAR TOPICS. ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 20, NO. 1, JAN. 2002