1 / 9

CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB

CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB. Kemal Efe, Vijay Raghavan, and Arun Lakhotia University of Louisiana Presented by Lan Nie 0 9 / 0 1/2005, Lehigh University. Introduction Search engine Crawl, index and retrieve information about web pages. find all of the relevant pages

bebe
Download Presentation

CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB Kemal Efe, Vijay Raghavan, and Arun Lakhotia University of Louisiana Presented by Lan Nie 09/01/2005, Lehigh University

  2. Introduction • Search engine • Crawl, index and retrieve information about web pages. • find all of the relevant pages • rank them by relevance to the user query • present a rank-ordered result • Recall and Precision • Early Search Engine • Solely keyword matching • Lots of low quality pages, rankings rarely agreed with user’s interests • Synonymy and polysemy • Modern Search Engine • Linkage structure provides valuable information. • Link analysis combined with content analysis • Substantially improve the search quality

  3. Link Analysis • Authority Flow Model • Link: a channel for authority flow • A page q with authority rank rq(i) at iteration i distribute all its current authority equally among its outgoing links. However.. • An authorative page on a subject is likely to be co-cited with other authoritative pages on the same subject • Rank of a page should augmented by its co-citation degree. • Random Walk Model • Surfer walks on the web graph and make random decisions about where to go next PageRank is a combination of authority flow model and random walk model

  4. Continued.. Authority and Hub • Co-citation matrix ATA • Entry (p,q) : the number of joint co-citations received by p and q; • Entry (p,p) : the indegree of page p. • Bibliographic coupling matrix AAT • Authority / Hub • diagonal term: authority is influenced by number of citation • non-diagonal term: authority is influenced by the degree co-citation • Influence (co-citation) >> Influence (citation) • A more general model : different weights for diagonal terms and no-diagonal terms in the above computation • HITS algorithm combined the authority and hub idea together

  5. Content Analysis Which pages are important in the Web Graph? (Link Analysis) Which pages are relevant to the query? (Content Analysis) Tasks of Content Analysis • how a page is relevant to the user query • Similarity between documents in vector space • Cosine Similarity • Okapi measure, Three Level Scoring,Cover Density Ranking • Where on the page to search for the query terms • Fields: Title, Anchor text, Abstract • Properties: Font, Highlighting, Capitalization, distance between subquery terms • Deal with synonymy and polysemy • LSI,GVSM • Application in classification, document search and relevance ranking.

  6. Combining Content and Link Analysis (PageRank) Page Rank: A Random Surfer (Brin and Page[1998]) • With 1-d, jumps to a random page; with d, follows a random outlink. • Rank Is independent of query/ topic. • Topic Sensitive Page Rank: Multiple Focused Surfer(Haveliwala[2002]) • A set of predefined topics (top level categories of ODP), with Ct as the set of URLs in the ODP category t. • Each page is assigned a rank vector , one rank for each topic. • Each surfer is focused on a specific topic t • With 1-d, jumps to a page in Ct; with d, follows a random outlink • For a given query, a page’s query-sensitive score is inner product of the page’s rank vector and the query’s topic distribution vector.

  7. Combining Content and Link Analysis (HITS) HITS • Sampling • Use query to collect a root set of pages from a textual search engine • Expand the root set into a base set by adding pages linked to and from the root set • Calculation of Authorities and Hubs • Problems • Tightly Knit Community (TKC) effect: HITS has converged to the regions of the web graph which is highly connected . How about TKC is irrelevant to the topic? • Page propagate the same authority weight to each outgoing page • Result is dominated by one community, a page would be deemed unimportant if it is popular of a smaller community. Example:Jaguar

  8. Continued.. Improvement of HITS • Chakrabarti et al.[1999] • Outlinks in different part of page may point to different topics • Page splitting: outlinks in the small page tend to be on the same topic • Li et al.[2002] • A good hub is likely to be cited, hub weights of pages are increased depending on their authority weights • Cohn and Chang[2001] • PHITS: A probabilistic model to rank a page within its own community rather than within the entire base set • Dean et. Al.[1999] • What’s Related? Given the seed page, find its parents, children of its parents, its children, parents of its children • Given a see page, find pages link to it, and what else they link to • Output pages that are most frequently co-cited with the seed URL • Bharat and Henzinger[1998], Chakrabarti et al.[1998] • Weighted HITS

  9. Weighted HITS CLEVER project(Chakrabarti et al.[1998]) • A relevance weight is computed for each link • W (p, q): The number of query matches in the surrounding texts of the link p->q Query Expansion (Bharat and Henzinger[1998]) • A relevance weight is assigned to each page • Broader Query Q: concatenation the first 1000 words from each doc in the root set • W(p): cosine similarity between page p and broader query Q

More Related