190 likes | 207 Views
Explore web mining techniques, bow-tie theory, authorities & hubs, Hits, and PageRank in the context of vast and dynamic web data, understanding link structures and search challenges.
E N D
Link Structure and Web Mining Shuying Wang 2003.11
Outline Part one: Link Structure and Web Mining Part two: Analysis of Link Structure Topic covered: - Web mining methods - Text based Web mining - Web graph -- Bow tie theory - Eigenvalue and Eigenvector - Authorities & Hubs - Hits (Hyperlink-Induced Topic Search) - PageRank
Challenges for Web Search • The WWW is a vast collection of information: over 3 billion text pages plus a multitude of multimedia files. Over a million new resources are added every day. • Huge • Complex • Dynamic • Diversity • Different User Group • How do we find the information we need in such a large collection? • Search is the most common activity on the web after email.
Web Mining Method • Web content mining - Context, Keyword, Document classification • Web structure mining - Link structure and link text • Web usage mining - Weblog, URL, timestamp, IP and web page content
Limitations of text based analysis • Text-based ranking function • Eg. Could www.harvard.edu be recognized as one of the most authoritative pages, since many other web pages contain “harvard” more often. • Pages are not sufficiently self – descriptive • Usually the term “search engine” doesn't’t appear on search engine web pages Web database Keyword Web pages
What are the benefits of link building? • Following a link is one of the most popular ways for people to find new sites. • By providing links to other material people don't have to re-invent the wheel. • Inbound links help to build trust. • Link structure and link text provide a lot of information for making relevance judgments and quality filtering • The link structure implies an underlying social structure in the way that pages and links are created, and it is an understanding of this social organization that can provide us the most leverage.
Queries and Authoritative Sources Types of queries • Specific queries E.g., “Does Netscape support the JDK 1.3?” • Broad-topic queries E.g., “Find information about the Java programming language.” • Similar-page queries E.g., “Find pages java.sun.com” Authoritative pages –relative to broad-topic query • It is not sufficient to collect a large number of potentially relevant page from text-based methods. • Authorities are often not particularly self-descriptive
Authorities and Hubs • A good authority is a page that is pointed by many good hubs, while a good hub is a page that points to many good authorities. • This is the mutually reinforcing relationship. The authority pages are those that contain the most definitive, central, and useful information in the context of particular topics. Hubs that link to a collection of prominent sites on a common topic hubs authorities
Hits (Hyperlink-Induced Topic Search) • The focused subgraph is created by first taking the highest-ranked pages from a text-based search engine as a root set R. • R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R. • Note that while R may fail to contain some “important” authorities, S will probably contain them. u Root set Rn … R1 … Sn S1 Base set
Computing Hubs and Authorities(1) For each page p, we associate a non-negative authority weight ap and a non-negative hub weight hp. (1) (2) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a1,a2,…,an) and h=(h1,h2,…,hn). (3) (4)
Computing Hubs and Authorities(2) (5) • In other words, a is an eigenvector of B: • B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. • B is symmetric and has n orthogonal unit eigenvectors. (6) (7) Let
Computing Hubs and Authorities(3) • We initialize a(p) = h(p) = 1 for all p. • We iterate the following operations: • And renormalize after each iteration
Computing Hubs and Authorities(4) • The eigenvectors of B are precisely the stationary points of this process. • h is the principal eigenvector of ATA, and a is the principal eigenvector of AAT. • The principal eigenvector represents the “densest cluster” within the focused subgraph. • By initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. • Initializing differently may lead to convergence to a different eigenvector. • In practice convergence is achieved after only 10-20 iterations.
PageRank (Simple structure of Google search engine) query offline TextIndex() Query-time Inverted Text index Query Processor Web Page rank PageRank() Ranked results
PageRank Computing u: web page v: page links to uBu: the set of pages c: a factor for normilization (C <1) (1) Let A be a square matrix with rows and columns corresponding to web pages. Let If let R as vector over web pages, Then R = cAR. (2) R is an eigenvector of A with eigenvalue c.
Hits and PageRank PageRank - Offline computing - Focuses on authoritative pages - Computing all the web pages Hits: - Query time computing - Seeks good hub pages - Computing the base set pages
Conclusion • A technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic. • Related work. • Standing, influence in social networks, scientific citations, etc. • Hypertext and WWW rankings • …
Reference • Mining the Link Structure of the World Wide Web Jon Kleinberg • Authoritative Sources in a Hyperlinked Environment JonKleinberg • The PageRank Citation Ranking: Bringing Order to the Web Larry Page • Effective Finding Relevant Web Pages from Linkage Information Jingyu Hou Yanchun Zhang • Data Mining Concepts and Techniques JiaWei Han Micheline Kamber