410 likes | 614 Views
Web structure mining / link mining and Web communities Bettina Berendt „Knowledge and the Web“ summer semester 2005 http://www.wiwi.hu-berlin.de/~berendt/lehre/2005s/kaw/ last updated: 2005-05-04. Acknowledgements and: how to use these slides.
E N D
Web structure mining / link miningand Web communitiesBettina Berendt„Knowledge and the Web“ summer semester 2005http://www.wiwi.hu-berlin.de/~berendt/lehre/2005s/kaw/last updated: 2005-05-04
Acknowledgementsand: how to use these slides • Some of these slides were taken from the slide set of the Web Mining book by Baldi, Frasconi, and Smyth (http://ibook.ics.uci.edu/Slides/MIW%20Chapter%205.ppt) – thank you for a great book and slides! • These slides are marked at the bottom left corner • I also based the slide layout on that slide set • Some figures were taken from the two presented articles (see p.4). • Further materials can be found in the directory of this session (http://www.wiwi.hu-berlin.de/~berendt/lehre/2005s/kaw/Session4) • Slides that just carry a title were developed in class and on the blackboard. • Please feel free to re-use these slides in your own teaching, and please credit their origin.
Objectives • To explore “what’s in a link” and to see what knowlede can therefore be extracted by analysing links • To calculate the popularity of a site based on link analysis • To see how linkage defines communities
Outline: Theory and applications of link analysis for ... • Search: ranking of search engine results • Scientific communities: co-citation analysis and other bibliometrics • Chen, C. & Carr, L. (1999) Visualizing the evolution of a subject domain: A case study. Proc. IEEE Visualization 1999. • An example of a resulting archive: citeseer • Identification of Web communities • Flake, G.W., Lawrence, S., & Giles, C.L. (2000). Efficient identification of web communities. Proc. KDD 2000. • Outlook: Social network analysis
Recall: Trees (slide from ISI) • A is the root node • B is the parent of D and E • D and E are children of B • (C,F) is an edge • 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 are leaves • A, B, C, D, E, F, G, H, I are internal nodes • The level (or depth) of E is 2 (number of edges to root) • The height (or order) of the tree is 4 (max number of edges from root to a leaf node) • The degree of node B is 2 (number of children) Based on Tom Blough, Introduction to Programming. http://www.rh.edu/~blought/fall02_cish4960/notes/lecture11-12.ppt
Graphs (data structure def.) • Definition: A set of items connected by edges. Each item is called a vertex or node. Formally, a graph is a set of vertices and a binary relation between vertices, adjacency. • Formal Definition: A graph G can be defined as a pair (V,E), where V is a set of vertices, and E is a set of edges between the vertices E = {(u,v) | u, v in* V}. If the graph is undirected, the adjacency relation defined by the edges is symmetric, or E = {{u,v} | u, v in V} (sets of vertices rather than ordered pairs). If the graph does not allow self-loops, adjacency is irreflexive. (http://www.nist.gov/dads/HTML/graph.html) Note: Edges are also called links (esp. in hypertext graphs like the WWW). • * „in“ denotes the „element-of“ relation
What‘s in a link?1. „This is good.“ Basic Assumptions of early link analysis • Hyperlinks contain information about the human judgment of a site • The more incoming links to a site, the more it is judged important
Outline of the “link analysis for search engine ranking” part • Early Approaches to Link Analysis • Hubs and Authorities: HITS • Page Rank • Stability • Probabilistic Link Analysis • Limitation of Link Analysis Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
Early Approaches Bray 1996 • The visibility of a site is measured by the number of other sites pointing to it • The luminosity of a site is measured by the number of other sites to which it points • Limitation: failure to capture the relative importance of different parents (children) sites Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
Early Approaches Mark 1988 • To calculate the score S of a document at vertex v 1 Σ S(w) S(v) = s(v) + | ch[v] | w Є |ch(v)| v: a vertex in the hypertext graph G = (V, E) S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v • Limitation: • - Require G to be a directed acyclic graph (DAG) • - If v has a single link to w, S(v) > S(w) • If v has a long path to w and s(v) < s(w), then S(v) > S (w) • unreasonable Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
Authority and Hubness 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4) Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
Authority and Hubness Convergence • Recursive dependency: • a(v) Σ h(w) • h(v) Σ a(w) w Є pa[v] w Є ch[v] • Using Linear Algebra, we can prove: a(v) and h(v) converge Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R A new set S (base subgraph) Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
HITS Example Hubs and authorities: two n-dimensional a and h • HubsAuthorities(G) • 1 [1,…,1] Є R • a h 1 • t 1 • repeat • for each v in V • do a (v) Σ h (w) • h (v) Σ a (w) • a a / || a || • h h / || h || • t t + 1 • until || a – a || + || h – h || < ε • return (a , h ) |V| 0 0 t w Є pa[v] t -1 w Є pa[v] t t -1 t t t t t t t t -1 t t -1 t t Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
HITS Improvements Brarat and Henzinger (1998) • HITS problems • The document can contain many identical links to the same document in another host • Links are generated automatically (e.g. messages posted on newsgroups) • Solutions • Assign weight to identical multiple edges, which are inversely proportional to their multiplicity • Prune irrelevant nodes or regulating the influence of a node with a relevance weight Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
Markov Chain Notation • Random surfer model • Description of a random walk through the Web graph • Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page rt= M rt-1M: transition matrix for a first-order Markov chain (stochastic) Does it converge to some sensible solution (as too) regardless of the initial ranks ? Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
Limits of Link Analysis • META tags/ invisible text • Search engines relying on meta tags in documents are often misled (intentionally) by web developers • Pay-for-place • Search engine bias : organizations pay search engines and page rank • Advertisements: organizations pay high ranking pages for advertising space • With a primary effect of increased visibility to end users and a secondary effect of increased respectability due to relevance to high ranking page Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
Limits of Link Analysis • Stability • Adding even a small number of nodes/edges to the graph has a significant impact • Topic drift – similar to TKC • A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page • Content evolution • Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
What‘s in a link? 2. „This has something to do with my document / me.“
Co-citation analysis and bibliographic coupling: basic ideas
Matrix Notation Adjacent Matrix A = * http://www.kusatro.kyoto-u.com Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine
Web communities • A community is a collection of web pages created by individuals or any kind of associations that have a common interest on a specific topic, such as fan pages of a baseball team, and official pages of PC vendors. Another example are Blog communities. • Formally: Flake et al.‘s definition of an „ideal community“ • == „A Pokemon web site is a site that links to or is linked by more Pokemon sites than non-Pokemon sites.“
Approximate communities • To apply this nice theorem, we would need to have the whole Web on our hard disk! • Realistically, we crawl a part of the Web starting with some pages that are in the community we are interested in. • Questions: • What is crawling? • What pages are retrieved during this crawl? • What other assumptions have to be made?
Crawling • Archives are not always given • Crawling = techniques for assembling archives from the Web • Simple: Unix command-line utility wget • Sophisticated: WIRE (contains analysis) next week • Crawling contains graph search
What is the virtual sink (= the site that is definitely not in the community)? • In the ideal version: • In the approximate version, use artificicial virtual sink (a theorem ensures correctness even if this is not really at the center of the graph)
What's in a link? 3. "This is my boss." • Examples of problems created by such „nepotistic links“: • Web: link farms • Much work since 2000 - http://www.cse.lehigh.edu/~brian/pubs/2000/aaaiws/aaai2000ws.pdf • Science / citation analysis
Outlook: social network analysis • Bibliometrics and link mining have their roots in a much older are: social network analysis • Direct transfer of the link analysis methods we have found: find "opinion leaders" in ciao.de and similar sites • see also viral marketing • Others: analyse communication patterns, prestige, power, ...