150 likes | 274 Views
Padmini Srinivasan Computer Science Department Department of Management Sciences http:// cs.uiowa.edu / ~ psriniva padmini-srinivasan@uiowa.edu. Ch. 13 Structure of the Web. Origins. Origins of WWW (1989/1990: http) Sir Tim Berners-Lee & Robert Cailliau
E N D
Padmini Srinivasan Computer Science Department Department of Management Sciences http://cs.uiowa.edu/~psriniva padmini-srinivasan@uiowa.edu Ch. 13 Structure of the Web
Origins • Origins of WWW (1989/1990: http) • Sir Tim Berners-Lee & Robert Cailliau • First prototype of browser: WorldWideWeb • 1st popular graphical browser: Mosaic (NCSA), Marc Andreessen and others • Mozilla -> Netscape -> Firefox • Lynx • 2000 Windows explorer • WAIS, Gopher, Veronica, • 1994: W3C • 1993: 1st World wide web conference • 1995: Yahoo! 1998: Google 2006: Live Search -> Bing
Network Metaphor • Information network: • Different from social network • Notion of a logical document: different • Decentralized, over many computers • annotation • Network metaphor: “inspired and non-obvious” • Origins in hypertext – origins in citation nets • Citation nets: distinctly temporal, web? • Citation maps (popular) co-citation; bibliographic coupling; • H-index (Hirsch); g-index; f-index • Patents; legal cases (precedents); medical literature • Indexes: cross-linkages; see also; wikipedia
Links/Associations • Directed edges, • Friendship nets, name-recognition, business colleagues, collaboration [Erdos number, Bacon number], IM nets, email graphs etc. • paths, shortest paths… • Associative memory • Semantic nets aka Conceptual networks (free-association studies) • Vannevar Bush “As We May Think” (1945) Atlantic Monthly. WW2. MEMEX (on web) • Associative connections between all of knowledge • Acknowledged by most • A way to rechannelhuman resources
Paths and Connectivity • Connected graphs • Path: sequence of nodes beginning at node X and ending at node Y. • A directed graph is strongly connected if there is a path (directed of course) between every pair of its nodes. • If it is not strongly connected, need to examine its ‘reachability’ properties. • Easier in an undirected graph: disconnected components • Directed? Find strongly connected components
Strongly Connected Component • SCC in a directed graph is a subset of nodes such that • (1) every node in it has a path to every other node in it • (2) the subset is not a part of a larger set of nodes that has the same property. [So it is the largest such component] • Why is it interesting to know about such components in the Web?
Bow-Tie Structure of the Web • 1999 Andrei Broder (now Yahoo!), then Alta Vista • SCC; IN; OUT; Tendrils; Tubes, Disconnected • Macro-model • Properties of a reasonable model: • Should have a succinct and fairly natural description • Rooted in plausible macro-level process for creation of Web content • Not require some prior static set of topics • Should reflect many of the structural phenomenon observed in the Web
Similar Studies • Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are • Webbase, 200 Million Stanford crawl • 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million) next SCC: 10 thousand!
Similar Studies • Buriol et al. (includes Donato): Temporal analysis of Wikigraph.
Bow-Tie • Why a single SCC? Why not two large ones? • Any other explanations? • Interlinked world? • Hard to be disconnected? • What about a new page? • Is the SCC static/fixed? How does it change? • Are links permanent? (2004: 25% remain after 1 year and 50% of pages stay the same; Ntoulas et al., 2004) • Many naturally occurring graphs have a giant SCC • IM (nodes people, link message) almost all are in the SCC; median path length is 7,mean 6.6.
Bow-Tie: points to note • Incomplete picture • Doesn’t tell you how this is generated, just that it is. • Macro model: • Thematic collections; differences? • Organization specific collections • Regional: economic incentives/disincentives? • Community based: education levels? • Bipartite cliques (small sized – many in number) • Fans pointing to centers • Will it always be observed? How about now?
Web 2.0 • “an attitude not a technology” • Collaboration/collective maintenance • Annotation, tags, links, editing, revisions • Data generated by individuals for individual and group sharing; Flickr, Gmail. • Connections between entities beyond “documents”. • Social feedback key; ‘wisdom of crowds’; long tail;
Web Links • Navigational – static pages – passive services • Transactional – dynamic / computational services. Deep web • Search engines – heuristics • What kinds of rules would you use? • Implications for crawlers
Summary • Web: origins, network metaphor • Citations, MEMEX • Paths • Structures (macro) • SCC • Bow-Tie model • Next • Ch 14: Hubs and Authorities; PageRank