400 likes | 546 Views
Studying Blogspace. Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com. Etymology. From the OED new ed. (draft entry, Mar 2003) … blog intr. To write or maintain a weblog. Also: to read or browse through weblogs, esp. habitually.
E N D
Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com
Etymology From the OED new ed. (draft entry, Mar 2003) … blog intr. To write or maintain a weblog. Also: to read or browse through weblogs, esp. habitually. web¢logn. 2. A frequently updated web site consisting of personal observations, excerpts from other sources, etc., typically run by a single person, and usually with hyperlinks to other sites; an online journal or diary. From WWW 2003 (Kumar, Novak, Raghavan, Tomkins) … blog¢spacen. The collection of weblogs; = blogosphere, blogsphere, blogistan, …
Blogs 101 • Characteristics • Pages with reverse chronological sequences of dated entries • Usually contain a persistent sidebar containing profile (and other blogs read by the author – “blogroll”) • Usually maintained and published by one of the common variants of public-domain blog software • From Slashdot, 1999 “… a new, personal, and determinedly non-hostile evolution of the electric community. They are also the freshest example of how people use the Net to make their own, radically different new media”
Look and feel • Quirky • Highly personal • Consumed by a small number of regular repeat visitors • Often updated multiple times each day • Highly interwoven into a network of small but active micro-communities • Eg: LiveJournal, Xanga, DeadJournal, Blogger, Memepool, …
The blog era • Blogs began in 1996, but exploded in popularity in 1999 • Proliferation of authoring tools • Newsweek 2002 estimates ~500K • LiveJournal 2005 estimates ~3.5M • Annual Blogathon for charity • Bloggers update their Blogs every 30m for 24h • Sponsors pay … • Impact of blogs • “Miserable failure” on Google
Structural study(Kumar, Novak, Raghavan, Tomkins, CACM 2004)
Livejournal blogspace • Livejournal.com: popular blog site • 1.3M bloggers (Feb 2004) • 3.5M bloggers (Apr 2005) • Each blogger has a profile • Name, age, … • Geographic information (city, state, zip, …) • Friends and friend of • Interests/communities
LJ bloggers in US < 1K < 5K < 10K < 25K < 50K ~ 100K
LJ bloggers world-wide < 1K < 2K < 5K ~ 25K ~ 50K ~ 75K
Who are they? Age % Representative interests
Friendship graph • Directed • 80% mutual • Average degree ~ 14 • Power law degrees • Clustering coeff. ~ 0.2 • Most friendships explained by age, location, interest Age 1% 5% 16% Location 20% Interest 16% 22%
Evolutionary study(Kumar, Novak, Raghavan, Tomkins, WWW 2003)
Blogs and evolution • Every blog contains a dated record of • Every word ever written to the blog • Every link ever added in the blog • Blogs are an increasingly important medium, but • Few systematic studies have been performed • Such study should take an evolutionary perspective [Brewington et al] [Bharat et al] [Fetterly et al] [Cho et al] • Tools for understanding evolution not fully understood
Time graphs Jan v1 v2 Feb Mar v3 v1 v2 Apr time May v4 v3 v4 Jun Jul Aug Underlyinggraph Time graph
Community evolution in blogs • What are the communities within the time graph? • Community definition, extraction • Graph-based methods (trawling) [Kumar Raghavan Rajagopalan Tomkins, WWW 99] • How active are these communities, and over what timeframe? • Burst analysis[Kleinberg, KDD 02]
Community extraction • Community analysis based on graph structure • Idea: there are many subgraphs that would never occur in a random graph – if we find such subgraphs, there must be some reason • In blogspace, we enumerate dense subgraphs using a greedy heuristic
Dense subgraph enumeration(heuristic) • Scan edges, find triangles • For each triangle, greedily grow its neighbor set • Growth is allowable based on a measure of connectivity to the current dense subgraph • Extracted “communities” are not unique
Bursts: Static to dynamic communities • Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity” • Model source as multi-state • Each state has certain emission properties • Traversal between states is controlled by a Markov model • Determine most likely underlying state sequence over time, given observable output
An example State 2: Output rate: very high State 1: Output rate: very low 0.01 1 2 0.005 I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Time Most likely “hidden” sequence
Some experiments • Crawled 24,109 blogs from popular sites (2003) • Extract archive links from blogs • Extract all dates on blog pages, and tag each word and link with a date • Simple heuristics to automatically extract time-stamps from entries (regular expressions, training, …) • Obtained dates for ~90% of edges
Experiments (contd.) • The time graph • 22,299 nodes, 70,472 unique edges • 0.77M multiedges (average edge multiplicity = 11) • Consider graphs formed by prefixes from Jan 1, 1999 to some later month – generate 47 “prefix graphs” for analysis • Enumerate communities and analyze their burstiness
SCC growth Largest SCC as fraction of all nodes 2nd and 3rd largest SCCs as fraction of all nodes
Connectivity in Blogspace Number of nodes participating in a community Number of communities Fraction of nodes participating In some community
Burstiness of communities Number of communities in “high state” during each time period
Are these results fluke? • “Randomized Blogspace”: A distribution over time graphs that look very much like the time graph of Blogspace, but remove some of the locality of the true graph • Vertices and edges arrive at the same times, each edge has the same source, but a randomly-chosen destination • If randomized blogspace behaves like blogspace, then community structure is a fake
SCC evolution Blogspace Randomized Blogspace Randomized Blogspace forms an SCC much earlier
Community evolution Blogspace Randomized Blogspace Blogspace has many more communities
Exogenous events Number of blog pages that belong to a community Number of blog communities Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Wired magazine publishes an article on weblogs that impacts the tech community NewsWeek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption
Some questions … • Modeling • Edge arrivals • `Interesting’ events • Algorithms • Prediction • Information percolation • Search • o(t ¢ T(n)) • Studies • Sociological • Effect on search and ranking
Prediction via blogs(Gruhl, Guha, Kumar, Novak, Tomkins, 2005)
Blogs as trend indicators • Can blogs be used to predict trends? • Data • Amazon sales rank of some books • Blog chatter in an index • Questions • How well do they correlate? • Can sales rank be predicted using blogs?
The Lance Armstrong Performance Program Query: Lance Armstrong OR Tour de France
Simple inferences • How to formulate queries automatically • Depends on the object (book, movie, …) • Simple heuristics work well • Predicting sales motion is hard • Predicting spikes appears relatively easier • More to be done …
Blogs and social networks(Kumar, Liben-Nowell, Novak, Raghavan, Tomkins, 2005)
Social networks • Blog friendship graph is a social network • Is there a simple model to describe this network? • Desiderata • Fit experimental observations • Exhibit “six-degrees of separation” • Theoretically tractable
RBF: Rank-Based Friendship • Population network model • Each person has a geographic location • d(¢, ¢) = measures geographic distance • rankA(B) = #{ C : d(A, C) < d(A, B) } • Pr[A “befriends” B] / 1/rankA(B) • Independent of distance • Works with arbitrary population densities • Plus local links to neighbors
RBF: Preliminary results • Fits LiveJournal friendship experimental graph data (using geo data in the profile) • Greedy routing: Is able to route messages from source to destination most of the time, just using geographic information • Theoretical analysis: Can show that this model guarantees geographic routing to work