Stochastic Models of User-Contributory Web Sites

Stochastic Models of User-Contributory Web Sites Tad Hogg HP Labs Kristina Lerman USC Information Sciences Institute

The Social Web Bugzilla essembly delicious “wisdom of crowds”

Activities • View existing content • Rate existing content • simple: vote • complex: write a review • Add new content • Link to other users focus of this presentation

Aggregate group behavior • Determines structure and usefulness of user-participatory sites • Models enable • Predicting trends or behaviors • E.g., which newly contributed content will become popular • Designing web sites • E.g., productive information displays • Altering user incentives • E.g., improve content quality or participation

Stochastic Modeling summary • Start with individual user behavior • Specify states and transitions between states • Determine collective behavior • Aggregate behavior of interest • Individual user behaviors create transitions among aggregate states • Rate equations give dynamics • How average collective behavior changes in time • How collective behavior depends on user characteristics

Illustration – Stochastic Model of Digg • Phenomenology of Digg • Users submit and vote on news stories • Digg promotes popular stories to front page • Digg allows social networking • Users can designate Friends • and view their friends’ activity on Digg • Directed social network • Friends of user A are everyone A is watching • Fans of A are all users who are watching A Alice’s friend Bob Alice Bob’s fan

Lifecycle of a story • User submits a story to the Upcoming Stories queue • Others vote on (digg) the story • If story accumulates enough votes in short time, it is promoted to the Front page • The Friends Interface lets users see • Stories friends submitted • Stories friends voted on, …

Model of Digg voting behavior • Stochastic model based on Digg user interface • visibility and interestingness  votes • Extension to prior model: [Lerman 2007] • “law of surfing” for viewing web pages [Huberman et al, 1998] • instead of geometric distribution • incremental average growth in number of voters’ fans • i.e., people who can see story via friends interface • Related work: aggregate phenomenological models • behavior for Digg, Wikipedia, YouTube, …. • e.g., [Wu & Huberman 2007; Crane & Sornette 2008; Wilkinson 2008]

see the story? user comes to Digg vote on the story? yes Voting on stories • combination of • visibility: does user see the story? • user interface • browse • recommended by friends • search • interest: does user like the story? • novelty, …

Story location • Digg shows stories as lists • most recent first • 15 stories per page • user must click to view subsequent pages • visibility decreases with distance from top of list • A given story • moves down the list as new stories added • eventually moves to later pages • switches from upcoming to top of front page if promoted

visibility interest User behavioral model upcomingq upcoming1 … r c n r front1 frontp Ø … vote wS r friends

Dynamical model of aggregate behavior • How number of votes Nvote(t) for a story changes • nf - rate users find story on the front page queue • nu - rate users find story on the upcoming stories queue • nfriends - rate users find story through the friends interface • r – fraction of users who see the story choose to vote for it visibility

Estimating model parameters • Need model parameters for • Story visibility • Story interestingness • Estimate from behavior of sample of users

Digg data set • Stories from front and upcoming pages • number of votes vs. time since submission • for several days in May 2006 • prior to availability of Digg API • sampled more extensively from front than upcoming pages • Number of fans for active users • 2152 stories with at least 4 observations • submitted by 1212 distinct users • 510 of these stories promoted to front page

Story visibility • User viewing behavior not available: • which stories users look at • how they find stories • front page, friends interface, … • Estimate indirectly from models & data

Modeling story visibility • Story location • Navigating web sites • Number of fans

upcoming q(t) front page p(t) Story location vs. time in each list • For upcoming and front page lists: • location on page (1 to 15), which page (1st, 2nd, …) • distance from top of list increases linearly with time • Rate story position increases: • front page: ~0.2 pages/hr • upcoming: ~4 pages/hr • 1/15th the rates new stories are • promoted to front page (~3/hr) • submitted as new stories (~60/hr) • since each page holds 15 stories • Averages over hourly variation • [Szabo & Huberman 2008] examples

Story location: promotion to front page • Digg promotion decision algorithm not public • based on popularity expressed by user votes • Approximation from data: • story promoted if • at least 40 votes within 24 hours of submission

Modeling story visibility • Story location • Navigating web sites • Number of fans

Navigating through a web site • Empirical model of user following links on a Web site • “law of surfing” [Huberman et al. 1998] • Inverse Gaussian distribution of #pages viewed before leaving web site few users go beyond 1st page parameters estimated from Digg data & model

Modeling story visibility • Story location • Navigating web sites • Number of fans: visibility via friends interface

Story visibility via friends interface • Each voter enables their fans to see story • via friends interface • Model of number of fans not yet viewing story, s(t) • based on number of votes on the story • story visible to submitter’s fans at submission time: s(0) fans of prior voters visit Digg new fans from new votes

Story interestingness • Reasons users vote for story not available, e.g., • topic • novelty [Wu & Huberman 2007] • popularity (determining interest, not just visibility) • e.g., “cool” fashion or gadgets • … • One approach: web-based experiments • e.g., [Salganik et al. 2006] • Estimate from models & data • from vote history after accounting for visibility

Model results

Solutions: votes vs. time model vs. observations for 6 stories • model captures qualitative features • slow growth initially • influence of fans on promotion • rapid growth if story promoted (much more visible to users)

promotion time number of votes number of fans not yet seeing story 40-vote promotion threshold Model: requirements for promotion • Values of S and r to get the story on front page

Promotion to front page: model prediction vs. data: 95% accurate promotion threshold from model logarithmic scale most stories not promoted, and from people with no fans

Additional model insights • Heterogeneity • users activity • content quality (“interestingness”) • Predictability from early reactions to new story

quantile-quantile plot shows good fit lognormal fit distribution of estimated interestingness values good fit with Kolmogorov-Smirnov test Story interestingness • Long-tail distribution (lognormal) • a few stories much more interesting than average • after accounting for visibility via user interface part of model • Open question: why? • A multiplicative process underlying user interests?

Predictions from early behavior • Estimate story interestingness • from full history, or • using initial votes • Behavior predictable from early reaction to story • also with YouTube • e.g., [Crane & Sornette 2008; Lerman & Galstyan 2008; Szabo & Huberman 2008] example: use first 4 observations r estimates correlate 0.9 with those based on full history prediction of final votes account for 75% of variance rms prediction error: 244 votes

see the story? user comes to Digg vote on the story? yes Model based on votes only? • Estimate based on initial votes only • not including visibility model • i.e., ignore effects of ‘law of surfing’ and social network

Model based on votes only? full model is better than not including visibility (differences significant, p-value <10-4)

Future work on models of activities: new content & links • View existing content • Rate existing content • Add new content • What motivates high-quality contribution? • Link to other users • How do users chose who to link to? • What does link signify? • common interests? • trust in recommendations? focus of this presentation

Conclusion • Stochastic process approach • connect user and system behaviors • Applicability: • users have limited information and actions • limited use of personalized history • e.g., user communities on the web • not face-to-face small group interactions • Example: news aggregator Digg • votes from visibility + interestingness • user model from info and actions provided by Digg UI

Stochastic Models of User-Contributory Web Sites

Stochastic Models of User-Contributory Web Sites

Presentation Transcript

Stochastic models - time series.

Stochastic Spatial Dynamics of Epidemic Models

Learning Models of Relational Stochastic Processes

Web Design – User Centric Sites

Creating User-Focused Web Sites

Stochastic Frontier Models

Stochastic Frontier Models

Stochastic Frontier Models

Introduction to Stochastic Models

Ranking Web Sites with Real User Traffic

Simple stochastic models 2

4.7 STOCHASTIC MODELS

User Models

Stochastic Climate Models

STOCHASTIC MODELS IN NEUROSCIENCE

Bibliography: Stochastic models

Stochastic Models of Manufacturing Systems

Stochastic Block Models of Mixed Membership

Bibliography: Stochastic models

STOCHASTIC MODELS IN NEUROSCIENCE

Stochastic models - time series.

Learning Models of Relational Stochastic Processes