340 likes | 351 Views
This presentation focuses on the use of stochastic models to understand and predict the behavior of user-participatory web sites, such as Digg. The models enable the prediction of trends and behaviors, as well as the design of effective web sites and incentives for users. The presentation includes an illustration of a stochastic model of Digg's voting behavior.
E N D
Stochastic Models of User-Contributory Web Sites Tad Hogg HP Labs Kristina Lerman USC Information Sciences Institute
The Social Web Bugzilla essembly delicious “wisdom of crowds”
Activities • View existing content • Rate existing content • simple: vote • complex: write a review • Add new content • Link to other users focus of this presentation
Aggregate group behavior • Determines structure and usefulness of user-participatory sites • Models enable • Predicting trends or behaviors • E.g., which newly contributed content will become popular • Designing web sites • E.g., productive information displays • Altering user incentives • E.g., improve content quality or participation
Stochastic Modeling summary • Start with individual user behavior • Specify states and transitions between states • Determine collective behavior • Aggregate behavior of interest • Individual user behaviors create transitions among aggregate states • Rate equations give dynamics • How average collective behavior changes in time • How collective behavior depends on user characteristics
Illustration – Stochastic Model of Digg • Phenomenology of Digg • Users submit and vote on news stories • Digg promotes popular stories to front page • Digg allows social networking • Users can designate Friends • and view their friends’ activity on Digg • Directed social network • Friends of user A are everyone A is watching • Fans of A are all users who are watching A Alice’s friend Bob Alice Bob’s fan
Lifecycle of a story • User submits a story to the Upcoming Stories queue • Others vote on (digg) the story • If story accumulates enough votes in short time, it is promoted to the Front page • The Friends Interface lets users see • Stories friends submitted • Stories friends voted on, …
Model of Digg voting behavior • Stochastic model based on Digg user interface • visibility and interestingness votes • Extension to prior model: [Lerman 2007] • “law of surfing” for viewing web pages [Huberman et al, 1998] • instead of geometric distribution • incremental average growth in number of voters’ fans • i.e., people who can see story via friends interface • Related work: aggregate phenomenological models • behavior for Digg, Wikipedia, YouTube, …. • e.g., [Wu & Huberman 2007; Crane & Sornette 2008; Wilkinson 2008]
see the story? user comes to Digg vote on the story? yes Voting on stories • combination of • visibility: does user see the story? • user interface • browse • recommended by friends • search • interest: does user like the story? • novelty, …
Story location • Digg shows stories as lists • most recent first • 15 stories per page • user must click to view subsequent pages • visibility decreases with distance from top of list • A given story • moves down the list as new stories added • eventually moves to later pages • switches from upcoming to top of front page if promoted
visibility interest User behavioral model upcomingq upcoming1 … r c n r front1 frontp Ø … vote wS r friends
Dynamical model of aggregate behavior • How number of votes Nvote(t) for a story changes • nf - rate users find story on the front page queue • nu - rate users find story on the upcoming stories queue • nfriends - rate users find story through the friends interface • r – fraction of users who see the story choose to vote for it visibility
Estimating model parameters • Need model parameters for • Story visibility • Story interestingness • Estimate from behavior of sample of users
Digg data set • Stories from front and upcoming pages • number of votes vs. time since submission • for several days in May 2006 • prior to availability of Digg API • sampled more extensively from front than upcoming pages • Number of fans for active users • 2152 stories with at least 4 observations • submitted by 1212 distinct users • 510 of these stories promoted to front page
Story visibility • User viewing behavior not available: • which stories users look at • how they find stories • front page, friends interface, … • Estimate indirectly from models & data
Modeling story visibility • Story location • Navigating web sites • Number of fans
upcoming q(t) front page p(t) Story location vs. time in each list • For upcoming and front page lists: • location on page (1 to 15), which page (1st, 2nd, …) • distance from top of list increases linearly with time • Rate story position increases: • front page: ~0.2 pages/hr • upcoming: ~4 pages/hr • 1/15th the rates new stories are • promoted to front page (~3/hr) • submitted as new stories (~60/hr) • since each page holds 15 stories • Averages over hourly variation • [Szabo & Huberman 2008] examples
Story location: promotion to front page • Digg promotion decision algorithm not public • based on popularity expressed by user votes • Approximation from data: • story promoted if • at least 40 votes within 24 hours of submission
Modeling story visibility • Story location • Navigating web sites • Number of fans
Navigating through a web site • Empirical model of user following links on a Web site • “law of surfing” [Huberman et al. 1998] • Inverse Gaussian distribution of #pages viewed before leaving web site few users go beyond 1st page parameters estimated from Digg data & model
Modeling story visibility • Story location • Navigating web sites • Number of fans: visibility via friends interface
Story visibility via friends interface • Each voter enables their fans to see story • via friends interface • Model of number of fans not yet viewing story, s(t) • based on number of votes on the story • story visible to submitter’s fans at submission time: s(0) fans of prior voters visit Digg new fans from new votes
Story interestingness • Reasons users vote for story not available, e.g., • topic • novelty [Wu & Huberman 2007] • popularity (determining interest, not just visibility) • e.g., “cool” fashion or gadgets • … • One approach: web-based experiments • e.g., [Salganik et al. 2006] • Estimate from models & data • from vote history after accounting for visibility
Solutions: votes vs. time model vs. observations for 6 stories • model captures qualitative features • slow growth initially • influence of fans on promotion • rapid growth if story promoted (much more visible to users)
promotion time number of votes number of fans not yet seeing story 40-vote promotion threshold Model: requirements for promotion • Values of S and r to get the story on front page
Promotion to front page: model prediction vs. data: 95% accurate promotion threshold from model logarithmic scale most stories not promoted, and from people with no fans
Additional model insights • Heterogeneity • users activity • content quality (“interestingness”) • Predictability from early reactions to new story
quantile-quantile plot shows good fit lognormal fit distribution of estimated interestingness values good fit with Kolmogorov-Smirnov test Story interestingness • Long-tail distribution (lognormal) • a few stories much more interesting than average • after accounting for visibility via user interface part of model • Open question: why? • A multiplicative process underlying user interests?
Predictions from early behavior • Estimate story interestingness • from full history, or • using initial votes • Behavior predictable from early reaction to story • also with YouTube • e.g., [Crane & Sornette 2008; Lerman & Galstyan 2008; Szabo & Huberman 2008] example: use first 4 observations r estimates correlate 0.9 with those based on full history prediction of final votes account for 75% of variance rms prediction error: 244 votes
see the story? user comes to Digg vote on the story? yes Model based on votes only? • Estimate based on initial votes only • not including visibility model • i.e., ignore effects of ‘law of surfing’ and social network
Model based on votes only? full model is better than not including visibility (differences significant, p-value <10-4)
Future work on models of activities: new content & links • View existing content • Rate existing content • Add new content • What motivates high-quality contribution? • Link to other users • How do users chose who to link to? • What does link signify? • common interests? • trust in recommendations? focus of this presentation
Conclusion • Stochastic process approach • connect user and system behaviors • Applicability: • users have limited information and actions • limited use of personalized history • e.g., user communities on the web • not face-to-face small group interactions • Example: news aggregator Digg • votes from visibility + interestingness • user model from info and actions provided by Digg UI