400 likes | 568 Views
Finding Event-Specific Influencers in Dynamic Social Networks. Masters Thesis – Chris Schenk December 1 st , 2010. Outline. Problem overview Influencers, reputation, validation and security Summary of analysis methods Boulder f ire data Twitter Data
E N D
Finding Event-Specific Influencers in Dynamic Social Networks Masters Thesis – Chris Schenk December 1st, 2010
Outline • Problem overview • Influencers, reputation, validation and security • Summary of analysis methods • Boulder fire data • Twitter Data • API, formats, collection and data limitations • Statistics • Finding event-specific influencers – Rankings • Stats • Hyperlink-Induced Topic Search (HITS) • Context-specific in-degree (original work) • Conclusions and Future Work
Influencers • Social dynamics vs online social dynamics • Social network features • Search, friends, re-tweets • Influencers and sheep • What is meant by influence? • Understanding the data • Sampling and baseline statistics • Similarity measures, clustering • Semantics, intent (NLP) • Baseline activity
Influencers – Network Structure • Betweenness/Closeness centrality • PageRank/TwitterRank/TunkRank • Local/Global hierarchical clustering • K-core decomposition • K-clique percolation • Nearest Neighbor Networks • Assortative mixing • HITS • Activity Network
Twitter Data Stats – Boulder Fire • Tweets • First day – September 6th, 2010 10:00am to September 7th, 2010 10:00am, Mountain time • First week – September 6th, 2010 10:00am to September 13th, 2010 10:00am, Mountain time • Social graph • Five one-day snapshots beginning September 7th, 2010 12:40pm, Mountain time • Tweet example • RT @garytx: Article on Twitter's use during #eqnz, #boulderfire, and #sanbrunofire: http://bit.ly/cwI1fi • kate30_CU - 2010-09-13 15:29:24+00:00 • Keywords: boulder, boulderfire, fourmilefire, fourmilecanyon, 4milefire
Qualitatively Influential Users • Sixteen users gathered by Jo White • Used as “ground truth” data for ranking comparison
Twitter API and Data Collection • Search+Track+REST • Unique users for a given event • Profiles • Periodic collection • Friends/Followers • Periodic collection • Tweets • One-time collection • Limitations • Rate limits, multi-threading • Improper SQL query
Graph Stats • Timezone: Mountain
Finding Influencers - Rankings • Tweets • Number of tweets • Username mentions • Number of re-tweets • Graph • In-degree • HITS • all users (sorted by frequency) • active users • Mentions • addressed messages (replies) • Context-specific in-degree • Global followers count • Active edges (pre-existing network) • New Edges
Hyperlink-Induced Topic Search (HITS) • Hubs • Those that link to many authorities • Authorities • Those that are linked to by many hubs • Process • Calculate the principle eigenvector of two matrices • Followers adjacency matrix (authorities) • Friends adjacency matrix (hubs) • Iterative • Rankings by highest value descending in eigenvectors
Context-specific In-degree Ranking • Global followers count • Periodically download user profiles • Calculate change in followers count for each snapshot • Rank based on overall change, descending • Active edges (includes pre-existing edges) • Periodically download friend/follower lists • Calculate change in followers count for each snapshot • Rank based on overall change, descending • New Edges • Periodically download friend/follower lists • Calculate change in followers count for each snapshot • Do not count edges that existed prior to the start of the event • Rank based on overall change, descending
Limitations and Modifications • On-going influence • Can only measure when a user becomes influential • Global popularity masking local influence • User “andrewhyde” • News and bot activity • Extra data needed to ignore these users • Large events • Data collection limitations • How important is a de-follow? • Can identify individual user activity • Identifying the sheep • Can equivalently count friends (out-links) created
Conclusions • Notions of influence and interaction are heavily dependent on social network features • No agreement on definitions • Influence measured by features not 100% in use • Or features not used in the same way by everyone • Composability problem • HITS ranking no better than global in-degree • Context-specific in-degree ranking good! • Needs to be tested on multiple events of varying sizes
Future Work • Understanding “baseline” behavior • For users active (using keywords) during an event • Calculate all given statistics for a user (Klout.com?) • Lots of ways to cut the data • Composable factors/measures/attributes • Explaining new links created • Models for searching, re-tweeting, hashtags, #ff, etc • Incorporating blogs, forums, news websites • Real-time vs not • Informing algorithms with other techniques • NLP and more automation • Qualitative analysis (crowdsourcing?)
Reputation • Definitions? • Scores • Composability • Explicit reputation • Ratings, votes • Implicit reputation • Client • Server
Validation • Ground truth • Authorities • Armies of grad students • Crowd-sourcing? • More data • Cross-referencing • News websites • Blogs • Public health and safety (or other)
Security • Malicious users • Inflation of reputation • Sybil attacks • Reporting • Audience? • Anonymization